=Paper= {{Paper |id=None |storemode=property |title=Learning Virtual Agents for Decision-Making in Business Simulators |pdfUrl=https://ceur-ws.org/Vol-627/mass_2.pdf |volume=Vol-627 |dblpUrl=https://dblp.org/rec/conf/mallow/GarciaBF10 }} ==Learning Virtual Agents for Decision-Making in Business Simulators== https://ceur-ws.org/Vol-627/mass_2.pdf
     Learning Virtual Agents for Decision-Making in
                   Business Simulators
         Javier Garcı́a and Fernando Fernández                                       Fernando Borrajo
              Universidad Carlos III de Madrid                               Universidad Autónoma de Madrid
            Avenida de la Universidad, 30, 28911                  Crta. de Colmenar Viejo, km. 14, 28049, Madrid, Spain
                  Leganés, Madrid, Spain                                     Email: fernando.borrajo@uam.es
            Email: fjgpolo,ffernand@inf.uc3m.es



   Abstract—In this paper we describe SIMBA, a simulator for           agents, are not directly included in the goal of this paper.
business administration, as a Multi-Agent platform for the design,     Designing virtual agents whose behavior challenges human
implementation and evaluation of virtual agents. SIMBA creates         players adequately is a key issue in computer games devel-
a complex competitive environment in which intelligent agents
play the role of business decision makers. An important issue          opment [23]. Games are boring when they are too easy and
of SIMBA architecture is that humans can interact with virtual         frustrating when they are too hard [8]. Difficulty of the game
agents. Decision making in SIMBA is a challenge, since it requires     is critically important for its “pedagogical” worth. The game
handling large and continuous state and action spaces. In this         difficulty must be such that it is “just barely too difficult” for
paper, we propose to tackle this problem using Reinforcement           the subject. If the game is too easy or too hard, “pedagogical”
Learning (RL) and K-Nearest Neighbors (KNN) approaches. RL
requires the use of generalization techniques to be applied in large   worth appears to be less efficient. So most games allow human
state and action spaces. We present different combinations in the      players adjust basic difficulty (easy, medium, hard).
choice of the generalization method based on Vector Quantization          However, developing agents that can outperform human-like
(VQ) and CMAC. We demonstrate that learning agents are very            behavior, under narrow circumstances, can do pretty well [15]
competitive, and they can outperform human expert decision             (ex: chess and Deep Blue or Othello and Logistello). Deep Blue
strategies from business literature.
                                                                       defeated World Chess Champion Garry Kasparov in an exhi-
                       I. I NTRODUCTION                                bition match. Campbell and Hsu describe the architecture and
                                                                       implementation of their chess machine in the paper [7]. A few
   Business simulators are a promising tool for research. The          months after this chess success, Othello became the new game
main characteristic of SIMBA (SIMulator for Business Admin-            to fall to computers when Michael Buro’s program Logistello
istration) [2] is that it emulates business reality. It can be used    defeated the World Othello Champion Takeshi Murakami. In
from a competitive point of view, since different companies            the paper [3], Buro discusses the learning algorithms used in
compete among themselves to improve their results. In this             his program. Thus, the goal of this paper is the development of
paper, SIMBA is considered as a multi-agent framework where            virtual “business” agents that can be able to beat hand-coded
the different agents manage their companies in different ways.         and random virtual agents, but also human business experts.
SIMBA can include several autonomous agents to play the role              To do so, we use two different learning approaches. The
of competing teams and, based on the research on decision              first one is Instance Based Learning (IBL). In this paper we
making patterns of human teams, further research is made to            propose the Adaptive KNN algorithm, a variation of KNN,
improve the complexity and effectiveness of such intelligent           where experience tuples are stored and selected automatically
agents.                                                                to generate new behaviors.
   Decision making in SIMBA requires handling more than                   However, decision making for business administration is an
100 continuous state variables, and more than 10 continuous            episodic task where decisions are sequentially taken. Therefore
decision variables, which makes the problem hard even for              we also propose to use Reinforcement Learning (RL). The
business administration experts. The motivation of this paper          RL agents developed need to apply generalization techniques
is the design, implementation and evaluation of virtual agents         to perform the learning process, given that both the state
in SIMBA using different machine learning (ML) approaches.             and action spaces are continuous. In this paper, we propose
The goal is that the developed agents can outperform human-            two different generalization methods in order to tackle the
like behavior when competing against hand-coded and random             large state and action spaces. The first one, Extended Vector
virtual agents, but also against expert humans players.                Quantization for Q-Learning, uses Vector Quantization (VQ)
   Human players have experimented the consequences of their           to discretize both the state and action spaces, extending
decisions in competition with the developed virtual agents.            previous works where VQ was used only to discretize the state
But, given that the agents try to “win” in all cases, they             space[6]. Some tasks have been solved by coarsely discretizing
make the game too hard for novice players. So “pedagogical”            the action variables [14], but up to our knowledge, this is
objectives for human players competing with our virtual                the first time that VQ is used to discretize the action space.
                                                                  in SIMBA, as will be shown in Section V. We describe some
                                                                  classical ones:
                                                                     1. Incremental decisions. This type of business strategy
                                                                  is based on incremental decisions for all decision variables,
                                                                  which typically ranges from a 10% to a 20%. This business
                                                                  strategy is considered as a conservative behavior.
                                                                     2. Risk decisions. It is based on strong changes in business
                                                                  decisions. It has strong impacts in market reactions, and is
                                                                  useful to detect gaps and market opportunities.
                  Fig. 1.   SIMBA’s Arquitecture
                                                                     3. Reactive. An organization with this type of strategy
                                                                  attempts to locate and maintain a secure niche in a relatively
                                                                  stable product or service area [11].
                                                                     4. Low cost strategy. With this strategy, managers try to
The second generalization approach, CMAC-VQQL, is based           gain a competitive advantage by focusing the energy of all the
on the combination of VQ to discretize the action space and       departments on driving the organization’s costs down below
CMAC (Cerebellar Model Articulation Controller) [1], which        the costs of its rivals [12].
is motivated by CMAC’s demonstrated capability to generalize         5. Differentiation and specialization. A differentiation
the state space.                                                  strategy is seen when a company offers a service or product
   Section II describes SIMBA. Section III introduces the         that is perceived as unique or distinctive from its competi-
learning approaches proposed, while Section IV shows how          tors [12].
these approaches have been used to learn the virtual agents          Which strategy management is chosen in every moment
for decision making in SIMBA. Section V shows comparative         depends on the organization’s strengths and its competitor’s
results of the virtual agents, when competing among them but      weaknesses.
also when competing against expert human players. Section VI
summarizes the related work. Last, Section VII concludes.         C. Autonomous Decision Making in Simba
                                                                     The goal of this section is to describe how a SIMBA software
                            II. SIMBA                             agent can be implemented. To do this, we describe the state
  In this section, SIMBA simulator is described in detail.        and action spaces, the transition function to transit between
                                                                  states and the variable to maximize.
A. SIMBA’s Architecture
                                                                     State Space. The state computed in every round or simu-
   Figure 1 shows the architecture of the business simulator      lation step is composed of 174 continuous variables. Table I
from a Multi-Agent perspective. The architecture designed         shows some of the features that compose the state space.
enables multiple players to interact with the simulator, in-         Action Space. The players (software or humans) must ap-
cluding both software agents and human players. The main          proach the decisions on the different functional areas of their
components of the system are:                                     companies. Each market in the competition requires the use
   • Simulation Server: Once all decisions are taken for the      of 25 variables. This is an indicator of SIMBAS’s capacity to
     current round, it computes the values of the variables       approach the complexity of managerial decision-making. In
     in the marketplace for every player. Finally, it sends the   our experiments, we consider a subspace of the total action
     results computed to each player. The player (software of     space and we use only the ten variables shown in table I. This
     human) uses these results to choose the best decisions in    reduction was suggested by the experts, because the discarded
     the next round of the simulation.                            variables are not very significant. All the actions that the agents
   • Simulation Control: It manages the software agents           can perform are constrained by the semantic of the business
     and their decisions. It receives the decision taken by       model. For instance, a company can not sell its product if it
     the software agents and sends them to the Simulation         does not have stock.
     Server. The simulation server the results computed to           Transition function. The different players participate in a
     the simulation control. The simulation control sends the     simulation in a step by step round mode. Each simulation step
     results to the corresponding software agent.                 is called a period, which is equivalent to three real months.
   • Software Agents: They represent an alternative to human      When a round ends, the time machine is run. By doing this, the
     players. In every step, the software agents receive the      simulator integrates the previous periods situation, the teams’
     results computed for the Simulation Server. The software     decisions, and the parameters of the general economic envi-
     agents use this information to take the decisions for the    ronment together with those of each geographic market, and
     next round of the simulation.                                orders the Simulators Server to generate output information
                                                                  for the new period.
B. Business Human Strategies                                         Variable to maximize. The agents try to maximize the result
   Different business strategies appear in the business litera-   of the exercise (profit). From a RL point of view, the objective
ture, and they all could be followed to manage the companies      is to maximize the total reward received. In this case, we
                            TABLE I                                                                                 TABLE II
      A SUBSET OF FEATURES OF THE STATE AND ACTION SPACES .                                                A DAPTIVE KNN A LGORITHM
                                                                                 Adaptive KNN
               FEATURES                      FEATURES                            1. Gather experience tuples
           of the State Space            of the Action Space                       1.1. Generate the set C of experience tuples of the type < s, a, r > from an
              Account value                  Selling price                         interaction of the agent in the environment, where s ∈ S, a ∈ A and r ∈ ℜ
            Human resources              Advertising expenses                      is the immediate reward.
               Material cost             Network sales budget                    2. During a new interaction between the agent and the environment
            Operating margin           Commercial information                      2.1 Get state s from simulator
           Financial expenses               Training budget                        2.2 Select the K nearest neighbors of s in the set C
             Pre-tax income              Production scheduled                       2.2.1 For each tuple ci ∈ C, where ci =< si , a, r >, calculate d(s, si )
                    Tax                      Material order                         2.2.2 Order d(s, si ) from lowest to highest
            Training expenses      Research and Development budget                  2.2.3 Select first K tuples, CK
              Bank overdraft                     Loan                              2.3. Select the tuple cb with the best r, where cb =< sb , ab , rb > and cb ∈ CK
          Economic productivity                Term loan                           2.4. Modify ab from cb , am = ab ± random∆
          Advertising prediction                                                   2.5. Execute action am obtaining reward r ′ ∈ ℜ
           Effort sales network                                                    2.6. Update set C using the new experience
                                                                                    2.6.1. Select the tuple cw ∈ CK with the worst reward rw
                                                                                    2.6.2. if r ′ > rw then replace the tuple cw =< sw , a, rw > with the tuple
                                                                                    < s, am , r ′ >
define the immediate reward as the result of the exercise in                     3. Return C
a period or step. Therefore, there is no delayed reward and,
like in other classical domains like Keepaway [17], immediate
                                                                                                                TABLE III
rewards received in every simulation step are relevant.                                                E XTENDED VQQL A LGORITHM
  III. P ROPOSED A LGORITHMS FOR L EARNING V IRTUAL                              Extended VQQL
                        AGENTS                                                   1. Gather experience tuples
                                                                                   1.1. Generate the set C of experience tuples of the type < s1 , a, s2 , r > from
  In this section we describe the new learning algorithms                          an interaction of the agent in the environment, where s1 , s2 ∈ S, a ∈ A and
proposed, based on KNN and RL.                                                     r ∈ ℜ is the immediate reward.
                                                                                 2. Reduce the dimension of the state space
A. Adaptive KNN                                                                    2.1. Let Cs the set of states in C
                                                                                   2.2. Apply a feature selection approach using Cs to reduce the number of
    In this paper, we propose a variant of KNN called Adaptive                     features in the state space. The resulting feature selection process is defined
KNN (Table II). In this variant, we can distinguish two phases.                    as a projection Γ : S → S ′
                                                                                   2.3. Set Cs′ = Γ(Cs )
In the first one, a data set C is obtained during an interaction
                                                                                 3. Discretize the state space
between the agent and the environment. This data set C is                          3.1. Use GLA to obtain a state space discretization, Ds′ = s′1 , s′2 , ..., s′n ,
composed by tuples in the form < s, a, r > where s ∈ S,                            s′i ∈ S ′ , from Cs′ .
                                                                                                 ′
a ∈ A and r ∈ ℜ is the immediate reward. In the second one,                        3.2. Let V QS : S ′ → Ds′ the function that given any state in S ′ returns the
                                                                                   discretized value in Ds .
the set C obtained in the previous phase is improved during                      4. Discretize the action space
a new interaction between the agent and the environment. In                        4.1. Let Ca the set of actions in C
each step of this second phase, the simulator returns the current                  4.2. Use GLA to obtain an action space discretization, Da = a1 , a2 , ..., am ,
                                                                                   ai ∈ A, from Ca
state s where the agent is. The algorithm selects the K nearest                    4.3. Let V QA : A → Da the function that given any state in A returns the
neighbors to the state s in C. Among these K neighbors, it                         discretized value in Da
selects the tuple with the best reward obtained in the phase                     5. Learn the Q-Table
                                                                                   5.1. Map the set C of experience tuples to a set C ′ . For each tuple
one. Then modify slightly the actions of this tuple and execute                    < s1 , a, s2 , r > in C, introduce in C ′ the tuple
it. If the new reward obtained is better than the worst reward                             ′                            ′
                                                                                   < V QS (Γ(s1 )), V QA (a), V QS (Γ(s2 )), r >
in K, it replaces the worst tuple in K with the new experience                     5.2. Apply the Q-Learning update function defined in equation 1 to learn a Q
                                                                                   table Q: Ds′ × Da → ℜ, using the set of experience tuples C ′
generated. Thus, the algorithm adapts the initial set C obtained                                       ′
                                                                                 6. Return Q, Γ, V QS , and V QA
in the phase one, to get increasingly better results in the second
phase.
B. RL Approaches
                                                                                   1) Extended VQQL for state and action space generaliza-
   Among many different RL algorithms, Q-learning has been
                                                                                tion: Applying VQ techniques permits to find a more compact
widely used in the literature [20].In Q-Learning, the update
                                                                                representation of the state and action space [6]. A vector
function is performed following equation 1, where α is a
                                                                                quantizer Q of dimension K and a size N is a mapping from a
learning parameter, and γ is a discount factor that reduces
                                                                                vector (state or action) in the K-dimensional Euclidean space,
the relevance of future decisions.
                                                                                Rk , into a finite set C containing N states, Q : Rk → C where
                                                                                C = {y1 , y2 , ..., yN }, yi ∈ Rk . In this way, given C, and a
  Q(st , at ) → Q(st , at ) + α[rt+1 + γmaxa Q(st+1 , a) − Q(st , at )]   (1)
                                                                                state x ∈ Rk , V Q(x) assigns x to the closest state from C,
  Except in very small environments it is impossible to enu-                    V Q(x) = arg miny∈C {dist(x, y)}.
merate the state and action spaces. In this section we explain                     To design the vector quantizer we use the Generalized Lloyd
two new approaches for state and action space generalization                    Algorithm (GLA). The Extended VQQL algorithm is shown
problem.                                                                        in Table III.
                                                                                                       TABLE IV
   It uses VQ to generalize the state and action spaces. In                                       CMAC-VQQL A LGORITHM
Extended VQQL algorithm, two vector quantizers are designed
for each agent. The first one is used to generalize the state             CMAC-VQQL
                                                                          1. Gather experience tuples
space and the second one is used to generalize the action                   1.1. Generate the set C of experience tuples of the type < s1 , a, s2 , r > from
space. The vector quantizers are designed from the input data               an interaction of the agent in the environment, where s1 , s2 ∈ S, a ∈ A and
C obtained during an interaction between the agent and the                  r ∈ ℜ is the immediate reward.
                                                                          2. Reduce the dimension of the state space
environment. The data set C is composed by tuples in the                    2.1. Let Cs the set of states in C
form < s1 , a, s2 , r > where s1 and s2 is in the state space S,            2.2. Apply a feature selection approach using Cs to reduce the number of
a is in the action space A and r is the immediate reward. In                features in the state space. The resulting feature selection process is defined
                                                                            as a projection Γ : S → S ′
many problems, s is composed by a large number of features.                 2.3. Set Cs′ = Γ(Cs )
In these cases, we suggest to apply feature selection to reduce           3. Discretize the action space
the number of features in the state space. Feature selection is a           3.1. Let Ca the set of actions in C
                                                                            3.2. Use GLA to obtain an action space discretization, Da = a1 , a2 , ..., am ,
technique of selecting a subset of relevant features for building           ai ∈ A, from Ca
a new subset. So feature selection is used to select the relevant           3.3. Let V QA : A → Da the function that given any state in A returns the
features of S to obtain a subset S ′ . This feature selection               discretized value in Da
                                                                          4. Design CMAC
process is defined as Γ : S → S ′ . The set of states s′ ∈ S ′ ,            4.1. Design a CMAC function approximator from Cs′ taking into account the
Cs′ , are used as input for the Generalized Lloyd Algorithm to              obtained action space Da .
                                                                     ′
obtain the first vector quantizer. The vector quantizer V Qs              5. Approximate the Q function
                                                                            5.1. Map the set C of experience tuples to a set C. For each
is a mapping from a vector s′ → S ′ into a vector s′ ∈ Ds′ ,
                                                                            tuple < s1 , a, s2 , r >∈ C, introduce in C’ the tuple
where Ds′ is the state space discretization Ds′ = s′1 , s′2 , ..., s′n      < Φ(Γ(s1)), V QA (Ca ), Φ(Γ(s2 )), r >
for s′i ∈ S ′ . The set of actions a ∈ A, Ca , are used as input            where Φ is the binary vector of features
                                                                            5.2. Update the vector weights θ for the action V QA (Ca ) using Φ(Γ(s1 )),
for the GLA to obtain the second vector quantizer.
                                                                            Φ(Γ(s2 )) and r.
   The vector quantizer V QA is a mapping from a vector                     5.3. Apply the approximate value function defined in equation 2 to approximate
a ∈ A into a vector a ∈ Da , where Da is the action space                   the Q function for the action V QA (Ca ) using θ and Φ(Γ(Cs )).
                                                                          6. Return Q, Γ, θ, and V QA
discretization Da = a1 , a2 , ..., am for ai ∈ A. In the last
part of the algorithm, the Q-table is learned from the obtained
discretizations using the set C ′ of experience tuples. To obtain
the set C ′ from C, each tuple in C is mapped to the new                 composed by a large number of features. Feature selection is
representation. Therefore, every state in C is firstly projected         used to select a subset S ′ of the relevant features of S.
                                                   ′
to the space S ′ and then discretized, i.e. V QS (Γ(S)); every              The set of actions a ∈ A, Ca , are used as input for
action a ∈ A in C is also discretized V QA (a).                          the GLA to obtain the second vector quantizer. The vector
   2) CMAC-VQQL for state and action space generalization:               quantizer V QA is a mapping from a vector a ∈ A into a
CMAC is a form of coarse coding [20]. In CMAC the features               vector a ∈ Da , where Da is the action space discretization
are grouped into partitions of input state space. Each of such           Da = a1 , a2 , ..., am for ai ∈ A. Later, the CMAC is built
partition is called a tiling and each element of a partition is          from Cs′ taking into account the obtained action space Da .
called a tile. Each tile is a binary feature. The tilings were           For each state variable x′i in s′ ∈ S ′ the tile width and the
overlaids, each offset from the others. In each tiling, the state is     number of tiles per tiling are selected taking into account their
in one tile. The approximate value function, Qa , is represented         ranges. In our work, a separate value function for each of the
not as a table, but as a parameterized form with parameter               discrete actions is used. In CMAC, each tile has associated a
vector θ"t . This means that the approximate value function Qa           weight. The set of these weights is what makes up the vector θ.
depends totally on θ"t . In CMAC, each tile has associated a             In the last part, the Q function is approximated by the equation
weight. The set of all these weights is what makes up the                2.
vector θ." The approximate value function, Qa (s) is calculated
                                                                                          IV. V IRTUAL AGENTS IN SIMBA
in the equation 2.
                                                                            In the following evaluation performed, we assume that 6
                                      X
                                      n
                                                                         companies are controlled by agents of different types. These
                              'T φ
                     Qa (s) = θ  '=         θ(i)φ(i)               (2)
                                                                         agents are: Random Agents, that assign to each decision
                                      i=0
                                                                         variable a random value following an uniform distribution;
   The CMAC-VQQL algorithm, described in Table IV, com-                  Hand-Coded Agents, that modify their decision variables by
bines two generalization techniques. It uses CMAC to gener-              increasing their values using the Consumer Price Index (CPI);
alize the state space and VQ to generalize the action space.             RL Agents, using the Extended VQQL and CMAC-VQQL
In this case, a data set C is obtained during an interaction             algorithms described in Section III-B; and Adaptive KNN
between the agent and the environment. This data set C is                Agents, using the algorithm described in section III-A.
composed by tuples in the form < s1 , a, s2 , r > where s1 and              3) Executing the Extended VQQL Algorithm: Executing the
s2 is in the state space S, a is in the action space A and r is          Extended VQQL algorithm to learn the VQ Agents requires
the immediate reward. In the same way that previously, s is              performing the 5 steps of the algorithm:
                                                                                              TABLE V
   Step 1: Gather experience tuples. To gather experience, we       R ESULTS FOR DIFFERENT CONFIGURATIONS OF E XTENDED VQQL ( IN
perform an exploration in the domain by using hand-coded                                 MILLIONS OF EUROS ).
agents. Specifically, we obtain the experiences generated by
                                                                                  Decisions              128                              64                               32
a hand-coded agent managing company 1 against five hand-                 States                 Mean               Std          Mean                Std          Mean               Std
coded agents managing companies 2, 3, 4, 5 and 6 respectively.           128                       4,3               1,73         5,43             4,2E-04         7,73               0,09

   Step 2: Reduce the dimension of the state space. The goal             64                       6,21               3,15         7,51                0,32         8,14               0,51
                                                                         32                       4,94               3,68         5,69                0,06         7,62               0,28
of this step is to select, from among all features in the state
space, those features most related to the reward (the result                                      TABLE VI
of the exercise). To perform this phase, we use the data-                R ESULTS FOR DIFFERENT CONFIGURATIONS OF CMAC-VQQL ( IN
mining tool, WEKA [22] using the attribute selection method                                  MILLIONS OF EUROS ).

CfsSubsetEval. This method evaluates the worth of a subset                          Decisions                 64                              32                                8
of attributes by considering the individual predictive ability       Configuration               Mean                Std         Mean                 Std         Mean                Std
of each feature along with the degree of redundancy between          1                                 4,87              1,62          6,49               0,04           7,0              0,12

them. The resulting description of the state space after the         2                                 6,23              0,13          5,82               0,90          6,25              0,37
                                                                     3                                 5,26              0,20          5,95          3,2E-04            6,24              0,96
attribute selection process is shown in Table I.
   Step 3: State space discretization. Now, we use the GLA to
discretize the state space.
                                                                                                                    V. R ESULTS
   Step 4: Discretize the action space. Again, we use GLA to
discretize the action space. The action space is composed of          In the experiments, the learning agent always manages
the features shown in Table I.                                     the first company of the six involved in the simulations.
                                                                   Each experiment consists of 10 simulations or episodes with
   Step 5: Learn the Q table. Once both the state and action
                                                                   20 rounds and we obtain the mean value and the standard
spaces are discretized, the Q function is learned using the
                                                                   deviation for the result of the exercise during the 20 periods. In
mapped experience tuples and the Q-Learning update function.
                                                                   this situation, a hand-coded agent that manages the company
The Q table is generated, composed of n rows (where n is the
                                                                   1 against five hand-coded agents that manage companies 2,
number of discretized states) and m columns (where m is the
                                                                   3, 4, 5 and 6 respectively obtains a mean value of the result
number of discretized actions).
                                                                   of the exercise of 2,901,002.13 euros. A random agent in the
   4) Executing CMAC-VQQL Algorithm: Executing the                 same situation obtains -2,787,382.78 euros.
CMAC-VQQL algorithm to learn the CMAC Agents requires                 In the experiments with human experts, simulations have 8
performing the 5 steps of the algorithm as described in            rounds.
Table IV. Steps 1 and 2 of CMAC-VQQL are the same as
steps 1 and 2 of Extended VQQL (gather experience and              A. RL and KNN Results
the reduction of the dimension of the state space). Step 3 of         In the first set of experiments we use the Extended VQQL
CMAC-VQQL (action space discretization) is also the same           algorithm to learn an agent that manages company 1 and plays
as step 4 of Extended-VQQL. Step 4 is the design of the            against five hand-coded agents that manage companies 2, 3, 4,
CMAC function approximator. In our experiments we use              5 and 6 respectively. The results for different discretizations
single-dimensional tilings. For each state variable, 32 tilings    size of the state (rows) and action (columns) spaces are shown
were overlaid, each offset from the others by 1/32 of the tile     in Table V.
width. For each state variable, we specified the width of the         The best result is obtained when we use a vector quantizer
tiles based on the width of the generalization that we desired.    of 64 centroids (or states) to generalize the state space and a
In the experiments we use three different configurations. The      vector quantizer of 32 centroids (or actions) to generalize the
size of the primary vector θ in Configuration #1 is 754272         action space.
(x1tiles +x2tiles + +x12tiles ), in Configuration #2 is 1364320,      In the second set of experiments we use the CMAC-VQQL
in Configuration #3 is 2440704. In our work, we use a              algorithm. The results for the different CMAC configurations
separate value function for each of the generalized actions.       described in section IV-4 (rows) combined with the different
Last, step 5 of the algorithm, learning the Q approximations,      sizes of the action space obtained by VQ (columns) are shown
can be performed.                                                  in Table VI.
                                                                      The best result is obtained when we use the Configuration
A. Adaptive KNN in SIMBA                                           #1 of CMAC to generalize the state space and a vector
                                                                   quantizer of 8 centroids to generalize the action space. This
   To apply the Adaptive KNN algorithm to create a SIMBA           value is smaller than the obtained with Extended VQQL but,
software agent, we use the same state space, action space, and     again, all the configurations obtain better results than the hand-
transition and reward functions that for the RL agent. We also     coded agent.
use the same experience tuples than for the RL agent, although        In the next set of experiments we use the KNN algorithm
in the learning process, the set is updated following step 6 of    to build an agent. The results for the different KNN configu-
the algorithm (as described in Table II).                          rations are shown in Table VII.
                          TABLE VII
 R ESULTS FOR DIFFERENT CONFIGURATIONS OF KNN ( IN MILLIONS OF
                            EUROS ).

                  K             5                     10                     15
    Learning           Mean         Std      Mean          Std      Mean          Std
    Adaptive             6,44         3,99     9,81          0,21     9,89          0,32
    No adaptive          7,86         1,15     5,20          1,11     7,47          4,36




   The columns of Table VII show different results for different
values of K (5, 10 and 15 respectively). The first row presents
the results of the Adaptive KNN algorithm, as it was described
in Table II. The second row shows the results of a classical                                           Fig. 3.   RL Agents vs. Adaptive KNN Agents
KN N approach, without the adaptation of the training set,
i.e. without executing the steps five and six of the Adaptive
KNN algorithm. The best results are obtained with the adaptive                                                      TABLE VIII
version, for K=10 and K=15. In these cases, we obtain a mean                                 R ESULTS FOR INCREMENTAL DECISION STRATEGY ( IN MILLIONS OF
value for the result of the exercise of 9,8 millions of euros,                                                        EUROS )

which is higher than the ones obtained with RL.                                                                           Simulation   10%    20%
   In previous experiments, the learning agent always learned                                            Agent
to manage the first company of the six involved in the                                                   Extended VQQL 64-32           7,27    7,28
simulations. However, the behavior of each company depends                                               Adaptive KNN K=15             1,58    1,58
                                                                                                         Human Expert                  0,56   -0,18
on their initial states and of historical data (periods -1, -
2, etc). Therefore, learning performance may vary from one
company to other. To evaluate this issue, we repeat the learning
                                                                                           we see that standard deviation is very high, so the behavior of
process for the best learning configurations, for each of the six
                                                                                           the agent managing different companies is very different. The
companies. Each experiment consists of 10 simulations with
                                                                                           result for the Extended VQQL agent have two behaviours well
20 rounds and we obtain the mean value and the standard
                                                                                           differentiated: before period 8, and after period 8. In the first
deviation for the result of the exercise during the 20 periods.
                                                                                           part, the result of the exercise always grows up, and dominates
The results shown in Figure 2 demonstrate that the Extended
                                                                                           the result of the Adaptive KNN agent. However, from period
VQQL agent and Adaptative KNN agent obtain similar results,
                                                                                           8, the result of the exercise for the Extended VQQL agent
and both obtain better results than the hand-coded agent.
                                                                                           stabilizes to a value of around 10 millions, and it is dominated
                                                                                           by the other agent from period 10. Interestingly, we have
                                                                                           revised all the simulations performed, and this behavior always
                                                                                           appears. We believe that the RL agent is affected by the CPI
                                                                                           and the evolution of the market and, with time, the actions
                                                                                           obtained by the VQ algorithm becomes old-fashioned (note
                                                                                           that 8 periods are equivalent to two years). Therefore, if we
                                                                                           focus in the early periods, typically the RL agents behave
                                                                                           better than the KNN ones.
                                                                                           B. RL and KNN Agents vs. Human Experts
                                                                                              In this section, we present experiments where software
 Fig. 2.       Mean value and Standard deviation for the result of the exercise.           agents play against a human expert during 8 periods. The
                                                                                           human expert actually is an associate full time professor in
                                                                                           Strategic and Business Organization at Universidad Autónoma
   Now, we compare the behavior of the best RL agent with                                  de Madrid (UAM), where he is Director of Master of Business
the behavior of the best Adaptative KNN agent obtained in                                  Administration (Executive) and Director of Doctorate Program
previous experiments. In this experiment, all the companies                                of Financial Economics.
have the same initial state and historical data, so the result                                In all the experiments, we use the best RL and Adaptive
is independent of the company managed. This experiment                                     KNN agents obtained in the previous section. In the first
consists of 10 simulations with 20 rounds and we obtain the                                experiment, the human expert uses the incremental decision
mean value and the standard deviation for the result of the                                strategy, described in section II. The results are shown in
exercise during the 20 periods. Figure 3 shows the mean value                              table VIII.
and the standard deviation for each kind of agent.                                            In this case, the Extended VQQL agent obtains the best
   For the Adaptive KNN agent, the average value grows from                                results. Furthermore, given that only 8 episodes are run, the
the first period, and raises up to 16 millions of euros. However,                          RL agent performs much better than the Adaptive KNN agent.
                           TABLE IX
        AGENTS VS . H UMAN E XPERT ( IN MILLIONS OF EUROS )         provide information but also may affect the environment and
                                                                    direction of the simulation [19].
                              Simulation      1       2                Machine learning techniques (decision trees, reinforcement
              Agent
              Extended VQQL 64-32           5,36    6,09
                                                                    learning,. . .) have been used widely to develop software
              Adaptive KNN K=15             3,47    2,53            agents [18]. In [19] for example, the software agents uses
              Human Expert                 -0,32   -1,30            decision trees to learn different behaviors. In this case, virtual
                                                                    players could take on the role of an executive or sales-
                                                                    person from a supplier firm, a union leader, or any other
The human expert obtains the worst results (independently of        role relevant to the simulation exercise. In [10] the learning
the increment used).                                                agents uses a typical genetic-based learning classifier system,
   In the second experiment, two different simulations with 8       XCS (eXtended learning Classifier System). In that work, RL
rounds each are performed. The human expert combines the            techniques are used, allowing decision-making agents to learn
use of the different business strategies described in section II.   from the reward obtained from executed actions and, in this
The results are shown in table IX.                                  way, to find an optimal behavior policy. In stochastic business
   In all the experiments, the software agents obtain better        games, the players take actions in order to maximize their
results than the human expert. From a qualitative point of          benefits. While the game evolves, the players learn more about
view, the virtual agents usually compete in the same market         the best strategy to follow. With this, RL can be used to
scope. They are very effective and efficient, been almost           improve the behavior of the players in a stochastic business
impossible to beat them under the parameter setting used in         game [13]. However, in all these cases, the business simulator
these simulations. The best strategies usually make decisions       games used did not involve the huge state and action spaces
in different market scopes, using high or low strategies (for       that SIMBA involves.
instance, low cost or differentiation and specialization). It          In complex domains with large state and action spaces
means that using more competitive strategies, the gap between       is necessary to apply generalization techniques such as VQ
the performance of the virtual agents and the human experts         or CMAC. VQ has been used successfully in many other
could be reduced.                                                   domains [6]. In addition, CMAC [17] are extensively used
                                                                    to generalize the state space, but the research on problems
                      VI. BACKGROUND
                                                                    where the actions are chosen from a continuous space is
   Business gaming usage has grown globally and has a long          much more limited. KNN has also been used in the scope
and varied history [4]. The first modern business simulation        of Business Intelligence. In [16], the authors investigates the
game can be dated back to 1932 in Europe and 1955 in                relationship among corporate strategies, environmental forces,
North America. In 1932, Mary Birshstein, while teaching at          and the Balanced Scorecard (BSC) performance measures
the Leningrad Institute, got the idea to adapt the concept of       using KNN. In this case, the authors used all time the same
war games to the business environment. In North America             initial set of experience and they did not try to adapt it, using
the first business simulator dates back to 1955, when RAND          the new experience generated during the game.
Corporation developed a simulation exercise that focused on            An important issue that make SIMBA different from other
the U.S. Air Force logistics system [9]. However, the first         classical RL domains (like Keepaway [17]) is that it is not
known use of a business simulator for pedagogical purposes          defined a priori as a cooperative or competitive domain. In
in an university course, was at the University of Washington        SIMBA, the number of adversaries is very high, and the
in a business policy course in 1957 [21].                           number of variables involved in the state and action space,
   From this point, the number of business simulation games in      too. In addition, in SIMBA software agents can play against
use grew rapidly. A 2004 e-mail survey of university business       humans. It is hard to find all these issues in other classical
school professors in North America reported that 30.6% of           domains. So the learning process in SIMBA represent a real
1,085 survey respondents were current business simulation           challenge.
users, while another 17.1% of the respondents were former
business game users [5].                                                                  VII. C ONCLUSION
   Over the years Artificial Intelligence (AI) and simulation
have grown closer together. AI is used increasingly in complex         This paper introduces SIMBA as a business simulator which
simulation, and simulation is contributing to the development       architecture enables different players, including both software
of AI [24]. The need for increased level of reality and fidelity    agents and human players, to manage companies in different
in domain-specific games calls for the use of methods that          markets. The simulator generates a competitive environment,
bring realism and intelligence to actors and scenarios (also in     where the different agents try to maximize their companies’
business simulators). Intelligent software agents, called “au-      profits. SIMBA represents a complex domain with large state
tonomous” avatars or virtual players, are now being embodied        and action spaces. Therefore, the learning approaches applied
in business games. Software agents can interact with each           to generate the virtual agents must handle that handicap. We
other and their environment producing new states, business          have demonstrated that the proposals presented, based on Lazy
information and events. In addition, these agents not only          Learning and RL, achieve the goal of being very competitive
when compared with previous hand-coded strategies. Further-                        [18] P. Stone and M. Veloso. Multiagent systems: A survey from a machine
more, we demonstrate that when competing with a human                                   learning perspective. Autonomous Robots, 8(3):345–383, June 2000.
                                                                                   [19] G. J. Summers. Today’s business simulation industry. Simul. Gaming,
expert, which follows classical management strategies, the                              35(2):208–241, 2004.
learning agents are able to outperform the behavior of the                         [20] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction
human.                                                                                  (Adaptive Computation and Machine Learning). Mit Pr, May 1998.
                                                                                   [21] H. J. Watson. Computer Simulation in Business. John Wiley & Sons,
   In the case of RL, the choice of the generalization method                           Inc., New York, NY, USA, 1981.
have a strong effect on the results that we obtain. For this                       [22] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning
reason, the state and action space representation is chosen with                        Tools and Techniques. Morgan Kaufmann Series in Data Management
                                                                                        Systems. Morgan Kaufmann, second edition, June 2005.
great care, and we have proposed two new methods: Extended-                        [23] J. Yang, S. Min, C.-O. Wong, J. Kim, and K. Jung. Dynamic game
VQQL and CMAC-VQQL. This is the first time that VQ                                      level generation using on-line learning. In Edutainment’07: Proceedings
is used to discretize the action space, and some preliminary                            of the 2nd international conference on Technologies for e-learning
                                                                                        and digital entertainment, pages 916–924, Berlin, Heidelberg, 2007.
results have shown that it is also useful in other domains,                             Springer-Verlag.
like autonomous helicopter control. The challenging results                        [24] L. Yilmaz, T. Ören, and N.-G. Aghaee. Intelligent agents, simulation,
obtained by the learning approaches to generate virtual agents                          and gaming. Simul. Gaming, 37(3):339–349, 2006.
in SIMBA offers promising results for Autonomous Decision
Making.

                          ACKNOWLEDGMENT
  This work has been partially supported by the Spanish
MICINN project TIN2008-06701-C03-03 and by the Spanish
TRACE project TRA2009-0080. The authors would like to
thank the people from Simuladores Empresariales S.L.

                               R EFERENCES
 [1] J. S. Albus. A new approach to manipulator control: The cerebellar
     model articulation controller (CMAC). Journal of Dynamic Systems,
     Measurement, and Control, 97(3):220–227, 1975.
 [2] F. Borrajo, Y. Bueno, I. de Pablo, B. n. Santos, F. Fernández, J. Garcı́a,
     and I. Sagredo. Simba: A simulator for business education and research.
     Decision Support Systems, June 2009.
 [3] M. Buro. Improving heuristic mini-max search by supervised learning.
     Artificial Intelligence, 134:85–99, 2002.
 [4] A. J. Faria, D. Hutchinson, W. J. Wellington, and S. Gold. Developments
     in business gaming: A review of the past 40 years. Simulation Gaming,
     40(4):464–487, August 2009.
 [5] A. J. Faria and W. J. Wellington. A survey of simulation game users,
     former-users, and never-users. Simul. Gaming, 35(2):178–207, 2004.
 [6] F. Fernández and D. Borrajo. Two steps reinforcement learning.
     International Journal of Intelligent Systems, 23(2):213–245, 2008.
 [7] F.-H. Hsu. Behind Deep Blue: Building the Computer that Defeated
     the World Chess Champion. Princeton University Press, Princeton, NJ,
     USA, 2002.
 [8] R. Hunicke and V. Chapman. Ai for dynamic difficulty adjustment in
     games. 2004.
 [9] J. R. Jackson. Learning from experience in business decision games.
     California Management Review, 1:23–29, 1959.
[10] M. Kobayashi and T. Terano. Learning agents in a business simulator.
     In Proceedings 2003. IEEE International Symposium on Computational
     Intelligence in Robotics and Automation, 2003.
[11] R. E. Miles and C. C. Snow. Organizational strategy, structure, and
     process. McGraw-Hill, New York, 1978.
[12] M. E. Porter. Competitive Advantage: Creating and Sustaining Superior
     Performance. Free Press, New York, 1 edition, June 1985.
[13] K. K. Ravulapati and J. Rao. A reinforcement learning approach to
     stochastic business games. IIE Transactions, 36:373–385, 2004.
[14] J. Santamaria, R. Sutton, and A. Ram. Experiments with reinforcement
     learning in problems with continuous state and action spaces. Adaptive
     Behavior, 6:163–217, 1998.
[15] J. Schaeffer and H. J. van den Herik. Games, computers, and artificial
     intelligence. Artif. Intell., 134(1-2):1–7, 2002.
[16] M. H. Sohn, T. You, S.-L. Lee, and H. Lee. Corporate strategies,
     environmental forces, and performance measures: a weighting decision
     support system using the k-nearest neighbor technique. Expert Syst.
     Appl., 25(3):279–292, 2003.
[17] P. Stone, R. S. Sutton, and G. Kuhlmann. Reinforcement learning for
     RoboCup-soccer keepaway. Adaptive Behavior, 13(3):165–188, 2005.