<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BeClone: Behavior Cloning with Inference for Real-Time Strategy Games</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Derek Martin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arnav Jhala</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>North Carolina State University Raleigh</institution>
          ,
          <addr-line>NC 27606</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Behavior cloning (BC) techniques that combine self-play capabilities with imitation learning from experts to refine self-play models have shown performance improvement in robotics simulation domains. In this paper, we investigate the performance of this technique on Real-time strategy game tasks. One challenge with this approach is the training time of agents and real-time adaptation to opponent strategies. We present a framework BeClone for training agents in two phases. The first phase uses behavior cloning (BC) to learn base policies. The second phase uses an advantage actor-critic (A2C) reinforcement learning algorithm to adapt base strategies through self-play to explore the action space. We demonstrate the success of the BeClone framework on the microRTS domain through the comparison of the performance of the agents against the baseline A2C agent proposed by Huang and Ontanon. Our results on the resource gathering benchmark show improvement in agent performance both in terms of rewards and training time.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Modern video games provide researchers with a complex
and diverse environment to develop and test agent
models and algorithms. Real-time strategy (RTS) games have
been an important genre of games for current AI researchers
due to characteristics like partial-observability, simultaneous
play, and reasoning across multiple competencies. Unlike
in turn-based games like Chess or Go, the time to take
action is an important factor in building successful agents.
Reinforcement learning algorithms that have been successful
in developing agents for turn-based games face challenges
in RTS domains. Partial observability of the game world
also increases complexity of agent models because
strategies learned during limited self-play simulations are not
sufficient in real-time reactions to opponent strategies. These
challenges for RTS games have been well-documented in
recent years [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        To overcome these challenges, a combination of heuristic
search and action policy learning algorithms have been
utilized. Heuristic search algorithms have an authoring
bottleneck in terms of fine tuning heuristics and definitions of
action operators for planning algorithms. Portfolio search
approach [
        <xref ref-type="bibr" rid="ref7 ref8">8, 7</xref>
        ] that utilizes an ensemble of heuristics to
determine best actions has shown promise in search-based
methods for RTS agent modeling. There has been initial work
in addressing the authoring bottleneck of search-based
algorithms in RTS games by learning hierarchical task network
like structures [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. This paper will focus on reinforcement
learning, specifically deep reinforcement learning models
that have been recently utilized to model various
competencies in RTS game playing agents [
        <xref ref-type="bibr" rid="ref10 ref20">10, 20</xref>
        ]. For RL algorithms
in this domain, two hurdles have been sparse rewards and
exploring undesirable or sub-optimal states. RL algorithms for
RTS agents need efficient representations and directed
exploration algorithm to improve performance [
        <xref ref-type="bibr" rid="ref12 ref20">20, 12</xref>
        ]. This
paper describes a two phase framework that utilizes
imitation learning, specifically behavior cloning, to guide the
agent towards a desirable base policy by observing of
expert game replays and, in the second phase, the agent
utilizes the advantage actor-critic (A2C) framework to learn
adaptations to the base policy to maximize rewards through
self-play. Alone, pure imitation learning can only perform
as well as the experts it observes and pure RL, as previously
stated, struggles with sparse rewards and exploring
undesirable states. We use the global state representation proposed
by Huang and Ontanon (HA) [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] to train an agent with the
proposed framework. Our results on the benchmark domain
show improvement both in terms of rewards and learning
time (with regard to the first successful harvest and return of
resources).
      </p>
      <p>We present experimental results on the RTS domain.</p>
      <p>
        RTS is an open-source simplified RTS game that
implements the key challenges of RTS games and was designed
specifically as an open sandbox for AI research [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. It
allows researchers to configure the game environment for their
experiments. Figure 1 shows a frame from RTS. The green
squares are resources that the worker (gray circles) units
gather to return the base (white squares). The numbers on
the units indicates the number of resources that the unit
contains. After Workers have gathered enough resources, they
can build barracks (gray squares) that produce light, heavy,
or ranged units. These units share a relationship similar to
rock, paper, scissors where light units work best against
ranged, ranged work best against heavy, and heavy work best
against light [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Our experiments show that our approach
does significantly better on the small, 4x4 map and slightly
better or the same on the larger, 6x6 and 8x8 maps.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>
        Real-time Strategy Games: In real-time strategy games,
players control armies composed of a variety of units,
usually in an partially observable environment. Partially
observable in the sense that the player does not have the ability
to continuously monitor opponents actions to predict their
strategy. Some RTS games also limit how much of the map
the player can see(Fog of War). These factors (real-time,
multi-agent, and partially observable) combined make RTS
games challenging for humans and learning agents [
        <xref ref-type="bibr" rid="ref20 ref6">6, 20</xref>
        ].
The real-time factor forces the player to be quick when
making decisions because players can make actions
simultaneously and actions have different durations. Not only do
players have a short amount of time to make decisions,
but they also have a large action and state space to
consider when making those decisions. Player build their armies
based upon some plan or strategy causing their action space
to grow, or branch, exponentially. For example, 10 units with
5 possibles actions results in a branching factor of 10
million [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Finally, since RTS games are usually partially
observable, this means players have to work under uncertainty.
RL in RTS: Reinforcement learning is a good fit to model
several competencies within RTS games such as spatial
reasoning [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In Sharma et al.[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], they combine
reinforcement learning with case-based reasoning (CBR) in the
MADRTS game domain. They present a multilayered
architecture that focuses on different competencies of the agent,
and residing in the middle of this architecture is a case-based
reinforcement learner that learns to make tactical decisions
over the game’s action space. Their experiments showed
that their learning agent not only succeeded on individual
tasks, but also had significant performance gains when
using knowledge transferred from previous tasks. This work
shows the effectiveness of using competency specific expert
agents and reinforcement learning to perform well in
realtime strategy games. One challenge in this work is the time
to train the agent on individual competencies independently
and then tuning the case-based module to determine the
policy of prioritizing recommendations from each module to
make the final decision. As shown by Weber et al [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], this
agent architecture leads to agents that struggle with
maintenance of a coherent plan due to constant queries and
responses to the case base.
      </p>
      <p>
        Similar to Sharma et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Lee et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] use case-based
reasoning (CBR) with reinforcement learning, but they use it
in the RTS game WARGUS. This system uses CBR to fulfill
all the requirements necessary for reinforcement learning to
operate in RTS games. Those requirements being: (1) a state
space that can be represented as an Markov Decision
Processes, future states depend on the actions made in previous
states; (2) an abstracted state space that is reduced and
guarantees real-time performance, and (3) an online state space
that allows the agent to use information about the game state.
The state space and cases are built by learning player
behaviors then using this knowledge to simplify the state space and
make decisions. Their experiments show that their method
outperforms the dynamic, adapting human-like scripts for
the WARGUS. This is another example of how useful
transfer learning through observing other or previous agents is
when using reinforcement learning in RTS games.
      </p>
      <p>
        The work that is most relevant to this paper is by HA [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
They use deep reinforcement learning in RTS to learn the
best state representation for training an RL agent for the
resource gathering competency. They compare two types of
representations: global and local. Global representation is
focused on training the agent to select both the unit and its
action. The global representation restricts the agent to one
action at a time. The local representation focused on
individual units in the game that are controlled by the agent.
Each unit learns what action perform for itself. They
demonstrate their algorithm on three maps (4x4, 6x6, and 8x8),
and report that local representation performs better global
representation. They also report a significant impact of map
size on agent performance. They believe this is due to the
navigation challenges added by the larger map. Their agent
struggles on larger maps and control of larger number of
units due to sparsity in rewards for the RL algorithm. This
indicates that more efficient representations are needed to
improve the agent’s performance on the resource gathering
task. We show how a two step algorithm which performs
imitation learning to learn base policies from human gameplay
observations followed by a self-play advantage actor-critic
algorithm (A2C) for fine-tuning the base policies improves
performance of the RL agent on this resources gathering
benchmark.
      </p>
      <p>
        Imitation Learning from Observation: Imitation
learning is the process of an agent learning how to perform a task
by observing an expert first then trying to imitate the
expert. One of the earliest studies using imitation learning is
Abbeel and Ng’s [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]’s work on an apprenticeship, another
name for imitation learning, agent for learning the reward
functions for Gridworld and a driving simulator. In this study
they characterize how long it takes for an apprentice agent
to converge and perform similar to the expert agents that it
observes. For the Gridworld experiments, finding a small set
of cells with positive rewards and varying the number of
experts sampled, they found that both versions of their
apprentice agent in the first experiment converged similar to the
observed expert agents. In the second Gridworld experiment
they compare the two previous apprentice agents to three
other agents (two parameterized agents and one ”mimic the
expert” agent) and find that their agents outperformed the
other three agents. For the second, car driving simulation
experiment, Abbeel and Ng [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] used imitation learning to
try teach their agent different driving styles. After letting
the agent observe a number of ”expert” driving styles for
approximately 2 minutes, the agent attempts to learn a
policy that best approximates each driving style. Their results
show that the agent is able to successfully mimic each
driving style.
      </p>
      <p>
        Torabi et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] propose a BC from observations
framework. In this framework their agent uses an inverse dynamic
model to learn how actions effect the environment and
behavior cloning. The agent first interacts with the
environment to gather prior experience/base understanding of the
environment then uses expert state-demonstrations only (no
expert actions are provided) to improve its model. They ran
experiments in the OpenAI gym and compared their
framework against three other imitation learning algorithms: BC,
Feature Expectation Matching (FEM), and Generative
Adversarial Imitation Learning. Their approach is the only one
that does not use the demonstrator’s actions while training,
and they found that their framework works as well or
better than the methods that require access to the
demonstrator’s actions. We based on BC approach on this approach,
except we included the action because our goal is a specific
task. When we scale this research to a full game, we plan
to remove access to the demonstrator’s actions or use
selfsupervised imitation learning to ensure our agent is more
generalized.
      </p>
      <p>
        Imitation learning has also be done through observations
from video, like Aytar et al.’s [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] study. In this study,
their agents learn how to play three games, Montezuma’s
Revenge, Pitfall! and Private Eye, with sparse rewards by
observing YouTube videos of humans playing those games.
They found that pure RL agents are not able to collect the
sparse rewards in Montezuma’s Revenge and Pitfall! and are
only able to gather the first two rewards in Private Eye. A
key takeaway from this work is that their approach uses a
combination of rewards from multiple modalities (video and
Global Representation
      </p>
      <p>Features
Hit Points
Resources</p>
      <p>Owner
Unit Type</p>
      <p>Action</p>
    </sec>
    <sec id="sec-3">
      <title>Approach</title>
      <p>
        We present a framework that trains an agent on the resource
gathering benchmark in RTS domain. This algorithm
operates in two phases. In the first phase, the agent learns a base
policy based on gameplay observations from a hard-coded
agent. In the second phase, the agent refines the base policy
through the advantage actor-critic algorithm with self-play
simulations. We treat the game map as a grid and each
feature map conforms to this grid. Each cell of the feature map
represents some information about the current game state,
such as: resources held, team, etc. The full table of features
and associated values are located in Table 1 above. Imitation
Learning is used to create a baseline policy to be used later
for reinforcement learning. The first half of training on
observation of gameplay of expert agents. These expert agents
could be human agents such as the ones used by Weber et
al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] or hand-programmed planning agents with different
heuristic functions for demonstrating a variety of behaviors.
This informs the agent of possible actions to take in given
game states and basic policies of resource gathering.
Global Representation: The environment is represented
as an observation matrix with a size of: width (w) x height
(h) x number of features (nf). There are 5 feature maps in
total that keep track of hit points, resources held, team,
action, and parameter. Each feature map has the same size as
the game map resulting in the observation matrix size. To
complete the input tensor, each cell of the feature maps
contains a vector with the size of the largest feature (nc). In this
paper, the largest feature has 7 values, so we use this for
each cell. The final hot-encoded tensor/vector has a size of:
nf w h nc. This is the same representation that HA uses
in their experiments.
      </p>
      <p>
        This representation is used to select the unit to control
and the action for the selected unit to perform. Similar to
HA [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], we only focus on harvesting resources, so the only
actions we focus on are Wait (No Operation), Move,
Harvest, and Return. Some of the actions require a parameter,
such as direction. For example, if the unit, in the first row, in
figure 2 wants to harvest resources then the model needs to
output Harvest and left as the action and parameter values.
The agent also has predicts the location of the unit it wants
to control.
      </p>
      <p>As a result, the action vector is: [atx; aty; ataction; atparam].
Where, atx is the predicted unit x-coordinate, aty is the
predicted unit y-coordinate, aaction is the selected action in
t
frame t, and aparam is the parameter of the selected action.</p>
      <p>t
The agent learns which moves are valid and invalid for the
selected unit. Also, the agent is only able to make one
command per turn. To summarize, the global representation
encodes the entire game state, and the RL agent uses this
information to select what unit to control, what action to perform,
and the direction to perform the action.</p>
      <p>
        Advantage Actor Critic: The advantage actor-critic
(A2C) algorithm is a reinforcement learning algorithm for
learning generalized policies. We chose this framework
because it enables the use of distributed RL by utilizing more
than one actor and a critic, and we could focus on adapting to
local contexts such as risk management [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. However, for
this paper we only focus on resource harvesting to
understand the effects of directing the exploration policy of our
RL agent using imitation learning. We base our
implementation on Mnih et al.’s [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] A3C algorithm to implement our
A2C algorithm similar to HA [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. The A2C algorithm has
two components: the actor and the critic. The actor
maintains the policy and handles the decision making, while the
critic criticizes (calculates the loss to update the actor’s
policy) the actor’s decisions and estimates the value function
of the environment, and they work together to maximize the
expected reward [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. After every episode, tmax or a terminal
state is reached, the policy and value function are updated.
The policy is updated via: r 0 log (atjst; 0)A(st; at; ; v)
where A(st; at; ; v) is the estimated advantage function
calculated by ik=01 irt+1 + kV (st+k; v) V (st; v),
where k represents the number of states visited during the
episode; has a max of tmax.
      </p>
      <p>
        Behavior Cloning: Behavior cloning (BC) trains an agent
using data collected from expert playthroughs (human or
other agents). BC with inference means while training the
agent tries to infer the expert agent’s selected action given a
sampled state and its resulting state (outcome of the expert’s
action). Our hypothesis with BC was if we let our agent
observe how expert agents harvest resources, it would give a
baseline policy to use when focusing on harvesting resources
or any other task that we provide samples for. Using BC,
we provide demonstrations for our agent to direct its
exploration towards important areas of the action space and form a
baseline policy. Directing its exploration also actively helps
to minimize the chance of the agent exploring undesirable
states [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The resulting policy gives the agent a baseline,
and if trained long enough would eventually perform exactly
like the experts unlike using a random exploration algorithm
like most RL algorithms [
        <xref ref-type="bibr" rid="ref12 ref2">12, 2</xref>
        ].
      </p>
      <p>
        As mentioned earlier, reinforcement learning struggles in
environments with sparse rewards, and learning how to
harvest from scratch means the agent will experience sparse
rewards as it learns how to harvest by itself. This exploration
for resources also gets combined with learning navigation
on the grid with respect to the relative locations of the base
and resources. Starting with a learned player model the agent
learns the basic movements and navigation primitives. There
are some limitations of using the learned player models. If
the observed agents are not sufficiently varied in terms of
their diversity of play then the agent could learn biases of
the observed agents. Beyond the RTS domain, the learning
agent would also be sensitive to the variety and size of maps
in which the observed agents are performing. Also, we are
only concerned with navigation and harvesting tasks in this
work but as the player model takes into account all available
actions that include additional units like production, attack,
and defense(Wait, Move, Return, Harvest, Produce, and
Attack) it becomes more challenging for the agent to learn
coherent policies due to interleaved actions. This challenge has
previously been documented and addressed to some extent
in the StarCraft community [
        <xref ref-type="bibr" rid="ref11 ref9">9, 11</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>Training</title>
      <p>Training through Behavior Cloning: As previously
stated, the first half of BeClone’s training involves BC
using the hard coded agents supplied by Ontan˜o´n et al. with
the RTS source code. We used the WorkerRush and
WorkerDefense agents to learn a model. The reason we chose
these agents is they have at least one worker unit
dedicated to harvesting resources as quickly as possible.
Therefore, our agent only needs to observe the game state then
attempt to infer the action of the worker harvesting
resources. While training with BC, our A2C model is treated
as an ordinary feed forward neural network. The hardcoded
agent’s actions are used as labels to produce the network
error, then those values are backpropagated to update the
weights (treated as the policy) of the networks. The agents
(WorkerRush and WorkerDefense) are trained against the
Idle agent in the RTS tournament environment for 100
iterations (2000 timesteps or terminal game state), so BeClone
can focus solely on harvesting resources without having to
worry about being attacked.</p>
      <p>Training through Reinforcement Learning: After
learning a baseline policy through BC, we utilize A2C to
complete the rest of the training. We transferred the weights from
the model in the previous section to initialize the full
network then we followed the implementation of HA for
updating the network. After initialization, the observation tensor
is fed into the network policy network and value network
(see Figure 3 for network architecture). The value network
takes the observation tensor and outputs an estimated
valuation for the current state. When the observation tensor is
fed into the policy network, the data is split into 4 logits
vectors. Each vector is then converted into a probability
distribution for sampling values to produce the action vector a
= [atx; aty; ataction; atparam].</p>
      <p>To calculate the loss over the episode, we use the formula:
log (atjst) =log (atxjst)
+ log (atyjst)
+ log (atactionjst)
+ log (atparamjst)
st is the state at time step t and at is the action probability.
To compute entropy we use the formula:</p>
      <p>this is the sum
ax; ay; aaction; aparam.</p>
      <p>To calculate the error, or gradient, of the network:
entropy = X
a
(ajst)log (ajst)</p>
      <p>(2)
of
the
entropy
values
policy {gzradient
v 0 (st))r log (atjst) +</p>
      <p>}
v 0 (st))r 0 v 0 (st) +
(Gt
|
((Gt
X</p>
      <p>a
| value estima{tzion gradient }</p>
      <p>(ajst)log (ajst)
| entropy regularization }</p>
      <p>{z</p>
      <p>Here 0 and are the weights for the value network and
the policy network, and are hyperparameters for
controlling the value estimation and entropy regularization
gradients, and Gt is the discounted rewards. We used the same
parameters as Huang and Ontan˜o´n, and you can find the full
list of parameters in Table 2.</p>
    </sec>
    <sec id="sec-5">
      <title>Experiment</title>
      <p>In this section we will compare our approach to HA’s
approach, on the same maps.
(1)
(3)</p>
      <p>Setup: We used similar maps to HA. We made the 4x4,
6x6, and 8x8 maps shown in Figure 4. They consist of two
worker units, two bases (player’s base and the enemy base),
and one resource unit containing 230 resources. The
reason behind the resource block containing 200+ resources is
because, ideally, over the 2000 time step episode the two
worker agents could harvest at most 200 resources on the
4x4 map. Each action takes 10 time steps to complete, so
to harvest and return one resource each worker would need
to harvest a resource (10 steps) and return the resource (10
steps) for a total of 20 time steps. Over the whole episode
(2000 time steps) that would more the two agents would be
able to harvest 200 resources: 2*2000/20 = 200.
Comparison: BeClone, when compared to HA, almost
doubled the average reward on the 4x4 map, had a slight
increase in average reward on the 6x6 map, and no change on
the 8x8 map. Another notable outcome is not only did
BeClone increase performance, it significantly decreased the
harvest and return times. For the 4x4 map, there was a
significant increase in average reward but also a significant
increase in the average time taken for harvesting and returning
resources. This may be an acceptable trade-off because even
though it takes longer on average, it also performs almost
twice as well as HA’s harvesting agent. However, for the
larger maps (6x6 and 8x8), our approach significantly
decreased time taken to harvest and return resource (when the
agent was successfully able to return the resource). There
was no significant increase in the average reward for these
maps though.</p>
      <p>This improvement was a result of imitation learning.
Imitation learning improved HA’s global representation agent
because BeClone observed an ”expert” and learned what to
do in certain game states where as the base agent had to learn
what to do through trial and error. Also, if given enough
time to train, BeClone would eventually learn the hardcoded
agents’ behaviors. This gave BeClone a great baseline
policy to operate with. Then using A2C enabled the agent to
try to maximize its performance. However, during the
reinforcement learning phase of training and testing, we noticed
that the agent would often move the worker units to the base
or try to return resources prior to harvesting a resource. This
is why we believe that BeClone performed roughly the same
as HA’s global agent on the larger maps. We believe if we
influence the reward structure such that harvesting a reward
weighs more than returning the reward or adding negative
rewards when the agent attempts to return a resource without
a resource first that this could potentially improve BeClone’s
performance on the larger, 6x6 and 8x8, maps.</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and Future Work</title>
      <p>
        Our results show that BeClone made a significant
improvement on the smaller map and similar on the larger maps to
previously best performing agents on the benchmark [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
These reults indicate that BC is promising to consider as
an effective addition to the A2C model. Without shaping
the reward structure or invalid action masking, we believe
that the BC model on the larger maps learned the path to
and from the base and showed improved performance on
resource gathering task. It did struggle on the larger map
in successfully returning the resources to the base, similar
to prior work. Often we noticed that BeClone would move
one or both worker units to the base and would not move
them too far away from the base. Initially one of the issues
with the agents in the original formulation of the task was
that the rewards were not set up in a way that harvesting
and navigation tasks had to be combined to encourage
efficient navigation with respect to timely harvesting and
delivery of resources back to the base. In the original setup,
the agent has no punishment for returning back to the base
without a resource leading to this inefficiency in the learned
agent. In the imitation learning phase, the reward to correct
movement and delivery are connected to the replication of
example movements of observed bots that are more efficient
in combining movement with timely and correct delivery of
resources. This is an important insight for future work.
      </p>
      <p>In summary, we have introduced a framework that
combines BC with actor-critic reinforcement learning to improve
performance over current benchmarks of resource
gathering tasks in RTS game environments. Immediate next steps
for this work are in scaling up learning in terms of domain
and task complexity. Instead of directly cloning behavior,
using cloned behavior tasks as a partial plan and inferring
actions that are appropriate for the current context is a
direction worth exploring. On the BC part, there are several
interesting avenues to pursue on the formal characteristics
of quantity and diversity of examples, learning from a
variety of human traces, difference between observations from
experts vs. novice players, and learning from state
observation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Vijay</surname>
            <given-names>R Konda</given-names>
          </string-name>
          and
          <string-name>
            <surname>John N Tsitsiklis.</surname>
          </string-name>
          “
          <article-title>Actor-critic algorithms”</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          .
          <source>2000</source>
          , pp.
          <fpage>1008</fpage>
          -
          <lpage>1014</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Pieter</given-names>
            <surname>Abbeel</surname>
          </string-name>
          and Andrew Y Ng.
          <article-title>“Apprenticeship learning via inverse reinforcement learning”</article-title>
          .
          <source>In: Proceedings of the twenty-first international conference on Machine learning</source>
          .
          <year>2004</year>
          , p.
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Manu</given-names>
            <surname>Sharma</surname>
          </string-name>
          et al. “
          <article-title>Transfer Learning in Real-Time Strategy Games Using Hybrid CBR/RL</article-title>
          .” In: IJCAI. Vol.
          <volume>7</volume>
          .
          <year>2007</year>
          , pp.
          <fpage>1041</fpage>
          -
          <lpage>1046</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Jaeyong</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Bonjung</given-names>
            <surname>Koo</surname>
          </string-name>
          , and Kyungwhan Oh. “
          <article-title>State space optimization using plan recognition and reinforcement learning on RTS game”</article-title>
          .
          <source>In: Proceedings of the International Conference on Artificial Intelligence</source>
          , Knowledge Engineering, and
          <string-name>
            <given-names>Data</given-names>
            <surname>Bases</surname>
          </string-name>
          .
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Ben</surname>
            <given-names>G Weber</given-names>
          </string-name>
          ,
          <article-title>Michael Mateas, and Arnav Jhala. “Case-based goal formulation”</article-title>
          .
          <source>In: Proceedings of the AAAI Workshop on Goal-Driven Autonomy</source>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Ben</given-names>
            <surname>George Weber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Mateas</surname>
          </string-name>
          , and Arnav Jhala. “
          <article-title>Building Human-Level AI for Real-Time Strategy Games</article-title>
          .”
          <source>In: AAAI Fall Symposium: Advances in Cognitive Systems</source>
          . Vol.
          <volume>11</volume>
          .
          <year>2011</year>
          , p.
          <fpage>01</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>David</given-names>
            <surname>Churchill</surname>
          </string-name>
          and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Buro</surname>
          </string-name>
          . “
          <article-title>Incorporating search algorithms into RTS game agents”</article-title>
          .
          <source>In: AI and Interactive Digital Entertainment Conference</source>
          ,
          <source>AIIDE (AAAI)</source>
          .
          <source>Citeseer</source>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>David</given-names>
            <surname>Churchill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Abdallah</given-names>
            <surname>Saffidine</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Buro</surname>
          </string-name>
          . “
          <article-title>Fast heuristic search for RTS game combat scenarios”</article-title>
          .
          <source>In: Eighth Artificial Intelligence and Interactive Digital Entertainment Conference. Citeseer</source>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Ben</given-names>
            <surname>George Weber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Mateas</surname>
          </string-name>
          , and Arnav Jhala.
          <article-title>“Learning from Demonstration for GoalDriven Autonomy</article-title>
          .” In: AAAI.
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Michael</surname>
            <given-names>A Leece</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Arnav</given-names>
            <surname>Jhala</surname>
          </string-name>
          . “
          <article-title>Reinforcement learning for spatial reasoning in strategy games”</article-title>
          .
          <source>In: Ninth Artificial Intelligence and Interactive Digital Entertainment Conference</source>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Michael</surname>
            <given-names>A Leece</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Arnav</given-names>
            <surname>Jhala</surname>
          </string-name>
          . “
          <article-title>Sequential pattern mining in starcraft: Brood war for short and longterm goals”</article-title>
          .
          <source>In: Tenth Artificial Intelligence and Interactive Digital Entertainment Conference</source>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Javier</given-names>
            <surname>Garcıa</surname>
          </string-name>
          and
          <article-title>Fernando Ferna´ndez. “A comprehensive survey on safe reinforcement learning”</article-title>
          .
          <source>In: Journal of Machine Learning Research 16.1</source>
          (
          <issue>2015</issue>
          ), pp.
          <fpage>1437</fpage>
          -
          <lpage>1480</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Volodymyr</given-names>
            <surname>Mnih</surname>
          </string-name>
          et al. “
          <article-title>Asynchronous methods for deep reinforcement learning”</article-title>
          .
          <source>In: International conference on machine learning</source>
          .
          <source>2016</source>
          , pp.
          <fpage>1928</fpage>
          -
          <lpage>1937</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Santiago</surname>
            <given-names>Ontano´</given-names>
          </string-name>
          n. “
          <article-title>Combinatorial multi-armed bandits for real-time strategy games”</article-title>
          .
          <source>In: Journal of Artificial Intelligence Research</source>
          <volume>58</volume>
          (
          <year>2017</year>
          ), pp.
          <fpage>665</fpage>
          -
          <lpage>702</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Yusuf</given-names>
            <surname>Aytar</surname>
          </string-name>
          et al. “
          <article-title>Playing hard exploration games by watching youtube”</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          .
          <year>2018</year>
          , pp.
          <fpage>2930</fpage>
          -
          <lpage>2941</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Lasse</given-names>
            <surname>Espeholt</surname>
          </string-name>
          et al. “
          <article-title>Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures”</article-title>
          .
          <source>In: International Conference on Machine Learning. PMLR</source>
          .
          <year>2018</year>
          , pp.
          <fpage>1407</fpage>
          -
          <lpage>1416</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Ong</surname>
          </string-name>
          <article-title>Leece. “Learning Hierarchical Abstractions from Human Demonstrations for ApplicationScale Domains”</article-title>
          .
          <source>PhD thesis</source>
          .
          <source>UC Santa Cruz</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Santiago</surname>
            <given-names>Ontan˜</given-names>
          </string-name>
          <article-title>o´n et al</article-title>
          . “
          <article-title>The first microrts artificial intelligence competition”</article-title>
          .
          <source>In: AI Magazine 39.1</source>
          (
          <issue>2018</issue>
          ), pp.
          <fpage>75</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Faraz</surname>
            <given-names>Torabi</given-names>
          </string-name>
          , Garrett Warnell, and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Stone</surname>
          </string-name>
          . “
          <article-title>Behavioral cloning from observation”</article-title>
          . In: arXiv preprint arXiv:
          <year>1805</year>
          .
          <year>01954</year>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Shengyi</given-names>
            <surname>Huang</surname>
          </string-name>
          and
          <article-title>Santiago Ontan˜o´n. “Comparing Observation and Action Representations for Deep Reinforcement Learning in MicroRTS”</article-title>
          . In: arXiv preprint arXiv:
          <year>1910</year>
          .
          <volume>12134</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Florian</given-names>
            <surname>Richoux</surname>
          </string-name>
          . “microPhantom:
          <article-title>Playing microRTS under uncertainty and chaos”</article-title>
          . In: arXiv preprint arXiv:
          <year>2005</year>
          .
          <volume>11019</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>