<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Imitation Learning on Atari using Non-Expert Human Annotations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ameya Panse</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tushar Madheshia</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anand Sriraman</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shirish Karande TCS Research - TRDDC</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hadapsar Industrial Estate Pune -</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maharashtra</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ameya.panse</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>tushar.madheshia</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>anand.sriraman</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>shirish.karandeg@tcs.com</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>In this paper, we explore the problem of learning a policy from non-expert human demonstrators. We use a consensus algorithm to estimate consensus actions and learn worker skill levels. We iteratively update the skill levels while training an RL agent using learned weights for demonstrations over the entire training period. We perform our experiments in the Atari Learning Environment (ALE) available on OpenAI Gym and show initial results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Deep reinforcement learning has been shown to be very
successful in solving problems such as playing Atari games
        <xref ref-type="bibr" rid="ref7">(Mnih et al. 2013)</xref>
        and Go (Silver et al. 2016). However,
initial learning via reinforcement learning can be extremely
slow and requires a large amount of interactions with the
environment to achieve substantial performance.
Incorporating human knowledge can help accelerate the training for
reinforcement learning agents. Imitation learning allows an
agent to learn from human demonstrations by mimicking
their behaviour on a task.
      </p>
      <p>
        Many algorithms have been proposed for Imitation
Learning. However, most of the prior work, e.g. DAgger (Ross,
Gordon, and Bagnell 2011) and it’s extension AggreVaTe
        <xref ref-type="bibr" rid="ref9">(Ross and Bagnell 2014)</xref>
        , use expert demonstrations to teach
an agent.In this paper, we look at the problem of learning
from non-expert human demonstrators. We model the
humans’ skill levels and learn the consensus actions at the
various states. By using learned weighting of various
demonstrations, we can perform better than by treating all
demonstrations equally. We also use demonstration data
throughout the training of the agent, rather than just a bootstrapping
method to improve initial performance. We base our work
upon demonstrations performed for Atari games by
nonexpert volunteers. In crowdsourcing literature, several
algorithms have been proposed to obtain consensus labels from a
set of worker labels. EM-based approaches such as
(Welinder and Perona 2010) have been quite popular to model both
worker skill levels as well as to obtain the consensus label.
Deep neural networks have also been used to obtain crowd
consensus
        <xref ref-type="bibr" rid="ref1">(Albarqouni et al. 2016)</xref>
        . We use an approach
Copyright c 2018 for this paper by its authors. Copying permitted
for private and academic purposes.
similar to Welinder and Perona, but modify the algorithm to
obtain workers’ action probability distributions at each state.
      </p>
      <p>
        <xref ref-type="bibr" rid="ref4">(Gao et al. 2018)</xref>
        comes the closest to our approach where
they learn from imperfect demonstrations throughout their
training. We differ from their approach, as we use an
iterative algorithm to learn the consensus policy across
demonstrations and use weighted demonstrations by modeling the
worker’s skill level. Our loss function and regularization
methods are also different.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <sec id="sec-2-1">
        <title>Reinforcement Learning</title>
        <p>The Reinforcement Learning problem that we consider is
defined by a Markov Decision Process (MDP). A MDP is
characterized by a tuple &lt; S; A; R; T; &gt;, where S is the
set of states, A is the set of actions, R(s; a) is the reward
function, T (s; a; s0) = P (s0js; a) is the transition
probability, and is the discount factor. An agent in a particular state,
interacts with the environment by taking an action, and
receives the reward while transitioning to the next state.</p>
        <p>The goal of the agent is to learn a policy such that the
agent maximizes the future discounted reward:
= argmax X
t</p>
        <p>Est;at [Rt]
t</p>
      </sec>
      <sec id="sec-2-2">
        <title>Proximal Policy Optimization</title>
        <p>In Policy Gradient Methods, the policy gradient is estimated
and is used in an stochastic gradient ascent algorithm. A
variant of the Policy Gradient Methods, Proximal Policy
Optimization (Schulman et al. 2017), where the policy updates
are constrained by size while maximizing the clipped
objective.</p>
        <p>LCLIP ( ) = Et[min(rt( )A^t; clip(rt( ); 1
; 1 + )A^t)]
argmax LCLIP ( )
subject to</p>
        <p>LKL = KL(
old ( jst);
( jst))]
where, KL is the Kullback Leibler Divergence. The
constraint is applied by using a penalty as follows :
argmax LP P O = LCLIP
S = fsig, worker annotations Z = fzij g, policy parameters
, worker parameters W = fwj g and difficulty parameters,
D = fdig as
p(Z; W; D; jS) = p( ) Y(p(di) Y p(akjsi; ))
i</p>
        <p>
          k
Y p(wj ) Y p(zij jsi; di; wj )
j
i;j
Let S = fsig be the set of states observed. At each state sij
a worker that sees the state, takes an action zij . The
workers are asked to complete an episode, and every state-action
pair of the worker is recorded. Let j be the policy of the
worker. If we have multiple annotations for each state, then
it is easy to setup a standard consensus algorithm and to
estimate the consensus policy cns to be use for guided
exploration. However, in most practical cases, it is infeasible
to assume that every state has even a single annotation, let
alone multiple. Hence we need to extrapolate j to the states
not seen by the worker to arrive at a consensus. We make
use of Deep Neural Networks for this generalization of the
policy to unseen states. We used a convolutional neural
network with three convolutional layers, similar to the Deep
Q-Network in
          <xref ref-type="bibr" rid="ref8">(Mnih et al. 2015)</xref>
          .
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Parameterized Policy and Distillation</title>
        <p>We consider the parameterized policy of our agent , where
are the parameters, such as the weights and biases of our
network. We want to make use of the confidence values of
each action, produced by the network for better estimates of
the skill and difficulty parameters. Hence, the policy is learnt
in conjunction with the other parameters.</p>
        <p>(Hinton, Vinyals, and Dean 2015) introduced Knowledge
Distillation, wherein a small student network accurately
learns from a large teacher network by matching soft
labels. Inspired by this, our primary contribution in Eq. (2)
make use of the consensus policy to guide the exploration of
the parameterized policy by adding a regularization loss to
match the soft actions of p and cns.</p>
        <p>LD( ) = E^s[( ( js)
cns( js))2]
We scale the distillation loss by . We reduce over time,
since the optimal policy need not match the crowd policy.
We estimate the distillation loss by a number of random
samples from the observed states.
(2)</p>
      </sec>
      <sec id="sec-2-4">
        <title>Worker Skill and Difficulty</title>
        <p>Let wj be the parameters encoding the skill level of the
worker j. The skill level should represent the confidence
of the workers actions. For example, an expert should have
a high skill level near compared to a non-expert. For each
worker, we estimate their skill level. The action
probabilities of a state are weighted according to the skill levels of
the workers annotating the state and the inherent difficulty
of the state i, encoded by di.</p>
        <p>We model the worker skill as 0 wj 1, where a highly
skilled worker has wj near to 1. We assume a prior of mixed
Beta Distributions to model different types of workers (high
skill, low skill, spammers).We model 0 di 1 denoting
the difficulty level of the state i. Further parameterization of
the worker and difficulty based on the time elapsed, so as
to take into account the improvement of the worker is being
considered as a part of future work.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Parameter Estimation</title>
        <p>Let A = fakg be that set of all possible actions. We
define the joint probability distribution over the observed states
(3)
(4)
(5)
(6)
(7)
(8)</p>
        <p>
          We now estimate the parameters by alternating
maximization algorithms
          <xref ref-type="bibr" rid="ref2">(Branson, Van Horn, and Perona 2017)</xref>
          :
^cns(akjsi) = p(akjsi; ^) Y p(zij jak; d^i; w^j )
        </p>
        <p>j
a^i = argmax ^cns(akjsi)</p>
        <p>ak
d^i = argmax p(di) Y p(zij ja^i; di; w^j )</p>
        <p>di
w^j = argmax p(wj ) Y p(zij ja^i; d^i; wj )
wj
j
i
^ = argmax(LP P O( )</p>
        <p>LD( ))
where p(akjsi; ^) is the confidence output from the agent,
and p(di); p(wj ) are priors, p(zij ja; d^i; w^j ) is probability of
worker j taking action zij given that a is the optimal action,
(8) is solved by stochastic gradient ascent.</p>
        <p>p(zij ja; d^i; w^j ) =
if zij = a :
else :
1
wj (1
wj (1
A
j j
1
di)
dj )</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiment Setup and Results</title>
      <p>
        For our preliminary experiment, we wanted to choose three
types of Atari games: one where humans were better than
RL agents, one where the agent was significantly better,
and one where both were performing similarly. We
obtained the scores from
        <xref ref-type="bibr" rid="ref6">(McKenzie et al. 2017)</xref>
        for
human performance and from (Salimans et al. 2017) for
the agent performance . Based on the ratio of human to
agent score, we chose Bowling(ratio=5:35) Seaquest(ratio=
11:44), Bankheist(ratio= 1:01) and Breakout(ratio= 0:08)
which were available on OpenAI Gym
        <xref ref-type="bibr" rid="ref3">(Brockman et al.
2016)</xref>
        . Bowling and Seaquest have a high ratio, indicating
that humans can perform better than machines on this game.
Hence, there is room to imitate humans to better train our
agent. Bankheist having a ratio close to one, we do not
expect much difference between pure reinforcement learning
and our algorithm. For Breakout, our algorithm does worse
due the fact that human performance is below the RL
performance.
      </p>
      <p>We invited 21 participants to volunteer and stored their
gameplay data, including actions performed, states and
rewards generated by the environment. The volunteers
consisted of our colleagues, family and friends. We only
explained the controls of the game to the players and did not
elaborate on the specific game mechanics. This was done
to ensure that they explored the game’s reward mechanisms
and would improve during the course of their episode.</p>
      <p>After collecting the demonstration data, we trained the
agent in four configurations. In the first configuration, we
didn’t incorporate the distillation loss during training to
simulate the vanilla RL training without demonstration data.
Next, we set equal skill levels to all the workers, and didn’t
update them during training. This was a baseline to
understand how the algorithm performed if the worker skill level
wasn’t modeled and all demonstrations were treated as
oracle demonstrations. Finally we ran two configurations where
we updated the worker skill levels at a low frequency (an
update every 10 iterations) and high frequency (an update
every iteration). We train all configurations for 50 iterations,
each iteration being 4 epochs.</p>
      <p>In situations where, human knowledge is useful for
imitation, as seen in Figure 1, updating the worker
parameters with a low frequency gave us the fastest training
improvement and highest average score.In the case where
human and RL performance was similar (Bankheist), we do
not see a significant difference between our algorithms and
pure Reinforcement learning. Whereas, in situations where
human performance is significantly worse that RL
performance (like Breakout), our algorithm takes a hit since the
incorporated human skills worsens the performance.</p>
      <p>However, treating all workers as experts (equal high skill)
lead to the worst performance in all cases, thus proving that
worker modeling is necessary for high performance levels.</p>
    </sec>
    <sec id="sec-4">
      <title>Discussion and Limitations</title>
      <p>The number of observed states were very high in number.
Hence, while updating the worker and difficulty parameters,
it was unfeasible to run over all observed states due to
memory constraints. Instead, we sampled all states where we had
multiple crowd inputs and matched those with an equal
number of randomly sampled states which totaled to 600 states.</p>
      <p>The frequency of the parameter updates have an impact
on the learning time and also the performance of the agent,
and this relationship is not monotonic. Too low frequencies
(Equal Skill Worker) and too high frequencies (High
Frequency Skill Update), both do not produce the best results.
An adaptive method of updates might boost the performance
significantly.</p>
      <p>In this paper, we have introduced a novel formulation for
continuous use of non-expert demonstration data for RL.
We have shown that modeling the worker skill levels, and
using weighted demonstrations during training helps speed
up the training significantly. In the future, we plan to scale
up our experiments, by optimizing our web system and
getting more demonstrations from a public crowd. We also plan
on exploring more ways of modeling worker behaviour, e.g.
learning in-game, shared worker parameters across games
and modeling the interference created by the game delivery
system like lag, jitter, etc.</p>
      <p>Ross, S.; Gordon, G.; and Bagnell, D. 2011. A reduction of
imitation learning and structured prediction to no-regret
online learning. In Proceedings of the fourteenth international
conference on artificial intelligence and statistics, 627–635.
Salimans, T.; Ho, J.; Chen, X.; and Sutskever, I. 2017.
Evolution strategies as a scalable alternative to reinforcement
learning. arXiv preprint arXiv:1703.03864.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Albarqouni</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Baur</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Achilles</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Belagiannis</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Demirci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Navab</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images</article-title>
          .
          <source>IEEE transactions on medical imaging 35</source>
          <volume>(5)</volume>
          :
          <fpage>1313</fpage>
          -
          <lpage>1321</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Branson</surname>
            , S.; Van Horn,
            <given-names>G.</given-names>
          </string-name>
          ; and Perona,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Lean crowdsourcing: Combining humans and machines in an online system</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <fpage>7474</fpage>
          -
          <lpage>7483</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Brockman</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cheung</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pettersson</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schneider</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schulman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Tang</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Zaremba</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Openai gym</article-title>
          .
          <source>arXiv preprint arXiv:1606</source>
          .
          <fpage>01540</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Levine</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Darrell,
          <string-name>
            <surname>T.</surname>
          </string-name>
          ; et al.
          <year>2018</year>
          .
          <article-title>Reinforcement learning from imperfect demonstrations</article-title>
          . arXiv preprint arXiv:
          <year>1802</year>
          .05313.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.;
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>ing the knowledge in a neural network</article-title>
          .
          <source>arXiv:1503</source>
          .
          <fpage>02531</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          2015.
          <article-title>DistillarXiv preprint McKenzie</article-title>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Loxley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Billingsley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            ; and
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Competitive reinforcement learning in atari games</article-title>
          .
          <source>In Australasian Joint Conference on Artificial Intelligence</source>
          ,
          <fpage>14</fpage>
          -
          <lpage>26</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Antonoglou</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wierstra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and Riedmiller,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>Playing atari with deep reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1312</source>
          .
          <fpage>5602</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rusu</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Veness</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Bellemare,
          <string-name>
            <given-names>M. G.</given-names>
            ;
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Fidjeland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            ;
            <surname>Ostrovski</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ; et al.
          <year>2015</year>
          .
          <article-title>Humanlevel control through deep reinforcement learning</article-title>
          .
          <source>Nature</source>
          <volume>518</volume>
          (
          <issue>7540</issue>
          ):
          <fpage>529</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Ross</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Bagnell</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Reinforcement and imitation learning via interactive no-regret learning</article-title>
          .
          <source>arXiv preprint arXiv:1406</source>
          .
          <fpage>5979</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>