<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Human-in-the-Loop Applied Machine Learning, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Unveiling the Role of Expert Guidance: A Comparative Analysis of User-centered Imitation Learning and Traditional Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Amr Gomaa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bilal Mahdy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>German Research Center for Artificial Intelligence (DFKI)</institution>
          ,
          <addr-line>Saarbrücken</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Saarland Informatics Campus, Saarland University</institution>
          ,
          <addr-line>Saarbrücken</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>0</volume>
      <fpage>4</fpage>
      <lpage>06</lpage>
      <abstract>
        <p>Integration of human feedback plays a key role in improving the learning capabilities of intelligent systems. This comparative study delves into the performance, robustness, and limitations of imitation learning compared to traditional reinforcement learning methods within these systems. Recognizing the value of human-in-the-loop feedback, we investigate the influence of expert guidance and suboptimal demonstrations on the learning process. Through extensive experimentation and evaluations conducted in a pre-existing simulation environment using the Unity platform, we meticulously analyze the efectiveness and limitations of these learning approaches. The insights gained from this study contribute to the advancement of human-centered artificial intelligence by highlighting the benefits and challenges associated with the incorporation of human feedback into the learning process. Ultimately, this research promotes the development of models that can efectively address complex real-world problems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Human-in-the-loop Learning</kwd>
        <kwd>Learning From Demonstrations</kwd>
        <kwd>Reinforcement Learning</kwd>
        <kwd>Imitation Learning</kwd>
        <kwd>Personalization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Related Work</title>
      <p>
        Human-centered artificial intelligence (HCAI) is an exciting new area of research that is
attracting increasing attention from researchers of both artificial intelligence (AI) and human-computer
interaction (HCI) [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ]. Despite the significant progress made in developing autonomous
systems, these systems still rely heavily on human operators, local or remote, to intervene and
help or take control in situations where the system cannot proceed, highlighting the need for
HCAI techniques to promote trust, control, and reliability between users and machines [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
However, developing and implementing these concepts remains a challenging and complex
task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. As a result, there is still much room for improvement and further research in this
ifeld [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Several approaches have proposed ways to incorporate human knowledge into neural
networks as a way of initialization, to guide network refinement, and to extract symbolic
information from the network [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. More recent attempts have tried to combine deep learning
with knowledge bases in joint models (e.g., for construction and population) [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. Some work
has focused on integrating neural networks with classical planning by mapping subsymbolic
input to symbolic one, which automatic planners can use [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Recently, reinforcement learning (RL) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] has reemerged as a promising machine learning
approach within the field of autonomous systems (e.g., ChatGPT). These methods have
demonstrated increasing efectiveness in optimizing reward functions for complex environments.
However, shaping appropriate reward functions for intricate tasks and encompassing their
aspects remains a challenge [11]. In contrast, humans excel at rapidly acquiring complex skills
by observing and imitating others. Similarly, autonomous agents can take advantage of this
concept, known as learning from demonstrations (LfD) [12], to address the challenges mentioned
above using imitation learning (IL) methods using expert demonstrations [13]. Behavioral
cloning (BC) [14] and Generative Adversarial Imitation Learning (GAIL) are the state-of-the-art
and most prominent approaches employed to tackle imitation learning problems where the
agent has access to state and action information from the demonstrations [15]. Significant
progress has been made in Reinforcement Learning (RL) and Imitation Learning (IL) domains.
Torabi et al. [16] introduced an advanced adaptation of behavioral cloning known as Behavioral
Cloning from Observation, where the agent solely observes demonstration states without access
to the corresponding demonstration actions. In a separate study by Taylor [17], several methods
were proposed to facilitate the agent’s optimal utilization of knowledge from suboptimal human
demonstrations, including Learning from Human Demonstrations and Learning from Human
Feedback. Fang et al. [18] compared reinforcement and imitation learning for indoor visual
navigation. Unlike previous works, ours focuses solely on analyzing the eficacy of imitation
learning techniques to assess the importance of learning from demonstrations as a
human-inthe-loop learning paradigm in a highly complex environment, regardless of the application
domain.
      </p>
      <p>Thus, this paper contributes to the field of imitation and reinforcement learning,
evaluating its performance, robustness, and limitations. We conduct a detailed
investigation into the performance of these state-of-the-art imitation learning techniques in the context
of a simulated Bird Hunter game using Unity ml-agents1 and Pytorch2 to evaluate and compare
their efectiveness with traditional RL techniques; we investigate the impact of expert guidance
and suboptimal demonstrations on imitation learning performance compared to traditional
reinforcement learning in diverse environmental complexities. We utilize the Proximal Policy
Approximation (PPO) [19] and the Soft-Actor Critic (SAC) [20] methods for our investigation
as the most used reinforcement learning techniques, especially in simulation frameworks such
as Unity. We provide valuable insights into the comparative eficacy of IL and traditional RL,
contributing to the development of intelligent systems in various environmental contexts.</p>
      <sec id="sec-1-1">
        <title>1https://github.com/Unity-Technologies/ml-agents 2https://pytorch.org/</title>
        <p>grayscale backdrop camera view (middle), and the high-complexity environment (right).</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>Our study adopts a systematic and progressive approach to comprehensively evaluate the
efectiveness of imitation learning with suboptimal and expert demonstrations, as well as its
comparison to reinforcement learning techniques such as PPO and SAC. Incremental
complexities are introduced to the base environment, incorporating new parameters and analytical
challenges at each stage, such as transitioning from grayscale to a colored environment and
introducing various bird species with distinct reward schemes. In reinforcement learning,
the agent interacts with an environment by selecting actions and receiving feedback through
observations and rewards. The observations provide information about the current state of the
environment, while the rewards serve as feedback signals that indicate the desirability of the
agent’s actions. Therefore, for each level of environment complexity, we establish the states of
observation and action, as well as the corresponding reward structure (i.e., reward shaping).</p>
      <sec id="sec-2-1">
        <title>Base Environment (Low-complexity Environment).</title>
        <sec id="sec-2-1-1">
          <title>We conducted our study using a</title>
          <p>preexisting 2D simulated Bird Hunter game to train an autonomous agent (see Figure 1). Initially,
a grayscale backdrop was used, with the bird represented as a white box on a black background.
The camera sensor captured grayscale images at a resolution of 50 pixels for each axis (x and y),
resulting in an observation space of 2500 pixels (50 x 50 x 1), where the one represents a single
channel image. The agent’s actions involved discrete pixel coordinate pairs for movement, with
shooting performed automatically and not treated as a separate action. The reward function (as
seen in Equation 1) assigns a reward of (+1) for hitting a bird and a negative reward of (-0.01)
for missing.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Limited Ammunition with Multiple Bird Environment (High Complexity Environment).</title>
        <p>In this environment, we enhance the complexity by assigning meaning to the colors in the
agent’s observation, rather than simply introducing a color channel to the environment. In
addition to the existing yellow bird as primary target, two new types of birds are introduced.
The red bird serves as a bonus, appearing when the agent successfully hits two yellow birds,
while the black bird acts as a bomb, exploding upon contact (see Figure 1). Consequently, the
reward function is updated to include additional rewards for the red bird (+2) and the black bird
(-0.5) as seen in Equation 2.
(2)
(3)</p>
        <p>Furthermore, we introduce new parameters to enhance the agent’s convergence towards
pinpoint accuracy. The parameter  is introduced to determine a preset amount of
ammunition available for shooting. Another virtual-dependent parameter, , specifies
the ammunition available to the player at time . Furthermore, the duration of reload  is
incorporated to determine the time steps required to complete a reload action. At each time
step , if ammunition is available ( &gt; 0), the agent is compelled to shoot. Otherwise,
a reload action is enforced, resetting the ammunition available to  after  time
steps as seen in Equation 3.</p>
        <p>=
⎧ − 1 − 1
⎨ _
⎩ 0
− 1 &gt; 0
 (mod(_ + )) = 0
ℎ</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Discussion and Results</title>
      <p>In this section, we present the results obtained from diferent environment settings using various
RL and IL approaches. The comparison between approaches in each respective environment is
based on the evaluation metrics traditionally used to assess RL agents, as outlined below:
• Cumulative Reward Function: The mean reward obtained by the agent in a specified
number of steps. Higher value indicates better performance.
• Episode length: The time taken for the agent to complete an episode, where episodes end
when any bird is shot. Lower value indicates better performance.
• Entropy: A measure of the agent’s uncertainty in choosing an action given the observed
state. Lower value indicates better performance.
3.1. Low and Medium Complexity Setting
First, we compare the performance of both SAC and PPO RL algorithms for the grayscale
environment and the RGB environment (i.e., low vs medium complex environments), then choose
the superior RL algorithm to compare RL to IL approaches. Figure 2 shows the comparison of the
RL algorithms in terms of the metrics mentioned above. While PPO’s entropy is lower than that
of SAC, indicating a relatively more stable choice of actions, SAC converged faster than PPO in
terms of cumulative reward and step count. Thus, SAC is used in further comparisons. Next,
we compare traditional RL algorithms (i.e., SAC) to IL techniques (i.e., BC and GAIL). Figure 3
and Table 1 show the results for the RL and IL comparison. It can be seen that while RL converges
faster than both BC and GAIL, the latter IL techniques have a better entropy, indicating more
stable learning and consistent action choices. It is also noticed that using the GAIL technique
alone is not stable and hard to converge for this medium complexity environment, even for
training for a very long number of steps (i.e., greater than million steps).</p>
      <p>Lastly, the RGB environment was evaluated by comparing two types of demonstrations
used to train imitation learning techniques (BC + GAIL): one from a proficient experienced
user and the other from a suboptimal user. Both demonstrations came from the same user to
ensure consistency, where the user attempted the shoting as best as he could for the expert
demonstration and intentially missed few birds to record the demonstration of the suboptimal
user. As a manipulation check, examination of the reward function showed that the competent
expert performed the task with high accuracy, achieving a mean reward of 0.997 with no missed
shots. On the other hand, the suboptimal demonstration had a mean reward of 0.81, indicating
a higher frequency of missed shots. These demonstrations aimed to evaluate the performance of
imitation learning under the same environment complexity and conditions. Figure 4 illustrates
that the agent trained with the expert demonstration exhibited faster learning, greater
consistency, and more confident action selection. In contrast, the agent trained with the suboptimal
demonstration eventually converged, but it took twice as long as the expert demonstration.
3.2. High Complexity Setting
In order to further assess the performance of the GAIL, BC, and RL algorithms, we performed
evaluations in a highly complex environment. This environment included multiple birds with
diferent rewards and limited ammunition, as described in the Methods section. Building on the
insights gained from the previously mentioned evaluations of imitation learning techniques, we
modified the training approach for BC and GAIL. Instead of relying solely on demonstrations,
these algorithms were trained with a combination of intrinsic and extrinsic rewards. This
adjustment was made to address the tendency of BC and GAIL to deviate from an optimal
policy when trained with demonstrations only. The RL and IL comparison results in this highly
complex environment are presented in Figure 5 and Table 2, which provide a comparison of the
RL, BC, and GAIL algorithms. These results ofer insight into the performance and efectiveness
of each algorithm in this challenging setting.</p>
      <p>Traditional RL. In this highly complex environment, the traditional RL algorithm failed to
capture an efective bird-shooting strategy. Instead, it resorted to “cheating” the environment
by learning the average spawn locations of the red and yellow birds. The agent then focused
solely on shooting at these specific spots, barely moving the cursor. Remarkably, the traditional
RL agent achieved a score close to that of a human player using this method. This highlights
the ability of RL algorithms to exploit loopholes given suficient time.</p>
      <p>Behavioural Cloning. In contrast, the BC algorithm encountered significant dificulties in
achieving the score of a human player. Since the recorded demonstration did not utilize
the environment loophole but instead moved the cursor around and aimed at the red and
yellow birds while avoiding the black ones, the agent struggled to replicate the demonstrated
behavior and failed to converge or show improvement after a substantial number of iterations.
This underscores the limitations of imitation learning algorithms relying solely on expert
demonstrations and their reduced capacity for exploratory behavior compared to traditional RL.</p>
      <p>GAIL. Initially, the GAIL algorithm faced similar challenges as the BC algorithm. However, due
to its combined approach, GAIL was able to break free from recorded behavior and discover the
same environment loophole exploited by the traditional RL algorithm. Ultimately, GAIL achieved
the highest score among all algorithms, surpassing even the recorded human demonstrations,
while achieving the lowest model entropy. This aligns with the notion that GAIL is particularly
efective when dealing with environments of high complexity and dimensions.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and Future Work</title>
      <p>In conclusion, we compared policy optimization techniques and model architectures across
various complexities of the environment, providing valuable information and avenues for future
research. PPO demonstrated stable convergence and lower model entropy, indicating increased
confidence in action selection. However, SAC exhibited superior sample eficiency and faster
convergence, emphasizing the stability-eficiency trade-of, making it favorable when time is
limited. The imitation learning algorithms converged slower but had a lower model entropy,
relying heavily on expert demonstrations and limiting loophole exploitation. Traditional
reinforcement learning algorithms discovered loopholes through reward-shaping complexity rather
than learning intended behavior. GAIL performed well by efectively capturing expert
demonstrations, achieving higher scores, and lower model entropy. This highlights the potential of
imitation learning to overcome reinforcement learning limitations. On the other hand,
reinforcement learning outperformed imitation learning in simple low-complexity environments where
reward shaping is not challenging. Future research should explore performance in diferent
domains, and develop hybrid approaches that take advantage of multiple algorithms to enhance
convergence, stability, and exploration capabilities.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work is partially funded by the German Ministry of Education and Research (BMBF) under
the TeachTAM project (Grant Number: 01IS17043) and the CAMELOT project (Grant Number:
01IW20008).
[11] D. Hadfield-Menell, S. Milli, P. Abbeel, S. Russell, A. Dragan, Inverse reward design, 2017.</p>
      <p>arXiv:1711.02827.
[12] B. Argall, S. Chernova, M. Veloso, B. Browning, A survey of robot learning from
demonstration, Robotics and Autonomous Systems 57 (2009) 469–483.
[13] C. Finn, S. Levine, P. Abbeel, Guided cost learning: Deep inverse optimal control via policy
optimization, 2016. arXiv:1603.00448.
[14] S. Ross, N. Melik-Barkhudarov, K. S. Shankar, A. Wendel, D. Dey, J. A. Bagnell, M. Hebert,</p>
      <p>Learning monocular reactive uav control in cluttered natural environments, 2013.
[15] J. Ho, S. Ermon, Generative adversarial imitation learning, 2016. arXiv:1606.03476.
[16] F. Torabi, G. Warnell, P. Stone, Behavioral cloning from observation, 2018.</p>
      <p>arXiv:1805.01954.
[17] M. E. Taylor, Improving reinforcement learning with human input, 2018.
[18] Q. Fang, X. Xu, X. Wang, Y. Zeng, Target-driven visual navigation in indoor scenes
using reinforcement learning and imitation learning, CAAI Transactions on Intelligence
Technology 7 (2022) 167–176.
[19] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy optimization
algorithms, 2017. arXiv:1707.06347.
[20] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft actor-critic: Of-policy maximum entropy
deep reinforcement learning with a stochastic actor, in: International conference on
machine learning, PMLR, 2018, pp. 1861–1870.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Toward human-centered ai: a perspective from human-computer interaction</article-title>
          , interactions
          <volume>26</volume>
          (
          <year>2019</year>
          )
          <fpage>42</fpage>
          -
          <lpage>46</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nowak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lukowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Horodecki</surname>
          </string-name>
          ,
          <article-title>Assessing artificial intelligence for humanity: Will ai be the our biggest ever advance? or the biggest threat [opinion]</article-title>
          ,
          <source>IEEE Technology and Society Magazine</source>
          <volume>37</volume>
          (
          <year>2018</year>
          )
          <fpage>26</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Bryson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Theodorou</surname>
          </string-name>
          ,
          <source>How Society Can Maintain Human-Centric Artificial Intelligence</source>
          , Springer Singapore, Singapore,
          <year>2019</year>
          , pp.
          <fpage>305</fpage>
          -
          <lpage>323</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Shneiderman</surname>
          </string-name>
          ,
          <article-title>Human-centered artificial intelligence: Reliable, safe</article-title>
          &amp; trustworthy,
          <source>International Journal of Human-Computer Interaction</source>
          <volume>36</volume>
          (
          <year>2020</year>
          )
          <fpage>495</fpage>
          -
          <lpage>504</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Shavlik</surname>
          </string-name>
          ,
          <article-title>Combining symbolic and neural learning</article-title>
          ,
          <source>Machine Learning</source>
          <volume>14</volume>
          (
          <year>1994</year>
          )
          <fpage>321</fpage>
          -
          <lpage>331</lpage>
          . URL: http://link.springer.com/10.1007/BF00993982. doi:
          <volume>10</volume>
          .1007/BF00993982.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Von Rueden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mayer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Garcke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bauckhage</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Schuecker,</surname>
          </string-name>
          <article-title>Informed machine learningtowards a taxonomy of explicit integration of knowledge into machine learning</article-title>
          ,
          <source>Learning</source>
          <volume>18</volume>
          (
          <year>2019</year>
          )
          <fpage>19</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ratner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ré</surname>
          </string-name>
          ,
          <article-title>Knowledge base construction in the machine-learning era</article-title>
          ,
          <source>Queue</source>
          <volume>16</volume>
          (
          <year>2018</year>
          )
          <volume>50</volume>
          :
          <fpage>79</fpage>
          -
          <lpage>50</lpage>
          :
          <fpage>90</fpage>
          . URL: http://doi.acm.
          <source>org/10</source>
          .1145/3236386.3243045. doi:
          <volume>10</volume>
          .1145/ 3236386.3243045.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Adel</surname>
          </string-name>
          ,
          <article-title>Deep learning methods for knowledge base population</article-title>
          ,
          <source>Ph.D. thesis, LMU</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Asai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fukunaga</surname>
          </string-name>
          ,
          <article-title>Classical planning in deep latent space: Bridging the subsymbolicsymbolic boundary</article-title>
          ,
          <source>in: Proceedings of the Conference on Artificial Intelligence (AAAI'18)</source>
          , AAAI Press,
          <year>2018</year>
          , pp.
          <fpage>6094</fpage>
          -
          <lpage>6101</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <source>Reinforcement learning: An introduction</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>