<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Oct.</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Human-AI Co-Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dimtrios Koutrintzes</string-name>
          <email>dkoutrintzes@iit.demokritos.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christos Spatharis</string-name>
          <email>cspatharis@iit.demokritos.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Dagioglou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Informatics and Telecommunications, National Centre for Scientific Research 'Demokritos'</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>19</volume>
      <issue>2024</issue>
      <abstract>
        <p>State-of-the-art AI methods allow us to develop agents that collaborate and co-learn with humans. The possibility to transfer knowledge from an expert to a novice human-AI team has the potential to streamline training, increase productivity and foster a more efective collaborative environment where individuals build on each other's strength. In this context, we present an experimentation pipeline that can be followed during human-aware AI design and development in the case of transfer learning from expert to novice human-AI teams. Moreover, we tackle two intricate research questions of 'when to stop training' and 'what expert knowledge' to transfer. Our results of a study with two expert human participants demonstrate the complexities of process and ofer relevant guidlines for future research.</p>
      </abstract>
      <kwd-group>
        <kwd>Human-AI collaboration</kwd>
        <kwd>Human-AI co-learning</kwd>
        <kwd>deep Reinforcement Learning</kwd>
        <kwd>Transfer learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Industry 5.0 brings forward a social integration of technology into the factory floor [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Humans
and society at large come at the centre of artificial intelligence (AI) systems, across their entire
life-cycle (from design to deployment and maintenance) through values-driven design, the
satisfaction of principles of ethical and trustworthy AI and importantly through cultivating a
congruous mentality among the stakeholders (developers, integrators, regulators, etc.).
Furthermore, human-centric digitisation challenges us to re-imagine and redesign industrial tasks in a
way that human and artificial agency are interwoven into a sustainable and resilient fabric.
      </p>
      <p>
        ‘Human-AI collaboration’ (HAIC) is an increasingly popular term, describing many diferent
things, and possibly shifting our attention from several ethical issues related to the integration
of AI in our society [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In the present work, HAIC, similarly to human-robot collaboration
(HRC), is used to describe systems where humans and AI (embodied or not) share goals and
perform interdependent actions and is a diferent paradigm compared to co-existence, interaction
and cooperation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. AI collaborators, from games [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] to robots in industrial set ups [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], need
to incorporate qualities and capabilities that support fluent and seamless collaboration. Similar
to human joint action [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], AI agents need to support processes for common perceptual and
cognitive grounding, transparent agency attribution and co-learning [
        <xref ref-type="bibr" rid="ref10 ref11 ref8 ref9">8, 9, 10, 11</xref>
        ].
ES
      </p>
      <p>
        The successful development of agents that collaborate with humans depends both on the
performance of state-of-the-art methods and the study of human behaviour. Deep reinforcement
learning (dRL) methods have allowed to develop agents that collaborate and co-learn in real-time
and real-world [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]. During the collaboration, it becomes possible to study how humans
perceive their interaction with an agent and how they behave in this context.
      </p>
      <p>
        The present work is related to the study of agents’ capabilities for co-learning and transfer
learning (TL). Like human de novo learning [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], co-learning demands long training periods
and involves a considerable physical efort and cognitive load. Transferring knowledge from
expert human-AI teams (HAIT) to novice HAITs can alleviate these complexities and support
retaining expert knowledge [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. In HAIC, it is possible that the source and target of TL are
human individuals, while the environment remains constant. This means that TL is not between
diferent tasks or settings, but between people working in the same context. The expertise, skills
or insights of one person can be used to accelerate the learning and performance of another. This
‘person-to-person TL’ has the potential to streamline training, increase productivity and foster
a more efective collaborative environment where individuals build on each other’s strengths
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        The process of transferring knowledge comprises several challenges that can be perhaps
formulated as a trade-of between transferring the knowledge to perform and transferring the
knowledge to learn. Time and cost eficiency reasons might require opting for performance. On
the other hand, learning to learn leaves more space to individualisation, avoids ‘experts-biases’
and can lead to more sustainable and resilient behaviours in the long run, especially in tasks
that involve motor learning [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. To address such research questions we need adequate HAIC
studies that explore diferent TL techniques [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and evaluate both the performance of HAITs
in the collaborative task, as well as individualisation margins and subjective human attitudes.
      </p>
      <p>Beyond diferent TL methods, within the context of each method there are many design and
development considerations during the stage of training an AI agent with the expert human.
There is a number of choices, seemingly technical, from model initialisation to deciding when
to stop the training that can impact knowledge transfer and merit investigation on their own.</p>
      <p>In this context, in the present work we explore the training process of AI agents with
expert humans, as well as with trained expert agents, and demonstrate the complexity of the
design choices at hand. Our main research question is about “How to evaluate behaviour and
when to stop training an expert AI agent and transfer this knowledge to novice HAITs (for
further behavioural studies)?”. We provide design considerations that allow human-aware TL
during human-AI co-learning. We then present the results of a study with two expert human
participants that demonstrate the complexities of deciding what knowledge to transfer and
when to transfer it.</p>
      <p>The rest of the paper is organised as follows. Section 2 presents the work related to the present
paper. Section 3 describes the methods of the present work, including design considerations
for human-aware transfer learning in human-AI co-learning, the used co-learning task, the
details of the AI agents, as well as the experimental design and conditions. Section 4 reports
the related results. Finally, Section 5 discusses our findings and Section 6 concludes this report
with future challenges and research directions.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Background</title>
      <p>
        Recent advancements in deep reinforcement learning (deep RL) have enabled the deployment
of complex systems that operate in real-time in dynamic environments. The success of deep RL
arises from its ability to learn complex motions and behaviours that are challenging to generate
using traditional hard-coded solutions. For example, dRL has been successfully implemented
for various robotic capabilities such as: navigation [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], robot arm control [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], grasping [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ],
drones maneuvering [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and human-robot co-learning [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. One popular paradigm in
colearning tasks is Soft Actor-Critic (SAC) [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] due to its robustness in balancing exploration and
exploitation, which is crucial in interactive environments [
        <xref ref-type="bibr" rid="ref12 ref13 ref23">12, 23, 13</xref>
        ].
      </p>
      <p>
        Recently, there has been a growing emphasis on developing agents that engage with humans.
Several studies focus on games [
        <xref ref-type="bibr" rid="ref24 ref25 ref26">24, 25, 26</xref>
        ], however, due to their competitive nature, these
agents typically rely on choosing the best actions against a nearly optimal opponent. Conversely,
in collaborative or social settings, modelling and leveraging human behaviour to work along
agents is a highly challenging task [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. In the context of human-AI co-learning, both entities
can learn from each other and grow together over time[
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. It must be noted that most HAIC
studies operate within well-defined discrete environments, such as overcooked [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. In contrast,
our work addresses a continuous environment that necessitates collaboration between humans
and agents to generate multi-modal trajectories and collectively achieve a goal.
      </p>
      <p>
        A challenge of deploying dRL models in diverse or previously unexplored environments [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]
is to do so without requiring training from scratch. Especially in HAIC tasks, this process is
time-consuming and demanding, and the extended training periods can negatively afect the
performance of the team. These limitations highlight the importance of a diferent paradigm
that allows agents to reuse knowledge from one task to a related, yet distinct, task. In RL tasks,
this paradigm is referred to as transfer learning (TL).
      </p>
      <p>
        Multiple approaches have been proposed to facilitate knowledge transfer in RL settings [
        <xref ref-type="bibr" rid="ref30">30, 31</xref>
        ].
TL aims to learn an optimal policy for a target task by leveraging external information from
a set of source tasks, as well as, internal information from the target task. One of the most
prominent TL approaches, reward shaping (RS) [32, 33], uses external knowledge to modify the
reward distribution in the target task, by incorporating a reward-shaping function. By providing
additional rewards along with interior environmental rewards, RS directs the agent toward more
optimal trajectories. Learning from demonstrations (LfD) allows RL agents to learn to perform
tasks by observing expert demonstrationsThis approach can be further decomposed to ofline
and online LfD, where the former uses demonstrations for pre-training the models [34, 35],
while the latter directly employs expert demonstrations to guide the agent’s behaviour for more
eficient exploration [ 36]. Finally, in policy transfer, a pre-trained policy on a source task is
directly applied to the target task. Policy transfer is further divided into TL learning via policy
distillation [37] and TL via policy reuse [38]. Policy distillation involves learning a model by
minimizing the divergence from multiple expert policies, while policy reuse leverages previously
learned policies by allowing the agent to draw from past experiences with some probability.
      </p>
      <p>
        Transferring knowledge in the context of HAIC comprises several complexities. Given the
nature of human (motor) learning, a core question is when to stop expert training and to transfer
knowledge. Just converging to a desired performance might not be the pursued goal. Injecting
variability in the transferred knowledge might be necessary to promote learning and leverage
future behaviour [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>The human involvement in HAIC tasks demands a considerable time and efort from human
experts due to their active involvement throughout the entire agent training process. This
significantly constraints the fine tuning of dRL model hyperparameters. Various studies [ 39,
40, 41] ofer valuable insights on the selection of hyperparameters based on the methodologies
and environmental contexts. However, the assumption of hyperparameter tuning does not
directly apply to HAIC tasks, as the inclusion of humans in the training loop renders exhaustive
hyperparameter search infeasible, given the time and energy constraints involved.</p>
      <p>Finally, every hypothesis on the eficiency of a TL method needs to be evaluated through
its impact to the entire team, taking into account multiple teams. Evaluating the performance
of HAIC teams requires a holistic examination of the human-AI team performance and
colearning dynamics during the collaboration, as well as analyzing individual contributions [42, 43].
Moreover, both objective and subjective metrics need to be incorporated in the evaluation process
to comprehensively assess the efectiveness of the team, as well as individual human behaviours
and experiences [44].</p>
    </sec>
    <sec id="sec-4">
      <title>3. Methods</title>
      <sec id="sec-4-1">
        <title>3.1. Design considerations for human-aware transfer learning in human-AI co-learning</title>
        <p>
          The ultimate goal of this work is to build AI agents that possess human-AI co-learning
capabilities. Transferring knowledge from expert HAITs can facilitate reasonable training periods for
a novice HAIT. Diferent TL methods are expected to result in diferent HAIT behaviour and
afect human behaviour and perceived interaction qualities. Which TL method allows faster
or more stable learning in the long run? Which TL method promotes individualisation and
alleviates superstitious learning as a result of expert-behaviour bias? These are examples of
research questions that can be pursued through rigorous testing during human studies that
attempt to capture ‘what knowledge has been transferred’. Within the context of TL, we need
to consider two design/development stages. These are presented in the table of Figure 1 and
capture our experience during HAIT studies ([
          <xref ref-type="bibr" rid="ref13">13, 45</xref>
          ] and other unpublished data). The overall
goal of HAIT studies is to either inform the next round of design and development or to actually
choose a deployable system (first row of the table in Figure 1). Experimental design and AI
model parameters need to be considered.
        </p>
        <p>In the case of TL methods, there is another experimental stage of design/development that
precedes that of HAIT studies. This stage is related to training expert HAIT teams (or possibly
expert AI-only teams) and aims at producing the knowledge to be transferred (policy,
demonstrations, etc.). Any design choices here will determine the AI agent’s model parameters in the
HAIT studies. In a sense, this is a set of design considerations, besides behavioural experimental
design, that needs to be controlled. Rigorous design and reporting of this stage is necessary
for guaranteeing transparency of the methods and reproducibility of the results, as well as, for
facilitating comparison among studies and methods. We identify two main complexities, as a
result of having a human in the loop.</p>
        <p>Efort of choosing AI model’s hyperparameters. In the absence of humans, iterative
testing of diferent sets of hyperparameters is exploited until a desirable performance is achieved.
In the event that human experts need to train the models, exhaustively investigating appropriate
parameters requires enormous efort and time. Alternative approaches such as using two
independent agents (instead of an agent and a human) for team training could be pursued.
However, based on our experience this might not work due to lack of representing important task
aspects (such as the collaborative nature of a task). On the other hand, the use of collaborative
agents introduces issues of decoupling the information later on.</p>
        <p>When to stop the training? While in a ‘AI-alone’ system the goal is to converge to the best
possible performance, in the case of transferring knowledge for co-learning this might not be
the desired outcome. Instead, a certain degree of variance in the transferred knowledge might
be pursued as a means of facilitating individualisation. As mentioned earlier, individualisation
prevents ‘experts-biases’ and can lead to more sustainable and resilient behaviours in the long
run. Such a design choice will also afect the set of chosen parameters.</p>
        <p>In the rest of the paper we focus on the ‘expert HAIT training’ and we demonstrate through an
experimental paradigm the process and results towards defining the knowledge to be transferred
before proceeding to a HAIT study.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Human-AI co-learning task</title>
        <p>
          A co-learning task, in a virtual environment, has been used to study HAIT behaviour, and
specifically expert HAIT training [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. A human collaborates with an RL agent to move a ball
from a starting to a target position (Figure 2), along a virtual tray. The tray rotates around two
axes; the human player controls the rotation around the y-axis, while the agent controls the
rotation around x-axis. The controlled variables are the angles θ, φ of the tray’s rotation. A
‘game’ lasts for maximum 40 seconds and the team wins if the ball reaches the target before this.
A pair of human and agent actions is applied for 200ms resulting in 200 time-steps per game.
        </p>
        <p>Both team members can take three discrete actions: a) rotate the tray clockwise, b) rotate the
tray counter-clockwise, or c) leave tray’s angle unchanged. The human collaborator applies
these actions through the keyboard by pressing ‘Right Arrow’ (&gt;), ‘Left Arrow’ (&lt;) or nothing,
accordingly. The tray rotates 30 degrees towards both sides, and each action causes an angle
change of around 5 degrees.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Deep RL agent</title>
        <p>The rotation of the tray around x-axis is controlled by an AI agent. A discrete version of
SoftActor Critic (SAC) has been used [46] to be more consistent with the discrete inputs provided
by the human user via a keyboard. A continuous 8-dimensional state space has been designed
to represent the environment’s configuration at each time-step, comprising: the ball’s position
(,  ) and speed (, ̇)̇ and the tray’s angles (, ) and rotational velocities (, ̇ )̇ .</p>
        <p>
          The above quantities have been normalized to the [
          <xref ref-type="bibr" rid="ref1">−1, 1</xref>
          ] range to ensure training stability for
the neural networks. The dRL agent’s action space is 1-dimensional and discrete,  = −1, 0, 1 ,
corresponding to counter-clockwise, no or clockwise change of the tray’s angles accordingly.
The agent receives a  = −1 reward for each elapsed time-step, and an  = 10 reward when the
team reaches the target. Figure 3 depicts the co-learning loop between the human and the RL
agent.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>3.4. Human-AI co-learning process</title>
        <p>Two experts were involved in the training process. They are regarded as experts due to their
profound understanding of the game’s environment, and agent’s behaviour. Their expertise has
been developed through extensive gameplay, with each expert having spent a minimum of 100
hours playing the game.</p>
        <p>The human-AI co-learning process is presented in Figure 4. Each team had to complete six
experimental blocks, where each block comprised:</p>
        <p>-A testing batch of 5 games, where the agent followed a deterministic approach by applying
the argmax action of its policy. No interaction data were collected for the replay bufer during
this phase.</p>
        <p>-A training batch of 5 games, each of which was followed by a round of 250 of-line gradient
updates (OGU). Here, the policy followed a stochastic approach by sampling from the Actor’s
categorical distribution. Interactions were stored in a bufer and used for OGU after each game.
Training began after the third game of the first block to ensure that the bufer had enough data.
So, in total there were 7000 OGU across the entire learning procedure.</p>
        <p>At the end of the sixth block, one more testing batch was included. For all experiments, and
for both experts, the same initialization weights have been used. This was motivated by the
variability in the subsequent performance that this initial agent induces. The initialization
weights were selected randomly without any evaluation of their impact to the subsequent team
performance. None of the expert had any previous experience with this initialized agent.</p>
        <p>Each human expert repeated the co-learning procedure three times under two conditions. In
Experiment 1 the humans collaborated with a SAC agent. This has been the baseline condition
that was used during hyperparameter tuning. Observations during this experimental condition
are exploited to define the final set of hyperparameters as well as the expert HAITs behaviour to
be transferred. Note that the number of the experimental blocks (6) was chosen after observing
the convergence of performance during experimentation in the condition of Experiment 1.
While behaviours of individual experts could be certainly transferred, we wanted to explore the
efect of combing experts’ knowledge as a means of considering various, potentially diferent,
expert behaviours. In Experiment 2 the initialised agents were pre-trained using a combination
of two replay bufers sourced from the two experts (during Experiment 1). Each expert selected
their replay bufer based on their assessment of the best run. During the pre-training phase, the
agent underwent 2500 gradient updates. The final replay bufer contained approximately 3500
interactions and new interactions during the collaboration replaced the old ones. In addition to
human-AI teams, in Experiment 3, we subjected to the co-learning procedure three pairs of
independent agents. Each pair consisted of a ‘novice’ SAC agent and a SAC agent as pre-trained
by the experts during Experiment 1 (direct policy transfer).</p>
        <p>The performance of the teams across the games is evaluated by the achieved score in each
game. This is computed by discounting one point for each time-step played per game. For
example, a successful game of 10 seconds (50 time-steps) would result in a score of 150, while
a unsuccessful game would result in a score of 0. Moreover, the performance is qualitatively
evaluated through occupancy grids of the ball throughout the games.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Results</title>
      <sec id="sec-5-1">
        <title>4.1. Hyperparameter tuning</title>
        <p>The hyperparameter values reported in Table 1 were defined after a series of experiments to
optimize the performance of the agent for the HAIC task. This time-consuming process involved
continuous human interaction in the training loop, requiring iterative testing and validation.
The involvement of human-in-the-loop not only slowed down the cycle, but also introduced
variability, necessitating numerous trials to converge to optimal settings.</p>
        <p>Following, we discuss the various hyperparameters adjusted during training and analyze
their impact on the overall performance of the method.</p>
        <p>Target entropy The target entropy  is crucial in SAC as it balances exploration and
exploitation. The equation to calculate the optimal target entropy contains a multiplier and the
formula that maximizes the entropy to give all actions the same probability. A high multiplier
maximizes the exploration but makes the agent having a more random behaviour, while a
low multiplier allows more intense exploitation with minimum exploration. Based on our
experience, using multipliers based on other set-ups [46] does not work and experimentation is
needed with diferent multipliers [ 47] considering the context and the goal of each task. In our
case, we pursued a balance between exploration and exploitation considering a specific training
time and the desired variability in the experts’ behaviour.</p>
        <p>Bufer sizes. The replay bufer (RB) is used to store past experiences to update the policy
in SAC. We tested various bufer sizes and our findings align with previous research [ 48],
showing that small RB sizes discard useful experiences over time, while very large sizes can also
negatively afect performance by including outdated and irrelevant experiences. Considering
that in each block the maximum number of experiences we can collect is 1000 (200 timesteps x
5 games per block), we opted to use a replay bufer size of 3500 experiences. This size allows
us to discard initial sub-optimal experiences (from the first two blocks) and keep the latest
ones, ensuring that as the policy progresses, the co-learning process prioritizes the most recent
experiences gathered.</p>
        <p>Update frequencies. A significant efort was dedicated on identifying the optimal frequency
for performing ofline gradient updates (OGU) on the neural networks using experiences stored
in the replay bufer. According to our findings, performing multiple OGUs after many rounds of
interaction (e.g. after each block) between the human and the RL agent might result in stagnant
policies that contribute to a poorly filled replay bufer, which will not be as efective for the
update process. On the other hand, updating very frequently (e.g. during the game), while the
RL agent actively gathers experiences, could end up with the opposite result: the policy changes
while playing and confuses the co-learning process. In our case, executing multiple OGUs at the
end of each game achieved a good balance between policy exploitation and updating, hence we
believe that the update frequency depends greatly on the pipeline of the overall methodology.</p>
        <p>AI-only training. In an attempt to further optimize the hyperparameters, we considered
experimenting without human involvement in the training loop. To that end, we implemented
an agent-vs-agent training scheme, where one agent controlled one axis of the tray while
another controlled the other axis. The goal was to identify an optimal set of hyperparameters
through multiple experiments that could be conducted seamlessly in the absence of the human
expert. However, the independent agents failed to learn the task, highlighting the complexity of
hyperparameter tuning in multi-agent systems and the irreplaceable value of human expertise
in collaborative tasks.</p>
        <p>Overall, hyperparameter tuning in such tasks is particularly challenging as human
involvement is required. Additionally, the human must decide when to stop testing a set of
hyperparameters and move on to a new one, making the process even more complex. Despite the
dificulties, balancing exploration and exploitation and leveraging human expertise are essential
for efective training.
Two human experts collaborated with the AI system (SAC agent) from the scratch. No prior
experience was used in the replay bufer. Figure 5 (left) shows that both experts exhibit comparable
behaviour, converging to a high score towards the last testing blocks as a manifestation of expert
behaviour. On the other hand, blocks 1 to 4 show significant variability in the performance of
both experts, suggesting that in the initial stages of training, the human-AI team still explores
various strategies to jointly reach the target.</p>
        <p>Experiment 1: Expert human - SAC agent teams (test blocks)</p>
        <p>Experiment 2: Expert human - pre-trained SAC agent teams (test blocks)
200
175
150
em125
a
reG100
P
e
rco 75
S
50
25
0
200
175
150
125
100
75
50
25
0
1
2
3
5
6
7
1
2
3
5
6
7
4
Blocks
4
Blocks</p>
        <p>Figure 6 shows the ball’s occupancy frequencies across the test blocks. In the first block,
the team performs poorly, with the ball remaining in the upper part of the tray, where cells
are highly occupied. In contrast, the second block shows more exploratory behavior, as the
team searches through the tray in order to find ways to reach the goal. As testing blocks
proceed, the coverage becomes sparser, indicating that HAIT has converged towards a more
direct approach to reaching the target, demonstrated by the ‘X’-shaped occupancy grid, with
increased occupancy in cells near the target, such as the lower left cell.</p>
        <p>B1 | Wins=0 | Dur.= 858</p>
        <p>B2 | Wins=0 | Dur.= 861</p>
        <p>B3 | Wins=1 | Dur.= 731</p>
        <p>B4 | Wins=5 | Dur.= 149</p>
        <p>B5 | Wins=5 | Dur.= 177</p>
        <p>B6 | Wins=5 | Dur.= 232</p>
        <p>B7 | Wins=5 | Dur.= 189
During the second experiment we wanted to study the efect of TL within experts, as a possible
means of introducing some variability due to the diferent experts. A replay bufer that combined
knowledge from the two replay bufers of two experts, during Experiment 1, was used. These
data were used to pre-train the SAC agent before interacting with the human experts [36].
Specifically, we conducted 2500 OGUs prior to starting the games, and then followed the same
experimental setup as in Experiment 1.</p>
        <p>The results, depicted in Figure 5 (right), demonstrate the efectiveness of TL in the
collaborative task. Despite some poor performance (of one of the two HAITs) in the very first test
block the overall performance of the teams across the rest of the block is consistently high.
This shows the robustness and the capability of ofline TL scheme to produce high quality
solutions. The observed variability in the initial blocks can be attributed to both variance in
human performance but also to some limited continuation of learning as can be seen in the
heatmaps of Figure 7 (Blocks 1 and 2) .</p>
        <p>Furthermore, in Figure 7, it can be noticed that the evolution of the occupancy grids difers
from Experiment 1. In particular, from the very first test block, the HAIT tends to explore
the lower half of the grid, while frequently reaching the lower left cell near the goal. In the
following test blocks, the team appears to quickly achieve optimal behaviour (i.e., ‘X’-shaped
occupancy grid), successfully reaching the goal from any starting position. This knowledge
reuse approach essentially continues the learning process from where it concluded at the end
of Experiment 1 and the variability introduced in the pre-training procedure (joined expert
bufers) due to possibly diferent expert behaviours is not suficient to trigger a significant drop
in the initial HAIT performance.</p>
        <p>B1 | Wins=1 | Dur.= 779</p>
        <p>B2 | Wins=5 | Dur.= 257</p>
        <p>B3 | Wins=5 | Dur.= 196</p>
        <p>B4 | Wins=5 | Dur.= 302</p>
        <p>B5 | Wins=5 | Dur.= 140</p>
        <p>B6 | Wins=5 | Dur.= 159</p>
        <p>B7 | Wins=5 | Dur.= 125
200
150
100
50
0</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.4. Experiment 3: Pre-trained - novice SAC agents co-learning</title>
        <p>In a final set of experiments, we employed TL with a focus on direct policy transfer to observe
the performance of two agents collaborating to achieve a common goal. Specifically, one agent
was initialized with a pre-trained policy from Experiment 1, while the other agent started with
no prior experience. Only the second agent participated in the training process, while the
parameters of the pre-trained agent remained fixed throughout the entire procedure. The
pretrained agent was frozen to retain the behaviour learned by the experts, and better simulating
in this way the condition of the previous experiments where we consider that the behaviour of
experts humans has reached a certain plateau.</p>
        <p>Figure 8 shows the scores obtained during the test blocks. Despite the first agent being
equipped with a (sub-)optimal expert policy, the overall team performance was poor compared
to the previous experiments. Specifically, the agents failed to achieve the high scores seen
previously and exhibited high variability, indicating that the team had not developed the
necessary collaborative skills. This further highlights the crucial role of having a human expert
in the loop.</p>
        <p>Experiment 3: pretrained - novice SAC agents co-learning (test blocks)</p>
        <p>As anticipated, the grids shown in Figure 9 fail to demonstrate meaningful behaviours, as the
team is rarely able to successfully reach the target. Instead, there is a noticeable tendency for
the agents to remain stuck along the edges of the grid for extended time-steps.
B1 | Wins=0 | Dur.= 831</p>
        <p>B2 | Wins=0 | Dur.= 832</p>
        <p>B3 | Wins=3 | Dur.= 412</p>
        <p>B4 | Wins=0 | Dur.= 832</p>
        <p>B5 | Wins=0 | Dur.= 836</p>
        <p>B6 | Wins=0 | Dur.= 833</p>
        <p>B7 | Wins=0 | Dur.= 930
200
150
100
50
0</p>
        <p>Additionally, we considered another evaluation scheme between a pre-trained and a novice
SAC agent. Specifically, we pre-trained one agent ofline using batches from the replay bufer
that contained experiences from both experts, similar to Experiment 2. Following the same
procedure as in Experiment 3, the pre-trained agent had its weights frozen while the novice
agent underwent training. The results exhibited the same behaviour as the previous one,
demonstrating poor performance, further supporting the need for a human expert in the
training procedure. Furthermore all AI-only schemes were tested with the pre-trained agent
participating in the training process but produced similar or worse results.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Discussion</title>
      <p>Recent advances in AI, such as in dRL, allow us to develop systems where humans and AI
agents/robots learn together and collaborate to achieve common goals in scenarios where their
actions are interdependent. The design, development and validation of ‘human-AI collaboration’
(HAIC) systems comprises not only the development of the AI methods but also a vigorous
study of ‘what works for human collaborators’ and for the human-AI team (HAIT) at large. Such
an approach, has been inherent to fields such as human-robot interaction but is now widely
appreciated in the context of AI ethical assessment processes and the human-centric design
elements required by Industry 5.0.</p>
      <p>Along with the capability for collaboration comes also the necessity of developing methods
that allow HAITs to co-learn. Although the technology to support this exists, learning is a long
process by nature. The possibility to transfer knowledge from an expert HAIT to a novice HAIT
could shorten training periods, increase productivity and prevent loss of expert knowledge.</p>
      <p>In the present work, we have first listed several considerations for designing, developing and
deploying human-AI co-learning systems. These considerations come out of our experience
and follow practices of human-aware AI design. One of the most important aspects of having
humans-in-the-loop is that any ‘final’ solution needs to be validated with many users in order to
evaluate not only the suitability of TL methods chosen but also the entire collaborative process
as experienced by humans. This is already a complicated procedure that needs careful and
controlled experimental designs due to the very nature of humans that exhibits great variability.
Moreover, the execution of a HAIT studies for TL, presupposes that expert HAIT knowledge has
been captured (one method or another) and what is then evaluated is the efect of the transferred
knowledge in the co-learning process.</p>
      <p>Based on our experience, an important step that is necessary before any HAIT study for
TL from expert HAITs is related to the very procedure of ‘knowledge collection’ from expert
HAITs. A major complexity is related to choosing appropriate hyperparameters for the AI
models as the human-in-the-loop nature of HAIC makes hyperparameter fine-tuning a costly
procedure. As shown in Experiment 3 (Section 4.4), it could be the case that exploring suitable
hyperparameters through AI-AI co-learning might not be possible as was not in our case. The
chosen hyperparameters in Section 4.1 have been the result of tens of hours of game training
that involved the expert human players. As mentioned earlier each expert has spent over 100
hours of training. This means that the exploration of hyperparameters is constrained both by
the efort needed by each individual expert, but also by the fact that a few experts might be
available.</p>
      <p>In this context, method designers and developers need to decide what constitutes a
‘satisfactory behaviour’ for a given task and context, and terminate the exploration of hyperparameters
based on tailored criteria rather than an optimal performance. This has been the question
that we pursued in the presented work: “when to stop training the expert agents and transfer
the knowledge to novice teams of humans and experts?”. The choice of hyperparameters in
Section 4.1 and the results of Experiment 1 (Section 4.2) actually mirror our choices for stopping
the training procedure. Two important criteria for doing so are related to the characteristics of
the learning curve during the HAIT co-learning and the duration of the procedure that will afect
the time required for each participant in HAIT studies later on. Specifically, having in mind
that we want to study the efect of TL in novice HAITs using a learning from demonstrations
approach, the hyperparameters were chosen so as to:
• have a learning curve that is neither steep nor shallow (regulated by the target entropy).</p>
      <p>Such a curve allows the final bufers to include demonstrations that mirror the entire
learning process, including both bad and good games.
• exclude from the bufers the very initial games that had sub-optimal experiences.
• not intervene in an obtrusive and destructive way in the learning process by inappropriate
frequency of the of-line gradient updates.</p>
      <p>• not exceed 6 experimental blocks in the future HAIT studies with novice HAITS.</p>
      <p>Generally, in terms of “what knowledge to transfer” that is related to the TL method used is
related to the follwoing possibilities:
• Transfer knowledge from optimal performance towards the end of learning. This approach
could aid the novice players receive refined strategies, potentially leading to quicker
adaptation to expert-level behaviors and higher performance.
• Transfer knowledge from an earlier stage where greater variability exists and which could
possibly allow more individualisation to the behaviour of novice users. This approach
could be more flexible and adaptable to diferent users’ needs, preferences, and learning
styles.
• Combine the knowledge of two or more experts, which could also provide some source
of variability in the behaviour. By integrating diverse expert experiences, novice players
could benefit from a richer set of policies potentially leading to faster convergence as the
RL agent has explored the state space more deeply. Note that such variability was shown
to leave experts’ behaviour unafected (Section 4.3).</p>
      <p>As a final note, we believe that the co-learning paradigm presented satisfies the needs of an
experimental set-up. Results produced in such environments can definitely guide design and
development in other contexts and tasks, as well as to inform human-aware AI design. However,
the design of each system must be treated uniquely based on the specific characteristics of each
environment and the participating actors.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusions</title>
      <p>In the present work, we have demonstrated the complex dynamics involved in developing agents
capable of collaborating and co-learning with human experts. Specifically, we have presented
an experimentation pipeline that can be followed during human-aware AI design in the case of
transfer learning from expert to novice HAITs. Moreover, we tackled two intricate research
questions of ‘when to stop training’ and ‘what expert knowledge to transfer’. Through reporting
the results of the process we followed we aim at contributing to future research designs that
are according to the needs of Industry 5.0 and trustworthy AI.</p>
      <p>The next step in our research involves examining how the choices outlined above afect the
transfer of knowledge to novice HAITs. Future studies will focus on assessing human behavior
and subjective perceptions of collaboration in human-AI interactions, in addition to objective
team performance. By evaluating the transfer learning capabilities of our method with novice
human HAITs, we aim to validate our findings and further refine our approach.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This research was (co-)funded by the European Union under GA no. 101135782 (MANOLO
project). Views and opinions expressed are however those of the authors only and do not
necessarily reflect those of the European Union or CNECT. Neither the European Union nor
CNECT can be held responsible for them.</p>
      <p>Pattern Analysis and Machine Intelligence 45 (2023).
[31] M. Islam, The impact of transfer learning on ai performance across domains, Journal of</p>
      <p>AI General science (JAIGS) 1 (2024).
[32] A. Ng, D. Harada, S. J. Russell, Policy invariance under reward transformations: Theory
and application to reward shaping, in: Int. Conf. on Machine Learning, 1999.
[33] M. Vecerík, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot, N. M. O. Heess, T. Rothörl,
T. Lampe, M. A. Riedmiller, Leveraging demonstrations for deep reinforcement learning
on robotics problems with sparse rewards, ArXiv (2017).
[34] S. Schaal, Learning from demonstration, in: Proc. of the 9th Int. Conf. on Neural
Information Processing Systems, 1996, p. 1040–1046.
[35] M. Yang, O. Nachum, Representation matters: Ofline pretraining for sequential decision
making, in: Int. Conf. on Machine Learning, 2021.
[36] T. Hester, M. Vecerík, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan,
A. Sendonaris, I. Osband, G. Dulac-Arnold, J. P. Agapiou, J. Z. Leibo, A. Gruslys, Deep
q-learning from demonstrations, in: AAAI Conf. on AI, 2017.
[37] G. E. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, ArXiv
(2015).
[38] F. Fernández, M. Veloso, Probabilistic policy reuse in a reinforcement learning agent, in:</p>
      <p>Proc. of the 5th Int. Joint Conf. on Autonomous Agents and Multiagent Systems, 2006.
[39] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, D. Meger, Deep reinforcement
learning that matters, in: Proc. of the 32nd AAAI Conf. on AI and 13th Innovative
Applications of AI Conf. and 8th AAAI Symposium on Educational Advances in AI, 2018.
[40] L. Engstrom, A. Ilyas, S. Santurkar, D. Tsipras, F. Janoos, L. Rudolph, A. Madry,
Implementation matters in deep rl: A case study on ppo and trpo, in: Int. Conf. on Learning
Representations, 2020.
[41] T. Eimer, M. Lindauer, R. Raileanu, Hyperparameters in reinforcement learning and how
to tune them, in: Proc. of the 40th Int. Conf. on Machine Learning, 2023.
[42] K. van den Bosch, T. Schoonderwoerd, R. Blankendaal, M. Neerincx, Six challenges for
human-ai co-learning (2019).
[43] P. Chattopadhyay, D. Yadav, V. Prabhu, A. Chandrasekaran, A. Das, S. Lee, D. Batra,
D. Parikh, Evaluating visual conversational agents via cooperative human-ai games, in:
AAAI Conf. on Human Computation &amp; Crowdsourcing, 2017.
[44] G. Hofman, Evaluating fluency in human–robot collaboration, IEEE Transactions on</p>
      <p>Human-Machine Systems 49 (2019) 209–218.
[45] D. Koutrintzes, Knowledge transfer in human-artificial intelligence collaboration, Master’s
thesis, University of Piraeus, 2023.
[46] P. Christodoulou, Soft actor-critic for discrete action settings, arXiv preprint
arXiv:1910.07207 (2019).
[47] Y. Xu, D. Hu, L. Liang, S. McAleer, P. Abbeel, R. Fox, Target entropy annealing for discrete
soft actor-critic, 2021.
[48] R. Liu, J. Y. Zou, The efects of memory replay in reinforcement learning, 2018 56th Annual
Allerton Conf. on Communication, Control, and Computing (Allerton) (2017) 478–485.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Commission</surname>
          </string-name>
          , D.-G. for Research, Innovation,
          <string-name>
            <given-names>M.</given-names>
            <surname>Breque</surname>
          </string-name>
          , L. De Nul,
          <string-name>
            <given-names>A.</given-names>
            <surname>Petridis</surname>
          </string-name>
          ,
          <article-title>Industry 5.0 - Towards a sustainable, human-centric and resilient European industry</article-title>
          ,
          <source>Publications Ofice of the European Union</source>
          ,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .2777/308407.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sarkar</surname>
          </string-name>
          ,
          <article-title>Enough with “human-ai collaboration”</article-title>
          ,
          <source>in: Extended Abstracts of the 2023 CHI Conf. on Human Factors in Computing Systems</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bütepage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kragic</surname>
          </string-name>
          ,
          <article-title>Human-robot collaboration: From psychology to social robotics</article-title>
          ,
          <source>ArXiv abs/1705</source>
          .10146 (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sarkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sadigh</surname>
          </string-name>
          ,
          <article-title>Diverse conventions for human-ai collaboration</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Daronnat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Azzopardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Halvey</surname>
          </string-name>
          ,
          <article-title>Impact of agents' errors on performance, reliance and trust in human-agent collaboration</article-title>
          ,
          <source>in: Proc. of the Human Factors and Ergonomics Society Annual Meeting</source>
          , volume
          <volume>64</volume>
          ,
          <string-name>
            <given-names>SAGE</given-names>
            <surname>Publications Sage</surname>
          </string-name>
          <string-name>
            <surname>CA</surname>
          </string-name>
          : Los Angeles, CA,
          <year>2020</year>
          , pp.
          <fpage>405</fpage>
          -
          <lpage>409</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Borboni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. V. V.</given-names>
            <surname>Reddy</surname>
          </string-name>
          , I. Elamvazuthi,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>AL-Quraishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Natarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Azhar</surname>
          </string-name>
          <string-name>
            <surname>Ali</surname>
          </string-name>
          ,
          <article-title>The expanding role of ai in collaborative robots for industrial applications: a systematic review of recent works</article-title>
          ,
          <source>Machines</source>
          <volume>11</volume>
          (
          <year>2023</year>
          )
          <fpage>111</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Sebanz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bekkering</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Knoblich,</surname>
          </string-name>
          <article-title>Joint action: bodies and minds moving together</article-title>
          ,
          <source>Trends in cognitive sciences 10</source>
          (
          <year>2006</year>
          )
          <fpage>70</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Van Zoelen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Van Den Bosch</surname>
          </string-name>
          , M. Neerincx,
          <article-title>Becoming team members: Identifying interaction patterns of mutual adaptation for human-robot co-learning,</article-title>
          <source>Frontiers in Robotics and AI</source>
          <volume>8</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>K. van den Bosch</surname>
          </string-name>
          , T. Schoonderwoerd,
          <string-name>
            <given-names>R.</given-names>
            <surname>Blankendaal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Neerincx</surname>
          </string-name>
          ,
          <article-title>Six challenges for human-ai co-learning</article-title>
          ,
          <source>in: Adaptive Instructional Systems: 1st Int. Conf., AIS</source>
          <year>2019</year>
          ,
          <article-title>Held as Part of the 21st HCI Int</article-title>
          . Conf.,
          <source>HCII</source>
          <year>2019</year>
          ,
          <article-title>Orlando</article-title>
          , FL, USA, July
          <volume>26</volume>
          -
          <issue>31</issue>
          ,
          <year>2019</year>
          ,
          <source>Proc. 21</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>572</fpage>
          -
          <lpage>589</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Holter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>El-Assady</surname>
          </string-name>
          ,
          <article-title>Deconstructing human-ai collaboration: Agency, interaction, and adaptation</article-title>
          ,
          <source>arXiv preprint arXiv:2404.12056</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Vössing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kühl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lind</surname>
          </string-name>
          , G. Satzger,
          <article-title>Designing transparency for efective human-ai collaboration</article-title>
          ,
          <source>Information Systems Frontiers</source>
          <volume>24</volume>
          (
          <year>2022</year>
          )
          <fpage>877</fpage>
          -
          <lpage>895</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shafti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tjomsland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dudley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Faisal</surname>
          </string-name>
          ,
          <article-title>Real-world human-robot collaborative reinforcement learning*</article-title>
          ,
          <source>IEEE/RSJ Int. Conf. on Intel. Robots and Systems (IROS)</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Tsitos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dagioglou</surname>
          </string-name>
          ,
          <article-title>Enhancing team performance with transfer-learning during real-world human-robot collaboration (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Krakauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Hadjiosif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Haith</surname>
          </string-name>
          ,
          <article-title>Motor learning</article-title>
          ,
          <source>Compr Physiol</source>
          <volume>9</volume>
          (
          <year>2019</year>
          )
          <fpage>613</fpage>
          -
          <lpage>663</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>P.</given-names>
            <surname>Spitzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kühl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Goutier</surname>
          </string-name>
          , Training novices:
          <article-title>The role of human-ai collaboration and knowledge transfer</article-title>
          ,
          <source>arXiv preprint arXiv:2207.00497</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>A. K. Dhawale</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>B. P.</given-names>
          </string-name>
          <string-name>
            <surname>Ölveczky</surname>
          </string-name>
          ,
          <article-title>The role of variability in motor learning</article-title>
          ,
          <source>Annual review of neuroscience 40</source>
          (
          <year>2017</year>
          )
          <fpage>479</fpage>
          -
          <lpage>498</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Transfer learning in deep reinforcement learning: A survey</article-title>
          , arXiv preprint arXiv:
          <year>2009</year>
          .
          <volume>07888</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>D.</given-names>
            <surname>Honerkamp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Welschehold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Valada</surname>
          </string-name>
          ,
          <article-title>Learning kinematic feasibility for mobile manipulation through deep rl</article-title>
          ,
          <source>IEEE Robotics and Automation Letters</source>
          <volume>6</volume>
          (
          <year>2021</year>
          )
          <fpage>6289</fpage>
          -
          <lpage>6296</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lischuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Henderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Prazenica</surname>
          </string-name>
          ,
          <article-title>A deep rl approach for inverse kinematics solution of a high degree of freedom robotic manipulator</article-title>
          ,
          <source>Robotics</source>
          <volume>11</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M. Q.</given-names>
            <surname>Mohammed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. L.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Chyi</surname>
          </string-name>
          ,
          <article-title>Review of deep reinforcement learning-based object grasping: Techniques, open challenges, and recommendations</article-title>
          ,
          <source>IEEE Access 8</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kaufmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bauersfeld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Loquercio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mueller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Koltun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Scaramuzza</surname>
          </string-name>
          ,
          <article-title>Championlevel drone racing using deep reinforcement learning</article-title>
          ,
          <source>Nature</source>
          <volume>620</volume>
          (
          <year>2023</year>
          )
          <fpage>982</fpage>
          -
          <lpage>987</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>Haarnoja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Tucker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <article-title>Learning to walk via deep reinforcement learning</article-title>
          ,
          <source>ArXiv</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>F.</given-names>
            <surname>Lygerakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dagioglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karkaletsis</surname>
          </string-name>
          ,
          <article-title>Accelerating human-agent collaborative reinforcement learning</article-title>
          ,
          <source>in: Proc. of the 14th PErvasive Technologies</source>
          Related to Assistive Environments Conf.,
          <year>2021</year>
          , pp.
          <fpage>90</fpage>
          -
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Maddison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sifre</surname>
          </string-name>
          , G. van den Driessche, J. Schrittwieser,
          <string-name>
            <given-names>I.</given-names>
            <surname>Antonoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Panneershelvam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lanctot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dieleman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Grewe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kalchbrenner</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>T. P.</given-names>
            <surname>Lillicrap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Leach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Graepel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hassabis</surname>
          </string-name>
          ,
          <article-title>Mastering the game of go with deep neural networks and tree search</article-title>
          ,
          <source>Nature</source>
          <volume>529</volume>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>N.</given-names>
            <surname>Brown</surname>
          </string-name>
          , T. Sandholm,
          <article-title>Superhuman ai for multiplayer poker</article-title>
          ,
          <source>Science</source>
          <volume>365</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>M. F. A. R. D. T.</surname>
          </string-name>
          (
          <article-title>FAIR)†,</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Bakhtin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Brown</surname>
          </string-name>
          , E. Dinan, G. Farina,
          <string-name>
            <given-names>C.</given-names>
            <surname>Flaherty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fried</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Jacob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Komeili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Konath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kwon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mitts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Renduchintala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rowe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Spisak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , M. Zijlstra,
          <article-title>Human-level play in the game of diplomacy by combining language models with strategic reasoning</article-title>
          ,
          <source>Science</source>
          <volume>378</volume>
          (
          <year>2022</year>
          )
          <fpage>1067</fpage>
          -
          <lpage>1074</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>M.</given-names>
            <surname>Carroll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Grifiths</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Seshia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dragan</surname>
          </string-name>
          ,
          <article-title>On the utility of learning about humans for human-AI coordination</article-title>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Y. C.</given-names>
            <surname>Huang</surname>
          </string-name>
          , Y. T. Cheng, L. L.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J. Y. J.</given-names>
          </string-name>
          <string-name>
            <surname>Hsu</surname>
          </string-name>
          ,
          <article-title>Human-ai co-learning for data-driven ai</article-title>
          ,
          <source>ArXiv</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>La</surname>
          </string-name>
          ,
          <article-title>Review of deep reinforcement learning for robot manipulation</article-title>
          ,
          <source>in: 3rd IEEE Int. Conf. on Robotic Computing (IRC)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>590</fpage>
          -
          <lpage>595</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Transfer learning in deep rl: A survey</article-title>
          ,
          <source>IEEE Trans. on</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>