Human-Aware Design for Transferring Knowledge
                                During Human-AI Co-Learning
                                Dimtrios Koutrintzes1 , Christos Spatharis1 and Maria Dagioglou1
                                1
                                    Institute of Informatics and Telecommunications, National Centre for Scientific Research ‘Demokritos’


                                               Abstract
                                               State-of-the-art AI methods allow us to develop agents that collaborate and co-learn with humans.
                                               The possibility to transfer knowledge from an expert to a novice human-AI team has the potential to
                                               streamline training, increase productivity and foster a more effective collaborative environment where
                                               individuals build on each other’s strength. In this context, we present an experimentation pipeline that
                                               can be followed during human-aware AI design and development in the case of transfer learning from
                                               expert to novice human-AI teams. Moreover, we tackle two intricate research questions of ‘when to
                                               stop training’ and ‘what expert knowledge’ to transfer. Our results of a study with two expert human
                                               participants demonstrate the complexities of process and offer relevant guidlines for future research.

                                               Keywords
                                               Human-AI collaboration, Human-AI co-learning, deep Reinforcement Learning, Transfer learning,
                                               Human-aware design, Expert’s behaviour,


                                1. Introduction
                                Industry 5.0 brings forward a social integration of technology into the factory floor [1]. Humans
                                and society at large come at the centre of artificial intelligence (AI) systems, across their entire
                                life-cycle (from design to deployment and maintenance) through values-driven design, the
                                satisfaction of principles of ethical and trustworthy AI and importantly through cultivating a
                                congruous mentality among the stakeholders (developers, integrators, regulators, etc.). Further-
                                more, human-centric digitisation challenges us to re-imagine and redesign industrial tasks in a
                                way that human and artificial agency are interwoven into a sustainable and resilient fabric.
                                   ‘Human-AI collaboration’ (HAIC) is an increasingly popular term, describing many different
                                things, and possibly shifting our attention from several ethical issues related to the integration
                                of AI in our society [2]. In the present work, HAIC, similarly to human-robot collaboration
                                (HRC), is used to describe systems where humans and AI (embodied or not) share goals and
                                perform interdependent actions and is a different paradigm compared to co-existence, interaction
                                and cooperation [3]. AI collaborators, from games [4, 5] to robots in industrial set ups [6], need
                                to incorporate qualities and capabilities that support fluent and seamless collaboration. Similar
                                to human joint action [7], AI agents need to support processes for common perceptual and
                                cognitive grounding, transparent agency attribution and co-learning [8, 9, 10, 11].
                                HAII5.0: Embracing Human-Aware AI in Industry 5.0, Workshop at ECAI 2024, Oct. 19, 2024, Santiago de Compostela,
                                ES
                                Envelope-Open dkoutrintzes@iit.demokritos.gr (D. Koutrintzes); cspatharis@iit.demokritos.gr (C. Spatharis);
                                mdagiogl@iit.demokritos.gr (M. Dagioglou)
                                Orcid 0009-0003-7401-6347 (D. Koutrintzes); 0009-0001-2791-2291 (C. Spatharis); 0000-0002-3357-2844 (M. Dagioglou)
                                             © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   The successful development of agents that collaborate with humans depends both on the
performance of state-of-the-art methods and the study of human behaviour. Deep reinforcement
learning (dRL) methods have allowed to develop agents that collaborate and co-learn in real-time
and real-world [12, 13]. During the collaboration, it becomes possible to study how humans
perceive their interaction with an agent and how they behave in this context.
   The present work is related to the study of agents’ capabilities for co-learning and transfer
learning (TL). Like human de novo learning [14], co-learning demands long training periods
and involves a considerable physical effort and cognitive load. Transferring knowledge from
expert human-AI teams (HAIT) to novice HAITs can alleviate these complexities and support
retaining expert knowledge [15]. In HAIC, it is possible that the source and target of TL are
human individuals, while the environment remains constant. This means that TL is not between
different tasks or settings, but between people working in the same context. The expertise, skills
or insights of one person can be used to accelerate the learning and performance of another. This
‘person-to-person TL’ has the potential to streamline training, increase productivity and foster
a more effective collaborative environment where individuals build on each other’s strengths
[15].
   The process of transferring knowledge comprises several challenges that can be perhaps
formulated as a trade-off between transferring the knowledge to perform and transferring the
knowledge to learn. Time and cost efficiency reasons might require opting for performance. On
the other hand, learning to learn leaves more space to individualisation, avoids ‘experts-biases’
and can lead to more sustainable and resilient behaviours in the long run, especially in tasks
that involve motor learning [16]. To address such research questions we need adequate HAIC
studies that explore different TL techniques [17] and evaluate both the performance of HAITs
in the collaborative task, as well as individualisation margins and subjective human attitudes.
   Beyond different TL methods, within the context of each method there are many design and
development considerations during the stage of training an AI agent with the expert human.
There is a number of choices, seemingly technical, from model initialisation to deciding when
to stop the training that can impact knowledge transfer and merit investigation on their own.
   In this context, in the present work we explore the training process of AI agents with
expert humans, as well as with trained expert agents, and demonstrate the complexity of the
design choices at hand. Our main research question is about “How to evaluate behaviour and
when to stop training an expert AI agent and transfer this knowledge to novice HAITs (for
further behavioural studies)?”. We provide design considerations that allow human-aware TL
during human-AI co-learning. We then present the results of a study with two expert human
participants that demonstrate the complexities of deciding what knowledge to transfer and
when to transfer it.
   The rest of the paper is organised as follows. Section 2 presents the work related to the present
paper. Section 3 describes the methods of the present work, including design considerations
for human-aware transfer learning in human-AI co-learning, the used co-learning task, the
details of the AI agents, as well as the experimental design and conditions. Section 4 reports
the related results. Finally, Section 5 discusses our findings and Section 6 concludes this report
with future challenges and research directions.
2. Background
Recent advancements in deep reinforcement learning (deep RL) have enabled the deployment
of complex systems that operate in real-time in dynamic environments. The success of deep RL
arises from its ability to learn complex motions and behaviours that are challenging to generate
using traditional hard-coded solutions. For example, dRL has been successfully implemented
for various robotic capabilities such as: navigation [18], robot arm control [19], grasping [20],
drones maneuvering [21] and human-robot co-learning [12]. One popular paradigm in co-
learning tasks is Soft Actor-Critic (SAC) [22] due to its robustness in balancing exploration and
exploitation, which is crucial in interactive environments [12, 23, 13].
   Recently, there has been a growing emphasis on developing agents that engage with humans.
Several studies focus on games [24, 25, 26], however, due to their competitive nature, these
agents typically rely on choosing the best actions against a nearly optimal opponent. Conversely,
in collaborative or social settings, modelling and leveraging human behaviour to work along
agents is a highly challenging task [27]. In the context of human-AI co-learning, both entities
can learn from each other and grow together over time[28]. It must be noted that most HAIC
studies operate within well-defined discrete environments, such as overcooked [27]. In contrast,
our work addresses a continuous environment that necessitates collaboration between humans
and agents to generate multi-modal trajectories and collectively achieve a goal.
   A challenge of deploying dRL models in diverse or previously unexplored environments [29]
is to do so without requiring training from scratch. Especially in HAIC tasks, this process is
time-consuming and demanding, and the extended training periods can negatively affect the
performance of the team. These limitations highlight the importance of a different paradigm
that allows agents to reuse knowledge from one task to a related, yet distinct, task. In RL tasks,
this paradigm is referred to as transfer learning (TL).
   Multiple approaches have been proposed to facilitate knowledge transfer in RL settings [30, 31].
TL aims to learn an optimal policy for a target task by leveraging external information from
a set of source tasks, as well as, internal information from the target task. One of the most
prominent TL approaches, reward shaping (RS) [32, 33], uses external knowledge to modify the
reward distribution in the target task, by incorporating a reward-shaping function. By providing
additional rewards along with interior environmental rewards, RS directs the agent toward more
optimal trajectories. Learning from demonstrations (LfD) allows RL agents to learn to perform
tasks by observing expert demonstrationsThis approach can be further decomposed to offline
and online LfD, where the former uses demonstrations for pre-training the models [34, 35],
while the latter directly employs expert demonstrations to guide the agent’s behaviour for more
efficient exploration [36]. Finally, in policy transfer, a pre-trained policy on a source task is
directly applied to the target task. Policy transfer is further divided into TL learning via policy
distillation [37] and TL via policy reuse [38]. Policy distillation involves learning a model by
minimizing the divergence from multiple expert policies, while policy reuse leverages previously
learned policies by allowing the agent to draw from past experiences with some probability.
   Transferring knowledge in the context of HAIC comprises several complexities. Given the
nature of human (motor) learning, a core question is when to stop expert training and to transfer
knowledge. Just converging to a desired performance might not be the pursued goal. Injecting
variability in the transferred knowledge might be necessary to promote learning and leverage
future behaviour [16].
   The human involvement in HAIC tasks demands a considerable time and effort from human
experts due to their active involvement throughout the entire agent training process. This
significantly constraints the fine tuning of dRL model hyperparameters. Various studies [39,
40, 41] offer valuable insights on the selection of hyperparameters based on the methodologies
and environmental contexts. However, the assumption of hyperparameter tuning does not
directly apply to HAIC tasks, as the inclusion of humans in the training loop renders exhaustive
hyperparameter search infeasible, given the time and energy constraints involved.
   Finally, every hypothesis on the efficiency of a TL method needs to be evaluated through
its impact to the entire team, taking into account multiple teams. Evaluating the performance
of HAIC teams requires a holistic examination of the human-AI team performance and co-
learning dynamics during the collaboration, as well as analyzing individual contributions [42, 43].
Moreover, both objective and subjective metrics need to be incorporated in the evaluation process
to comprehensively assess the effectiveness of the team, as well as individual human behaviours
and experiences [44].


3. Methods
3.1. Design considerations for human-aware transfer learning in human-AI
     co-learning
The ultimate goal of this work is to build AI agents that possess human-AI co-learning capabili-
ties. Transferring knowledge from expert HAITs can facilitate reasonable training periods for
a novice HAIT. Different TL methods are expected to result in different HAIT behaviour and
affect human behaviour and perceived interaction qualities. Which TL method allows faster
or more stable learning in the long run? Which TL method promotes individualisation and
alleviates superstitious learning as a result of expert-behaviour bias? These are examples of
research questions that can be pursued through rigorous testing during human studies that
attempt to capture ‘what knowledge has been transferred’. Within the context of TL, we need
to consider two design/development stages. These are presented in the table of Figure 1 and
capture our experience during HAIT studies ([13, 45] and other unpublished data). The overall
goal of HAIT studies is to either inform the next round of design and development or to actually
choose a deployable system (first row of the table in Figure 1). Experimental design and AI
model parameters need to be considered.
   In the case of TL methods, there is another experimental stage of design/development that
precedes that of HAIT studies. This stage is related to training expert HAIT teams (or possibly
expert AI-only teams) and aims at producing the knowledge to be transferred (policy, demon-
strations, etc.). Any design choices here will determine the AI agent’s model parameters in the
HAIT studies. In a sense, this is a set of design considerations, besides behavioural experimental
design, that needs to be controlled. Rigorous design and reporting of this stage is necessary
for guaranteeing transparency of the methods and reproducibility of the results, as well as, for
facilitating comparison among studies and methods. We identify two main complexities, as a
result of having a human in the loop.
   Effort of choosing AI model’s hyperparameters. In the absence of humans, iterative
Figure 1: Design considerations for human-aware transfer learning in human-AI co-learning.


testing of different sets of hyperparameters is exploited until a desirable performance is achieved.
In the event that human experts need to train the models, exhaustively investigating appropriate
parameters requires enormous effort and time. Alternative approaches such as using two
independent agents (instead of an agent and a human) for team training could be pursued.
However, based on our experience this might not work due to lack of representing important task
aspects (such as the collaborative nature of a task). On the other hand, the use of collaborative
agents introduces issues of decoupling the information later on.
   When to stop the training? While in a ‘AI-alone’ system the goal is to converge to the best
possible performance, in the case of transferring knowledge for co-learning this might not be
the desired outcome. Instead, a certain degree of variance in the transferred knowledge might
be pursued as a means of facilitating individualisation. As mentioned earlier, individualisation
prevents ‘experts-biases’ and can lead to more sustainable and resilient behaviours in the long
run. Such a design choice will also affect the set of chosen parameters.
   In the rest of the paper we focus on the ‘expert HAIT training’ and we demonstrate through an
experimental paradigm the process and results towards defining the knowledge to be transferred
before proceeding to a HAIT study.

3.2. Human-AI co-learning task
A co-learning task, in a virtual environment, has been used to study HAIT behaviour, and
specifically expert HAIT training [23]. A human collaborates with an RL agent to move a ball
from a starting to a target position (Figure 2), along a virtual tray. The tray rotates around two
axes; the human player controls the rotation around the y-axis, while the agent controls the
rotation around x-axis. The controlled variables are the angles θ, φ of the tray’s rotation. A
‘game’ lasts for maximum 40 seconds and the team wins if the ball reaches the target before this.
A pair of human and agent actions is applied for 200ms resulting in 200 time-steps per game.


Figure 2: HAIC task virtual environment. A white ball (d=1-unit) travels from one of three starting
positions (enumerated white circles) to a target position (green hole), along a 10x10 unit tray. The ball is
constrained within the tray by a 1-unit high wall. Two obstacles placed across the main diagonal of the
tray force the ball to move through through a 1.4-unit ‘gate’ along its trip from the start to the goal
positions. A count down timer is presented to users during a game (left hand side bottom), while study
and time statistics are shown at the end of each game.


   Both team members can take three discrete actions: a) rotate the tray clockwise, b) rotate the
tray counter-clockwise, or c) leave tray’s angle unchanged. The human collaborator applies
these actions through the keyboard by pressing ‘Right Arrow’ (>), ‘Left Arrow’ (<) or nothing,
accordingly. The tray rotates 30 degrees towards both sides, and each action causes an angle
change of around 5 degrees.

3.3. Deep RL agent
The rotation of the tray around x-axis is controlled by an AI agent. A discrete version of Soft-
Actor Critic (SAC) has been used [46] to be more consistent with the discrete inputs provided
by the human user via a keyboard. A continuous 8-dimensional state space has been designed
to represent the environment’s configuration at each time-step, comprising: the ball’s position
(𝑥, 𝑦) and speed (𝑥,̇ 𝑦)̇ and the tray’s angles (𝜙, 𝜃) and rotational velocities (𝜙,̇ 𝜃).
                                                                                       ̇
   The above quantities have been normalized to the [−1, 1] range to ensure training stability for
the neural networks. The dRL agent’s action space is 1-dimensional and discrete, 𝛼 = −1, 0, 1,
corresponding to counter-clockwise, no or clockwise change of the tray’s angles accordingly.
The agent receives a 𝑟 = −1 reward for each elapsed time-step, and an 𝑟 = 10 reward when the
team reaches the target. Figure 3 depicts the co-learning loop between the human and the RL
agent.

3.4. Human-AI co-learning process
Two experts were involved in the training process. They are regarded as experts due to their
profound understanding of the game’s environment, and agent’s behaviour. Their expertise has
Figure 3: Co-learning process between an RL agent and a human expert.


been developed through extensive gameplay, with each expert having spent a minimum of 100
hours playing the game.
   The human-AI co-learning process is presented in Figure 4. Each team had to complete six
experimental blocks, where each block comprised:
   -A testing batch of 5 games, where the agent followed a deterministic approach by applying
the argmax action of its policy. No interaction data were collected for the replay buffer during
this phase.
   -A training batch of 5 games, each of which was followed by a round of 250 off-line gradient
updates (OGU). Here, the policy followed a stochastic approach by sampling from the Actor’s
categorical distribution. Interactions were stored in a buffer and used for OGU after each game.
Training began after the third game of the first block to ensure that the buffer had enough data.
So, in total there were 7000 OGU across the entire learning procedure.
   At the end of the sixth block, one more testing batch was included. For all experiments, and
for both experts, the same initialization weights have been used. This was motivated by the
variability in the subsequent performance that this initial agent induces. The initialization
weights were selected randomly without any evaluation of their impact to the subsequent team
performance. None of the expert had any previous experience with this initialized agent.


Figure 4: An experimental block consists of 5 test games (test batch) and 5 training games (train batch).
Each training game is followed by an off-line gradient update (OGU).


  Each human expert repeated the co-learning procedure three times under two conditions. In
Experiment 1 the humans collaborated with a SAC agent. This has been the baseline condition
that was used during hyperparameter tuning. Observations during this experimental condition
are exploited to define the final set of hyperparameters as well as the expert HAITs behaviour to
be transferred. Note that the number of the experimental blocks (6) was chosen after observing
the convergence of performance during experimentation in the condition of Experiment 1.
While behaviours of individual experts could be certainly transferred, we wanted to explore the
effect of combing experts’ knowledge as a means of considering various, potentially different,
expert behaviours. In Experiment 2 the initialised agents were pre-trained using a combination
of two replay buffers sourced from the two experts (during Experiment 1). Each expert selected
their replay buffer based on their assessment of the best run. During the pre-training phase, the
agent underwent 2500 gradient updates. The final replay buffer contained approximately 3500
interactions and new interactions during the collaboration replaced the old ones. In addition to
human-AI teams, in Experiment 3, we subjected to the co-learning procedure three pairs of
independent agents. Each pair consisted of a ‘novice’ SAC agent and a SAC agent as pre-trained
by the experts during Experiment 1 (direct policy transfer).
   The performance of the teams across the games is evaluated by the achieved score in each
game. This is computed by discounting one point for each time-step played per game. For
example, a successful game of 10 seconds (50 time-steps) would result in a score of 150, while
a unsuccessful game would result in a score of 0. Moreover, the performance is qualitatively
evaluated through occupancy grids of the ball throughout the games.


4. Results
4.1. Hyperparameter tuning
The hyperparameter values reported in Table 1 were defined after a series of experiments to
optimize the performance of the agent for the HAIC task. This time-consuming process involved
continuous human interaction in the training loop, requiring iterative testing and validation.
The involvement of human-in-the-loop not only slowed down the cycle, but also introduced
variability, necessitating numerous trials to converge to optimal settings.

Table 1
Final set of hyperparameters.
                            Hyperparameter          Value
                            Number of layers        2 fully connected layers
                           Hidden layer units       [32, 32]
                            Replay buffer size      3500
                        Off-line gradient updates   250
                                Batch size          256
                              Discount rate         0.99
                               Learning rate        0.0003
                                Optimizer           Adam
                            Weight initializer      Xavier
                              Target entropy        0.5 * (-log(1/|A|))
   Following, we discuss the various hyperparameters adjusted during training and analyze
their impact on the overall performance of the method.
   Target entropy The target entropy 𝐻 is crucial in SAC as it balances exploration and
exploitation. The equation to calculate the optimal target entropy contains a multiplier and the
formula that maximizes the entropy to give all actions the same probability. A high multiplier
maximizes the exploration but makes the agent having a more random behaviour, while a
low multiplier allows more intense exploitation with minimum exploration. Based on our
experience, using multipliers based on other set-ups [46] does not work and experimentation is
needed with different multipliers [47] considering the context and the goal of each task. In our
case, we pursued a balance between exploration and exploitation considering a specific training
time and the desired variability in the experts’ behaviour.
   Buffer sizes. The replay buffer (RB) is used to store past experiences to update the policy
in SAC. We tested various buffer sizes and our findings align with previous research [48],
showing that small RB sizes discard useful experiences over time, while very large sizes can also
negatively affect performance by including outdated and irrelevant experiences. Considering
that in each block the maximum number of experiences we can collect is 1000 (200 timesteps x
5 games per block), we opted to use a replay buffer size of 3500 experiences. This size allows
us to discard initial sub-optimal experiences (from the first two blocks) and keep the latest
ones, ensuring that as the policy progresses, the co-learning process prioritizes the most recent
experiences gathered.
   Update frequencies. A significant effort was dedicated on identifying the optimal frequency
for performing offline gradient updates (OGU) on the neural networks using experiences stored
in the replay buffer. According to our findings, performing multiple OGUs after many rounds of
interaction (e.g. after each block) between the human and the RL agent might result in stagnant
policies that contribute to a poorly filled replay buffer, which will not be as effective for the
update process. On the other hand, updating very frequently (e.g. during the game), while the
RL agent actively gathers experiences, could end up with the opposite result: the policy changes
while playing and confuses the co-learning process. In our case, executing multiple OGUs at the
end of each game achieved a good balance between policy exploitation and updating, hence we
believe that the update frequency depends greatly on the pipeline of the overall methodology.
   AI-only training. In an attempt to further optimize the hyperparameters, we considered
experimenting without human involvement in the training loop. To that end, we implemented
an agent-vs-agent training scheme, where one agent controlled one axis of the tray while
another controlled the other axis. The goal was to identify an optimal set of hyperparameters
through multiple experiments that could be conducted seamlessly in the absence of the human
expert. However, the independent agents failed to learn the task, highlighting the complexity of
hyperparameter tuning in multi-agent systems and the irreplaceable value of human expertise
in collaborative tasks.
   Overall, hyperparameter tuning in such tasks is particularly challenging as human involve-
ment is required. Additionally, the human must decide when to stop testing a set of hyper-
parameters and move on to a new one, making the process even more complex. Despite the
difficulties, balancing exploration and exploitation and leveraging human expertise are essential
for effective training.
                   4.2. Experiment 1: Human - SAC agent co-learning
                   Two human experts collaborated with the AI system (SAC agent) from the scratch. No prior ex-
                   perience was used in the replay buffer. Figure 5 (left) shows that both experts exhibit comparable
                   behaviour, converging to a high score towards the last testing blocks as a manifestation of expert
                   behaviour. On the other hand, blocks 1 to 4 show significant variability in the performance of
                   both experts, suggesting that in the initial stages of training, the human-AI team still explores
                   various strategies to jointly reach the target.

                                     Experiment 1: Expert human - SAC agent teams (test blocks)                                            Experiment 2: Expert human - pre-trained SAC agent teams (test blocks)
                 200                                                                                                         200
                 175                                                                                                         175
                 150                                                                                                         150
                 125                                                                                                         125
Score Per Game


                 100                                                                                                         100
                  75                                                                                                              75
                  50                                                                                                              50
                  25                                                                                                              25
                                                                                                                                                                                                                 Expert 1
                   0                                                                                                               0                                                                             Expert 2
                       1


                                    2


                                                3


                                                                 4


                                                                                5


                                                                                                 6


                                                                                                           7


                                                                                                                                       1


                                                                                                                                                  2


                                                                                                                                                                 3


                                                                                                                                                                             4


                                                                                                                                                                                           5


                                                                                                                                                                                                        6


                                                                                                                                                                                                                       7
                                                                     Blocks                                                                                                      Blocks

                   Figure 5: Scores of HAIT during the test games for Experiment 1 (left) and Experiment 2 (right). For
                   each expert, the games across all three runs are collapsed in each boxplot. The median for each run is
                   shown by the open black circles.


                      Figure 6 shows the ball’s occupancy frequencies across the test blocks. In the first block,
                   the team performs poorly, with the ball remaining in the upper part of the tray, where cells
                   are highly occupied. In contrast, the second block shows more exploratory behavior, as the
                   team searches through the tray in order to find ways to reach the goal. As testing blocks
                   proceed, the coverage becomes sparser, indicating that HAIT has converged towards a more
                   direct approach to reaching the target, demonstrated by the ‘X’-shaped occupancy grid, with
                   increased occupancy in cells near the target, such as the lower left cell.

                   B1 | Wins=0 | Dur.= 858     B2 | Wins=0 | Dur.= 861        B3 | Wins=1 | Dur.= 731   B4 | Wins=5 | Dur.= 149        B5 | Wins=5 | Dur.= 177   B6 | Wins=5 | Dur.= 232   B7 | Wins=5 | Dur.= 189          200

                                                                                                                                                                                                                            150

                                                                                                                                                                                                                            100

                                                                                                                                                                                                                            50

                                                                                                                                                                                                                            0


                   Figure 6: Ball occupancy frequency across the seven blocks during the test batches of one HAIT in
                   Experiment 1.
4.3. Experiment 2: Human - Pre-trained SAC agent co-learning
During the second experiment we wanted to study the effect of TL within experts, as a possible
means of introducing some variability due to the different experts. A replay buffer that combined
knowledge from the two replay buffers of two experts, during Experiment 1, was used. These
data were used to pre-train the SAC agent before interacting with the human experts [36].
Specifically, we conducted 2500 OGUs prior to starting the games, and then followed the same
experimental setup as in Experiment 1.
   The results, depicted in Figure 5 (right), demonstrate the effectiveness of TL in the collab-
orative task. Despite some poor performance (of one of the two HAITs) in the very first test
block the overall performance of the teams across the rest of the block is consistently high.
This shows the robustness and the capability of offline TL scheme to produce high quality
solutions. The observed variability in the initial blocks can be attributed to both variance in
human performance but also to some limited continuation of learning as can be seen in the
heatmaps of Figure 7 (Blocks 1 and 2) .
   Furthermore, in Figure 7, it can be noticed that the evolution of the occupancy grids differs
from Experiment 1. In particular, from the very first test block, the HAIT tends to explore
the lower half of the grid, while frequently reaching the lower left cell near the goal. In the
following test blocks, the team appears to quickly achieve optimal behaviour (i.e., ‘X’-shaped
occupancy grid), successfully reaching the goal from any starting position. This knowledge
reuse approach essentially continues the learning process from where it concluded at the end
of Experiment 1 and the variability introduced in the pre-training procedure (joined expert
buffers) due to possibly different expert behaviours is not sufficient to trigger a significant drop
in the initial HAIT performance.

B1 | Wins=1 | Dur.= 779   B2 | Wins=5 | Dur.= 257   B3 | Wins=5 | Dur.= 196   B4 | Wins=5 | Dur.= 302   B5 | Wins=5 | Dur.= 140   B6 | Wins=5 | Dur.= 159   B7 | Wins=5 | Dur.= 125   200

                                                                                                                                                                                      150

                                                                                                                                                                                      100

                                                                                                                                                                                      50

                                                                                                                                                                                      0


Figure 7: Ball occupancy frequency across the seven blocks during test batches of one HAIT in
Experiment 2.


4.4. Experiment 3: Pre-trained - novice SAC agents co-learning
In a final set of experiments, we employed TL with a focus on direct policy transfer to observe
the performance of two agents collaborating to achieve a common goal. Specifically, one agent
was initialized with a pre-trained policy from Experiment 1, while the other agent started with
no prior experience. Only the second agent participated in the training process, while the
parameters of the pre-trained agent remained fixed throughout the entire procedure. The pre-
trained agent was frozen to retain the behaviour learned by the experts, and better simulating
in this way the condition of the previous experiments where we consider that the behaviour of
experts humans has reached a certain plateau.
   Figure 8 shows the scores obtained during the test blocks. Despite the first agent being
equipped with a (sub-)optimal expert policy, the overall team performance was poor compared
to the previous experiments. Specifically, the agents failed to achieve the high scores seen
previously and exhibited high variability, indicating that the team had not developed the
necessary collaborative skills. This further highlights the crucial role of having a human expert
in the loop.

                                       200
                                                               Experiment 3: pretrained - novice SAC agents co-learning (test blocks)
                                       175

                                       150

                                       125
                          Score Per Game


                                       100

                                           75

                                           50

                                           25                                                                                                                            Expert Agent 1
                                                                                                                                                                         Expert Agent 2
                                            0                                                                                                                            Expert Agent 3

                                                                                                        Blocks
                                                1


                                                                      2


                                                                                     3


                                                                                                    4


                                                                                                                        5


                                                                                                                                      6


                                                                                                                                                        7
Figure 8: Scores of agent teams during the test games for Experiment 3.


   As anticipated, the grids shown in Figure 9 fail to demonstrate meaningful behaviours, as the
team is rarely able to successfully reach the target. Instead, there is a noticeable tendency for
the agents to remain stuck along the edges of the grid for extended time-steps.

B1 | Wins=0 | Dur.= 831                     B2 | Wins=0 | Dur.= 832       B3 | Wins=3 | Dur.= 412   B4 | Wins=0 | Dur.= 832   B5 | Wins=0 | Dur.= 836   B6 | Wins=0 | Dur.= 833      B7 | Wins=0 | Dur.= 930   200

                                                                                                                                                                                                               150

                                                                                                                                                                                                               100

                                                                                                                                                                                                               50

                                                                                                                                                                                                               0


Figure 9: Ball occupancy frequency across seven blocks during the test batches of one run in Experiment
3.

   Additionally, we considered another evaluation scheme between a pre-trained and a novice
SAC agent. Specifically, we pre-trained one agent offline using batches from the replay buffer
that contained experiences from both experts, similar to Experiment 2. Following the same
procedure as in Experiment 3, the pre-trained agent had its weights frozen while the novice
agent underwent training. The results exhibited the same behaviour as the previous one,
demonstrating poor performance, further supporting the need for a human expert in the
training procedure. Furthermore all AI-only schemes were tested with the pre-trained agent
participating in the training process but produced similar or worse results.


5. Discussion
Recent advances in AI, such as in dRL, allow us to develop systems where humans and AI
agents/robots learn together and collaborate to achieve common goals in scenarios where their
actions are interdependent. The design, development and validation of ‘human-AI collaboration’
(HAIC) systems comprises not only the development of the AI methods but also a vigorous
study of ‘what works for human collaborators’ and for the human-AI team (HAIT) at large. Such
an approach, has been inherent to fields such as human-robot interaction but is now widely
appreciated in the context of AI ethical assessment processes and the human-centric design
elements required by Industry 5.0.
   Along with the capability for collaboration comes also the necessity of developing methods
that allow HAITs to co-learn. Although the technology to support this exists, learning is a long
process by nature. The possibility to transfer knowledge from an expert HAIT to a novice HAIT
could shorten training periods, increase productivity and prevent loss of expert knowledge.
   In the present work, we have first listed several considerations for designing, developing and
deploying human-AI co-learning systems. These considerations come out of our experience
and follow practices of human-aware AI design. One of the most important aspects of having
humans-in-the-loop is that any ‘final’ solution needs to be validated with many users in order to
evaluate not only the suitability of TL methods chosen but also the entire collaborative process
as experienced by humans. This is already a complicated procedure that needs careful and
controlled experimental designs due to the very nature of humans that exhibits great variability.
Moreover, the execution of a HAIT studies for TL, presupposes that expert HAIT knowledge has
been captured (one method or another) and what is then evaluated is the effect of the transferred
knowledge in the co-learning process.
   Based on our experience, an important step that is necessary before any HAIT study for
TL from expert HAITs is related to the very procedure of ‘knowledge collection’ from expert
HAITs. A major complexity is related to choosing appropriate hyperparameters for the AI
models as the human-in-the-loop nature of HAIC makes hyperparameter fine-tuning a costly
procedure. As shown in Experiment 3 (Section 4.4), it could be the case that exploring suitable
hyperparameters through AI-AI co-learning might not be possible as was not in our case. The
chosen hyperparameters in Section 4.1 have been the result of tens of hours of game training
that involved the expert human players. As mentioned earlier each expert has spent over 100
hours of training. This means that the exploration of hyperparameters is constrained both by
the effort needed by each individual expert, but also by the fact that a few experts might be
available.
   In this context, method designers and developers need to decide what constitutes a ‘satisfac-
tory behaviour’ for a given task and context, and terminate the exploration of hyperparameters
based on tailored criteria rather than an optimal performance. This has been the question
that we pursued in the presented work: “when to stop training the expert agents and transfer
the knowledge to novice teams of humans and experts?”. The choice of hyperparameters in
Section 4.1 and the results of Experiment 1 (Section 4.2) actually mirror our choices for stopping
the training procedure. Two important criteria for doing so are related to the characteristics of
the learning curve during the HAIT co-learning and the duration of the procedure that will affect
the time required for each participant in HAIT studies later on. Specifically, having in mind
that we want to study the effect of TL in novice HAITs using a learning from demonstrations
approach, the hyperparameters were chosen so as to:
    • have a learning curve that is neither steep nor shallow (regulated by the target entropy).
      Such a curve allows the final buffers to include demonstrations that mirror the entire
      learning process, including both bad and good games.
    • exclude from the buffers the very initial games that had sub-optimal experiences.
    • not intervene in an obtrusive and destructive way in the learning process by inappropriate
      frequency of the off-line gradient updates.
    • not exceed 6 experimental blocks in the future HAIT studies with novice HAITS.
   Generally, in terms of “what knowledge to transfer” that is related to the TL method used is
related to the follwoing possibilities:
    • Transfer knowledge from optimal performance towards the end of learning. This approach
      could aid the novice players receive refined strategies, potentially leading to quicker
      adaptation to expert-level behaviors and higher performance.
    • Transfer knowledge from an earlier stage where greater variability exists and which could
      possibly allow more individualisation to the behaviour of novice users. This approach
      could be more flexible and adaptable to different users’ needs, preferences, and learning
      styles.
    • Combine the knowledge of two or more experts, which could also provide some source
      of variability in the behaviour. By integrating diverse expert experiences, novice players
      could benefit from a richer set of policies potentially leading to faster convergence as the
      RL agent has explored the state space more deeply. Note that such variability was shown
      to leave experts’ behaviour unaffected (Section 4.3).
  As a final note, we believe that the co-learning paradigm presented satisfies the needs of an
experimental set-up. Results produced in such environments can definitely guide design and
development in other contexts and tasks, as well as to inform human-aware AI design. However,
the design of each system must be treated uniquely based on the specific characteristics of each
environment and the participating actors.


6. Conclusions
In the present work, we have demonstrated the complex dynamics involved in developing agents
capable of collaborating and co-learning with human experts. Specifically, we have presented
an experimentation pipeline that can be followed during human-aware AI design in the case of
transfer learning from expert to novice HAITs. Moreover, we tackled two intricate research
questions of ‘when to stop training’ and ‘what expert knowledge to transfer’. Through reporting
the results of the process we followed we aim at contributing to future research designs that
are according to the needs of Industry 5.0 and trustworthy AI.
   The next step in our research involves examining how the choices outlined above affect the
transfer of knowledge to novice HAITs. Future studies will focus on assessing human behavior
and subjective perceptions of collaboration in human-AI interactions, in addition to objective
team performance. By evaluating the transfer learning capabilities of our method with novice
human HAITs, we aim to validate our findings and further refine our approach.


Acknowledgments
This research was (co-)funded by the European Union under GA no. 101135782 (MANOLO
project). Views and opinions expressed are however those of the authors only and do not
necessarily reflect those of the European Union or CNECT. Neither the European Union nor
CNECT can be held responsible for them.


References
 [1] E. Commission, D.-G. for Research, Innovation, M. Breque, L. De Nul, A. Petridis, Industry
     5.0 – Towards a sustainable, human-centric and resilient European industry, Publications
     Office of the European Union, 2021. doi:10.2777/308407 .
 [2] A. Sarkar, Enough with “human-ai collaboration”, in: Extended Abstracts of the 2023 CHI
     Conf. on Human Factors in Computing Systems, 2023, pp. 1–8.
 [3] J. Bütepage, D. Kragic, Human-robot collaboration: From psychology to social robotics,
     ArXiv abs/1705.10146 (2017).
 [4] B. Sarkar, A. Shih, D. Sadigh, Diverse conventions for human-ai collaboration, Advances
     in Neural Information Processing Systems 36 (2024).
 [5] S. Daronnat, L. Azzopardi, M. Halvey, Impact of agents’ errors on performance, reliance
     and trust in human-agent collaboration, in: Proc. of the Human Factors and Ergonomics
     Society Annual Meeting, volume 64, SAGE Publications Sage CA: Los Angeles, CA, 2020,
     pp. 405–409.
 [6] A. Borboni, K. V. V. Reddy, I. Elamvazuthi, M. S. AL-Quraishi, E. Natarajan, S. S. Azhar Ali,
     The expanding role of ai in collaborative robots for industrial applications: a systematic
     review of recent works, Machines 11 (2023) 111.
 [7] N. Sebanz, H. Bekkering, G. Knoblich, Joint action: bodies and minds moving together,
     Trends in cognitive sciences 10 (2006) 70–76.
 [8] E. M. Van Zoelen, K. Van Den Bosch, M. Neerincx, Becoming team members: Identifying
     interaction patterns of mutual adaptation for human-robot co-learning, Frontiers in
     Robotics and AI 8 (2021).
 [9] K. van den Bosch, T. Schoonderwoerd, R. Blankendaal, M. Neerincx, Six challenges for
     human-ai co-learning, in: Adaptive Instructional Systems: 1st Int. Conf., AIS 2019, Held
     as Part of the 21st HCI Int. Conf., HCII 2019, Orlando, FL, USA, July 26–31, 2019, Proc. 21,
     Springer, 2019, pp. 572–589.
[10] S. Holter, M. El-Assady, Deconstructing human-ai collaboration: Agency, interaction, and
     adaptation, arXiv preprint arXiv:2404.12056 (2024).
[11] M. Vössing, N. Kühl, M. Lind, G. Satzger, Designing transparency for effective human-ai
     collaboration, Information Systems Frontiers 24 (2022) 877–895.
[12] A. Shafti, J. Tjomsland, W. Dudley, A. A. Faisal, Real-world human-robot collaborative
     reinforcement learning*, IEEE/RSJ Int. Conf. on Intel. Robots and Systems (IROS) (2020).
[13] A. C. Tsitos, M. Dagioglou, Enhancing team performance with transfer-learning during
     real-world human-robot collaboration (2022).
[14] J. W. Krakauer, A. M. Hadjiosif, J. Xu, A. L. Wong, A. M. Haith, Motor learning, Compr
     Physiol 9 (2019) 613–663.
[15] P. Spitzer, N. Kühl, M. Goutier, Training novices: The role of human-ai collaboration and
     knowledge transfer, arXiv preprint arXiv:2207.00497 (2022).
[16] A. K. Dhawale, M. A. Smith, B. P. Ölveczky, The role of variability in motor learning,
     Annual review of neuroscience 40 (2017) 479–498.
[17] Z. Zhu, K. Lin, J. Zhou, Transfer learning in deep reinforcement learning: A survey, arXiv
     preprint arXiv:2009.07888 (2020).
[18] D. Honerkamp, T. Welschehold, A. Valada, Learning kinematic feasibility for mobile
     manipulation through deep rl, IEEE Robotics and Automation Letters 6 (2021) 6289–6296.
[19] A. Malik, Y. Lischuk, T. Henderson, R. Prazenica, A deep rl approach for inverse kinematics
     solution of a high degree of freedom robotic manipulator, Robotics 11 (2022).
[20] M. Q. Mohammed, K. L. Chung, C. S. Chyi, Review of deep reinforcement learning-based
     object grasping: Techniques, open challenges, and recommendations, IEEE Access 8 (2020).
[21] E. Kaufmann, L. Bauersfeld, A. Loquercio, M. Mueller, V. Koltun, D. Scaramuzza, Champion-
     level drone racing using deep reinforcement learning, Nature 620 (2023) 982–987.
[22] T. Haarnoja, A. Zhou, S. Ha, J. Tan, G. Tucker, S. Levine, Learning to walk via deep
     reinforcement learning, ArXiv (2018).
[23] F. Lygerakis, M. Dagioglou, V. Karkaletsis, Accelerating human-agent collaborative re-
     inforcement learning, in: Proc. of the 14th PErvasive Technologies Related to Assistive
     Environments Conf., 2021, pp. 90–92.
[24] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser,
     I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalch-
     brenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis,
     Mastering the game of go with deep neural networks and tree search, Nature 529 (2016).
[25] N. Brown, T. Sandholm, Superhuman ai for multiplayer poker, Science 365 (2019).
[26] M. F. A. R. D. T. (FAIR)†, A. Bakhtin, N. Brown, E. Dinan, G. Farina, C. Flaherty, D. Fried,
     A. Goff, J. Gray, H. Hu, A. P. Jacob, M. Komeili, K. Konath, M. Kwon, A. Lerer, M. Lewis,
     A. H. Miller, S. Mitts, A. Renduchintala, S. Roller, D. Rowe, W. Shi, J. Spisak, A. Wei, D. Wu,
     H. Zhang, M. Zijlstra, Human-level play in the game of diplomacy by combining language
     models with strategic reasoning, Science 378 (2022) 1067–1074.
[27] M. Carroll, R. Shah, M. K. Ho, T. L. Griffiths, S. A. Seshia, P. Abbeel, A. Dragan, On the
     utility of learning about humans for human-AI coordination, 2019.
[28] Y. C. Huang, Y. T. Cheng, L. L. Chen, J. Y. J. Hsu, Human-ai co-learning for data-driven ai,
     ArXiv (2019).
[29] H. Nguyen, H. La, Review of deep reinforcement learning for robot manipulation, in: 3rd
     IEEE Int. Conf. on Robotic Computing (IRC), 2019, pp. 590–595.
[30] Z. Zhu, K. Lin, A. K. Jain, J. Zhou, Transfer learning in deep rl: A survey, IEEE Trans. on
     Pattern Analysis and Machine Intelligence 45 (2023).
[31] M. Islam, The impact of transfer learning on ai performance across domains, Journal of
     AI General science (JAIGS) 1 (2024).
[32] A. Ng, D. Harada, S. J. Russell, Policy invariance under reward transformations: Theory
     and application to reward shaping, in: Int. Conf. on Machine Learning, 1999.
[33] M. Vecerík, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot, N. M. O. Heess, T. Rothörl,
     T. Lampe, M. A. Riedmiller, Leveraging demonstrations for deep reinforcement learning
     on robotics problems with sparse rewards, ArXiv (2017).
[34] S. Schaal, Learning from demonstration, in: Proc. of the 9th Int. Conf. on Neural Informa-
     tion Processing Systems, 1996, p. 1040–1046.
[35] M. Yang, O. Nachum, Representation matters: Offline pretraining for sequential decision
     making, in: Int. Conf. on Machine Learning, 2021.
[36] T. Hester, M. Vecerík, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan,
     A. Sendonaris, I. Osband, G. Dulac-Arnold, J. P. Agapiou, J. Z. Leibo, A. Gruslys, Deep
     q-learning from demonstrations, in: AAAI Conf. on AI, 2017.
[37] G. E. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, ArXiv
     (2015).
[38] F. Fernández, M. Veloso, Probabilistic policy reuse in a reinforcement learning agent, in:
     Proc. of the 5th Int. Joint Conf. on Autonomous Agents and Multiagent Systems, 2006.
[39] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, D. Meger, Deep reinforcement
     learning that matters, in: Proc. of the 32nd AAAI Conf. on AI and 13th Innovative
     Applications of AI Conf. and 8th AAAI Symposium on Educational Advances in AI, 2018.
[40] L. Engstrom, A. Ilyas, S. Santurkar, D. Tsipras, F. Janoos, L. Rudolph, A. Madry, Imple-
     mentation matters in deep rl: A case study on ppo and trpo, in: Int. Conf. on Learning
     Representations, 2020.
[41] T. Eimer, M. Lindauer, R. Raileanu, Hyperparameters in reinforcement learning and how
     to tune them, in: Proc. of the 40th Int. Conf. on Machine Learning, 2023.
[42] K. van den Bosch, T. Schoonderwoerd, R. Blankendaal, M. Neerincx, Six challenges for
     human-ai co-learning (2019).
[43] P. Chattopadhyay, D. Yadav, V. Prabhu, A. Chandrasekaran, A. Das, S. Lee, D. Batra,
     D. Parikh, Evaluating visual conversational agents via cooperative human-ai games, in:
     AAAI Conf. on Human Computation & Crowdsourcing, 2017.
[44] G. Hoffman, Evaluating fluency in human–robot collaboration, IEEE Transactions on
     Human-Machine Systems 49 (2019) 209–218.
[45] D. Koutrintzes, Knowledge transfer in human-artificial intelligence collaboration, Master’s
     thesis, University of Piraeus, 2023.
[46] P. Christodoulou, Soft actor-critic for discrete action settings, arXiv preprint
     arXiv:1910.07207 (2019).
[47] Y. Xu, D. Hu, L. Liang, S. McAleer, P. Abbeel, R. Fox, Target entropy annealing for discrete
     soft actor-critic, 2021.
[48] R. Liu, J. Y. Zou, The effects of memory replay in reinforcement learning, 2018 56th Annual
     Allerton Conf. on Communication, Control, and Computing (Allerton) (2017) 478–485.