=Paper= {{Paper |id=Vol-3118/p08 |storemode=property |title=Supporting Impaired People with a Following Robotic Assistant by means of End-to-End Visual Target Navigation and Reinforcement Learning Approaches |pdfUrl=https://ceur-ws.org/Vol-3118/p08.pdf |volume=Vol-3118 |authors=Nguyen Ngoc Dat,Valerio Ponzi,Samuele Russo,Francesco Vincelli |dblpUrl=https://dblp.org/rec/conf/icyrime/DatPRV21 }} ==Supporting Impaired People with a Following Robotic Assistant by means of End-to-End Visual Target Navigation and Reinforcement Learning Approaches== https://ceur-ws.org/Vol-3118/p08.pdf

Supporting Impaired People with a Following Robotic
Assistant by means of End-to-End Visual Target Navigation
and Reinforcement Learning Approaches
Nguyen Ngoc Dat1 , Valerio Ponzi1 , Samuele Russo2 and Francesco Vincelli1
1
Department of Computer, Control and Management Engineering, Sapienza Univerisity of Rome, 00185 Rome, Italy
2
Department of Psychology, Sapienza University of Rome, via dei Marsi 78 Roma 00185, Italy

Abstract
We present an improvement in visual object tracking and navigation for mobile robot implementing the advantage actor-
critic (A2C) reinforcement learning architecture on top of the Gym-Gazebo framework. This work provides an easier way
to integrate reinforcement learning algorithms for navigation and object tracking tasks in robotics field. We train the
convolutional-recurrent model employed for the policy estimation in an end-to-end manner. The robot is able to follow a
simulated human walking in an indoor environment by using the sequence of images provided by the robot camera. The
input of the algorithm is acquired and processed directly in ROS-Gazebo environment. The policy learned by the robot agent
proved to generalize well also in an environment with different size and shape with respect to the training one. Moreover,
the policy allows the robot to avoid obstacles while following the tracking target. Thanks to these improvements, we can
straightforwardly apply the tracking system in a real world robot for a person following task in indoor environments.

Keywords
Visual Object Tracking, Human Tracking, Person Following, Human Robot Interaction, ROS, Gazebo, Gym-Gazebo, Visual
Navigation, Reinforcement Learning, Advantage Actor Critic

1. Introduction help the person reflect on the mistake made, asking the
person questions that help him to reflect and reason. .
In some categories of subjects defined as "fragile" the pres- Furthermore, the activation of reasoning can occur even
ence of an assistant who can guide and help them can before the person is about to commit a visual-spatial er-
be a valid help. In some cases, this assistant is not only ror. For example, the robot can prompt the person to
useful but also necessary, especially with subjects who activate a reasoning based on the planning of the path
show spatial orientation problems. The idea of this study that is shorter or better reachable. Furthermore, the robot
concerns the accompaniment of the person to a robot that can also perform the function of "companion" with which
accompanies and helps him when it is necessary to regain the person talks, and then chooses and decides indepen-
the visuospatial orientation. For example, when the sub- dently with the help of the robot’s questions, which is
ject walks down the street, the robot recognizes if that is the place where he prefers to go. In fact, in patients with
the right path and, if so, communicates to the person that mild cognitive impairment, it is easy for the subject to
he has taken the wrong path and suggests the appropri- act on impulse, without adequate planning of the path
ate path. This functionality is very useful and sometimes and without having reflected on the objectives for which
indispensable when the patient has cognitive problems one chooses to follow that path. From a psychological
that can have implications on executive functions and on and neuropsychological point of view, the use of robots
the abilities of visuospatial orientation [1, 2, 3]. Where plays a fundamental role in supporting, on the one hand,
self-monitoring and visuospatial planning skills are lack- the self-determination and autonomy of the person and
ing, the use of this robot can help the patient on the one on the other, slowing down their decline and activating
hand to maintain greater autonomy and independence, neuropsychological enhancement processes mediated by
and on the other hand to reinforce, strengthen and reha- the robot. Object detection is the process of detecting
bilitate skills. visuospatial. In fact, the use of the robot the object in frames of a video sequence, while the ob-
would not be limited only to a replacement of the action ject tracking is the process of finding the direction of an
that the person has failed, but it would also be able to object while moving around a scene [4] The main steps
of the tracking process are: (1) Detection of moving ob-
ICYRIME 2021 @ International Conference of Yearly Reports on jects, (2) Tracking of the related object from the current
Informatics Mathematics and Engineering, online, July 9, 2021
$ ponzi@diag.uniroma1.it (V. Ponzi); samuele.russo@uniroma1.it
frame to the next frame, (3) Analysis of tracking objects
(S. Russo) to recognize their behaviour. Visual tracking plays an im-
0000-0002-1846-9996 (S. Russo) portant role in fusing many computer vision applications
© 2021 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). which include image and video processing, pattern recog-
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)

51
Nguyen Ngoc Dat et al. CEUR Workshop Proceedings 51–63

nition, information retrieval, automation and control. tion of the input frames. It fuses past recurrent states
The tracking procedure finds itself in many applications with current visual features to make predictions of the
like mobile robotics, solar forecasting, particle tracking target object’s movements along the input sequence of
in microscopy images, biological applications, surveil- frames over time. We employed an end-to-end algorithm
lance, to cite the most common ones [5, 6]. Much of that allows the model to be trained to maximize tracking
the existing work related to object tracking is on passive performance in the long run. This procedure uses back-
tracker, where it is assumed that the object of interest is propagation to train the neural network components and
always in the image scene, and there is no need to handle off-policy actor-critic reinforcement learning algorithm
camera control during tracking. This approach is not [9] to train the policy network. Recent research in vi-
suitable for some use-cases, e.g., the tracking performed sual object tracking relies on game engines as simulation
by a mobile robot with a camera mounted or by a drone. environment to perform training of the neural network
For such applications, one should seek a solution to ap- models to be then applied on physical robotic platform
proach active tracking, which unifies the two sub-tasks, and real-world environments. We notice that game en-
i.e., the object tracking and the camera control. In the gines are not suitable for mobile robot applications, such
passive tracker approach, it is difficult to jointly tune the as the person following the task considered in this work.
pipeline with the two separate sub-tasks. The tracking Game engines only allow to control the camera position
task may also involve many human efforts for bounding and orientation, without caring about how to control
box labeling. Moreover, the implementation of camera the robot motion and navigation in response to tracking
control is non-trivial and can incur many expensive trial- outputs, since the game engine does not have any robot
and-errors system tuning in the real-world, as shown hardware APIs. For this reason, our approach relies on
in [7], [8]. Active object tracking additionally considers a simulation environment based on ROS/Gazebo frame-
camera control compared with traditional object tracking. work for the training process, providing suitable APIs
There exists not much research focus in this approach to deal with camera control by navigation of the robotic
for visual object tracking so far. platform carrying the camera sensor. Our main target
is teaching the mobile manipulator TIAGo, from PAL
Robotics, to follow a human target, while walking in an
2. Related Works indoor environment, for assistance and health-care task.
TIAGo robot and the human tracking target are modeled
Despite the success of traditional trackers based on low-
in Gazebo 3D simulator. A sequence of images, acquired
level, hand-crafted features, models based on deep con-
by robot camera sensor, is passed as input to the obser-
volutional neural network (CNN) have dominated recent
vation encoder; then the sequence encoder collects and
visual tracking research. The success of these models
encodes the temporal correlation of extracted feature rep-
largely depends on the capability of CNN to learn a good
resentation; after these steps, the advantage actor-critic
feature representation for the tracking target. Unfor-
(A2C) RL off-policy algorithm is used to optimize the ac-
tunately, for a busy scene with occluding objects, this
tor and critic networks through policy gradient and value
approach can fail to find long-term temporal correlations
loss and the output of reinforcement learning algorithm
expressing target motion along different frames. In this
is then used to sample the new action which the robot
work we explore and investigate a more general strat-
has to perform to follow the human trajectory.
egy to develop a novel visual tracking approach based
Target-driven visual navigation is a relatively new task
on reinforcement learning and convolutional recurrent
in the field of robotics research. Only recently, end-to-
networks. The major intuition behind this method is that,
end systems have been specifically developed to address
during the active tracking process, the historical visual
this problem. A possible naive approach could be to use
semantics and tracking proposals encode pertinent in-
a classic map-based navigation algorithm along with an
formation for future predictions. Such features require
image or object recognition model.
continuous and accurate predictions in both spatial and
To overcome these limits, map-less methods, which try to
temporal domain over a long period of time, thus de-
solve the problem of navigation and target approaching
manding for a novel network architecture design as well
jointly, have been proposed [10, 11]. These systems, like
as proper training algorithms. We formulate the visual
ours, do not build a geometric map of the area, instead,
tracking problem as a sequential decision-making pro-
they implicitly acquire the minimum knowledge of the
cess and explored a novel framework, referred to as Deep
environment necessary for navigation. This is done by
RL Tracker (DRLT). The latter processes video frames as
direct mapping visual inputs to motion, i.e. pixels to ac-
a whole and directly outputs actions to make the camera
tions. The DRL framework proves very promising for
able to follow the target in each frame. Our model in-
this purpose. In deep reinforcement learning and agent-
tegrates convolutional network with recurrent network
based models, a reward function is defined based on the
(Figure 2), and builds up a spatial-temporal representa-
robot’s perceived state and performed actions. The robot

52
Nguyen Ngoc Dat et al. CEUR Workshop Proceedings 51–63

Figure 1: Overview of the network architecture. The encoder block contains 4 convolutional layers. The sequence encoder is a
LSTM layer, which extract feature over time. The actor-critic networks are fully connected layers

learns sequential decision making to accumulate more tor) can be reproduced. As described in the previous
rewards while in operation. The overall problem is typi- chapter, the task of the project is the development and
cally formulated as a Markov decision process (MDP) and the improvement of a visual object tracking system al-
the optimal action-state rules are learned using dynamic lowing to actively track a human walking in an indoor
programming techniques. These methods are attractive environment in an end-to-end manner by means of an
because they do not require supervision and they imitate actor-critic RL algorithm. To avoid the potential issues
the natural human learning experience. However, they about the different operative systems running on our
require complex and lengthy learning processes. DRL machines, the continuous update of dependencies and
for Robotic Applications and Visual Navigation RL is be- deprecated packages, which can prove troublesome for
coming popular in recent times. In [12] it is proposed developing while using a Robot Operating System (ROS)
a solution to the navigation problem of nonholonomic code-base, a Docker container has been built to develop,
mobile robots with continuous control based on deep-RL. run, manage and sync the modules building up the whole
Moreover, training the robot for the motion task in a project (refer to paragraph 3.1) The Gazebo was chosen
virtual environment allows to speed up the learning and as a simulation environment since it can plugin directly
generalization process and also avoid the costs and risks to the ROS framework. Robot simulation is an essential
of a trial and error learning approach in the real-world tool in every robotics toolbox. A well-designed simulator
setup. Equipped with deep ConvNets, RL shows impres- makes it possible to rapidly test algorithms, design robots,
sive successes on visual tracking tasks as shown also in perform regression testing, and train AI systems using
[13]. However, they are distinct from this work, as they realistic scenarios. For the integration of RL algorithm
do not formulate the tracking procedure in an end-to-end in ROS/Gazebo framework we relied on Gym-Gazebo,
manner and do not consider camera controls. A further a toolkit which extends the OpenAI Gym for robotics,
step towards generalization is taken by [14], which intro- providing different learning techniques and algorithms
duces a framework that integrates a deep neural network to be compared using the same virtual conditions.
based object recognition module. With this module, the
agent can identify the target object regardless of where 3.1. Docker
the photo of that object is taken. However, it is still
trained or fine-tuned in the same environments where Docker is a software platform that allows one to build,
test, and deploy applications quickly. Docker packages
it is tested. Therefore, it is still not able to generalize to
unseen scenarios. software into standardized units called containers that
have everything the software needs to run including
libraries, system tools, code, and runtime. We man-
3. System set-up and simulation aged to build a complex Docker container, requiring
environment to sync many packages and tools that are dependent
to each other. For instance, TIAGo robot ROS package
The first step in the development of this work consisted (tiago_public_ws) require ROS Melodic version, ROS
in the setup of a simulation environment in which the Melodic requires Python v2.7; about the RL module, Gym-
TIAGo robot model and a walking human model (ac- Gazebo requires PyTorch machine learning framework

53
Nguyen Ngoc Dat et al. CEUR Workshop Proceedings 51–63

NVIDIA Docker
Container

ROS Melodic Gym-Gazebo Python 3.6v

TIAGo APIs Gym Environment PyTorch 1.8.0v

Gazebo RL Framework Actor-Critic
architecture

Dependencies

Figure 2: Scheme of the project framework in docker container

and PyTorch requires Python v3.6. Thus we need to setup packages. The main features of the ROS infrastructure
both python2 and python3 for adapting with the sys- are: Nodes. They are the executable processes that partic-
tem. Having an NVIDIA GPU available, we leverage the ipate in the communication. They can be programmed in
NVIDIA Container Toolkit that allows users to build and C ++ or Python. For this project, Python has been chosen
run GPU-accelerated Docker containers. for its speed and simplicity and the built-in integration
GPU-enabled applications need access to both kernel- with the PyTorch deep learning library. Topics. The
level device drivers and user-level CUDA libraries, and inter-process communication has a Publish/Subscribe
different applications may require different CUDA ver- model and the communication data are called Topics.
sions. One way to solve this problem is to install the They are the communication channels, defined by a spe-
GPU drivers inside the container and map the physical cific name and a single type of message that can be posted
NVIDIA GPU device on the underlying Docker host (e.g., to them. Nodes can subscribe to topics to receive the in-
/dev/nvidia0) to the container. The problem with this formation published in them or publish information for
approach is that the version of the driver and libraries other nodes to receive it. Callbacks. They are the in-
inside the container needs to precisely match. Otherwise, terruption service routines that are generated when a
the application will fail. In such a case, users still have node subscribed to a topic detects that something has
to worry about what drivers and libraries are installed been published on that channel. In this routine, the data
on each host computer to ensure compatibility with con- processing is done, such as saving the position of the
tainerized applications. robot in the callback generated by the topic where an
NVIDIA Docker, instead, provides driver-agnostic CUDA odometry sensor publishes the readings. Launch. A
images. This Docker plug-in enables GPU applications class of files that run and manage multiple nodes for the
running in containers to share graphic acceleration de- robot and its sensors, the simulation environment and
vices on the Docker host without worrying about version the data visualization software, such as Rviz or Gazebo
mismatches between libraries and device drivers. In par- simulator. They are encoded in XML format.
ticular, we employed the rocker toolkit [15] that, building URDF (Undefined Robot Description Format). They
upon the nvidia-docker2 package, provides an easy are definition files of a robot. The links of the robot’s
way to run the docker container with graphic user inter- kinematic structure are connected to each other by joints.
face and GPU acceleration. The structure of the docker They also define the dynamic, kinematic, visual and col-
container is shown in detail in the diagram of Figure 2 lision properties of the robot. As launch files, they are
programmed in XML too.
Gazebo is a simulator specifically designed for robotics.
3.2. ROS - Gazebo Its convenient design makes it possible to quickly test
ROS is a collection of libraries, drivers, and tools for the algorithms, robots and AI applications. In Gazebo all the
effective development and building of a robot systems. elements present in reality are simulated, the sensors and
It has a Linux-like command tool, an inter-process com- actuators act according to the environment. In this work,
munication system, and numerous application-related Gazebo has been used for the design of the virtual indoor

54
Nguyen Ngoc Dat et al. CEUR Workshop Proceedings 51–63

environment, where all the experiments with the TIAGo Jetson TX2 Kit that guarantees power-efficient computing
robot are performed. resources well fitting to deep learning applications.

3.3. TIAGo Robot
To better fit our task of the person following via visual
object tracking algorithm, we consider some robot plat-
forms, such as TIAGo and Turltebot. Differently from
TIAGo, Turltebot provides with APIs, packages and open
source ROS plugins. However, we chose TIAGo robot
platform that, although it does not provide as many open
source APIs as Turltebot, it is a more advanced and mod-
ern robot platform and its kinematic structure can better
adapt to our target. Since the aim of our work is to make
a mobile robot to be able to learn in an end-to-end man-
ner how to actively detect and track a person walking
in an indoor space, the height of TIAGo is more suitable
because the camera mounted on the head of the robot can
catch the full human body of the person to follow, thus Figure 3: TIAGo robot with structural components high-
lighted)
providing better input data to the reinforcement learning
algorithm. Moreover, the robot is a service robot specifi-
cally designed to work in indoor environments, so our
application can be easily deployed on the real platform
and in the real world.
3.4. Animated human model: actor
Following the main components of the TIAGo robot are Gazebo simulator allows defining animated model (called
described (see Figure 3). ’actor’ in Gazebo) which is useful if one wants to have
entities following predefined paths in simulation with-
• The mobile base uses a differential drive sys- out being affected by the physics engine. They have a
tem and has a maximum speed of 1 m/s. It is 3D visualization that can be seen by RGB cameras, and
designed for indoor operation. At the base is the 3D meshes which can be detected by GPU based depth
laser sensor that has a variable sensing distance sensors, so being suitable for computer vision applica-
depending on the model (iron, steel or titanium), tions. A closed-loop trajectory is defined for each of the
between 5.6 meters and 25 meters. To detect what considered train test cases. Additional plugin files are
lies behind the robot, there are 3 sonars of 1 meter used to control animations based on feedback from the
of detection. environment. Actors extend common models, adding an-
• The body is the central part of the TIAGo robot imation capabilities. There are two types of animations
and is made up of the arm and torso. The torso that can be used separately or combined together:
has a prismatic articulation that allows increasing • Skeleton animation, which is relative motion
the height of the robot by 35cm. The arm has between links in one model;
7 degrees of freedom, a length of 87 cm in its
maximum extension, and a load capacity of 3kg. • Motion along a trajectory, which carries all of
the actor’s links around the world, as one group;
• The head comprises the neck, having 2 DoF that
allow TIAGo to look in any direction. In the place Both types of motions can be combined to achieve a skele-
of the eyes, there is an RGB-D camera that pro- ton animation that moves in the world. Gazebo supports
vides color and depth images, being able to recre- two different skeleton animation file formats: COLLADA
ate the environments using point clouds. The (.dae) and Biovision Hierarchy (.bvh). The actor model
technology is the same as that used by the pop- defined in the project loads a COLLADA file described
ular Kinect cameras. In particular, the robot is within the tag. Sometimes, it is useful to com-
equipped with a built-in Asus Xtion camera. bine different skins with different animations. Gazebo
allows one to take the skin from one file, and the anima-
PAL Robotics offers different ROS packages and libraries tion from another file, as long as they have compatible
compatibility allowing TIAGo robot to perform complex skeletons. Scripted trajectories represent the high-level
perception, navigation, manipulation and human-robot animation type of actors, which consists of specifying a
interaction tasks. The platform is also equipped with a series of poses to be reached at specific times. Gazebo

55
Nguyen Ngoc Dat et al. CEUR Workshop Proceedings 51–63

takes care of interpolating the motion between them so
the movement is fluid. The trajectory is defined inside
the .world file containing all the models which are vi-
sualized in simulation. Inside the