=Paper= {{Paper |id=Vol-3118/p08 |storemode=property |title=Supporting Impaired People with a Following Robotic Assistant by means of End-to-End Visual Target Navigation and Reinforcement Learning Approaches |pdfUrl=https://ceur-ws.org/Vol-3118/p08.pdf |volume=Vol-3118 |authors=Nguyen Ngoc Dat,Valerio Ponzi,Samuele Russo,Francesco Vincelli |dblpUrl=https://dblp.org/rec/conf/icyrime/DatPRV21 }} ==Supporting Impaired People with a Following Robotic Assistant by means of End-to-End Visual Target Navigation and Reinforcement Learning Approaches== https://ceur-ws.org/Vol-3118/p08.pdf
Supporting Impaired People with a Following Robotic
Assistant by means of End-to-End Visual Target Navigation
and Reinforcement Learning Approaches
Nguyen Ngoc Dat1 , Valerio Ponzi1 , Samuele Russo2 and Francesco Vincelli1
1
    Department of Computer, Control and Management Engineering, Sapienza Univerisity of Rome, 00185 Rome, Italy
2
    Department of Psychology, Sapienza University of Rome, via dei Marsi 78 Roma 00185, Italy


                                             Abstract
                                             We present an improvement in visual object tracking and navigation for mobile robot implementing the advantage actor-
                                             critic (A2C) reinforcement learning architecture on top of the Gym-Gazebo framework. This work provides an easier way
                                             to integrate reinforcement learning algorithms for navigation and object tracking tasks in robotics field. We train the
                                             convolutional-recurrent model employed for the policy estimation in an end-to-end manner. The robot is able to follow a
                                             simulated human walking in an indoor environment by using the sequence of images provided by the robot camera. The
                                             input of the algorithm is acquired and processed directly in ROS-Gazebo environment. The policy learned by the robot agent
                                             proved to generalize well also in an environment with different size and shape with respect to the training one. Moreover,
                                             the policy allows the robot to avoid obstacles while following the tracking target. Thanks to these improvements, we can
                                             straightforwardly apply the tracking system in a real world robot for a person following task in indoor environments.

                                             Keywords
                                             Visual Object Tracking, Human Tracking, Person Following, Human Robot Interaction, ROS, Gazebo, Gym-Gazebo, Visual
                                             Navigation, Reinforcement Learning, Advantage Actor Critic



1. Introduction                                                                                                            help the person reflect on the mistake made, asking the
                                                                                                                           person questions that help him to reflect and reason. .
In some categories of subjects defined as "fragile" the pres-                                                              Furthermore, the activation of reasoning can occur even
ence of an assistant who can guide and help them can                                                                       before the person is about to commit a visual-spatial er-
be a valid help. In some cases, this assistant is not only                                                                 ror. For example, the robot can prompt the person to
useful but also necessary, especially with subjects who                                                                    activate a reasoning based on the planning of the path
show spatial orientation problems. The idea of this study                                                                  that is shorter or better reachable. Furthermore, the robot
concerns the accompaniment of the person to a robot that                                                                   can also perform the function of "companion" with which
accompanies and helps him when it is necessary to regain                                                                   the person talks, and then chooses and decides indepen-
the visuospatial orientation. For example, when the sub-                                                                   dently with the help of the robot’s questions, which is
ject walks down the street, the robot recognizes if that is                                                                the place where he prefers to go. In fact, in patients with
the right path and, if so, communicates to the person that                                                                 mild cognitive impairment, it is easy for the subject to
he has taken the wrong path and suggests the appropri-                                                                     act on impulse, without adequate planning of the path
ate path. This functionality is very useful and sometimes                                                                  and without having reflected on the objectives for which
indispensable when the patient has cognitive problems                                                                      one chooses to follow that path. From a psychological
that can have implications on executive functions and on                                                                   and neuropsychological point of view, the use of robots
the abilities of visuospatial orientation [1, 2, 3]. Where                                                                 plays a fundamental role in supporting, on the one hand,
self-monitoring and visuospatial planning skills are lack-                                                                 the self-determination and autonomy of the person and
ing, the use of this robot can help the patient on the one                                                                 on the other, slowing down their decline and activating
hand to maintain greater autonomy and independence,                                                                        neuropsychological enhancement processes mediated by
and on the other hand to reinforce, strengthen and reha-                                                                   the robot. Object detection is the process of detecting
bilitate skills. visuospatial. In fact, the use of the robot                                                               the object in frames of a video sequence, while the ob-
would not be limited only to a replacement of the action                                                                   ject tracking is the process of finding the direction of an
that the person has failed, but it would also be able to                                                                   object while moving around a scene [4] The main steps
                                                                                                                           of the tracking process are: (1) Detection of moving ob-
ICYRIME 2021 @ International Conference of Yearly Reports on                                                               jects, (2) Tracking of the related object from the current
Informatics Mathematics and Engineering, online, July 9, 2021
$ ponzi@diag.uniroma1.it (V. Ponzi); samuele.russo@uniroma1.it
                                                                                                                           frame to the next frame, (3) Analysis of tracking objects
(S. Russo)                                                                                                                 to recognize their behaviour. Visual tracking plays an im-
 0000-0002-1846-9996 (S. Russo)                                                                                           portant role in fusing many computer vision applications
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                          which include image and video processing, pattern recog-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)




                                                                                                                      51
Nguyen Ngoc Dat et al. CEUR Workshop Proceedings                                                                     51–63



nition, information retrieval, automation and control.           tion of the input frames. It fuses past recurrent states
The tracking procedure finds itself in many applications         with current visual features to make predictions of the
like mobile robotics, solar forecasting, particle tracking       target object’s movements along the input sequence of
in microscopy images, biological applications, surveil-          frames over time. We employed an end-to-end algorithm
lance, to cite the most common ones [5, 6]. Much of              that allows the model to be trained to maximize tracking
the existing work related to object tracking is on passive       performance in the long run. This procedure uses back-
tracker, where it is assumed that the object of interest is      propagation to train the neural network components and
always in the image scene, and there is no need to handle        off-policy actor-critic reinforcement learning algorithm
camera control during tracking. This approach is not             [9] to train the policy network. Recent research in vi-
suitable for some use-cases, e.g., the tracking performed        sual object tracking relies on game engines as simulation
by a mobile robot with a camera mounted or by a drone.           environment to perform training of the neural network
For such applications, one should seek a solution to ap-         models to be then applied on physical robotic platform
proach active tracking, which unifies the two sub-tasks,         and real-world environments. We notice that game en-
i.e., the object tracking and the camera control. In the         gines are not suitable for mobile robot applications, such
passive tracker approach, it is difficult to jointly tune the    as the person following the task considered in this work.
pipeline with the two separate sub-tasks. The tracking           Game engines only allow to control the camera position
task may also involve many human efforts for bounding            and orientation, without caring about how to control
box labeling. Moreover, the implementation of camera             the robot motion and navigation in response to tracking
control is non-trivial and can incur many expensive trial-       outputs, since the game engine does not have any robot
and-errors system tuning in the real-world, as shown             hardware APIs. For this reason, our approach relies on
in [7], [8]. Active object tracking additionally considers       a simulation environment based on ROS/Gazebo frame-
camera control compared with traditional object tracking.        work for the training process, providing suitable APIs
There exists not much research focus in this approach            to deal with camera control by navigation of the robotic
for visual object tracking so far.                               platform carrying the camera sensor. Our main target
                                                                 is teaching the mobile manipulator TIAGo, from PAL
                                                                 Robotics, to follow a human target, while walking in an
2. Related Works                                                 indoor environment, for assistance and health-care task.
                                                                 TIAGo robot and the human tracking target are modeled
Despite the success of traditional trackers based on low-
                                                                 in Gazebo 3D simulator. A sequence of images, acquired
level, hand-crafted features, models based on deep con-
                                                                 by robot camera sensor, is passed as input to the obser-
volutional neural network (CNN) have dominated recent
                                                                 vation encoder; then the sequence encoder collects and
visual tracking research. The success of these models
                                                                 encodes the temporal correlation of extracted feature rep-
largely depends on the capability of CNN to learn a good
                                                                 resentation; after these steps, the advantage actor-critic
feature representation for the tracking target. Unfor-
                                                                 (A2C) RL off-policy algorithm is used to optimize the ac-
tunately, for a busy scene with occluding objects, this
                                                                 tor and critic networks through policy gradient and value
approach can fail to find long-term temporal correlations
                                                                 loss and the output of reinforcement learning algorithm
expressing target motion along different frames. In this
                                                                 is then used to sample the new action which the robot
work we explore and investigate a more general strat-
                                                                 has to perform to follow the human trajectory.
egy to develop a novel visual tracking approach based
                                                                    Target-driven visual navigation is a relatively new task
on reinforcement learning and convolutional recurrent
                                                                 in the field of robotics research. Only recently, end-to-
networks. The major intuition behind this method is that,
                                                                 end systems have been specifically developed to address
during the active tracking process, the historical visual
                                                                 this problem. A possible naive approach could be to use
semantics and tracking proposals encode pertinent in-
                                                                 a classic map-based navigation algorithm along with an
formation for future predictions. Such features require
                                                                 image or object recognition model.
continuous and accurate predictions in both spatial and
                                                                 To overcome these limits, map-less methods, which try to
temporal domain over a long period of time, thus de-
                                                                 solve the problem of navigation and target approaching
manding for a novel network architecture design as well
                                                                 jointly, have been proposed [10, 11]. These systems, like
as proper training algorithms. We formulate the visual
                                                                 ours, do not build a geometric map of the area, instead,
tracking problem as a sequential decision-making pro-
                                                                 they implicitly acquire the minimum knowledge of the
cess and explored a novel framework, referred to as Deep
                                                                 environment necessary for navigation. This is done by
RL Tracker (DRLT). The latter processes video frames as
                                                                 direct mapping visual inputs to motion, i.e. pixels to ac-
a whole and directly outputs actions to make the camera
                                                                 tions. The DRL framework proves very promising for
able to follow the target in each frame. Our model in-
                                                                 this purpose. In deep reinforcement learning and agent-
tegrates convolutional network with recurrent network
                                                                 based models, a reward function is defined based on the
(Figure 2), and builds up a spatial-temporal representa-
                                                                 robot’s perceived state and performed actions. The robot



                                                            52
Nguyen Ngoc Dat et al. CEUR Workshop Proceedings                                                                       51–63




Figure 1: Overview of the network architecture. The encoder block contains 4 convolutional layers. The sequence encoder is a
LSTM layer, which extract feature over time. The actor-critic networks are fully connected layers



learns sequential decision making to accumulate more              tor) can be reproduced. As described in the previous
rewards while in operation. The overall problem is typi-          chapter, the task of the project is the development and
cally formulated as a Markov decision process (MDP) and           the improvement of a visual object tracking system al-
the optimal action-state rules are learned using dynamic          lowing to actively track a human walking in an indoor
programming techniques. These methods are attractive              environment in an end-to-end manner by means of an
because they do not require supervision and they imitate          actor-critic RL algorithm. To avoid the potential issues
the natural human learning experience. However, they              about the different operative systems running on our
require complex and lengthy learning processes. DRL               machines, the continuous update of dependencies and
for Robotic Applications and Visual Navigation RL is be-          deprecated packages, which can prove troublesome for
coming popular in recent times. In [12] it is proposed            developing while using a Robot Operating System (ROS)
a solution to the navigation problem of nonholonomic              code-base, a Docker container has been built to develop,
mobile robots with continuous control based on deep-RL.           run, manage and sync the modules building up the whole
Moreover, training the robot for the motion task in a             project (refer to paragraph 3.1) The Gazebo was chosen
virtual environment allows to speed up the learning and           as a simulation environment since it can plugin directly
generalization process and also avoid the costs and risks         to the ROS framework. Robot simulation is an essential
of a trial and error learning approach in the real-world          tool in every robotics toolbox. A well-designed simulator
setup. Equipped with deep ConvNets, RL shows impres-              makes it possible to rapidly test algorithms, design robots,
sive successes on visual tracking tasks as shown also in          perform regression testing, and train AI systems using
[13]. However, they are distinct from this work, as they          realistic scenarios. For the integration of RL algorithm
do not formulate the tracking procedure in an end-to-end          in ROS/Gazebo framework we relied on Gym-Gazebo,
manner and do not consider camera controls. A further             a toolkit which extends the OpenAI Gym for robotics,
step towards generalization is taken by [14], which intro-        providing different learning techniques and algorithms
duces a framework that integrates a deep neural network           to be compared using the same virtual conditions.
based object recognition module. With this module, the
agent can identify the target object regardless of where          3.1. Docker
the photo of that object is taken. However, it is still
trained or fine-tuned in the same environments where     Docker is a software platform that allows one to build,
                                                         test, and deploy applications quickly. Docker packages
it is tested. Therefore, it is still not able to generalize to
unseen scenarios.                                        software into standardized units called containers that
                                                         have everything the software needs to run including
                                                         libraries, system tools, code, and runtime. We man-
3. System set-up and simulation                          aged to build a complex Docker container, requiring
     environment                                         to sync many packages and tools that are dependent
                                                         to each other. For instance, TIAGo robot ROS package
The first step in the development of this work consisted (tiago_public_ws) require ROS Melodic version, ROS
in the setup of a simulation environment in which the Melodic requires Python v2.7; about the RL module, Gym-
TIAGo robot model and a walking human model (ac- Gazebo requires PyTorch machine learning framework




                                                             53
Nguyen Ngoc Dat et al. CEUR Workshop Proceedings                                                                51–63


                                                 NVIDIA Docker
                                                   Container



           ROS Melodic                             Gym-Gazebo                           Python 3.6v



                 TIAGo APIs                         Gym Environment                        PyTorch 1.8.0v



                   Gazebo                             RL Framework                           Actor-Critic
                                                                                             architecture


                Dependencies

Figure 2: Scheme of the project framework in docker container



and PyTorch requires Python v3.6. Thus we need to setup      packages. The main features of the ROS infrastructure
both python2 and python3 for adapting with the sys-          are: Nodes. They are the executable processes that partic-
tem. Having an NVIDIA GPU available, we leverage the         ipate in the communication. They can be programmed in
NVIDIA Container Toolkit that allows users to build and      C ++ or Python. For this project, Python has been chosen
run GPU-accelerated Docker containers.                       for its speed and simplicity and the built-in integration
GPU-enabled applications need access to both kernel-         with the PyTorch deep learning library. Topics. The
level device drivers and user-level CUDA libraries, and      inter-process communication has a Publish/Subscribe
different applications may require different CUDA ver-       model and the communication data are called Topics.
sions. One way to solve this problem is to install the       They are the communication channels, defined by a spe-
GPU drivers inside the container and map the physical        cific name and a single type of message that can be posted
NVIDIA GPU device on the underlying Docker host (e.g.,       to them. Nodes can subscribe to topics to receive the in-
/dev/nvidia0) to the container. The problem with this        formation published in them or publish information for
approach is that the version of the driver and libraries     other nodes to receive it. Callbacks. They are the in-
inside the container needs to precisely match. Otherwise,    terruption service routines that are generated when a
the application will fail. In such a case, users still have  node subscribed to a topic detects that something has
to worry about what drivers and libraries are installed      been published on that channel. In this routine, the data
on each host computer to ensure compatibility with con-      processing is done, such as saving the position of the
tainerized applications.                                     robot in the callback generated by the topic where an
NVIDIA Docker, instead, provides driver-agnostic CUDA        odometry sensor publishes the readings. Launch. A
images. This Docker plug-in enables GPU applications         class of files that run and manage multiple nodes for the
running in containers to share graphic acceleration de-      robot and its sensors, the simulation environment and
vices on the Docker host without worrying about version      the data visualization software, such as Rviz or Gazebo
mismatches between libraries and device drivers. In par-     simulator. They are encoded in XML format.
ticular, we employed the rocker toolkit [15] that, building  URDF (Undefined Robot Description Format). They
upon the nvidia-docker2 package, provides an easy            are definition files of a robot. The links of the robot’s
way to run the docker container with graphic user inter-     kinematic structure are connected to each other by joints.
face and GPU acceleration. The structure of the docker       They also define the dynamic, kinematic, visual and col-
container is shown in detail in the diagram of Figure 2      lision properties of the robot. As launch files, they are
                                                             programmed in XML too.
                                                             Gazebo is a simulator specifically designed for robotics.
3.2. ROS - Gazebo                                            Its convenient design makes it possible to quickly test
ROS is a collection of libraries, drivers, and tools for the algorithms, robots and AI applications. In Gazebo all the
effective development and building of a robot systems. elements present in reality are simulated, the sensors and
It has a Linux-like command tool, an inter-process com- actuators act according to the environment. In this work,
munication system, and numerous application-related Gazebo has been used for the design of the virtual indoor




                                                          54
Nguyen Ngoc Dat et al. CEUR Workshop Proceedings                                                                51–63



environment, where all the experiments with the TIAGo         Jetson TX2 Kit that guarantees power-efficient computing
robot are performed.                                          resources well fitting to deep learning applications.

3.3. TIAGo Robot
To better fit our task of the person following via visual
object tracking algorithm, we consider some robot plat-
forms, such as TIAGo and Turltebot. Differently from
TIAGo, Turltebot provides with APIs, packages and open
source ROS plugins. However, we chose TIAGo robot
platform that, although it does not provide as many open
source APIs as Turltebot, it is a more advanced and mod-
ern robot platform and its kinematic structure can better
adapt to our target. Since the aim of our work is to make
a mobile robot to be able to learn in an end-to-end man-
ner how to actively detect and track a person walking
in an indoor space, the height of TIAGo is more suitable
because the camera mounted on the head of the robot can
catch the full human body of the person to follow, thus       Figure 3: TIAGo robot with structural components high-
                                                              lighted)
providing better input data to the reinforcement learning
algorithm. Moreover, the robot is a service robot specifi-
cally designed to work in indoor environments, so our
application can be easily deployed on the real platform
and in the real world.
                                                              3.4. Animated human model: actor
Following the main components of the TIAGo robot are        Gazebo simulator allows defining animated model (called
described (see Figure 3).                                   ’actor’ in Gazebo) which is useful if one wants to have
                                                            entities following predefined paths in simulation with-
     • The mobile base uses a differential drive sys- out being affected by the physics engine. They have a
       tem and has a maximum speed of 1 m/s. It is 3D visualization that can be seen by RGB cameras, and
       designed for indoor operation. At the base is the 3D meshes which can be detected by GPU based depth
       laser sensor that has a variable sensing distance sensors, so being suitable for computer vision applica-
       depending on the model (iron, steel or titanium), tions. A closed-loop trajectory is defined for each of the
       between 5.6 meters and 25 meters. To detect what considered train test cases. Additional plugin files are
       lies behind the robot, there are 3 sonars of 1 meter used to control animations based on feedback from the
       of detection.                                        environment. Actors extend common models, adding an-
     • The body is the central part of the TIAGo robot imation capabilities. There are two types of animations
       and is made up of the arm and torso. The torso that can be used separately or combined together:
       has a prismatic articulation that allows increasing        • Skeleton animation, which is relative motion
       the height of the robot by 35cm. The arm has                  between links in one model;
       7 degrees of freedom, a length of 87 cm in its
       maximum extension, and a load capacity of 3kg.             • Motion along a trajectory, which carries all of
                                                                     the actor’s links around the world, as one group;
     • The head comprises the neck, having 2 DoF that
       allow TIAGo to look in any direction. In the place Both types of motions can be combined to achieve a skele-
       of the eyes, there is an RGB-D camera that pro- ton animation that moves in the world. Gazebo supports
       vides color and depth images, being able to recre- two different skeleton animation file formats: COLLADA
       ate the environments using point clouds. The (.dae) and Biovision Hierarchy (.bvh). The actor model
       technology is the same as that used by the pop- defined in the project loads a COLLADA file described
       ular Kinect cameras. In particular, the robot is within the  tag. Sometimes, it is useful to com-
       equipped with a built-in Asus Xtion camera.          bine different skins with different animations. Gazebo
                                                            allows one to take the skin from one file, and the anima-
PAL Robotics offers different ROS packages and libraries tion from another file, as long as they have compatible
compatibility allowing TIAGo robot to perform complex skeletons. Scripted trajectories represent the high-level
perception, navigation, manipulation and human-robot animation type of actors, which consists of specifying a
interaction tasks. The platform is also equipped with a series of poses to be reached at specific times. Gazebo




                                                         55
Nguyen Ngoc Dat et al. CEUR Workshop Proceedings                                                                            51–63



takes care of interpolating the motion between them so
the movement is fluid. The trajectory is defined inside
the .world file containing all the models which are vi-
sualized in simulation. Inside the