=Paper=
{{Paper
|id=Vol-3118/p08
|storemode=property
|title=Supporting Impaired People with a Following Robotic
Assistant by means of End-to-End Visual Target Navigation
and Reinforcement Learning Approaches
|pdfUrl=https://ceur-ws.org/Vol-3118/p08.pdf
|volume=Vol-3118
|authors=Nguyen Ngoc Dat,Valerio Ponzi,Samuele Russo,Francesco Vincelli
|dblpUrl=https://dblp.org/rec/conf/icyrime/DatPRV21
}}
==Supporting Impaired People with a Following Robotic
Assistant by means of End-to-End Visual Target Navigation
and Reinforcement Learning Approaches==
Supporting Impaired People with a Following Robotic Assistant by means of End-to-End Visual Target Navigation and Reinforcement Learning Approaches Nguyen Ngoc Dat1 , Valerio Ponzi1 , Samuele Russo2 and Francesco Vincelli1 1 Department of Computer, Control and Management Engineering, Sapienza Univerisity of Rome, 00185 Rome, Italy 2 Department of Psychology, Sapienza University of Rome, via dei Marsi 78 Roma 00185, Italy Abstract We present an improvement in visual object tracking and navigation for mobile robot implementing the advantage actor- critic (A2C) reinforcement learning architecture on top of the Gym-Gazebo framework. This work provides an easier way to integrate reinforcement learning algorithms for navigation and object tracking tasks in robotics field. We train the convolutional-recurrent model employed for the policy estimation in an end-to-end manner. The robot is able to follow a simulated human walking in an indoor environment by using the sequence of images provided by the robot camera. The input of the algorithm is acquired and processed directly in ROS-Gazebo environment. The policy learned by the robot agent proved to generalize well also in an environment with different size and shape with respect to the training one. Moreover, the policy allows the robot to avoid obstacles while following the tracking target. Thanks to these improvements, we can straightforwardly apply the tracking system in a real world robot for a person following task in indoor environments. Keywords Visual Object Tracking, Human Tracking, Person Following, Human Robot Interaction, ROS, Gazebo, Gym-Gazebo, Visual Navigation, Reinforcement Learning, Advantage Actor Critic 1. Introduction help the person reflect on the mistake made, asking the person questions that help him to reflect and reason. . In some categories of subjects defined as "fragile" the pres- Furthermore, the activation of reasoning can occur even ence of an assistant who can guide and help them can before the person is about to commit a visual-spatial er- be a valid help. In some cases, this assistant is not only ror. For example, the robot can prompt the person to useful but also necessary, especially with subjects who activate a reasoning based on the planning of the path show spatial orientation problems. The idea of this study that is shorter or better reachable. Furthermore, the robot concerns the accompaniment of the person to a robot that can also perform the function of "companion" with which accompanies and helps him when it is necessary to regain the person talks, and then chooses and decides indepen- the visuospatial orientation. For example, when the sub- dently with the help of the robot’s questions, which is ject walks down the street, the robot recognizes if that is the place where he prefers to go. In fact, in patients with the right path and, if so, communicates to the person that mild cognitive impairment, it is easy for the subject to he has taken the wrong path and suggests the appropri- act on impulse, without adequate planning of the path ate path. This functionality is very useful and sometimes and without having reflected on the objectives for which indispensable when the patient has cognitive problems one chooses to follow that path. From a psychological that can have implications on executive functions and on and neuropsychological point of view, the use of robots the abilities of visuospatial orientation [1, 2, 3]. Where plays a fundamental role in supporting, on the one hand, self-monitoring and visuospatial planning skills are lack- the self-determination and autonomy of the person and ing, the use of this robot can help the patient on the one on the other, slowing down their decline and activating hand to maintain greater autonomy and independence, neuropsychological enhancement processes mediated by and on the other hand to reinforce, strengthen and reha- the robot. Object detection is the process of detecting bilitate skills. visuospatial. In fact, the use of the robot the object in frames of a video sequence, while the ob- would not be limited only to a replacement of the action ject tracking is the process of finding the direction of an that the person has failed, but it would also be able to object while moving around a scene [4] The main steps of the tracking process are: (1) Detection of moving ob- ICYRIME 2021 @ International Conference of Yearly Reports on jects, (2) Tracking of the related object from the current Informatics Mathematics and Engineering, online, July 9, 2021 $ ponzi@diag.uniroma1.it (V. Ponzi); samuele.russo@uniroma1.it frame to the next frame, (3) Analysis of tracking objects (S. Russo) to recognize their behaviour. Visual tracking plays an im- 0000-0002-1846-9996 (S. Russo) portant role in fusing many computer vision applications © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). which include image and video processing, pattern recog- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 51 Nguyen Ngoc Dat et al. CEUR Workshop Proceedings 51–63 nition, information retrieval, automation and control. tion of the input frames. It fuses past recurrent states The tracking procedure finds itself in many applications with current visual features to make predictions of the like mobile robotics, solar forecasting, particle tracking target object’s movements along the input sequence of in microscopy images, biological applications, surveil- frames over time. We employed an end-to-end algorithm lance, to cite the most common ones [5, 6]. Much of that allows the model to be trained to maximize tracking the existing work related to object tracking is on passive performance in the long run. This procedure uses back- tracker, where it is assumed that the object of interest is propagation to train the neural network components and always in the image scene, and there is no need to handle off-policy actor-critic reinforcement learning algorithm camera control during tracking. This approach is not [9] to train the policy network. Recent research in vi- suitable for some use-cases, e.g., the tracking performed sual object tracking relies on game engines as simulation by a mobile robot with a camera mounted or by a drone. environment to perform training of the neural network For such applications, one should seek a solution to ap- models to be then applied on physical robotic platform proach active tracking, which unifies the two sub-tasks, and real-world environments. We notice that game en- i.e., the object tracking and the camera control. In the gines are not suitable for mobile robot applications, such passive tracker approach, it is difficult to jointly tune the as the person following the task considered in this work. pipeline with the two separate sub-tasks. The tracking Game engines only allow to control the camera position task may also involve many human efforts for bounding and orientation, without caring about how to control box labeling. Moreover, the implementation of camera the robot motion and navigation in response to tracking control is non-trivial and can incur many expensive trial- outputs, since the game engine does not have any robot and-errors system tuning in the real-world, as shown hardware APIs. For this reason, our approach relies on in [7], [8]. Active object tracking additionally considers a simulation environment based on ROS/Gazebo frame- camera control compared with traditional object tracking. work for the training process, providing suitable APIs There exists not much research focus in this approach to deal with camera control by navigation of the robotic for visual object tracking so far. platform carrying the camera sensor. Our main target is teaching the mobile manipulator TIAGo, from PAL Robotics, to follow a human target, while walking in an 2. Related Works indoor environment, for assistance and health-care task. TIAGo robot and the human tracking target are modeled Despite the success of traditional trackers based on low- in Gazebo 3D simulator. A sequence of images, acquired level, hand-crafted features, models based on deep con- by robot camera sensor, is passed as input to the obser- volutional neural network (CNN) have dominated recent vation encoder; then the sequence encoder collects and visual tracking research. The success of these models encodes the temporal correlation of extracted feature rep- largely depends on the capability of CNN to learn a good resentation; after these steps, the advantage actor-critic feature representation for the tracking target. Unfor- (A2C) RL off-policy algorithm is used to optimize the ac- tunately, for a busy scene with occluding objects, this tor and critic networks through policy gradient and value approach can fail to find long-term temporal correlations loss and the output of reinforcement learning algorithm expressing target motion along different frames. In this is then used to sample the new action which the robot work we explore and investigate a more general strat- has to perform to follow the human trajectory. egy to develop a novel visual tracking approach based Target-driven visual navigation is a relatively new task on reinforcement learning and convolutional recurrent in the field of robotics research. Only recently, end-to- networks. The major intuition behind this method is that, end systems have been specifically developed to address during the active tracking process, the historical visual this problem. A possible naive approach could be to use semantics and tracking proposals encode pertinent in- a classic map-based navigation algorithm along with an formation for future predictions. Such features require image or object recognition model. continuous and accurate predictions in both spatial and To overcome these limits, map-less methods, which try to temporal domain over a long period of time, thus de- solve the problem of navigation and target approaching manding for a novel network architecture design as well jointly, have been proposed [10, 11]. These systems, like as proper training algorithms. We formulate the visual ours, do not build a geometric map of the area, instead, tracking problem as a sequential decision-making pro- they implicitly acquire the minimum knowledge of the cess and explored a novel framework, referred to as Deep environment necessary for navigation. This is done by RL Tracker (DRLT). The latter processes video frames as direct mapping visual inputs to motion, i.e. pixels to ac- a whole and directly outputs actions to make the camera tions. The DRL framework proves very promising for able to follow the target in each frame. Our model in- this purpose. In deep reinforcement learning and agent- tegrates convolutional network with recurrent network based models, a reward function is defined based on the (Figure 2), and builds up a spatial-temporal representa- robot’s perceived state and performed actions. The robot 52 Nguyen Ngoc Dat et al. CEUR Workshop Proceedings 51–63 Figure 1: Overview of the network architecture. The encoder block contains 4 convolutional layers. The sequence encoder is a LSTM layer, which extract feature over time. The actor-critic networks are fully connected layers learns sequential decision making to accumulate more tor) can be reproduced. As described in the previous rewards while in operation. The overall problem is typi- chapter, the task of the project is the development and cally formulated as a Markov decision process (MDP) and the improvement of a visual object tracking system al- the optimal action-state rules are learned using dynamic lowing to actively track a human walking in an indoor programming techniques. These methods are attractive environment in an end-to-end manner by means of an because they do not require supervision and they imitate actor-critic RL algorithm. To avoid the potential issues the natural human learning experience. However, they about the different operative systems running on our require complex and lengthy learning processes. DRL machines, the continuous update of dependencies and for Robotic Applications and Visual Navigation RL is be- deprecated packages, which can prove troublesome for coming popular in recent times. In [12] it is proposed developing while using a Robot Operating System (ROS) a solution to the navigation problem of nonholonomic code-base, a Docker container has been built to develop, mobile robots with continuous control based on deep-RL. run, manage and sync the modules building up the whole Moreover, training the robot for the motion task in a project (refer to paragraph 3.1) The Gazebo was chosen virtual environment allows to speed up the learning and as a simulation environment since it can plugin directly generalization process and also avoid the costs and risks to the ROS framework. Robot simulation is an essential of a trial and error learning approach in the real-world tool in every robotics toolbox. A well-designed simulator setup. Equipped with deep ConvNets, RL shows impres- makes it possible to rapidly test algorithms, design robots, sive successes on visual tracking tasks as shown also in perform regression testing, and train AI systems using [13]. However, they are distinct from this work, as they realistic scenarios. For the integration of RL algorithm do not formulate the tracking procedure in an end-to-end in ROS/Gazebo framework we relied on Gym-Gazebo, manner and do not consider camera controls. A further a toolkit which extends the OpenAI Gym for robotics, step towards generalization is taken by [14], which intro- providing different learning techniques and algorithms duces a framework that integrates a deep neural network to be compared using the same virtual conditions. based object recognition module. With this module, the agent can identify the target object regardless of where 3.1. Docker the photo of that object is taken. However, it is still trained or fine-tuned in the same environments where Docker is a software platform that allows one to build, test, and deploy applications quickly. Docker packages it is tested. Therefore, it is still not able to generalize to unseen scenarios. software into standardized units called containers that have everything the software needs to run including libraries, system tools, code, and runtime. We man- 3. System set-up and simulation aged to build a complex Docker container, requiring environment to sync many packages and tools that are dependent to each other. For instance, TIAGo robot ROS package The first step in the development of this work consisted (tiago_public_ws) require ROS Melodic version, ROS in the setup of a simulation environment in which the Melodic requires Python v2.7; about the RL module, Gym- TIAGo robot model and a walking human model (ac- Gazebo requires PyTorch machine learning framework 53 Nguyen Ngoc Dat et al. CEUR Workshop Proceedings 51–63 NVIDIA Docker Container ROS Melodic Gym-Gazebo Python 3.6v TIAGo APIs Gym Environment PyTorch 1.8.0v Gazebo RL Framework Actor-Critic architecture Dependencies Figure 2: Scheme of the project framework in docker container and PyTorch requires Python v3.6. Thus we need to setup packages. The main features of the ROS infrastructure both python2 and python3 for adapting with the sys- are: Nodes. They are the executable processes that partic- tem. Having an NVIDIA GPU available, we leverage the ipate in the communication. They can be programmed in NVIDIA Container Toolkit that allows users to build and C ++ or Python. For this project, Python has been chosen run GPU-accelerated Docker containers. for its speed and simplicity and the built-in integration GPU-enabled applications need access to both kernel- with the PyTorch deep learning library. Topics. The level device drivers and user-level CUDA libraries, and inter-process communication has a Publish/Subscribe different applications may require different CUDA ver- model and the communication data are called Topics. sions. One way to solve this problem is to install the They are the communication channels, defined by a spe- GPU drivers inside the container and map the physical cific name and a single type of message that can be posted NVIDIA GPU device on the underlying Docker host (e.g., to them. Nodes can subscribe to topics to receive the in- /dev/nvidia0) to the container. The problem with this formation published in them or publish information for approach is that the version of the driver and libraries other nodes to receive it. Callbacks. They are the in- inside the container needs to precisely match. Otherwise, terruption service routines that are generated when a the application will fail. In such a case, users still have node subscribed to a topic detects that something has to worry about what drivers and libraries are installed been published on that channel. In this routine, the data on each host computer to ensure compatibility with con- processing is done, such as saving the position of the tainerized applications. robot in the callback generated by the topic where an NVIDIA Docker, instead, provides driver-agnostic CUDA odometry sensor publishes the readings. Launch. A images. This Docker plug-in enables GPU applications class of files that run and manage multiple nodes for the running in containers to share graphic acceleration de- robot and its sensors, the simulation environment and vices on the Docker host without worrying about version the data visualization software, such as Rviz or Gazebo mismatches between libraries and device drivers. In par- simulator. They are encoded in XML format. ticular, we employed the rocker toolkit [15] that, building URDF (Undefined Robot Description Format). They upon the nvidia-docker2 package, provides an easy are definition files of a robot. The links of the robot’s way to run the docker container with graphic user inter- kinematic structure are connected to each other by joints. face and GPU acceleration. The structure of the docker They also define the dynamic, kinematic, visual and col- container is shown in detail in the diagram of Figure 2 lision properties of the robot. As launch files, they are programmed in XML too. Gazebo is a simulator specifically designed for robotics. 3.2. ROS - Gazebo Its convenient design makes it possible to quickly test ROS is a collection of libraries, drivers, and tools for the algorithms, robots and AI applications. In Gazebo all the effective development and building of a robot systems. elements present in reality are simulated, the sensors and It has a Linux-like command tool, an inter-process com- actuators act according to the environment. In this work, munication system, and numerous application-related Gazebo has been used for the design of the virtual indoor 54 Nguyen Ngoc Dat et al. CEUR Workshop Proceedings 51–63 environment, where all the experiments with the TIAGo Jetson TX2 Kit that guarantees power-efficient computing robot are performed. resources well fitting to deep learning applications. 3.3. TIAGo Robot To better fit our task of the person following via visual object tracking algorithm, we consider some robot plat- forms, such as TIAGo and Turltebot. Differently from TIAGo, Turltebot provides with APIs, packages and open source ROS plugins. However, we chose TIAGo robot platform that, although it does not provide as many open source APIs as Turltebot, it is a more advanced and mod- ern robot platform and its kinematic structure can better adapt to our target. Since the aim of our work is to make a mobile robot to be able to learn in an end-to-end man- ner how to actively detect and track a person walking in an indoor space, the height of TIAGo is more suitable because the camera mounted on the head of the robot can catch the full human body of the person to follow, thus Figure 3: TIAGo robot with structural components high- lighted) providing better input data to the reinforcement learning algorithm. Moreover, the robot is a service robot specifi- cally designed to work in indoor environments, so our application can be easily deployed on the real platform and in the real world. 3.4. Animated human model: actor Following the main components of the TIAGo robot are Gazebo simulator allows defining animated model (called described (see Figure 3). ’actor’ in Gazebo) which is useful if one wants to have entities following predefined paths in simulation with- • The mobile base uses a differential drive sys- out being affected by the physics engine. They have a tem and has a maximum speed of 1 m/s. It is 3D visualization that can be seen by RGB cameras, and designed for indoor operation. At the base is the 3D meshes which can be detected by GPU based depth laser sensor that has a variable sensing distance sensors, so being suitable for computer vision applica- depending on the model (iron, steel or titanium), tions. A closed-loop trajectory is defined for each of the between 5.6 meters and 25 meters. To detect what considered train test cases. Additional plugin files are lies behind the robot, there are 3 sonars of 1 meter used to control animations based on feedback from the of detection. environment. Actors extend common models, adding an- • The body is the central part of the TIAGo robot imation capabilities. There are two types of animations and is made up of the arm and torso. The torso that can be used separately or combined together: has a prismatic articulation that allows increasing • Skeleton animation, which is relative motion the height of the robot by 35cm. The arm has between links in one model; 7 degrees of freedom, a length of 87 cm in its maximum extension, and a load capacity of 3kg. • Motion along a trajectory, which carries all of the actor’s links around the world, as one group; • The head comprises the neck, having 2 DoF that allow TIAGo to look in any direction. In the place Both types of motions can be combined to achieve a skele- of the eyes, there is an RGB-D camera that pro- ton animation that moves in the world. Gazebo supports vides color and depth images, being able to recre- two different skeleton animation file formats: COLLADA ate the environments using point clouds. The (.dae) and Biovision Hierarchy (.bvh). The actor model technology is the same as that used by the pop- defined in the project loads a COLLADA file described ular Kinect cameras. In particular, the robot is within thetag. Sometimes, it is useful to com- equipped with a built-in Asus Xtion camera. bine different skins with different animations. Gazebo allows one to take the skin from one file, and the anima- PAL Robotics offers different ROS packages and libraries tion from another file, as long as they have compatible compatibility allowing TIAGo robot to perform complex skeletons. Scripted trajectories represent the high-level perception, navigation, manipulation and human-robot animation type of actors, which consists of specifying a interaction tasks. The platform is also equipped with a series of poses to be reached at specific times. Gazebo 55 Nguyen Ngoc Dat et al. CEUR Workshop Proceedings 51–63 takes care of interpolating the motion between them so the movement is fluid. The trajectory is defined inside the .world file containing all the models which are vi- sualized in simulation. Inside the