SLAM method: reconstruction and modeling of environment with moving objects using an RGBD camera Korney Tertychny Dmitry Krivolapov Sergey Karpov Volgograd State University Volgograd State University Volgograd State University Volgograd, Russia Volgograd, Russia Volgograd, Russia tertychny@volsu.ru krivolapov@volsu.ru karpov.sergey@volsu.ru Alexander Khoperskov Volgograd State University Volgograd, Russia khoperskov@volsu.ru Abstract We have analyzed high and low dynamic environment problems of Si- multaneous Localization and Mapping (SLAM) methods, which we use for automatic 3D models construction. Constructed models are conve- nient for developing lifelike VR (virtual reality) systems of real places. Our software provides preliminary processing of data before transfer- ring it to the main SLAM method. Developed software allows uti- lizing various computer vision algorithms or artificial neural networks for recognition of objects which may change their location. Microsoft Kinect sensor allows easily recognize movable objects like human while testing. 1 Introduction Navigation systems development is one of the most important problem in robotic engineering. To make decisions on further action a robot needs information about environment, surrounding objects, and its localization. There are a lot of various navigation methods based on odometry [13], inertial navigation, magnetometer, active labels (RFIDS, GPS) [7], label and map matching. Simultaneous Localization and Mapping (SLAM) approach is one of the most promising way of navigation [6, 7]. Modern SLAM solutions provide mapping and localization in an unknown environment [2]. Some of them can be used for updating the map which has been made before. SLAM approach can be used in mobile autonomous systems like robots, cars, etc. SLAM approach is the only method which provides its processing without any external information sources or the preparation of the environment (for example, placing labels) [10]. SLAM is the general methodology for solving two problems [4, 9, 18]: 1) environment mapping and 3D model construction; 2) localization using generated map and trajectory processing. Copyright c by the paper’s authors. Copying permitted for private and academic purposes. In: Marco Schaerf, Massimo Mecella, Drozdova Viktoria Igorevna, Kalmykov Igor Anatolievich (eds.): Proceedings of REMS 2018 – Russian Federation & Europe Multidisciplinary Symposium on Computer Science and ICT, Stavropol – Dombay, Russia, 15–20 October 2018, published at http://ceur-ws.org Efficacy of SLAM solution determines navigation and environment reconstruction accuracy [13, 13]. SLAM is one of the general directions in computer vision and robotic engineering for the last than 25 years. It appears that, all animals and human beings use this principle of navigation. However, there are some problems in modern SLAM solutions [6, 7]. One of them is dynamic environment processing, where objects can move both in sight or out of sight. The problem becomes even more serious, since a change of the position of objects may lead to a complete disorientation of the robotic system. Our research is focused on the algorithm for removing moving objects to increase accuracy of 3D reconstruction of a real environment using SLAM approach development. There are four types of robots [13]: indoor, outdoor, underwater and airborne robots. Our solution is applicable with all four types, but the main relevance are indoor and outdoor robots. We have applied and tested it with indoor robots due to technical features of our equipment. Figure 1: diagram of common SLAM solution, which uses RGBD sensor Figure 1 presents a common operation SLAM algorithm, which uses RGBD sensor. The output data of a RGBD camera is an RGB image and a depthmap. In the depthmap image pixels contain data about the distance between the sensor and the object. RGB image and the depthmap are transfered to the SLAM front- end from a RGBD sensor. A feature matching algorithm processes the RGB image. Selected features are used for a localization (See yellow and green points on “Feature matching” panel Figure 1). ORB, SURF, SIFT are examples of such algorithms [6, 15]. Then the image with the selected features and the depthmap are resized, cut and calibrated. Calibration, cutting and resizing are required because the RGB and the depthmap may have different resolution, view angle and these cameras are located in some distance from each other [3]. Some SLAM solutions combine RGB, depthmap and features information into a single structure. Then the data is transfered to SLAM back-end, where the localization of the image, the camera and the features are provided. Afterwards SLAM solution can make and locate point cloud on the 3D scene. Also it can make 2D maps, a camera trajectory, and a feature map (features and their location). Usually SLAM solutions based on graph-based representation may be subdivided into two parts: the front-end and the back-end. Using sensor data the front-end provides preliminary solution as a feature map. The back-end uses the data from the front-end and a probabilistic approach to make solution with maximum probability. As a rule, such approach is faster and more reliable than filter–based approaches due to better ability to cope with SLAM nonlinearity [4, 13]. In the classical SLAM scheme, when a SLAM algorithm starts working there is no knowledge on surrounding environment, no preliminary map or labels [4]. Thus, the solution is only the result of sensor measurements. The other peculiarity is that the environment, where the robot works, must be static. It means that objects seen and memorized by the robot must not change their positions and properties [4]. The lighting conditions should be taken into the account depending on used equipment. All of them is hardly possible in a real environ- ment. We can highlight several important SLAM problems, which appear in the real world: dynamic or change- able environment [9, 18] and accumulative mistakes eliminating (See discussion in works [3, 13, 9, 18, 8, 12, 1, 17]). In the current work we focus on the first problem. 2 The dynamic environment problem Emergence of moving and movable objects in an environment which is surveyed by a robot is a significant problem. We call high-dynamic environment, if objects move in sight, and low-dynamic environment, if objects move out of sight [9, 18]. The presence of such objects may cause errors in the resulting 3D model, or localizing failure. Any objects in a dynamic environment can be divided into objects moving in sight or out of sight. Some SLAM solutions, such as DP-SLAM and RTAB-Map, may cope with small out of sight changes [14, 5]. DP-SLAM can include an overlapping frames, and after some time passes it can replace wrong frames with overlapping ones. RTAB-Map overwrites voxels, but the object don’t disappear while all voxels are overwritten. Actually some parts of the object will never eliminate. It looks like transparent object (See Figure 2). It is important that such objects can contain some features, which cause localization failure or errors. Figure 2: Example of a “transparent” object (human Figure 3: Moving object traces in the point cloud being), which is gone out of sigh. We can see the point cloud made from the frames set. Half of them contain a human being who left the frame. In that case some features are located at the defunct object. So there are several errors, but the method still works because greater amount of features are on static objects. Also the model is extended with the parts which have been invisible before the human being has gone. The place which was invisible looks like a shadow on the floor. If there are too many features on a defunct or replaced object, the robot can’t understand that it is the same place. It may cause unexpected behavior [13, 9]. Any dramatic global changes may cause such problems too. In a high dynamic environment sensors provide erroneous data on environment with moving objects. If moving objects don’t carry features, modern SLAM solutions don’t have any problems. Nevertheless such objects leave traces in point cloud (See Figure 3). All point cloud images in this article are rotated relative to their RGB and depth analogs for better visibility. There are a lot of traces and features (yellow dots) left by a moving object. We can see significant amount of features, which are not placed on static objects. Figures 4 and 5 shows how features are located on the image with a human being. A lighter space on the RGBD image shows where the distance was correctly measured. Only points with correct distance are placed to the point cloud. Figure 4: Example of features matching on the Figure 5: Example of features matching on the RGB image RGBD image (RGB+DepthMap) with the human with the human being being It is easy to observe that on the RGB image there are approximately 30% of features located on the human being. The transition to the RGBD image shows that the human being has already occupied more than 50% of features. This is caused by the fact that all the features without correct distance (not located on the RGBD image) are discarded. A human being is usually separated form any environment and the distance to him can be correctly measured. Also a human being is a good source of features. When several human beings appear in the frame incorrect work of SLAM solutions is practically guaranteed. On the RGB image different circle colors and sizes mean reliability level and location accuracy of features. The smaller the radius, the more accurate they are determined. It should be mentioned that in the current work we consider human as an object which can change its position, since it is easy to locate a human being using Kinect. Thus all the argumentation above is valid for different movable and moving objects, such as animals, cars, papers, chairs etc. All of those objects are not a reliable features source. The solutions for the dynamic environment problem have been already offered. Authors [9] propose to make static and dynamic maps for high-dynamic environment separately. The localization is provided by using a static map. Not real objects but only moving elements of the frame are determined. If a pixel has changed its color, it is considered as a moving object and removed from the static map. This method does not work in a low-dynamic environment with objects which can change their position out of sight. In Ref.[18] a framework solving the problem with a low-dynamic environment has been proposed. In their work an environment changes between sessions. Their algorithm finds outdated frames from last session and replace them with fresher ones. The algorithm does not treat moving objects in frame leading to the instability when moving object has too many features. Both of these solutions solve the problem in the back-end part. 3 Modificated method We propose a solution which can work both with a high and low-dynamic environment in the front-end. It processes data before transfering it to the main SLAM method. So a different back-end can be used in order to transform input data. Our solution removes movable objects from data stream. Our algorithm can allocate objects which don’t move but can move unlike the solution presented in Ref.[9]. We establish human beings as removing objects for our algorithm in this article. This type of an object has been chosen because it is easy to locate people using Kinect sensor and Kinect SDK, which provide a body frame stream. We use a body frame to remove human being image from data streams. We overwrite a distance as incorrect (less than minimum) in pixels where a human location is specified. It is displayed with a red colored human silhouette on Figure 6. On Figure 6 the red color was chosen to make images more visible. Figure 6: The depthmap frame modification After processing our algorithm transfers RGB and depthmap frames to feature matching algorithms and then to SLAM back-end. An examples of solutions which uses an RGB, depthmap and feature map are RTAB-Map (Real-Time Appearance-Based Mapping), and DP-SLAM (Distributed Particle). After several tests we noticed that sometimes some pixels, which belong to removed objects, stay away of overwritten part. We determined that there are 0 to 3 pixels around an overwritten space. Figure 7 shows the sequence of updated SLAM method operations. Our algorithm processes data from any RGBD sensor overwriting unwelcome objects on the depth frames, and transfers data to the main SLAM solution. It should be mentioned that in the “Object rejection” stage we can remove any unwelcome objects, which we can allocate by computer vision algorithms or artificial neural networks. 4 Results and discussion In the current work the algorithm which provides preliminary data processing to remove objects which may cause errors in simultaneous localization and mapping is described. The examples of such objects are people, cars, animals and other movable or moving objects. We test developed software using a Kinect 2 sensor, and RTAB-Map as a SLAM solution. We take out a human as an sample of moving objects from data provided by sensors, since Microsoft Kinect SDK provides simple and convenient work with human contours. We use an output point cloud for our VR (virtual reality) system. It provides virtual tours to reconstructed real places. Figure 9 shows the output point cloud of our hardware and software system. Our algorithm creates the point cloud with no erroneous data caused by moving objects. If there is some place hidden by removing objects, there will be empty space in the point cloud. After it becomes visible, system will add information. If removing objects occupy a significant part of the field of vision so the SLAM solution can’t localize, it may show an error and continue localizing after the object moves out. There is no problem if there are enough features outside a moving object. As it was before, yellow and green dots on Figure 9 indicate features. SLAM solutions such as RTAB-Map use features for localization and place memorization. While testing we have found out that there may be some pixels which are not overwritten by our algorithm around the removing object. So we have to modify our solution to overwrite at least 3 more pixels around allocated objects. There is an empty space that looks like a shadow which is made by human. Standard SLAM solution doesn’t add points there (See Figure 8). Without moving of the camera, the algorithm continues to assume that the Figure 7: The scheme of operation of modified SLAM method cloud of points is complete and true and the empty place is out of view. The key feature of our algorithm is an ability to create a list of unwelcome objects, that are not desirable to transmit to SLAM input. We can use different computer vision algorithms or artificial neural networks to recognize objects that can change their position. Then such objects can be removed, as it has been shown in this article using a human being as an example. Our algorithm has an ability to remove any recognizable object. In addition to the modular principle, Figure 8: Standard SLAM solution output point Figure 9: Modified SLAM solution output point cloud. cloud. The silhouette of a human is clearly seen. There are no features tied to non-stationary objects. Green and yellow point are features used by methods for localization. it is possible to work with the method in combination with various solutions that implement the SLAM approach. The tests of the developed algorithm give satisfactory results completely excluding undesirable objects from the input data of the main SLAM algorithm. SLAM solution without using the proposed algorithm results shown on Figure 8. It is seen that in the cloud of points there is a large number of features, which are not tied to real objects. Such features are not a reliable source of information about an environment. Figure 9 shows the cloud of points obtained with the proposed algorithm. It does not contain any erroneous points which always appear due to the presence of a moving object. Tests show that it is enough to remove an object image from the depthmap, because SLAM solutions use only the features with correctly measured distance for localization and mapping. Furthermore, removing one object like a human doesn’t cause performance decline using medium-power personal computer. Acknowledgements Work has been supported by the Ministry of Education and Science of the Russian Federation (government task No. 2.852.2017/4.6). References [1] Bakkay, M.C., Arafan, M., Zagrouba, E.: Dense 3D SLAM in dynamic scenes using kinect. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9117, pp. 121-129 (2015) [2] Cao, F., Zhuang, Y., Zhang, H., Wang, W.: Robust Place Recognition and Loop Closing in Laser- Based SLAM for UGVs in Urban Environments. IEEE Sensors Journal 18(10), 4242-4252 (2018) https://doi.org/10.1109/JSEN.2018.2815956 [3] Civera, J., Bueno, D., Davison, A., Montiel, J.: Camera self-calibration for sequential Bayesian struc- ture from motion. IEEE International Conference on Robotics and Automation 2009, pp. 403-408. IEEE (2009) [4] Cummins, M., Newman, P.: Appearance-only SLAM at Large Scale with FAB-MAP 2.0. The International Journal of Robotics Research, Vol. 30, pp 1100-1123 (2010). https://doi.org/10.1177/0278364910385483 [5] DP Developers, https://github.com/jordant0/DP-SLAM. Last accessed 27 March 2018 [6] Durrant-Whyte, H., Bailey, T.: Simultaneous localization and mapping: Part I. IEEE Robotics and Automation Magazine 13(2), pp. 99-110 (2006) https://doi.org/10.1109/MRA.2006.1638022 [7] Durrant-Whyte, H., Bailey, T.: Simultaneous localization and mapping: Part II IEEE Robotics and Automation Magazine 13(3), pp. 108-117 (2006) https://doi.org/10.1109/MRA.2006.1678144 [8] Endres, F., Hess, J., Sturm, J., Cremers, D., Burgard, W.: 3-D Mapping with an RGB- D camera. In IEEE Transactions on Robotics 2014, Vol 30(1), pp. 177-187. IEEE (2014) https://doi.org/10.1109/TRO.2013.2279412 [9] Hahnel, D., Triebel, R., Burgard, W., Thrun, S.: Map building with mobile robots in dynamic environments. Robotics Automation, 2003, pp. 1557-1563. IEEE (2003) https://doi.org/10.1109/ROBOT.2003.1241816 [10] Kim, P., Chen, J., Cho, Y.K.: SLAM-driven robotic mapping and registration of 3D point clouds. Automation in Construction 89, 38-48 (2018) https://doi.org/10.1016/j.autcon.2018.01.009 [11] Lachat, E., Macher, H., Landes, T., Grussenmeyer, P.: Assessment and calibration of a RGB-D camera (Kinect v2 Sensor) towards a potential use for close-range 3D modeling. Remote Sensing 7(10), 13070- 13097 (2015) https://doi.org/10.3390/rs71013070 [12] Majdi, A., Bakkay, M., Zagrouba, E.: 3D modeling of indoor environments using Kinect sensor. In IEEE 2nd International Conference on Image Information Processing pp, 2013,pp. 67-72. IEEE (2013) https://doi.org/10.1109/ICIIP.2013.6707557 [13] Nister, D., Naroditsky, O., Bergen, J.: Visual Odometry. IEEE Computer Society Confer- ence on Computer Vision and Pattern Recognition 2004, vol. 1, pp. 652-659. IEEE (2004) https://doi.org/10.1109/CVPR.2004.1315094 [14] RTAB map Developers, http://introlab.github.io/rtabmap/. Last accessed 27 March 2018 [15] Rublee, E., Rabaud, V, Konolige, K., Bradski, G.: ORB: an efficient alternative to SIFT or SURF. IEEE International Conference on Computer Vision 2011, pp. 2564-2571, IEEE (2011) https://doi.org/10.1109/ICCV.2011.6126544 [16] Shao, B., Yan, Z.: 3D Indoor Environment Modeling and Detection of Moving Object and Scene Understanding, Transactions on Edutainment XIV,40-55, IEEE International Conference on. IEEE, pp. 2564-2571 (2011) [17] Walcott-Bryant, A., Kaess, M., Johannsson, H., Leonard, J.J.: Dynamic pose graph SLAM: Long- term mapping in low dynamic environments. IEEE International Conference on Intelligent Robots and Systems vol. 6385561, pp. 1871-1878. IEEE (2012) https://doi.org/10.1109/IROS.2012.6385561 [18] Wang, Y., Huang, S., Xiong, R., Wu, J.: A framework for multisession RGBD SLAM in low dy- namic workspace environment. CAAI Transactions on Intelligence Technology 1, 90-103. CAAI (2016) https://doi.org/10.1016/j.trit.2016.03.009 [19] Zhang, Z.: Microsoft kinect sensor and its effect. IEEE Multimedia 19(2), pp. 4-10 (2012) https://doi.org/10.1109/MMUL.2012.24