=Paper=
{{Paper
|id=Vol-3248/paper21
|storemode=property
|title=A Robust Visual-Inertial Odometry Based on RANSAC Modeling and Motion Conflict in Dynamic Environments
|pdfUrl=https://ceur-ws.org/Vol-3248/paper21.pdf
|volume=Vol-3248
|authors=Shenglin Zhao,Haoyuan Cai,Yaqian Liu,Chengnie Liao,Chunxiu Liu
|dblpUrl=https://dblp.org/rec/conf/ipin/ZhaoCLLL22
}}
==A Robust Visual-Inertial Odometry Based on RANSAC Modeling and Motion Conflict in Dynamic Environments==
A Robust Visual-Inertial Odometry Based on RANSAC Modeling and Motion Conflict in Dynamic Environments Shenglin Zhao 1,2, Haoyuan Cai 1 , Yaqian Liu 1,2 , Chengnie Liao 1,2 and Chunxiu Liu 1 1 State Key Laboratory of Transducer Technology, Aerospace Information Research Institute, Chinese Academy of Sciences, Haidian District, Beijing, 100190, China 2 University of Chinese Academy of Sciences, Beijing, I00086, China Abstract Visual-inertial navigation systems (VINS) face challenges in highly dynamic environments. Current mainstream solutions filter dynamic objects based on the semantics of the object category. Such approaches require semantic classifiers to encompass all possibly-moving object classes, which makes them hard to scale and deploy. This paper proposes a dynamic feature point recognition method without advanced training. It utilizes the conflict information between IMU pre-integration and visual measurement to determine whether the RANSAC- modeled essential matrix locates in the static world or dynamic object. And it helps to filter out dynamic feature points when there is one primary moving object. We add a moving object in the visual field as interference and make an artificial dataset based on the EuRoC dataset. Experiments show that after adding the interference, the error of VINS-Mono increases by about 12 times, while our dynamic-VIO only increases by about 1.4 times. When VINS-Mono diverged due to interference, dynamic-VIO can remain robust. This method is capable of both dynamic objects moving in the environment and partially occluding. Keywords 1 visual-inertial odometry, dynamic environment, dynamic points recognition, obstructed view 1. Introduction Most typical visual SLAMs require a strong assumption that the scene is rigid [1]. The Visual Inertial Navigation System (VINS), which is widely applied to autonomous driving and micro MAVs [2, 3, 4], is usually more robust than visual SLAMs owing to its tightly coupled optimization with IMU participation. But it is still inseparable from the rigidity assumption. The visually dynamic environment is still a significant challenge. The mainstream method uses semantic segmentation to identify and filter dynamic feature points. However, the semantics of the model trained by this supervised learning is limited. The dynamic interference in the actual environment is complex and changeable. In addition to rigid dynamic objects, there are flexible, dynamic interference and irregular shape visual field occlusion. In this case, maintaining robustness is our first demand. This paper presents a non-training-based robust VIO, which identifies and filters dynamic feature points based on RANSAC modeling when the estimated motion from inertial and visual information conflict. This method can keep robust when there is a single main dynamic object in the field of view. 2. Related Work IPIN 2022 WiP Proceedings, September 5 β 7, 2022, Beijing, China EMAIL: shurlinzhao@foxmail.com (A. 1); hycai@mail.ie.ac.cn (A. 2); liuyaqian20@mails.ucas.ac.cn (A. 3); liaochengnie21@mails.ucas.ac.cn (A. 4); cxliu@mail.ie.ac.cn (A. 5); ORCID: 0000-0001-7949-8232 (A. 1); 0000-0001-8822-2756 (A. 2); 0000-0002-3589-4353 (A. 3); 0000-0002-5960-8710 (A. 4); 0000-0003- 3718-9364 (A. 5) Β©οΈ 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Most approaches use supervised deep learning methods to recognize moving objects and filter them. [5] proposes to learn the rigidity of a scene in a supervised manner from an extensive collection of dynamic scene data and directly infer a rigidity mask from two sequential images with depths. [6] proposes a generalized Hidden Markov Model (HMM) that formulates the tracking and management of the primary motion and the secondary motion as a single estimation problem and an epipolar constrained deep neural network that generates a per-pixel motion conflict probability map. It can distinguish the motion conflict of each road marking point at the pixel level. This method can not only identify the feature points belonging to the main motion but also calculate the motion of other dynamic objects to realize the simultaneous tracking of various motions. Among the deep learning methods, different semantic segmentation methods (e.g., Mask R-CNN, SegNet) are trendy because of their scalability, intuitiveness, and good performance. [7] constructs an SSD object detector that combines prior knowledge to detect dynamic objects in the newly detection thread at the semantic level based on the convolutional neural network. They also proposed a missed detection compensation algorithm based on the speed invariance in adjacent frames to improve the recall rate of detection. [1] presents a real-time visual dynamic SLAM, RDS-SLAM, built on ORB- SLAM3 and adds a semantic thread and a semantic-based optimization thread for robust tracking and mapping in dynamic environments in real-time. The parallel thread enables the tracking thread to eliminate the wait for semantic information. [8] proposes a dynamic RGB-D SLAM based on semantic information and optical flow (DRSO-SLAM). They use the Mask R-CNN semantic segmentation network to obtain semantic information in indoor dynamic scenes and use the epipolar geometry to filter out the actual dynamic feature points. However, the method based on pre-training is challenging to expand and deploy, which is also a problem that many researchers are aware of. The Autonomous Systems Lab, ETHZ, presents a novel end-to-end occupancy grid-based pipeline that can automatically label a wide variety of arbitrary dynamic objects. And it can thus generalize to different environments without the need for expensive manual labeling and, at the same time, avoids assumptions about the presence of a predefined set of known objects in the scene [9]. There are also some methods that are not based on deep learning. [10] regards structural regularities in the form of planes such as walls and ground surfaces as crucially static and presents a robust plane-based monocular VIO (RP-VIO), which also improves the robustness and accuracy in challenging dynamic environments. [11] adds adaptive multi-resolution range images and uses tightly-coupled lidar inertial odometry first to remove moving objects and then match lidar scan to the submap. This article also discusses how to make better use of IMU to make VIO more robust. IMU measurements image inputs Feature Detection IMU Pre-integration and Mapping Get E, inliers, Timestamp outliers through Alignment RANSAC Get qIMU Get qcamera |qIMUβqcamera| >threshold? No Yes Publish Inliers Publish Outliers Figure 1: The pipeline of the dynamic-VIO algorithm 3. Method This work is based on VINS-Mono [12]. The main idea is when the motion perceived by IMU conflicts with the visual measurement, we choose to believe the IMU measurement. It does not mean using the measurement and calculation of IMU as the output but using this indication of IMU to reverse the selection range of feature points to provide more reliable features for the back end. Figure 1 shows the specific algorithm process. The IMU and camera obtain their measured motion information (rotation information is used here) through pre-integration and visual geometry, respectively, and then compare their differences. If the difference is slight, the feature tracker publishes feature points normally. If the difference can not be ignored, it selects the feature points inversely and publishes the outliers. 3.1. IMU pre-integration The sampling rate of the camera is slower than that of IMU. Only when the camera has one sample coming in can the IMU integration be participated in backend optimization. So we can pre-integrate the IMU measurements over an image sampling period in the sensor coordinate system. This strategy saves a significant amount of computational resources [13]. Another significance of pre-integration is that IMU needs to calculate at the same time as the camera. Therefore, we need to save the IMU measurements in advance according to (1), (2), and (3) before that time point. (1) πΆππππ = β¬ (πππππ‘ πππ‘ )πΏπ‘ 2 , π‘β[π,π] (2) π·ππππ = β« (πππππ‘ πππ‘ )πΏπ‘, π‘β[π,π] 0 (3) πππππ = β« πππππ‘ β¨ [1 ππ‘ ] πΏπ‘, π‘β[π,π] π 2 3.2. Timestamp Alignment between images and IMUs The frequency of IMU is usually higher than that of the camera, and the sampling timestamps are not synchronized perfectly. As shown in Figure 2: Timestamps of IMU and camera sampling, there must be two IMU measurements at both ends of the sampling time π‘π of an image frame on the time axis. We need to interpolate the IMU measurements to align the timestamps. The acceleration π interpolates linearly using Linear intERPolation (LERP) as (4), (5), and the angular velocity π uses the Spherical Linear intERPolation (SLERP) as (6), (7). Considering the computational efficiency, we can use LERP to interpolate π in the program. That is because the sampling rate of the gyroscope is fast (200Hz), and the angle change can be minimal. As we can see later, we use IMU pre-integration only to decide the primary motion, and we do not need to calculate the accurate value of interpolation. As a result, the error of interpolation using LERP or SLERP can be ignored. Figure 2: Timestamps of IMU and camera sampling ππ‘π = (1 β π)ππ‘1 + πππ‘2 , (4) π‘2 β π‘ (5) π= , π‘2 β π‘1 sinβ‘[(1 β π)π] sinβ‘(ππ) (6) ππ‘π = ππ‘1 + ππ‘2 , π πππ π πππ π = arccosβ‘(ππ‘1 β ππ‘2 ), (7) Then, we pre-integrate the acceleration and angular velocity at time π‘π and get the real pre- integration between the two frames. We use the pre-integrated quaternion πIMU for further determination. 3.3. Inliers and Outliers by RANSAC We use feature point detection and matching to correlate two images. Once the two images match, we calculate the essential matrix π¬ and decompose it by SVD. The essential matrix π¬ is solved with Random Sample Consensus (RANSAC). RANSAC computes a model that matches the most points by repeatedly selecting a group of random subsets in the data, and the selected subsets are assumed to be inlier points. Then we get inlier points that match the final model and the outlier points that match not the model. Both inliers and outliers are essential to our algorithm, and the next section will explain why. After triangulation, we can get one unique set of πΉ, π» from four solutions from SVD. At last, we transform the rotation matrix πΉ into quaternion πcamera for further determination. 3.4. Feature Points Publishing Normally, element values of the quaternions obtained by the camera and the IMU are similar. If so, it means that the RANSAC algorithm has modeled the visually obtained essential matrix into the correct scene, which also means that there are no very prominent dynamic objects. If this is not the case, that is, πIMU and πcamera have a large difference, it means that RANSAC must have built the model on the incorrect object. Because the IMU measurements are always more reliable than the camera measurement under normal circumstances. At this point, most of the field of view is occluded by dynamic objects, and the outliers obtained by RANSAC are actually located in the static world. Under this circumstance, we should not publish the inlier points to the backend but the outlier points. Figure 3: A dynamic points recognition example. Red points are recognized to be the dynamic points, and the blue points lie in the static world. 4. Experiment We overlaid a leaf image over the original image based on the EuRoC dataset [14] and generated the new dataset called EuRoC_mask. The EuRoC ground truth is obtained from Leica Nova MS503 laser tracker and Vicon motion capture system. The length of the testing path is about 80m. These datasets simulated the process of the leaf gradually moving from the left end of the image to the right and then disappearing to the right end of the view, as shown in Figure 4: EuRoC dataset masked by a moving leaf. The experiment environments are Intel Core i5-9400FοΌ6Γ2.90GHz, Ubuntu 16.04, and ROS Kinetic. Figure 4: EuRoC dataset masked by a moving leaf We run VINS-Mono (no-loop) and the dynamic-VIO (no-loop) proposed by this paper on these datasets. The number of feature points detected and tracked is set to 150. Table 1 shows the Root Mean Square Error (RMSE) of the Absolute Trajectory Error (ATE). When running on the masked datasets, the positioning error of VINS-Mono increases significantly, while dynamic-VIO can still keep the error in a small range. Table 1 The positioning accuracy of the two algorithms on different datasets Absolute Trajectory Error RMSE (m) Dataset VINS-Mono Dynamic-VIO MH01_easy_mask 1.563484 0.215349 MH03_medium_mask 3.482319 0.348051 MH05_difficult_mask 2.339641 0.335812 V1_02_medium_mask 2.03004 0.227272 In addition, a typical case is that when VINS diverges, as shown in Figure 5: VINS-Mono trajectory on V2_02_mask, dynamic-VIO can still maintain convergence, as shown in Figure 6: The positioning trajectory of VINS-Mono and dynamic-VIO.. Figure 5: VINS-Mono trajectory on V2_02_mask V2_02 V2_02_mask Figure 6: The positioning trajectory of VINS-Mono and dynamic-VIO. This research also plans to test flexible, dynamic objects and static visual field occlusion. Theoretically, this method should also be capable of flexible objects because the modeling process of RANSAC does not depend on object characteristics. As long as the motion sampled by IMU and camera is different, it is sufficient to identify the primary dynamic object in the static world. For static visual field occlusion, it depends on the degree of occlusion. For a small part of occlusion, it can be regarded as a particular case of this method, and this part of the occlusion area will be filtered out in each calculation period. If the occlusion part is too large, resulting in an insufficient number of feature points collected for a long time, this method will also fail. However, the binary judgment between inertial information and visual information determines that this method can only recognize the largest dynamic object, which also comes from the reason that RANSAC selects the model satisfied by most feature points. If there are multiple dynamic objects in the environment and they are similar in size, this method will fail. In this case, this method will not do better than semantic segmentation-based approaches. 5. Conclusion This paper proposed a robust VIO in dynamic environments. According to the principle of RANSAC modeling and the level of motion conflict, this algorithm flexibly selects inlier points and outlier points to publish, which can avoid the interference of the major dynamic object. Experiments on artificial datasets show that this method can keep robust and maintain the same level of trajectory error, while VINS-Mono should fail or get the positioning error increase significantly. Theoretically, this method should be applicable to rigid, flexible, dynamic objects and visual field occlusion. This method does not need data training and can be conveniently deployed when there is one major dynamic object, and high robustness is required. 6. Acknowledgments This work is supported by the National Key Research and Development Plan ( 2020YFC2004501 and 2020YFC2004503 ), National Natural Science Foundation of China (NSFC) (Nos. 61774157 and 81771388), and Beijing Natural Science Foundation (No. 4182075). 7. References [1] Y. Liu and J. Miura, βRDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods,β IEEE Access, vol. 9, pp. 23772β23785, 2021. [2] Z. Yu, L. Zhu, and G. Lu, βTightly-coupled Fusion of VINS and Motion Constraint for Autonomous Vehicle,β IEEE Trans. Veh. Technol., pp. 1β1, 2022. [3] K. J. Wu, C. X. Guo, G. Georgiou, and S. I. Roumeliotis, βVINS on wheels,β in 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, Singapore, May 2017, pp. 5155β5162. [4] M. A. K. Gomaa, O. De Silva, G. K. I. Mann, and R. G. Gosine, βObservability-Constrained VINS for MAVs Using Interacting Multiple Model Algorithm,β IEEE Trans. Aerosp. Electron. Syst., vol. 57, no. 3, pp. 1423β1442, Jun. 2021. [5] Z. Lv, K. Kim, A. Troccoli, D. Sun, J. M. Rehg, and J. Kautz, βLearning Rigidity in Dynamic Scenes with a Moving Camera for 3D Motion Field Estimation,β in Computer Vision β ECCV 2018, Cham, 2018, vol. 11209, pp. 484β501. [6] B. P. Wisely Babu, Z. Yan, M. Ye, and L. Ren, βOn Exploiting Per-Pixel Motion Conflicts to Extract Secondary Motions,β in 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Munich, Germany, Oct. 2018, pp. 46β56. [7] L. Xiao, J. Wang, X. Qiu, Z. Rong, and X. Zou, βDynamic-SLAM: Semantic monocular visual localization and mapping based on deep learning in dynamic environment,β Robotics and Autonomous Systems, vol. 117, pp. 1β16, Jul. 2019. [8] N. Yu, M. Gan, H. Yu, and K. Yang, βDRSO-SLAM: A Dynamic RGB-D SLAM Algorithm for Indoor Dynamic Scenes,β in 2021 33rd Chinese Control and Decision Conference (CCDC), Kunming, China, May 2021, pp. 1052β1058. [9] P. Pfreundschuh, H. F. C. Hendrikx, V. Reijgwart, R. Dube, R. Siegwart, and A. Cramariuc, βDynamic Object Aware LiDAR SLAM based on Automatic Generation of Training Data,β p. 7. [10] K. Ram, C. Kharyal, S. S. Harithas, and K. Madhava Krishna, βRP-VIO: Robust Plane-based Visual-Inertial Odometry for Dynamic Environments,β in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, Sep. 2021, pp. 9198β9205. [11] βRF-LIO: Removal-First Tightly-coupled Lidar Inertial Odometry in High Dynamic Environments,β in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, Sep. 2021, pp. 4421β4428. [12] T. Qin, P. Li, and S. Shen, βVINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator,β IEEE Trans. Robot., vol. 34, no. 4, pp. 1004β1020, Aug. 2018, [13] S. Shen, N. Michael, and V. Kumar, βTightly-coupled monocular visual-inertial fusion for autonomous flight of rotorcraft MAVs,β in 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, May 2015, pp. 5303β5310. [14] M. Burri et al., βThe EuRoC micro aerial vehicle datasets,β The International Journal of Robotics Research, vol. 35, no. 10, pp. 1157β1163, Sep. 2016.Wang, Xin, Tapani Ahonen, and Jari Nurmi. "Applying CDMA technique to network-on-chip." IEEE transactions on very large scale integration (VLSI) systems 15.10 (2007): 1091-1100.