<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ORCID:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Robust Visual-Inertial Odometry Based on RANSAC Modeling and Motion Conflict in Dynamic Environments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shenglin Zhao</string-name>
          <email>shurlinzhao@foxmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haoyuan Cai</string-name>
          <email>hycai@mail.ie.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yaqian Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chengnie Liao</string-name>
          <email>liaochengnie21@mails.ucas.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chunxiu Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>State Key Laboratory of Transducer Technology, Aerospace Information Research Institute</institution>
          ,
          <addr-line>Chinese Academy</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Chinese Academy of Sciences</institution>
          ,
          <addr-line>Beijing, I00086</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>of Sciences</institution>
          ,
          <addr-line>Haidian District, Beijing, 100190</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Visual-inertial navigation systems (VINS) face challenges in highly dynamic environments. Current mainstream solutions filter dynamic objects based on the semantics of the object category. Such approaches require semantic classifiers to encompass all possibly-moving object classes, which makes them hard to scale and deploy. This paper proposes a dynamic feature point recognition method without advanced training. It utilizes the conflict information between IMU pre-integration and visual measurement to determine whether the RANSACmodeled essential matrix locates in the static world or dynamic object. And it helps to filter out dynamic feature points when there is one primary moving object. We add a moving object in the visual field as interference and make an artificial dataset based on the EuRoC dataset. Experiments show that after adding the interference, the error of VINS-Mono increases by about 12 times, while our dynamic-VIO only increases by about 1.4 times. When VINS-Mono diverged due to interference, dynamic-VIO can remain robust. This method is capable of both dynamic objects moving in the environment and partially occluding. visual-inertial odometry, dynamic environment, dynamic points recognition, obstructed view</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Most typical visual SLAMs require a strong assumption that the scene is rigid [1]. The Visual Inertial
Navigation System (VINS), which is widely applied to autonomous driving and micro MAVs [2, 3, 4],
is usually more robust than visual SLAMs owing to its tightly coupled optimization with IMU
participation. But it is still inseparable from the rigidity assumption. The visually dynamic environment
is still a significant challenge.</p>
      <p>The mainstream method uses semantic segmentation to identify and filter dynamic feature points.
However, the semantics of the model trained by this supervised learning is limited. The dynamic
interference in the actual environment is complex and changeable. In addition to rigid dynamic objects,
there are flexible, dynamic interference and irregular shape visual field occlusion. In this case,
maintaining robustness is our first demand. This paper presents a non-training-based robust VIO, which
identifies and filters dynamic feature points based on RANSAC modeling when the estimated motion
from inertial and visual information conflict. This method can keep robust when there is a single main
dynamic object in the field of view.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>2020 Copyright for this paper by its authors.</p>
      <p>Most approaches use supervised deep learning methods to recognize moving objects and filter them.
[5] proposes to learn the rigidity of a scene in a supervised manner from an extensive collection of
dynamic scene data and directly infer a rigidity mask from two sequential images with depths. [6]
proposes a generalized Hidden Markov Model (HMM) that formulates the tracking and management
of the primary motion and the secondary motion as a single estimation problem and an epipolar
constrained deep neural network that generates a per-pixel motion conflict probability map. It can
distinguish the motion conflict of each road marking point at the pixel level. This method can not only
identify the feature points belonging to the main motion but also calculate the motion of other dynamic
objects to realize the simultaneous tracking of various motions.</p>
      <p>Among the deep learning methods, different semantic segmentation methods (e.g., Mask R-CNN,
SegNet) are trendy because of their scalability, intuitiveness, and good performance. [7] constructs an
SSD object detector that combines prior knowledge to detect dynamic objects in the newly detection
thread at the semantic level based on the convolutional neural network. They also proposed a missed
detection compensation algorithm based on the speed invariance in adjacent frames to improve the
recall rate of detection. [1] presents a real-time visual dynamic SLAM, RDS-SLAM, built on
ORBSLAM3 and adds a semantic thread and a semantic-based optimization thread for robust tracking and
mapping in dynamic environments in real-time. The parallel thread enables the tracking thread to
eliminate the wait for semantic information. [8] proposes a dynamic RGB-D SLAM based on semantic
information and optical flow (DRSO-SLAM). They use the Mask R-CNN semantic segmentation
network to obtain semantic information in indoor dynamic scenes and use the epipolar geometry to
filter out the actual dynamic feature points.</p>
      <p>However, the method based on pre-training is challenging to expand and deploy, which is also a
problem that many researchers are aware of. The Autonomous Systems Lab, ETHZ, presents a novel
end-to-end occupancy grid-based pipeline that can automatically label a wide variety of arbitrary
dynamic objects. And it can thus generalize to different environments without the need for expensive
manual labeling and, at the same time, avoids assumptions about the presence of a predefined set of
known objects in the scene [9]. There are also some methods that are not based on deep learning. [10]
regards structural regularities in the form of planes such as walls and ground surfaces as crucially static
and presents a robust plane-based monocular VIO (RP-VIO), which also improves the robustness and
accuracy in challenging dynamic environments. [11] adds adaptive multi-resolution range images and
uses tightly-coupled lidar inertial odometry first to remove moving objects and then match lidar scan to
the submap. This article also discusses how to make better use of IMU to make VIO more robust.</p>
      <p>IMU measurements</p>
      <p>image inputs
IMU Pre-integration</p>
      <p>Timestamp
Alignment
Get qIMU</p>
      <p>Feature Detection
and Mapping
Get E, inliers,
outliers through</p>
      <p>RANSAC</p>
      <p>Get qcamera
|qIMU–qcamera| &gt;threshold?
No</p>
      <p>Yes
Publish Inliers</p>
      <p>Publish Outliers
3. Method</p>
      <p>This work is based on VINS-Mono [12]. The main idea is when the motion perceived by IMU
conflicts with the visual measurement, we choose to believe the IMU measurement. It does not mean
using the measurement and calculation of IMU as the output but using this indication of IMU to reverse
the selection range of feature points to provide more reliable features for the back end. Figure 1 shows
the specific algorithm process. The IMU and camera obtain their measured motion information (rotation
information is used here) through pre-integration and visual geometry, respectively, and then compare
their differences. If the difference is slight, the feature tracker publishes feature points normally. If the
difference can not be ignored, it selects the feature points inversely and publishes the outliers.
3.1.</p>
    </sec>
    <sec id="sec-3">
      <title>IMU pre-integration</title>
      <p>
        The sampling rate of the camera is slower than that of IMU. Only when the camera has one sample
coming in can the IMU integration be participated in backend optimization. So we can pre-integrate the
IMU measurements over an image sampling period in the sensor coordinate system. This strategy saves
a significant amount of computational resources [13]. Another significance of pre-integration is that
IMU needs to calculate at the same time as the camera. Therefore, we need to save the IMU
measurements in advance according to (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) before that time point.
      </p>
      <p>= ∬</p>
      <p>(        )  2,
 ∈[ , ]
     = ∫
 ∈[ , ]
(        ) ,</p>
      <p>0
     = ∫
 ∈[ , ]
     ⨂ [1    ]  ,
2
3.2.</p>
    </sec>
    <sec id="sec-4">
      <title>Timestamp Alignment between images and IMUs</title>
      <p>
        The frequency of IMU is usually higher than that of the camera, and the sampling timestamps are
not synchronized perfectly. As shown in Figure 2: Timestamps of IMU and camera sampling, there
must be two IMU measurements at both ends of the sampling time  of an image frame on the time
axis. We need to interpolate the IMU measurements to align the timestamps. The acceleration 
interpolates linearly using Linear intERPolation (LERP) as (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ), (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ), and the angular velocity  uses the
Spherical Linear intERPolation (SLERP) as (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ), (
        <xref ref-type="bibr" rid="ref7">7</xref>
        ). Considering the computational efficiency, we can
use LERP to interpolate  in the program. That is because the sampling rate of the gyroscope is fast
(200Hz), and the angle change can be minimal. As we can see later, we use IMU pre-integration only
to decide the primary motion, and we do not need to calculate the accurate value of interpolation. As a
result, the error of interpolation using LERP or SLERP can be ignored.
      </p>
      <p>
        = arccos⁡(  1 ⋅   2), (
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
      </p>
      <p>Then, we pre-integrate the acceleration and angular velocity at time   and get the real
preintegration between the two frames. We use the pre-integrated quaternion  IMU for further
determination.</p>
      <p>1 +</p>
      <p>
        2,
sin⁡[(1 −  ) ]
sin⁡( )
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
3.3.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Inliers and Outliers by RANSAC</title>
      <p>We use feature point detection and matching to correlate two images. Once the two images match,
we calculate the essential matrix  and decompose it by SVD. The essential matrix  is solved with
Random Sample Consensus (RANSAC). RANSAC computes a model that matches the most points by
repeatedly selecting a group of random subsets in the data, and the selected subsets are assumed to be
inlier points. Then we get inlier points that match the final model and the outlier points that match not
the model. Both inliers and outliers are essential to our algorithm, and the next section will explain why.
After triangulation, we can get one unique set of  ,  from four solutions from SVD. At last, we
transform the rotation matrix  into quaternion  camera for further determination.
3.4.</p>
    </sec>
    <sec id="sec-6">
      <title>Feature Points Publishing</title>
      <p>Normally, element values of the quaternions obtained by the camera and the IMU are similar. If so,
it means that the RANSAC algorithm has modeled the visually obtained essential matrix into the correct
scene, which also means that there are no very prominent dynamic objects.</p>
      <p>If this is not the case, that is,  IMU and  camera have a large difference, it means that RANSAC must
have built the model on the incorrect object. Because the IMU measurements are always more reliable
than the camera measurement under normal circumstances. At this point, most of the field of view is
occluded by dynamic objects, and the outliers obtained by RANSAC are actually located in the static
world. Under this circumstance, we should not publish the inlier points to the backend but the outlier
points.</p>
    </sec>
    <sec id="sec-7">
      <title>4. Experiment</title>
      <p>We overlaid a leaf image over the original image based on the EuRoC dataset [14] and generated
the new dataset called EuRoC_mask. The EuRoC ground truth is obtained from Leica Nova MS503
laser tracker and Vicon motion capture system. The length of the testing path is about 80m. These
datasets simulated the process of the leaf gradually moving from the left end of the image to the right
and then disappearing to the right end of the view, as shown in Figure 4: EuRoC dataset masked by a
moving leaf. The experiment environments are Intel Core i5-9400F，6×2.90GHz, Ubuntu 16.04, and
ROS Kinetic.</p>
      <p>We run VINS-Mono (no-loop) and the dynamic-VIO (no-loop) proposed by this paper on these
datasets. The number of feature points detected and tracked is set to 150. Table 1 shows the Root Mean
Square Error (RMSE) of the Absolute Trajectory Error (ATE). When running on the masked datasets,
the positioning error of VINS-Mono increases significantly, while dynamic-VIO can still keep the error
in a small range.</p>
      <p>In addition, a typical case is that when VINS diverges, as shown in Figure 5: VINS-Mono trajectory
on V2_02_mask, dynamic-VIO can still maintain convergence, as shown in Figure 6: The positioning
trajectory of VINS-Mono and dynamic-VIO..</p>
      <p>This research also plans to test flexible, dynamic objects and static visual field occlusion.
Theoretically, this method should also be capable of flexible objects because the modeling process of
RANSAC does not depend on object characteristics. As long as the motion sampled by IMU and camera
is different, it is sufficient to identify the primary dynamic object in the static world. For static visual
field occlusion, it depends on the degree of occlusion. For a small part of occlusion, it can be regarded
as a particular case of this method, and this part of the occlusion area will be filtered out in each
calculation period. If the occlusion part is too large, resulting in an insufficient number of feature points
collected for a long time, this method will also fail.</p>
      <p>However, the binary judgment between inertial information and visual information determines that
this method can only recognize the largest dynamic object, which also comes from the reason that
RANSAC selects the model satisfied by most feature points. If there are multiple dynamic objects in
the environment and they are similar in size, this method will fail. In this case, this method will not do
better than semantic segmentation-based approaches.</p>
    </sec>
    <sec id="sec-8">
      <title>5. Conclusion</title>
      <p>This paper proposed a robust VIO in dynamic environments. According to the principle of RANSAC
modeling and the level of motion conflict, this algorithm flexibly selects inlier points and outlier points
to publish, which can avoid the interference of the major dynamic object. Experiments on artificial
datasets show that this method can keep robust and maintain the same level of trajectory error, while
VINS-Mono should fail or get the positioning error increase significantly. Theoretically, this method
should be applicable to rigid, flexible, dynamic objects and visual field occlusion. This method does
not need data training and can be conveniently deployed when there is one major dynamic object, and
high robustness is required.</p>
    </sec>
    <sec id="sec-9">
      <title>6. Acknowledgments</title>
      <p>This work is supported by the National Key Research and Development Plan ( 2020YFC2004501
and 2020YFC2004503 ), National Natural Science Foundation of China (NSFC) (Nos. 61774157 and
81771388), and Beijing Natural Science Foundation (No. 4182075).</p>
    </sec>
    <sec id="sec-10">
      <title>7. References</title>
      <p>[8] N. Yu, M. Gan, H. Yu, and K. Yang, “DRSO-SLAM: A Dynamic RGB-D SLAM Algorithm for
Indoor Dynamic Scenes,” in 2021 33rd Chinese Control and Decision Conference (CCDC),
Kunming, China, May 2021, pp. 1052–1058.
[9] P. Pfreundschuh, H. F. C. Hendrikx, V. Reijgwart, R. Dube, R. Siegwart, and A. Cramariuc,
“Dynamic Object Aware LiDAR SLAM based on Automatic Generation of Training Data,” p. 7.
[10] K. Ram, C. Kharyal, S. S. Harithas, and K. Madhava Krishna, “RP-VIO: Robust Plane-based
Visual-Inertial Odometry for Dynamic Environments,” in 2021 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, Sep. 2021, pp.
9198–9205.
[11] “RF-LIO: Removal-First Tightly-coupled Lidar Inertial Odometry in High Dynamic
Environments,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), Prague, Czech Republic, Sep. 2021, pp. 4421–4428.
[12] T. Qin, P. Li, and S. Shen, “VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State</p>
      <p>Estimator,” IEEE Trans. Robot., vol. 34, no. 4, pp. 1004–1020, Aug. 2018,
[13] S. Shen, N. Michael, and V. Kumar, “Tightly-coupled monocular visual-inertial fusion for
autonomous flight of rotorcraft MAVs,” in 2015 IEEE International Conference on Robotics and
Automation (ICRA), Seattle, WA, USA, May 2015, pp. 5303–5310.
[14] M. Burri et al., “The EuRoC micro aerial vehicle datasets,” The International Journal of Robotics
Research, vol. 35, no. 10, pp. 1157–1163, Sep. 2016.Wang, Xin, Tapani Ahonen, and Jari Nurmi.
"Applying CDMA technique to network-on-chip." IEEE transactions on very large scale
integration (VLSI) systems 15.10 (2007): 1091-1100.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Miura</surname>
          </string-name>
          , “
          <article-title>RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods</article-title>
          ,” IEEE Access, vol.
          <volume>9</volume>
          , pp.
          <fpage>23772</fpage>
          -
          <lpage>23785</lpage>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , and G. Lu, “
          <article-title>Tightly-coupled Fusion of VINS and Motion Constraint for Autonomous Vehicle,”</article-title>
          <source>IEEE Trans. Veh</source>
          . Technol., pp.
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>K. J. Wu</surname>
            ,
            <given-names>C. X.</given-names>
          </string-name>
          <string-name>
            <surname>Guo</surname>
            , G. Georgiou, and
            <given-names>S. I. Roumeliotis</given-names>
          </string-name>
          , “VINS on wheels,” in
          <source>2017 IEEE International Conference on Robotics and Automation (ICRA)</source>
          , Singapore, Singapore, May
          <year>2017</year>
          , pp.
          <fpage>5155</fpage>
          -
          <lpage>5162</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. A. K.</given-names>
            <surname>Gomaa</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. De Silva</surname>
            ,
            <given-names>G. K. I.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
          </string-name>
          , and R. G. Gosine, “
          <article-title>Observability-Constrained VINS for MAVs Using Interacting Multiple Model Algorithm,”</article-title>
          <source>IEEE Trans. Aerosp. Electron. Syst.</source>
          , vol.
          <volume>57</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>1423</fpage>
          -
          <lpage>1442</lpage>
          , Jun.
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Troccoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Rehg</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Kautz</surname>
          </string-name>
          , “
          <article-title>Learning Rigidity in Dynamic Scenes with a Moving Camera for 3D Motion Field Estimation</article-title>
          ,” in Computer Vision - ECCV
          <year>2018</year>
          , Cham,
          <year>2018</year>
          , vol.
          <volume>11209</volume>
          , pp.
          <fpage>484</fpage>
          -
          <lpage>501</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Wisely Babu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ye</surname>
          </string-name>
          , and L. Ren, “
          <article-title>On Exploiting Per-Pixel Motion Conflicts to Extract Secondary Motions</article-title>
          ,” in
          <source>2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR)</source>
          , Munich, Germany, Oct.
          <year>2018</year>
          , pp.
          <fpage>46</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Rong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Zou</surname>
          </string-name>
          , “
          <article-title>Dynamic-SLAM: Semantic monocular visual localization and mapping based on deep learning in dynamic environment</article-title>
          ,
          <source>” Robotics and Autonomous Systems</source>
          , vol.
          <volume>117</volume>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          , Jul.
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>