Comparative Analysis of Monocular Visual Odometry Methods for Indoor Navigation E.O. Trubakov1, O.R. Trubakova1 trubakoveo@gmail.com| trubakovaor@gmail.com 1 Bryansk State Technical University, Bryansk, Russia; The monocular visual odometry algorithm involves several basic steps and for each of them there is a number of methods. The purpose of this work is to conduct practical research of methods for key point detection and the optical flow calculation in the problem of determining the unmanned ground vehicle proper motion. For detection method research conduction the image panel containing various distortions typical for follow robot shot was made. It has been educed that among the accounted methods FAST finds the largest number of points within the minimum elapsed time. At the same time image distortions strongly affect the results of the method, which is negative for the positioning of the robot. Therefore the Shi-Tomasi method was chosen for key points detection within a short period of time, because its results are less dependent on distortion. For research undertake a number of video clips by means of the follow robot shot was made in a confined space at a later scale of the odometry algorithm. From experimental observations the conclusions concerning the application of Lucas-Kanade optical flow method tracking the identified points on the video sequence have been made. Based on the error in the results obtained it was implication that monocular odometry cannot be the main method of an unmanned vehicle positioning in confined spaces, but in combination with probe data given by assistive sensors it is quite possible to use it for determining the robotic system position. Keywords: positioning, monocular visual odometry, image key points, detector, FAST, Harris, ORB, Shi-Tomasi, optical flow, Lucas-Kanade, Farneback. 1. Introduction 2. Key points finding One of the fundamental tasks in the field of mobile robots For the detection of object movement it is necessary to and unmanned vehicles is the localization of the object and the identify changes in the sequential image set. To fulfill this construction of the surrounding area map. There are many requirement it’s reasonable to find those points on the image that approaches to solve this problem using various technical means, are steadily different from other points (key points) and check for example, laser systems such as LiDAR [3, 7], IMU [12], GPS, their displacement in subsequent images. radar [2]. One of the methods of identification (detection) of key points All these systems have various disadvantages (for example, in the image is the Harris method. It is based on considering the high cost of equipment in LiDAR or data precision in GPS). This derivatives of image brightness variation to make the tracking of article will consider the method of visual odometry as a method brightness in all directions possible. The principle of the method of a moving object positioning. A noteworthy detail is that the is that for the image 𝐼 a window π‘Š is offered, which is dependent main evils of this method is the dependence upon external on the size of the image (the most commonly used window size factors, for example, the illumination of the studied room directly 5x5) centered at (π‘₯, 𝑦), having its shift to (𝑒, 𝑣). affects the level of the obtained data precision. The sum of the squared difference between the initial and And for the correct objects matching in progressive photo shifted window is: images the dominating in the midst of static objects is required. 2 𝐸(𝑒, 𝑣) = βˆ‘ 𝑀(π‘₯, 𝑦) (𝐼(π‘₯ + 𝑒, 𝑦 + 𝑣) βˆ’ 𝐼(π‘₯, 𝑦)) β‰ˆ In addition, there are other problems, for example, geometric (π‘₯,𝑦)βˆˆπ‘Š limitations when specifying the exact rotation and camera 2 π‘₯ movement through the images (without additional information β‰ˆ βˆ‘ 𝑀(π‘₯, 𝑦) (𝐼π‘₯ (π‘₯, 𝑦)𝑒 + 𝐼𝑦 (π‘₯, 𝑦)𝑣) β‰ˆ (π‘₯ 𝑦)𝑀 ( ), 𝑦 from other sensors it is impossible to determine the displacement (π‘₯,𝑦)βˆˆπ‘Š scale) [8]. where 𝑀(π‘₯, 𝑦) is a weight function (usually Gauss function or Despite all this, for many mobile systems, this method suits. binary window is used); 𝑀 is autocorrelation matrix: The main point of visual odometry is to analyze the 𝐼π‘₯2 𝐼π‘₯ 𝐼𝑦 progressiveness of photos taken by the robot's camera. Through 𝑀 = βˆ‘ 𝑀(𝑒, 𝑣) [ ]. 𝐼π‘₯ 𝐼𝑦 𝐼𝑦2 the objects position change in images, the repositioning of the (𝑒,𝑣)βˆˆπ‘Š robot over a distance is determined. On change of the function 𝐸(π‘₯, 𝑦) at a large scale in the line The monocular visual odometry algorithm consists of several of (π‘₯, 𝑦) modulo large eigenvalues of the matrix are obtained 𝑀, steps. After receiving the image, the first step is to key points which is quite time-consuming. Therefore a response measure finding in the image. To implement this stage of odometry there parameter was created: is a number of methods. This paper analyzes some of the 𝑅 = 𝑑𝑒𝑑𝑀 – π‘˜(π‘‘π‘Ÿπ‘€)2 > π‘˜, commonest methods, such as Harris [6], Shi-Tomasi [11], FAST where π‘˜ is the empirical constant (π‘˜ ∈ [0,04; 0,06]). [4], ORB [10]. In this case the value 𝑅 is positive for angle points. Then the The next stage of the algorithm is the optical flow achieving. points having the value 𝑅 less than the prescribed threshold value Ad hoc studies were conducted to choose between the method of are removed from the set of points. Further the local maxima of optical flow Lucase-Kanade [9] and the method of dense optical the function R are calculated in the neighborhood of the given flow Farneback [5] (as one of the most effective and well- radius and the obtained points are chosen as angle key points. proven). There is another detector that is similar in algorithm to the The final stage of the algorithm is to determine the motion of Harris detector. It is called the angle detector of Shi-Tomasi. The the camera based on the data obtained from the optical flow, i.e. difference between the Shi-Tomasi detector and the Harris the rotation matrix and displacement vector calculation. The detector is in the calculation of the response measure: paper investigates the methods of key points identification and 𝑅 = π‘šπ‘–π‘›(πœ†1 , πœ†2 ), the optical flow computation in the problem of proper motion where πœ†1 and πœ†2 are the eigenvalues of 𝑀. detection. Copyright Β© 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). This method calculates the eigenvalues directly, as the search To study the methods of detecting key points, a number of for angles will be more stable. Essentially, it is necessary to set a photos were taken with a wide-angle camera installed on the threshold value and if the calculated value is higher than the Waveshare robot. The photos were presented in three main threshold, the point is considered as an angle, in other words, an resolutions: 400x250, 800x600, 1920x1080. The images were interest point. taken in different resolutions to check the dependence of the The described methods identify key points by applying their methods on scaling. Besides it pays in both: determining the best algorithms directly to the pixels of the initial image. There is an resolution for the assembled device and choosing the detection alternative approach, which involves using of computer-aided method upon the criterion of speed. learning algorithms training a point classifier within a set of The photos were edited to test the methods ' susceptibility to images. The FAST method makes decision trees for pixel various distortions (darkening, rotating, blurring and with added classification. noise). A darkened image may occur due to changes in the For each pixel p, a circle of radius 4 inscribed in a square area illumination of the space during the robot’s movement. The with a side of 7 pixels is introduced. On the basis of this set of rotated image can be caused by the device movement in the points the conclusion about whether the starting point is the key process of shooting when it travels along uneven surfaces. point or not is made. A blurred image may appear because the photo was taken The circle passes through 16 pixels. Each of the pixels of the while moving and the boundaries of the objects in the photo may circle referring to the pixel 𝑝 can be in one of three states: be blurred. Noise can happen due to changes in light, image 𝑑, 𝐼π‘₯ ≀ 𝐼𝑝 βˆ’ 𝑑 (darker color) compression, or natural features such as the appearance of a dust 𝑆𝑝→π‘₯ = { 𝑠, 𝐼𝑝 βˆ’ 𝑑 ≀ 𝐼π‘₯ ≀ 𝐼𝑝 + 𝑑 (same) cloud in the area. In studies with distorted images the identified 𝑏, 𝐼𝑝 + 𝑑 ≀ 𝐼π‘₯ (lighter color) key points have been counted and the percentage of point losses on images without modification has been determined. These For each π‘₯ and obtained 𝑆𝑝→π‘₯ for each 𝑝 ∈ 𝑃 (𝑃 is the set of experiments have revealed the methods, which are least all pixels of the training set of images) the set 𝑃 is divided into 3 susceptible to image distortion. subsets of points 𝑃𝑑 , 𝑃𝑠 , 𝑃𝑏 that are darker, same, or lighter than This FAST method was chosen as the first method to be the point π‘₯ respectively. Then the decision tree is built. studied was implemented in two forms. The first did not have the According to the results of the tree the angles on the test images maximum suppression of the average shift algorithm, and the are determined. second was absolutely devoid of it. For the first alternative of the The main disadvantage of the FAST method is the order of FAST method about 1900 key points were found in the image selecting points and this affects the efficiency. It is also worth with a resolution of 400x250 pixels within 0.14 seconds. For taking into account the fact that in the environment of the starting photos with 800x600 pixels about 8800 points within 0.66 point may occur other key points and in this case the method can seconds were detected. In the image with a resolution of give erroneous results. 1920x1080 pixels about 27400 key points were found within Another method of finding key points is the ORB 2.32 seconds. The time taken to revelation of key points is method [10]. Its algorithm is as follows: directly proportional to the number of these points. 1. 1. Key points are detected using a fast tree-like FAST in The second implementation of the method finds the number the initial image and in several images from the thumbai of key points several times more, while the time spent on their pyramid. search increases less than twice. For example, in an image with 2. 2. The Harris measure is calculated for the detected a resolution of 400x250 pixels, about 7700 key points were found points. Try outs with a low Harris measure value are neglected. within 0.3 seconds. That means that the first implementation of 3. 3. The orientation angle of the key point is calculated. In FAST method recognized the number of key points 4 times less this regard the luminance elements for the key point than the second one, spending time half as much. neighborhood are calculated: The next implemented detector was Harris method.In the π‘šπ‘π‘ž = βˆ‘ π‘₯ 𝑝 𝑦 π‘ž 𝐼(π‘₯, 𝑦), image with a resolution of 400x250 pixels, about 250 key points π‘₯,𝑦 were found within 0.11 seconds. Since the Harris detector is an where π‘₯, 𝑦 are pixel coordinates, 𝐼 is brightness. angle detector, it detected key points neither on the And then the orientation angle of the key point is calculated: circumference nor on the edges of objects. In the photo with a πœƒ = atan2(π‘š01 , π‘š10 ). resolution of 800x600 pixels there were about 600 points The result is a particular direction for the neighborhood of detected within 0.56 seconds. On the image with a resolution of the key point. 1920x1080 pixels, about 1080 key points were found within 2.25 4. 4. Having the orientation angle of the key point, the seconds. It can be noted that unlike previous detectors, this sequence of points for binary comparisons in the BRIEF [1] method finds fewer key points within a shorter period of time and descriptor rotates according to this angle. that resulted in the increasing of accuracy of this method to be Mathematically the new positions for the points of binary used in the future. tests are calculated as follows: In the ORB algorithm the maximum number of key points by π‘₯𝑖′ π‘₯𝑖 default cannot be more than 500, if it is, then Harris angle ( β€² ) = 𝑅(πœƒ) βˆ— ( ). detector is applied for excluding the least significant ones. In this 𝑦𝑖 𝑦𝑖 regard, the algorithm gave the following results: in the image 5. 5. The binary descriptor BRIEF is calculated from the obtained points. with a resolution of 400x250 pixels about 470 key points were found within 0.24 seconds; in the image with a resolution of 3. Case Studies 800x600 pixels 500 key points were found within 0.83 seconds; in the image with a resolution of 1920x1080 pixels 500 key All studies were conducted on the basis of a robotic device, points were found within 2.98 seconds. Experiments have shown which is a four-wheeled unmanned vehicle. The system is that this detector is not the fastest among the described above. implemented on the basis of the Raspberry Pi 3 microcomputer. The Shi-Tomasi detector is based on the Harris detector. But On this microcomputer a four BCM2837 core processor having in spite of this fact, the Shi-Tomasi algorithm detected only 25 1200 MHz and 1 GB SDRAM is installed. The methods have key points in images with different resolutions within the same been implemented in the C++ programming language under the period of time, which is 10 times less than the Harris detector did Ubuntu operating system. in these images. 3.1. Analysis of key points detection methods The second type of experiments was carried out to analyze with a radius of 0.4 m and a robot speed of 0.2 m / s, the trajectory algorithms connected with image distortion. As a result the calculated on the basis of visual odometry has an error of about percentage of point reduction comparing to the initial image 20.7%. having a resolution of 1920x1080 pixels became apparent. Studies have proved both: the sensitivity of visual odometry FAST detectors turned out to be particularly instable in calculation method and the err of calculations when using the darkening. These algorithms lose about 65-70% of key points Shi-Tomasi method of key points detection. when the image is distorted. With a small rotation of the image The next method to be investigated was the Harris method. (20 degrees) these detectors lose about 48-53% of points. FAST When moving back and forth using the same videos that in detectors proved to be also instable in image blurring. In this case situations with the Shi-Tomasi method, the trajectory the algorithm having not maximum suppression loses more key calculations based on odometry were in error of about 8.2%. points (89%) than the second implementation of this method When conducting the study on the second type of video records, (58%). The FAST method flounders in case of noisy images and i.e. in case of the circular motion, the error of calculations was is noninvariant to the appearance of noise (these algorithms find 24.8%. Studies have shown that the Harris method is subject to 82-88% more key points, which is erroneous). computational errors not unlike Shi-Tomasi method. Harris detector showed invariance to distortions such as darkening, blurring and noise, but it was instable in case of image 4. Results rotation (this method loses about 36% of the points). It can’t Comparing the implemented methods of key point detection typify the Shi-Tomasi method, which is based on the Harris minding the robot’s speed in relation to the number of detected detector, but unlike Harris it is independent on rotating like the points the FAST method turned out to be the fastest detector Orb detector as well. During the experiments it was found out among all others considered in these studies, since this method that the main advantage of Shi-Tomasi and ORB detectors is finds several times more points than the others within a shorter invariance with respect to noise, blurring and darkening. period of time. Also, this method will be the best solution with 3.2. Analysis of optical flow construction methods the tasks when the number of obtained points really matters. The following type of research was carried out with a group However, it is highly fallible with image distortions. of video sequences having spans of 288 frames and 504 frames. Both Harris and Shi-Tomasi detectors exhibited the best Each group was divided into three options depending on the results in terms of speed (fig. 1). But the number of found points video resolution: 800x480, 640x360 and 320x240. These studies in these methods differs exponentially (10 times). were conducted to find the method of the optical flow For a more accurate result when working with detectors it is constructing for its further use in the algorithm of robot’s visual better to neglect a small number of key points. Therefore, the odometry. The Lucas-Kanade method and the Farneback method Harris detector is better when we mean the operating time of dense optical flow were chosen to be studied. regarding to the number of points. The Lucas-Kanade optical flow method was chosen first for 4 that purpose. When the video resolution is 320x240, the working time of the method is approximately 0.07 seconds per frame. When the video resolution is increased, the operation time increases too. So when the resolution is 640x360 the frame is processed within about 0.11 seconds, and in case of 800x480 resolution the frame processing time is already about 0.16 2 seconds. Studies have shown that the method does not depend on the span of the video sequences, but depends on its extension. The next implemented method connected with optical flow was the Farneback dense optical flow method. This method like the Lucas-Kanade method showed the best working time at a 0 lower resolution (320x240). This behavior is reasoned because 400Ρ…250 800Ρ…600 1920Ρ…1080 the method deals with a smaller area of the image. When implementing this method for video with a resolution of FAST nonmaxSuppression FAST 320x240, the frame processing time was about 0.42 seconds per HARRIES ORB frame. With a frame resolution of 640x360, the runtime is Shi-Tomasi approximately 1.1 seconds per frame. And in case of 800x480 Fig. 1. The operating time of detectors depending from the resolution this speed is reduced to 1.9 seconds per frame. image size 3.3. Experimental proof of detection method effect on the Maximum independence from all types of distortions was determination of the proper motion presented by both: ORB and Shi-Tomasi detectors (fig. 2). 89% The final studies were conducted on two groups of video 69% 65% 58% 53% 48% 36% sequences. The first type of video lies in the equable movement 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% of the robot forward and then back having the final point of the movement coinciding with the initial one. The second type of motion is a sustainable closed circular motion in which the final DARKENED ANGLE-WISE BLURRED NOISES point of motion coincides with the initial point. The trajectory calculated on the basis of visual odometry should be linear forward and backward in the first case and it is -182% -188% to describe a closed trajectory in the second case. Due to the fact that the initial points coincide with the final ones, it is no need FAST nonmaxSuppression FAST taking into consideration the displacement scale received from HARRIES ORB other sensors, but it’s quite sufficient to use the data received Shi-Tomasi from the camera. The first method chosen for these studies was Shi-Tomasi. Fig. 2. Percentage of reduction of found points depending on When moving forward by 2.5 meters, and then moving back by image distortion 2.5 meters in different situations, the error of reaching the initial Thus, on the basis of the conducted experiments it can be point in the motion diagram is about 6.9%. In a circular motion concluded that the Harris detector is preferable in terms of speed being independent from image distortions. But at the same time information about its movement isn’t significant should be it is fallible to distortion of the image in case of rotation. The Shi- indispensible. Tomasi method has the same execution time and is not fraught The final studies have been fulfilled in experiments with image distortions. conducted to make the use of visual odometry as an option to However, under similar conditions this method found an identify the robot displacement relatively to the initial point order less key points. This fact can have negative consequences possible. It should be noted that the methods for calculating the on the operation of the robot positioning algorithm as a whole. rotation and displacement matrices of the camera are quite This may occur due to the fact that pictures are taken in the sensitive to external parameters influencing their calculations dynamics and at high speed of the robot, so the key points can go and giving errors. It is also worth mentioning that errors in the out of frame. That is why follow-up studies were conducted in calculations of the camera position displacement can be caused regard to both methods : Harris and Shi-Tomasi by the environment. For example, the low illumination of the The results of optical flow methods research show the time room can result in the number of key points, which is insufficient dependence in percentage terms upon the video sequence for the algorithm work. duration. Figure 3 shows that the Lucas-Kanade optical flow On the basis of the study it can be concluded that the method functions much faster than the Farneback dense optical monocular visual odometry having low accuracy in the flow method. When the resolution is 320x240, this difference calculations of motion indicators can not be the main method for reaches 5.8 times, if it is 640x360, the difference is 9.8 times, and determining the whereness of a moving system. However, in in case of the 800x480 resolution it is 11 times faster. Thus, it conjunction with data obtained from other sensors, it can serve can be concluded that the Lucas-Kanade method of the optical as a method for the robotic ground vehicle positioning. flow calculating is much more preferable comparing to the Additional studies connected with assistive sensors application Farneback method having in mind the speed of execution in are essential for the conclusion confirmation. relation to a robotic vehicle and regardless of the video resolution. 6. References 5000% [1] Calonder, M. BRIEF: Binary Robust Independent 4217% Elementary Features / M. Calonder, V. Lepetit, Chr. 4000% Strecha, P. Fua, // 11th European Conference on Computer 3000% 2625% Vision (ECCV). – 2010. – pp. 778 – 792. [2] Checchin, P. Radar scan matching SLAM using the Fourier- 2000% Mellin transform / P. Checchin, Fr. Gerossier, Chr. Blanc // 1017% Field and Service Robotics / Springer. – 2010. – pp. 151– 1000% 267% 383% 161. 175% [3] Cole, D.M. Using laser range data for 3D SLAM in outdoor 0% environments / D.M. Cole, P.M. Newman // Robotics and 320x240 640x360 800x480 Automation, 2006. ICRA 2006. Proceedings 2006 IEEE Lucas-Kanade Farneback International Conference on / IEEE. – 2006. – pp. 1556– 1563. Fig. 3. Time taken by method in percentage terms regarding [4] Drummond, E.R.a.T. Fusing Points and Lines for High to video duration Performance Tracking, 2005. The final stage of the study showed that the previously [5] Farneback, G. Two-frame motion estimation based on selected methods for identifying key points (Shi-Tomasi method polynomial expansion / G. Farneback // Lecture Notes in and Harris method) do not provide 100% accuracy in the Computer Science. – 2003. – pp. 363-370. calculations of camera displacement. However, both of them can [6] Harris, C. A combined corner and edge detector / C. Harris, be applied in the algorithm of monocular visual odometry M. Stephens // In Fourth Alvey Vision Conference, bearing in mind method errors. It is also worth noting that the Manchester, UK. – 1988. – pp. 147-151. Shi-Tomasi method has better accuracy comparing to the Harris [7] Hess, W. Real-time loop closure in 2D LIDAR SLAM / method W. Hess, D. Kohler, H. Rapp, D. Andor // Robotics and Automation (ICRA), 2016 IEEE International Conference 5. Conclusion on / IEEE. – 2016. – pp. 1271–1278. The article analyzes the algorithm of visual odometry using [8] Kitt, B.M. Monocular visual odometry using a planar road a single camera (monocular visual odometry) for mobile object model to solve scale ambiguity / B. M. Kitt, J. Rehder, positioning. Empirical studies have been conducted for various A.D. Chambers, M. Schonbein, H. Lategahn, S. Singh // In methods aimed at implementing of algorithm steps. Proc. European Conference on Mobile Robots, September According to the results of studies it has been concluded that 2011– 2011. using of the Shi-Tomasi method for an unmanned ground-based [9] Lucas, B.D. An Iterative Image Registration Technique robotic vehicle operating when detecting key points of the image with an Application to Stereo Vision / B.D. Lucas, is the most preferable in comparison with the others considered T. Kanade // Proceedings of the 7th international joint here. This is due to the fact that it is the least prone to various conference on Artificial intelligence. – 1981. – pp. 674-679. image distortions such as darkening, rotation, blurring and noise. [10] Rublee, E. ORB: an efficient alternative to SIFT or SURF / To track the identified key points of the image in the video, E. Rublee, V. Rabaud, K. Konolige, G. Bradski // Computer it is more preferential to use the Lucas-Kanade optical flow Vision (ICCV), IEEE International Conference on. IEEE. – method. This choice is based on the speed of its operating if it is 2011. – pp. 2564-2571. compared or contrasted with some other considered methods. [11] Shi, J. Good features to track / J. Shi, C. Tomasi // TR 93- However, the experiments proved that this method is not able to 1399, Cornell U., 1993 manage each frame in real-time mode. So, when implementing [12] Yi, J. IMU-based localization and slip estimation for skid- monocular visual odometry for a moving unmanned vehicle, it is steered mobile robots / J. Yi, J. Zhang, D. Song, necessary to track key points only on subframes of video stream. S. Jayasuriya // Intelligent Robots and Systems, 2007. IROS Besides it, finding the traffic speed, at which the loss of 2007. IEEE/RSJ International Conference on / IEEE. – 2007. – pp. 2845–2850.