Detection of Good Matching Areas Using Convolutional Neural Networks in Scene Matching-Based Navigation Systems Ayham Shahoud 1, Dmitriy Shashev 1 and Stanislav Shidlovskiy 1 1 Tomsk State University, 36 Lenin Ave, Tomsk, 634050, Russia Abstract This paper presents a solution for false matching detection in scene matching-based aerial navigation systems. A navigation system that uses normalized cross-correlation to match a captured image with a reference image was designed. While traditional methods rely on statistical indicators to detect false matchings, this research relied on deep learning using Convolutional Neural Network (CNN). A CNN was trained to online predict the probability of a matching result to be true or false. The training dataset of images was constructed depending on the knowledge of where good matching areas are expected to be. The probability numbers were stored as an assistant map to be used again with the same reference map without classification. The system was implemented and tested in a 3D simulation environment using models for a drone, camera, and flight environment. The Robot Operating System (ROS) and the 3D dynamic simulator Gazebo were used for simulation. The results proved the efficiency of the proposed method in excluding the false matchings. Using the assistant map without classification resulted in an execution time of 41ms and RMS error of position less than 1.2m. Keywords 1 Correlation, CNN, false matching, assistant map, potential fields, ROS, Tensorflow. 1. Introduction Computer vision techniques are commonly used in drone navigation systems. They are also used for mapping, path tracking, and observation [1]. These systems are light, cheap, and do not rely on external systems. They can also offer a good solution to overcome the Global Positioning Systems (GPS) loss of signals. Computer vision navigation measurements can be absolute like in scene matching or relative like in visual odometry. The absolute position can be calculated by matching a captured image with a reference map. Image matching can be done using local features in the images or the cross-correlation function which is adopted in this research, cross-correlation function is a well-studied method for scene matching since the last century. The normalized cross-correlation has been used in many outdoor applications because of its resistance to illumination variation [2]. A lot of problems appear when talking about correlation-based scene matching like execution time, accuracy, and false matching. The false matchings take place because of the noise, areas that produce a flattened correlation result, or other error sources [3]. In autonomous aerial navigation applications, large errors or error jumps are totally not allowed. These errors must be treated or excluded before being used by the autopilot. False matching may lead to large final errors, mission failures, or human damages. This work presents a solution for false matching detection in correlation-based navigation systems. The system was designed and tested using a drone in the 3D environment shown in Figure 1. A CNN was trained to online classify the areas where good and bad matchings are expected to occur. The classification results were saved to be used with the same reference image without the need to repeat the classification. The rest of this paper is organized as follows: section 2 for related studies, section 3 for navigation algorithm, section 4 for true and false matching, section 5 for classification using CNN, section 6 for implementation, and section 7 for results analysis and conclusion. GraphiCon 2021: 31st International Conference on Computer Graphics and Vision, September 27-30, 2021, Nizhny Novgorod, Russia EMAIL: ayhams86@gmail.com (A. Shahoud); dshashev@mail.ru (D. Shashev); shidlovskiysv@mail.ru (S. Shidlovskiy) ORCID: 0000-0002-5620-9984 (A. Shahoud); 0000-0002-9533-4577 (D. Shashev); 0000-0002-7541-9637 (S. Shidlovskiy) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: The 3D flight environment in Gazebo to the right. The IRIS drone model to the left with a camera fixed on it. 2. Related studies Scene matching can be established using methods like local features and cross-correlation function. A lot of researches were done to enhance the reliability and accuracy of navigation systems based on cross-correlation. If accurate and trustable matchings are available, they will be useful in integrated navigation systems. They can be used to compensate for other navigation systems’ errors like GPS, inertial systems, or visual odometry. In reference [3], introduced an implementation of an integrated navigation system using inertial sensors, GPS, and computer vision. The vision navigation system depended on the cross-correlation, and the false matchings were detected and excluded using statistical analysis. An observation of the output variance over a sliding window for the last 30 measurements was done to detect false matchings. Good matchings were expected to occur in areas where road intersections existed. In reference [4], shown an image matching system for an autonomous Unmanned Aerial Vehicle (UAV) based on neural networks. A multi-layer neural network was used to detect the edges, and the position was calculated using cross-correlation. Matching edgy areas produced a reliable correlation result. Using mutual information between images to enhance the correlation results is presented in [5][6]. Map analysis based on image entropy could be used to select the best matching areas as in [7]. The use of an asymmetric two-branch CNN light model to predict the match probability of two images is presented in [8]. The matching network takes a pair of aerial images and road landmarks as input. The outputs are two depth features that were fed into a feature matching layer to compute the correlation map. In reference [9], scene matching between satellite images and online drone images was established using CNN. The contextual information extracted from the scene was used to attain increased localization accuracy and enable navigation without the use of GPS. A semantic shape matching algorithm is subsequently applied to extract and match meaningful shape information from both images. The statistical solutions for detecting the false matchings may fail during maneuvering, or when the speed changes at a high rate. Relying on other sensors to solve the problem will be directly affected by their errors, and may increase the cost if high accurate sensors are needed. In this research, an online solution that depends on artificial intelligence to detect good and bad matching areas was studied. Although it rises the execution time, it is more robust and innovative. A CNN was trained to predict the probability of an online captured image to have a true (or false) match on a reference map. The results were simply saved as an assistant map to be used with the same reference map for many purposes in the future. A dataset was constructed and used in training and validation. The navigation system decides according to the output of the CNN whether the correlation- based system output must be accepted or rejected. The resulted assistant map is not only beneficent in excluding false matchings, but it can be used to analyze the flight environment offline. The assistant map can be considered as a meta-map that contains information about each pixel in the reference map. The “assistant map” name was adopted to refer to the map function, especially in the presented application. Captured image Reference image Image processing h, ψ Image alignment Adaptive window Correlation Yes No New map A No probability CNN number exists Assistant map update Yes position calculation Figure 2: The navigation algorithm. The assistant map values will be updated just in case of no corresponding probability number or in case of a new map. 3. Navigation algorithm A navigation system that depends on cross-correlation was designed. The position of the drone corresponds to the patch that produces the highest correlation value with the reference image. Before calculating the cross-correlation, the captured image must be aligned with the reference image i.e., rotated and scaled to match the reference image scale and orientation. The alignment process was done using the heading angle ‘ψ’ from the compass and the height ‘h’ from an altimeter. A detailed explanation of the navigation system implementation is presented in [10]. Let ‘T’ be the captured image and ‘I’ the reference image, the normalized cross-correlation is given in the following equation: ∑𝑥̀ 𝑦̀(𝑇(𝑥̀ , 𝑦̀ ). 𝐼(𝑥 + 𝑥,̀ 𝑦 + 𝑦̀ )) 𝑅(𝑥, 𝑦) = (1) √∑𝑥̀ 𝑦̀ 𝑇(𝑥̀ , 𝑦̀ )2 . ∑𝑥̀ 𝑦̀ 𝐼(𝑥 + 𝑥,̀ 𝑦 + 𝑦̀ )2 To reduce matching time, only a cropped window of the map was matched with the captured image. The size of the cropped window changes adaptively with the last movement amplitude and the scaled image size [10]. The cross-correlation process consumes less time than the classification process. That’s why the classification was done after correlation, so maybe there will be no need for the classification step, for more details refer to section 5.4. The navigation algorithm is shown in Figure 2. 4. True matchings and false matchings False matching means that for some reasons the maximum correlation value occurs in the wrong position on the reference image. False matching between an image and the reference image leads to jumps or large errors in positioning [3]. These errors may have dangerous outcomes in the case of autonomous vehicles. Destruction of the vehicle, mission failures, or damages to the surrounding environment might occur. There are many reasons why false matching occurs like image noise and areas that produce a flattened correlation such as soft intensive jungle. The normalized cross-correlation is a mathematical operation that always will have results. Even if the images (matrices) are not identical or not similar to each other, the correlation will produce a result. The navigation system must decide if the matching result or the final position value is accepted or not. Artificial Intelligence (AI) offers a good innovative solution depending on the knowledge of where good and false matchings are expected to occur. Figure 3: Examples for good matching areas in the constructed dataset from google maps. Figure 4: Examples for bad matching areas in the constructed dataset from google maps. 4.1. Good matching areas Good matchings areas are places where the cross-correlation between a captured image and the reference image is expected to have a true matching. According to previous studies, these areas have a large probability to occur with the street (path) intersections. In general, they exist in relatively large geometrical intersection areas [8]. These areas are more stable and robust against the noise, weather, and illumination variance compared to small details in the image. This research is interested in the city environment. The intersections in a city could be similar to the following shapes ╬, ╠, ╔, or may have other shapes with more arms like stars and polygonal. In real situations, they might be rotated, cropped, or warped under different models. The CNN must be able to recognize them by using a suitable training dataset. A set of good matching areas are shown in Figure 3. All pictures were taken from google maps. 4.2. Bad matching areas They are areas with almost flattened correlation results i.e., no robust peak. Areas covered with snow, lake, large asphalt squares, intensive green jungle or grass, and homogeneous roofs might produce false matchings. In real work, the situation will be worse because of noise, illumination, weather, and sensor errors. A set of bad matching areas are shown in Figure 4. 5. Good and bad matching areas classification using CNN 5.1. CNN structure Deep learning that depends on the convolutional neural network is very efficient in image classification. It is used for many applications like object detection, visual path tracking, and many other classification problems. The pre-processing required in a CNN is much lower compared to other classification algorithms. CNN takes an input image and assigns learnable filters to various features in the image, CNN has the ability to learn these filters [9][11]. A structure of a CNN is shown in Figure 5, it is used to classify the bad and good matching areas. A CNN consists of convolutional layers, pooling layers, fully connected layers, and activation functions. The feature extraction is done using convolution and some non-linear activation functions in the convolution layers. Figure 5: The designed CNN structure, conv. refers to a convolution layer. Only kernels’ elements (filters) that are used in the convolution operation are trained in the training phase. In the designed CNN, two convolution layers (64 and 32 kernels respectively) with a kernel size of 3x3 and the nonlinear ReLu activation function were used. Down-sampling operation which reduces the in-plane dimensionality is done in pooling layers. In this research, the features are not small details in the image, so an average-pooling of 2x2 was used after each convolution layer, which helped in the suppression of undesired small details in the images. The outputs of the convolution layers are flattened and connected to a fully connected layer. To prevent overfitting, two dropout layers were used in addition to two fully connected layers. The final output layer used the Softmax function as an activation function. More details are in Figure 5. 5.2. Dataset creation Using a down-looking camera simplified the task of collecting and creating the dataset from google maps. We chose 3 classes representing 3 types of intersections, and one class for bad matching. Increasing the number of classes that represent the good matchings may improve the results, but that needs a huge effort for building the dataset and training the CNN. Four classes gave acceptable results in a normal city environment. For cities with very complex geometrical characteristics, maybe we should think of increasing the number of classes. The first class: bad matching areas similar to images shown in Figure 4 (for more about this class, refer to 5.4). The second class: good matching areas with intersections similar to ╬ i.e., a node with 4 arms. The third class: good matching areas with intersections similar to ╠ i.e., a node with 3 arms. The fourth class: good matching areas with intersections similar to ╔ i.e., a node with 2 arms. Of course, these shapes are ideal just for explanation. In Figure 3, shown real examples for them from the dataset. A dataset of 3920 images was constructed. 1500 images were cropped from google maps, then augmented to 3920. The images were cropped from google maps for real intersections in cities. They were converted into grayscale, resized to 200x200 pixels, and stored as a labeled dataset. 5.3. CNN training and validation A huge development was achieved in artificial intelligence tools that are used in training CNN, like Tensorflow and Keras. These tools were used in this research to train the CNN. The dataset was divided into a training dataset and a validation dataset. 80 % of the data used for the training and the rest for the validation. The accuracy and loss in training and validation are shown in Table 1. The final results proved the efficiency of the selected CNN structure and dataset. Table 1 CNN training results Parameter Accuracy Loss Training 0.96 0.08 Validation 0.92 0.1 5.4. Navigation decision In integrated navigation, error compensation does not require continuous scene matching. It is enough to have a number of trustable matchings according to the application. In standalone vision navigation, losing a matching will be a problem, and that is out of the scope of this work [8]. After the correlation process, the image is fed into the CNN as shown in Figure 2. If it is classified as a bad matching area (class number 1), the navigation system will not rely on the vision navigation system. The first class (bad matching areas) was used to reinforce the exclusion of such areas from being used in fixing integrated navigation solutions or from being used by the autopilot. These areas must be treated as “obstacles” for the drone. If a matching got a probability greater than 0.5 to be class 1, then no matter what the overall CNN result is, the patch will be excluded and considered false. If the image is classified as a good matching area (class number 2, 3, or 4), and the probability to be class 1 was less than 0.5, then the navigation system will rely on the vision navigation system. The system must associate the probability (of having a true matching) to a predefined patch of the map that corresponds to the drone position, as shown in Figure 6. The previous association could be done by building a corresponding map that contains the probability number of the pixel in the reference map. Each pixel on the constructed map (the assistant map) corresponds to the same pixel in the reference map. The corresponding patch size on the assistant map was fixed practically to be equal to twice the last movement amplitude in pixels (proportional to drone speed). In this work, the average speed of the drone was 10 m/s, and the execution time during classification was 140ms, so the patch size was approximated to 5 pixels (2.65m) centered by the drone pixel position (for more about the map specification, refer to 6.1). Association overlaying might occur continually if the map is new, although that is not bad (just a recalculation depending on a new classification result), it is time-consuming. Overlaying can be avoided when using new maps by using a well-defined stride in the association, that’s not adopted in this work. There is no need to repeat CNN classification on the same map in the future. If the same map is used, then after the correlation, a check will be done to see if the probability number that corresponds to the drone position (on the map) exists on the assistant map. If the probability number does not exist, then classification is needed. The adopted method in saving results was easy to implement and the obtained assistant map was easy to analyze and visualize as an image or a matrix. Analyzing the assistant map will be efficient to classify areas where robust visual navigation results are expected to be. To make the processing and visualizing easier, the results could be saved as “zeros” for the false matchings and “ones” for the true matchings according to a threshold of 0.5. Figure 6: Saving the classification results. The black color on the assistant map corresponds to pixels in which if the drone is located, the camera is expected to capture a false matching, and the white color corresponds to the expectance of a true matching. Gray color refers to non-visited areas. 6. Implementation 6.1. The equipment A computer with CPU i5-10300 - 2.5Ghz was used. The chosen camera model has a 60˚ field of view and an image size of 200x200. Tensorflow 2.4.1 and Keras 2.4.3 were used in training the CNN. All programs were written using Python 3.7 under Linux. OpenCV 3.4 library was used for image processing. A reference image with a resolution of 0.53 m/pixel and a size of 900x1000 pixels was used as a reference map. It was taken from a height of 500m in the environment. 6.2. Simulation environment IRIS drone model from Ardupilot was selected, it was equipped with a camera, inertial sensors, compass, and GPS. A 3D city environment from Gazebo was used as a flight environment for the drone. Software In The Loop (SITL) was used to launch and control the drone trajectory. Subscribers were written to the camera images, compass, and to navigation solution published by ROS with MAVROS communication protocol. MAVROS is a middle protocol that translates the messages of the models into ROS messages. The navigation solution published by ROS was used as a reference path. The simulation environment was flexible in visualization, changing parameters, and repeating tests with zero cost. 6.3. Experiments The drone was guided on a path in the 3D environment with an average speed of 10 m/s and a height of 100 m. Three flights were done on the same path. The first flight using online classification, the second flight using the created assistant map during the first flight, and the third flight using a statistical indicator to detect the false matchings. The adopted statistical method is similar to that used in [3], the variance of the position was observed over a sliding window of 30 measurements. The matching was considered false if the variance exceeded a predefined threshold. The chosen environment intentionally contained areas where bad matchings are expected, as shown in Figure 1. The captured images were processed online to calculate the drone position. The position output was saved to ‘csv’ file and plotted using Octave. In the following figures, “scene matching” refers to the calculated path and “ref” to the reference path. In Figure 7, shown the position results in the first flight i.e., with online classification. In Figure 8, shown the statistical indicator with a threshold of 1.4 to indicate a false matching for variance values greater than it. 7. Results analysis and conclusion 7.1. Results analysis A summary of the results is shown in Table 2. With online CNN classification, the execution time was 140ms while the RMS error of the position was 1.2m (disregarding the false matchings). The position error almost maintains itself using the assistant map, while the execution time becomes 41ms. Without false matching detection, the RMS error of position was 3.1m and the execution time was 40ms. The statistical method resulted in 1.5m RMS error of position and 41ms execution time. Table 2 Navigation results False matching exclusion method RMS position Execution time (ms) Detected false error(m) matchings/overall (%) Online CNN 1.2 140 98.7 With the assistant map 1.15 41 98.7 With Statistical indicator 1.5 41 112 Without false matching exclusion 3.1 40 - Figure 7: The calculated path in (x,y) plane with the reference path using online classification to the left. The calculated and reference position on the x-axis and y-axis to the right. Figure 8: The calculated variance with a threshold of 1.4 to indicate false matchings. The results were expected since CNN consumes a large time. Using the assistant map proved its efficiency regarding the execution time and accuracy. Excluding false matching suppressed the position error jumps and enhanced the reliability. Compared to statistical methods, an advance to CNN was noticed, the percentage of false matchings detected by CNN was more convenient. In the statistical method, a jump in the calculated variance corresponds to the false matchings was noticed. A latency appears in observing the false matchings because of the observation window size. This means that when bad matchings finish or start, the statistical indicator will not respond immediately. The 112% means that extra (not real) false matchings were detected because of the previous reason (latency). That happened even with the selection of small window size, to be able to sense the variance variation rapidly. The RMS error of position was expected to be small (using any method) when excluding the false matchings. Losing a matching because it has a large probability to be false, has fewer negative outcomes than being accepted and used in navigation or control. In the CNN-based method, in addition to the trustable false matching exclusion, the assistant map was built which offers beneficent information about the environment. 7.2. Conclusion This paper presented an implementation of a navigation system based on normalized cross- correlation. A new approach to online detection of good and bad matching areas was introduced and compared to a traditional statistical method. A CNN was trained to learn good and bad matching areas. A dataset of 3920 images was created to train and validate the CNN. Images for geometrical intersections from google maps were selected and divided into 4 classes, one class for bad matchings and the others for good matchings. The CNN classification results were saved as an assistant map with the same size as the reference map. The created assistant map contains the probability numbers i.e., the probability of an image captured from the drone at that position to have true or false matching. Probability numbers were used to help the navigation system to decide if the vision navigation system output should be accepted or rejected. The final system resulted in a more accurate and robust navigation system compared to traditional false matching methods, but as expected with a larger execution time of 140ms in online detection mode. In the offline detection (with the assistant map) the execution time was the same as in traditional methods. The stored assistant map can be used for many purposes. It can be used for the same previous purpose i.e., navigation decision assistant, without the need for classification. It can be used also to predict where a visual navigation system will be more effective by visualizing the assistant map before the mission. Analyzing the assistant map will helpful also in choosing the type of vision navigation system. For example, it might be found up that cross-correlation is not efficient and other methods like local features must be adopted. Sometimes it might be found up that standalone vision navigation is efficient without the need for integration with other systems. In a standalone vision navigation system or generally, in critical missions, it might be necessary to consider the bad matching areas on the assistant map as obstacles. Instead of excluding these areas, they can be avoided by the drone guidance system using the potential fields method for example. Classification of the best loitering and landing areas, topics we shall be focusing on in future work. Employing the assistant map in path planning is an interesting topic for future work also. 8. References [1] Belmonte, L.M.; Morales, R.; Fernández-Caballero, A. Computer Vision in Autonomous Unmanned Aerial Vehicles—A Systematic Mapping Study, Applied Sciences. 2019, 9, 3196. [2] Matuszewski, Jan & Grzywacz, Wojciech. (2017). Application of Discrete Cross-Correlation Function for Observational-Comparative Navigation System. Annual of Navigation. 24. 10.1515/aon-2017-0004. J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp. 68–73. [3] J. R. G. Braga, H. F. C. Velho, G. Conte, P. Doherty and É. H. Shiguemori, "An image matching system for autonomous UAV navigation based on neural network," 2016 14th International Conference on Control, Automation, Robotics and Vision (ICARCV), Phuket, Thailand, 2016, pp. 1-6, doi: 10.1109/ICARCV.2016.7838775. [4] A. Yol, B. Delabarre, A. Dame, J. Dartois and E. Marchand, "Vision-based absolute localization for unmanned aerial vehicles," 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, Chicago, IL, USA, 2014, pp. 3429-3434, doi: 10.1109/IROS.2014.6943040. [5] Tsvetkov, Oleg & Tananykina, L.V. (2015). A preproсessing method for correlation-extremal systems. Computer Optics. 39. 738-743. 10.18287/0134-2452-2015-39-5-738-743. [6] X. Zhang, Z. He, Y. Liang and P. Zeng, "Selection Method for Scene Matching Area Based on Information Entropy," 2012 Fifth International Symposium on Computational Intelligence and Design, 2012, pp. 364-368, doi: 10.1109/ISCID.2012.98. [7] G. Conte and P. Doherty, "An Integrated UAV Navigation System Based on Aerial Image Matching," 2008 IEEE Aerospace Conference, 2008, pp. 1-10, doi: 10.1109/AERO.2008.4526556. [8] Y. Zhao and T. Wang, "A Lightweight Neural Network Framework for Cross-Domain Road Matching," 2019 Chinese Automation Congress (CAC), 2019, pp. 2973-2978, doi: 10.1109/CAC48633.2019.8996270. [9] A. Nassar, K. Amer, R. ElHakim and M. ElHelw, "A Deep CNN-Based Framework For Enhanced Aerial Imagery Registration with Applications to UAV Geolocalization," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018, pp. 1594- 159410, doi: 10.1109/CVPRW.2018.00201. [10] A. Shahoud, D. Shashev and S. Shidlovskiy, "Design of a Navigation System Based on Scene Matching and Software in the Loop Simulation," 2021 International Conference on Information Technology (ICIT), 2021, pp. 412-417, doi: 10.1109/ICIT52682.2021.9491778. [11] Salman Khan; Hossein Rahmani; Syed Afaq Ali Shah; Mohammed Bennamoun; Gerard Medioni; Sven Dickinson, A Guide to Convolutional Neural Networks for Computer Vision, Morgan & Claypool, 2018.