Technology for Indoor Drone Positioning Based on CNN Detector V.A. Gorbachev1, Yu.B. Blokhinov1, A.D. Nikitin1, E.E. Andrienko1 vadim.gorbachev@gosniias.ru|yury.blokhinov@gosniias.ru 1 FSUE “GosNIIAS”, Moscow, Russia The article presents the drone positioning technology in a multi-camera system by using the detection algorithm. Paper describes positioning system and algorithm for calculating 3d drone coordinates based on its image position, detected on images of stationary video cameras. Positioning enables automatically control the drone when precise data from satellite navigation systems are not available, for example, in closed hangars. The developed technology is used to create a complex of automatic visual control of aircraft. The ways of adaptation of neural network detection algorithm to the problem of drone detection are presented. The main attention is paid to the methods of training data preparation. It is shown that high accuracy can be achieved using synthesized images without any real data or manual labelling. Keywords: object detection, neural networks, drones, positioning, indoor navigation, multi-camera system, image synthesis. covering of aircraft is made. This technology allows complete 1. Introduction automating the process of visual inspection of aircraft [1]. The paper describes the features of creating such a Currently, due to the increase in the aircraft flow in the technology in terms of positioning drones through the use of airspace, the complexity of their timely and high-quality visual CNN-based detectors. inspection during the maintenance at the airport has increased significantly. Significantly increased the total downtime of aircraft during unscheduled inspections, caused, for example, the impact of atmospheric electricity on the surface of the fuselage of the aircraft in flight. External human inspection of hard-to-reach areas of the aircraft, such as the upper fuselage or tail, aimed to identify the effects of lightning today takes a significant time, leading to downtime of aircraft or even flight delays. For companies which have a fleet of more than 200 aircraft, such as Aeroflot, such an event is not uncommon: according to the company, it occurs about 300-400 times a year, leading to significant time and financial losses. Large companies such as Airbus, Lufthansa, EasyJet, American Airlines start applying drones to solve the problems of Fig. 1. The drone flight over the aircraft during the tests. accelerating the visual inspection of the aircraft. However, currently, the use of drones is carried out in manual mode, 2. Review of detection algorithms which does not allow to completely reveal the potential of the technology. According to experts, the use of programmable The proposed technology is based on an algorithm for drones will significantly reduce the time of inspection of the detecting objects in images (namely, video frames). The most aircraft and, no less significantly, make the technology itself advanced detection algorithms today are algorithms based on completely digital. deep convolutional neural networks (CNN). Neural network The article proposes an approach to the creation of architectures for detection are divided into two main types: automated drone control technology based on its real time single-stage and two-stage. In two-stage approaches, the task of positioning using a system of stationary cameras. This detecting objects is divided into two steps: identifying areas of technology is necessary to ensure the functioning of the drone interest, then classifying the class of object in the area, and control system in enclosed spaces such as aircraft hangars. The predicting the parameters of the bounding box. development of a special positioning technology is necessary, The two-stage approach was first introduced in 2014 by since the signals of global satellite navigation systems (GPS, Girshik [2]. His work R-CNN (Regions with CNNs) uses a GLONASS, etc.) may be partially or completely inaccessible in selective search method [3] to detect regions of interest in input the hangar where aircraft maintenance is carried out. At the images and uses a regional classifier based on DCN same time, the inertial navigation system of the drone can’t (Deformable Convolutive Networks) to self-classify regions of provide sufficient accuracy throughout its flight. Due to the fact interest. Fast-RCNN [4] improves R-CNN by extracting regions that the flight of the drone must be carried out at a short of interest from feature maps. Faster R-CNN [5] is a distance from the aircraft (no more than 1.5 meters), ensuring modification of the method of Fast R-CNN and R-CNN. The the accuracy of the trajectory is a critical aspect for the safety method is based on the idea of region proposals. The key and applicability of the technology. Visual positioning system is difference between Faster R-CNN and its predecessors is that the most preferable in the described conditions, as it is able to regions are calculated not from the original image, but from the provide sufficient accuracy, it does not require the installation feature map obtained from the convolutional neural network. To of additional equipment on the drone, it is passive, so, it does do this, a module called Region Proposal Network (RPN) was not emit any radio or other signals except Wi-Fi. added. Obtained with the help values are passed to two parallel During maintenance, the drone flies over the aircraft on a fully connected branches: bounding box prediction (regression) programmed trajectory and makes a high resolution video of the and classification framework. The outputs of these layers are surface of the fuselage and wings (Fig. 1). Based on the based on the so called anchor areas (ancor boxes) – several coordinates obtained from the visual positioning system, the frames for each position of the window, having different sizes onboard drone control system monitors compliance with the and aspect ratios. The regression layer for each such rectangle choosen trajectory. By results of the automatic analysis of the produces 4 parameters that adjust the position of the bounding received videos the decision on existence of damages on a rectangle, and the classification layer produces the probability Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). that the rectangle contains an object and the probability that the board control system via Wi-Fi channel. The scheme of the object in the frame corresponds to each of the classes. Cascade proposed navigation system is shown in Fig. 2. R-CNN [6] solves the problem of increasing the accuracy of the To calculate the three-dimensional coordinates of the object bounding box detection by applying a sequence of detectors based on its position in the images, a method is used, which is a with varying thresholds. special case of block triangulation by the method of ligaments In single-stage approaches, there is no stage of finding [14]. Since the camera orientation parameters are known, only regions of interest, the regression of bounding boxes and the three unknown 3D coordinate values are calculated. The idea of classification of candidates in anchor areas is performed the method is to minimize the deviation of the projection of the directly. Because of this, these architectures are more calculated three-dimensional point on the image from the real computationally efficient than two-stage architectures, while position of the object (more precisely, the sum of squared errors maintaining a competitive accuracy-performance ratio. SSD for all cameras). The projection equations are: (Single Shot Detector) [7], uses a single neural network that 𝑎1 (𝑋𝑔𝑖 −𝑋)+ 𝑏1 (𝑌𝑔𝑖 −𝑌)+𝑐1 (𝑍𝑔𝑖 −𝑍) 𝑥𝑖 = −𝑓𝑖 , (1) performs all the necessary calculations and eliminates the need 𝑎3 (𝑋𝑔𝑖 −𝑋)+ 𝑏3 (𝑌𝑔𝑖 −𝑌)+𝑐3(𝑍𝑔𝑖 −𝑍) for resource-intensive methods of region proposals predicting. 𝑎2 (𝑋𝑔𝑖 −𝑋)+ 𝑏2 (𝑌𝑔𝑖 −𝑌)+𝑐2 (𝑍𝑔𝑖 −𝑍) 𝑦𝑖 = −𝑓𝑖 , (2) SSD place the anchors densely over the input image and uses 𝑎3 (𝑋𝑔𝑖 −𝑋)+ 𝑏3 (𝑌𝑔𝑖 −𝑌)+𝑐3( 𝑍𝑔𝑖 −𝑍) the features of different convolutional layers for the regression where (𝑋𝑔 , 𝑌𝑔 , 𝑍𝑔𝑖 ) are camera positions, (𝑋, 𝑌, 𝑍) is 3D object 𝑖 𝑖 and classification of anchor regions. DSSD [8] adds a position, (xi,yi) is its projection on image i, fi is focus distances, deconvolution layers inside the SSD to interconnect features 𝑎1𝑖 𝑎2𝑖 𝑎3𝑖 from the top and bottom layers. YOLO (You Only Look Once) 𝑅𝑖 = (𝑏1𝑖 𝑏2𝑖 𝑏3𝑖 ) [9] uses a small number of anchor regions (dividing the input image with a rectangular grid) and is based on the VGG-16 𝑐1𝑖 𝑐2𝑖 𝑐3𝑖 neural network. YOLOv2 [10] improves the performance due to is rotation matrix for camera i. the use of a new method of bounding the regression framework This is a well-known problem, which is solved by the and a new neural network Darknet-19. YOLOv3 [11] continues method of iterative approximations. Each increment step of the to improve Darknet-19, offering a deeper neural network with three-dimensional coordinates ΔX is determined from the skip connections. Architecture YOLOv2 and YOLOv3 allow to solution of the system of equations: change the balance between accuracy and speed of detection by ATA ΔX + ATB = 0, varying the number of areas able to solve the problem of where A is the matrix of partial differential of projection detection in real time. equations (1),(2) by drone coordinates over all cameras (size A slightly different approach is used by CenterNet [12], a 3*3*number of cameras in the system), B is the discrepancy detection algorithm based on methods for key points detection vector (size 2*number of cameras), containing deviations of using neural networks. It learns to predict the centers of objects object projections from real positions on images. and form a feature map. The parameters of the bounding rectangle are then regressed for the detected centers. Corner Net 4. Detection algorithm details [13] is another detection algorithm based on key points As the detection algorithm YOLOv2 [9] CNN architecture prediction. Unlike the CenterNet, CornerNet detects an object was used. This architecture is slightly concede to YOLOv3 in using a pair of corners of its frame. accuracy, but has a higher calculaton speed, and demonstrates one of the best ratios of accuracy and performance, which in 3. Indoor positioning system our task is of key importance. Performance determines the frequency of control signals delivered to the drone, which directly affects the accuracy of control and maximum safe flight speed. The network receives a three-channel image as input, and outputs a tensor of size X×X×Y, where X is the number of cells in the input image. The length of the tensor Y depends on the number of classes detected and the number of anchor regions in the cell. For each anchor area, 5 basic parameters are calculated: the coordinates of the upper left corner of the rectangle, the width, the height, and the probability that this rectangle contains any object. In addition, the probability of the object belonging to each selected class is determined. The hyperparameters of the algorithm are the number and size of the anchor areas and the size of the input image. Fig. 2. Indoor cameras-based drone positioning system. The image size determines the number of cells for which the features are calculated, since the cell size is fixed and is As part of the work, an original scheme of the organization equal to 32x32 pixels. Therefore, it directly affects the of the internal positioning system was developed. Video performance and accuracy of the network, as the number of cameras (4 or more) are placed in the hangar space in a certain cells increases the number of network filters. On the other hand, way, the orientation parameters of which are pre-determined if there are more cells, each of them contains fewer objects; the during the calibration of the system. The cameras are connected features calculated in it correspond more accurately to each to a server that receives and processes video data. The UAV object and allow to build a more reliable prediction. The plot in itself is considered as a target object, which is detected on the figure 3 shows the dependence of the FPS, precision, recall and frames of the received video streams by the detection algorithm. accuracy of the object frame (by the metric Intersection over The algorithm parameters are trained to detect the drone of the Union, IoU) on the image resolution. Despite the slight increase selected model. In our case, it was a DJI Phantom 3 Advanced in accuracy, 576×576 (18x18 cells) resolution was chosen to drone. Based on the position of the drone on the frames and improve performance. orientation of the cameras, its spatial position is calculated. The To maintain a balance between speed and accuracy, the calculated coordinates of the object are transmitted to its on- number of anchors is set to 5, as the higher number of anchor areas decreases performance. K-means clustering of bounding on a uniform-colored background. The object in the image was rectangles on our training data set was used to determine anchor automatically cut out, and its mask was built. Then the image sizes. and mask were subjected to random transformations: rotation, scaling, displacement, reflection, perspective transformation, blurring, salt/pepper noise, shift of color channel values (Fig. 5). After that, the image of the object on his mask was ovelayed on arbitrary backgrounds. Both random images and images from the test hangar where the subsequent testing was carried out were used as backgrounds. In order to make such insertion look natural and the network did not remember overlay artifacts as informative features of the object, local smoothing of objects with a Gaussian filter with randomized intensity was performed. In addition, objects from the Coil-100 collection were added to the images to increase the discriminative ability of the network [15]. To prepare the test data and expand the training base through real images, the real drone flight and video capture in the test hangar were carried out (Fig. 6). To get rid of manual annotation video files, the following automatic labellng algorithm was used. Optical flow maps were calculated for each video frame. The area with the maximum magnitude of the optical flow was selected on the maps. Since normally there were no other moving objects in the experimental scene, this area was thought to correspond to a drone. Sometimes due to the presence of foreign moving objects and shadows, as well as inaccuracy of segmentation, such labelling contained several errors. An experimental study was devoted to the estimate of the influence of different types of training data on the detection results. Fig. 3. Dependency of FPS, precision, recall and IoU on image resolution. 5. Automated data preparation Since the work uses AI detection algorithms, training data is required to learn them. In our case, data are images with annotation: the type of object and its coordinates (bounding box) in the image. The CNN detectors used are very flexible but have a very large number of parameters. In this regard, a lot of training data is required. In order to avoid time-consuming manual data labelling, automatic synthesis of images was used for training and testing the algorithm. Fig. 4. Rendered 3D drone model and its mask. The data were synthesized based on the rendering of the existing three-dimensional model of the drone (Fig. 4). Special 3D environments were not used during data endering, as their preparation requires additional manual labor of designers. Instead, the process was structured as follows. The model of the Fig. 5. Examples of training samples. Image having real drone in different angles was rendered in a 3D modeling system hangar image as a background and random image. 6. Experiments In addition, during the experiments it was found that when training the network on the data obtained by the above- The accuracy of the detection algorithms was tested in a described autolabelling method, the accuracy was worse than on series of computational experiments on various training synthetic data. This is due to the fact that optical flow map is collections. We had three main data collections: synthetic, blurred, and the resulting bounding box is greater than the real where images of drones were obtained by rendering 3D models, object bounding box (Fig. 6). Also, the available real data are and the backgrounds are taken arbitrarily; semi-synthetic, where not sufficiently diverse. backgrounds for rendering was real images of the hangar in which the experiment was carried out; and autolabelled real, Table 1. Detector testing results obtained by automated labelling drone videos (using optical Train Data Precision Recall IoU flow). Incorrectly labelled data was manually deleted. Testing data was the part of the real data collection that was not used for Synthetic 100% 98.69% 98.65% training. The obtained precision, recall and IoU for different Synthetic without sets of training data are shown in table 1. 27.53% 97.71% 97.63% disctractors According to the results of the experiments, the following conclusions can be drawn. First, it is possible to train the Synthetic without 18.25% 86.60% 86.41% algorithm with high accuracy on fully synthetic data, which was smoothing the purpose of the work. Secondly, the smoothing of objects Semi-Synthetic 41.13% 92.48% 91.58% when overlaying them on the background image plays a crucial role. Without smoothing, artifacts at the boundaries of objects Semi-Synthetic without 45.56% 93.32% 98.64% become too important feature for the neural network, and it smoothing overfits to detect only artificial objects. Third, the use of a large Autolabelled real data 33.92% 37.91% 37.89% number of random backgrounds was better than the use of a small number of real backgrounds from the test hangar. Despite the fact that the background images on the test data were similar 7. Conclusion to the training ones (but not the same), the network overfits, that The paper describes the indoor drone positioning means it has a low generalizing ability and does not cope with technology based on stationary visual sensors and the algorithm new scenes. Fourth, the inclusions of random objects of drone detection. Given camera orientation and detection (distractor) in the training images allowed significantly improve results the 3D position is reconstructed using a special the accuracy of the work. Although these objects are not algorithm of iterative minimization of the total reprojection labelled in the test data, the network has learned to better error. The ways of adaptation of the CNN-based detector to the distinguish drones from any other objects (see table 1). subject area were investigated. Both the automated process of creating training data and hyperparameter tuning are described. The influence of the data generation methods on the result is studied, in particular the inclusion of distracting objects in the data, artifacts of object overlay, the use of various background images. Conducted experiments showed that it is possible to train high accuracy detector exclusively on automatically synthesized images obtained using the renderings of a three- dimensional model of the drone without any real samples. 8. Acknowledgements This work was supported by the Russian Foundation for Basic Research, project no. 17-08-00191 a. 9. References [1] Yu.B. Blokhinov, V.A. Gorbachev, A.D. Nikitin, S.V. Skryabin. Technology for Visual Inspection of Aircraft Surfaces using Programmable Unmanned Aerial Vehicles. Journal of Computer and Systems Sciences International. Received by the editor 28.06.2019. [2] R. Girshick, J. Donahue, T. Darrell, J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, p. 580–587, 2014. [3] J. R. Uijlings, K. E. Van De Sande, T. Gevers, A. W. Smeulders. Selective search for object recognition. International journal of computer vision, 104(2):154–171, 2013. [4] R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, p. 1440– 1448, 2015. [5] S. Ren, K. He, R. Girshick, J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing Fig. 6. Real video frame from experimental hangar and systems, p. 91–99, 2015. corresponding optical flow map. [6] Z. Cai and N. Vasconcelos. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018. [7] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.- Y. Fu, and A. C. Berg. SSD: Single shot multibox detector. ECCV, p. 21–37. Springer, 2016 [8] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, A. C. Berg. DSSD: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017. [9] J. Redmon, S. Divvala, R. Girshick, A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, p. 779–788, 2016. [10] J. Redmon, A. Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, p. 7263–7271, 2017. [11] J. Redmon, A. Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. [12] X. Zhou, D. Wang, P. Krähenbühl. Object as Points. arXiv preprint arXiv:1904.07850v2, 2019. [13] H. Law, J. Deng. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European conference on computer vision, p. 734–750, 2018. [14] A.P. Mikhailov, A.G. Chibunichev. Photogrammetry – MIIGAIK Publishing, Moscow, 2016, 294 p. (In Russian language) [15] S. A. Nene, S. K. Nayar, H. Murase. Columbia Object Image Library (COIL-100). Technical Report CUCS-006- 96. February, 1996.