=Paper=
{{Paper
|id=Vol-2744/paper29
|storemode=property
|title=Dataset Expansion by Generative Adversarial Networks for Detectors Quality Improvement
|pdfUrl=https://ceur-ws.org/Vol-2744/paper29.pdf
|volume=Vol-2744
|authors=Alexander Kostin,Vadim Gorbachev
}}
==Dataset Expansion by Generative Adversarial Networks for Detectors Quality Improvement==
Dataset Expansion by Generative Adversarial Networks for Detectors Quality Improvement * Alexander Kostin, Vadim Gorbachev Federal State Unitary Enterprise «State Research Institute Of Aviation Systems» (GosNIIAS), Moscow, Russia {akostin,vadim.gorbachev}@gosniias.ru Abstract. Modern neural network algorithms for object detection tasks require large labelled dataset for training. In a number of practical applications creation and annotation of large data collections requires considerable resources which are not always available. One of the solutions to this problem is creation of arti- ficial images containing the object of interest. In this work the use of generative adversarial networks (GAN) for generation of images of target objects is pro- posed. It is demonstrated experimentally that GAN’s allows to create new images on the basis of the initial collection of real images on background images (not containing objects), which simulate real images accurately enough. Due to this, it is possible to create a new training collection containing a greater variety of training examples, which allows to achieve higher precision for detection algo- rithm. In our setting, GAN training does not require more data than is required for direct detector training. The proposed method has been tested to teach a net- work for detecting unmanned aerial vehicles (UAVs). Keywords: Object Detection, GAN, Domain Adaptation, UAV, Drone. 1 Introduction The majority of modern object detection systems and computer vision algorithms are based on machine learning, primarily neural networks. They have proven their reliabil- ity and quality in a wide range of tasks. The main disadvantage of such algorithms is the requirement of large (or even super large) annotated training datasets. Thus, the problem of lack of such data is usually faced in applied tasks. For example, in case of training the detector for a specific object that is not represented in large public annotated data collections, or the need to work in specific conditions. The problem of development of visual indoor UAV positioning system [1] is one of such cases. In the absence of data from satellite navigation systems and requirements for the absence of additional radio wave sources, the use of passive sensors such as video cameras with detection algorithm is extremely useful. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 A. Kostin, V. Gorbachev Fig. 1. An example of an image in which it is necessary to detect a mini UAV. A massive training set is required to train a highly accurate detection network, while the amount of labelled data is severely limited due to limited resources for markup. The authors had at their disposal about 900 images with the drone taken from only 6 differ- ent angles. It will be shown below that this number is not sufficient to provide training for a robust detector. For comparison, the standard data collections for object detection tasks include a huge number of images. For example, data set ImageNet [2] contains more than 14 million images, MS COCO [3] - 328000. It is standard practice for tasks with small data to use augmentations, but in our case they have not been effective enough. In order to achieve high accuracy of detector train- ing in conditions of very limited training set, we have investigated the possibility of using generative adversarial neural networks (GAN) for creation of synthetic training images. They were created by drawing drones with neural network in different areas of the background. With this approach it is possible to achieve enriching the dataset with new object-background combinations and automatic annotations and to raise the detec- tor quality. Adversarial algorithms are learning method in which two agents (a generator and a descriptor) are created inside a neural network that pursue opposite goals and have cor- responding loss functions. The generator tries to draw an artificial image that the dis- criminator cannot distinguish from the real one, while the discriminator tries to learn to distinguish the imitation from real images. This method makes it possible to create a generative network, the output of which will simulate the distribution of available data accurately enough. Such algorithms show that it is possible to create highly realistic artificial images. For example, it is possible to train a network to generate plausible images from noise [4] or a network to transform the data domain with or without a teacher [5,6]. Dataset Expansion by Generative Adversarial Networks for Detectors Quality… 3 2 Related work The problem of data lack in neural network training is well known. There are various approaches to expanding training samples without additional manual markup. The main approach is augmentation of the original collection of marked images. Augmentation consists in turns, reflections and distortions of color channels, image noise and so on [7]. Another approach is the so-called "transfer learning", i.e., teaching the neural net- work on large available collections of similar data with additional training directly on the target data [8]. The approach is generally accepted, but it does not solve the problem of lack of target data until the end. One more possible approach is to add undetected data to a training sample and then mark it up using a training model [9]. Thus, we obtained pseudo labels, which will be the markup for added data on the next epoch of neural network training. But the quality of the bounding box regression task could not be significantly improved with this ap- proach. An alternative approach is to create and use in training synthetic images obtained by rendering 3D models of objects [10]. At such approach annotations of objects are gen- erated automatically, and the volume of received data is theoretically not limited. The disadvantage of this method is that the synthesized images do not always simulate real ones accurately enough, and neural networks are highly sensitive to the data distribution fluctuations. In the article [1] this very approach was used to extend the training dataset and train the detector. The data were synthesized on with an existing 3D drone model. The drone model was drawn in the 3D modeling system with different angles. Then random transformations were applied to the image and its mask: rotation, scaling, dis- placement, reflection, etc. After that the image of the object by its mask was inserted into arbitrary background. This approach showed a good result in case the drones on the images looked relatively large and contrasting, but when switching to higher reso- lution images with smaller objects (near-realistic conditions) proved to be ineffective. In the article [11] GAN was used for creating additional training samples. The key feature of this approach is that an attempt was made to obtain feedback (in the form of gradients in training network) for the generator from the detector. The network was arranged in such a way that the input of the generator was fed with a background image in RGB format and a window representing a rectangle of ones against the background of zeros, which indicated the place to which the generator should overlay the detectable object. The entire image was created by the generator. In the application task under consideration such architecture did not work, as the detectable objects were much smaller than the image itself. In contrast to DetectorGAN, in our approach images do not generated entirely, but only a small square containing the object to be detected, which is then inserted back into the high-resolution image. 4 A. Kostin, V. Gorbachev 3 Detection algorithm The algorithms of object detection based on neural networks can be divided into single- stage [12] and two-stage [13]. Detectors from the first group are faster, but in general they are inferior in accuracy to detectors from the second group. Since in the application task under consideration it was required to provide real-time processing of image stream from 6 cameras, it was decided to use single stage detectors. RetinaNet [14] with PeleeNet [15] backbone was chosen as the detection network. As the specificity of the task consists in detecting small objects, the network has been modified accordingly. The initial version of PeleeNet received a 304x304 resolution image for input and de- tected objects using 5 different scales. The initial image was divided into grids of sizes from 1x1 to 19x19 cells. The 19x19 grids were not enough to provide a sufficiently dense coverage of the image with anchor boxes. Therefore, in this study the network was modified in such a way as to accept an image of 608x608 resolution at the input and split it on a 38x38 mesh grid. 4 Data generation algorithm The main objective of our work was to create an algorithm that draw target objet on a given background image. To solve this problem, the principle of domain transfer was applied. It bases on a render image of 3D object model which is pasted into some back- ground image. This image is processed by the neural network, which transforms the synthetic image of the object to a more realistic one. The generative adversarial network (GAN) was used as such transformation network. Fig. 2. The general scheme of the proposed pipeline. From left to right: background image, back- ground image with pasted render of 3D model of the object, transformed image, whole image with inserted fragment (fragment is highlighted with red square) In spite of the fact that in general the scheme is quite simple, in practice a number of problems arise in its implementation, because of which synthesized data cannot be an effective substitute for real data. The main difficulty with this approach is a sharp boundary (Fig. 3.), which appears at the place where the modified fragment is inserted into the original image. Usually, image transformation algorithms change not only the object itself and its domain, but also partially change the background. When a fragment is inserted back into the original image, a sharp non-uniform border appears. Such an Dataset Expansion by Generative Adversarial Networks for Detectors Quality… 5 artifact may be "learned" by the detection algorithm at the training stage as an important informative feature, which prevents it from working correctly on real data. Fig. 3. An example of generated images containing a clear boundary at the drone generation site In order to get rid of this boundary effect, Attention Guided GAN was taken as a gen- erative model [16]. Its learning algorithm is similar to classical CycleGAN [3], but the principle of image generation is different. The generator receives the input image from the original (synthetic) domain, the encoder extracts latent features, and then the de- coder creates a mask and some image, which is pasted into the original image by the mask. This changes only the part of the image that is directly related to the domain of the image. The border of the changed part of the picture turns out to be smooth and it is possible to insert a fragment back in picture without obvious artifacts, which in train- ing may mislead the detector. Such generator produces 9 masks and 9 images. To train it, two sets of images from different domains are required: a set of background images with pasted drone renderers into them and a set of real drone images cropped from training dataset. This network will learn how to convert images from one domain to another like CycleGAN. The proposed data extension algorithm consists of 3 stages: The input of the algorithm contains a picture and a parameters of a bounding box inside which the target object is located. A square fragment is cut out of the picture with a center equals to a center of the box and fixed by height and width. The values of these parameters depend on the data and should be large enough that the resulting frag- ment can hold an object in the training dataset with the largest bounding rectangle. The fragment’s side was set to 152 pixels in our experiment. An arbitrary image of the rendered drone is pasted into the cut out fragment, after which it is fed to the input of the trained generator, which performs the transition from the domain of synthetic data to the domain of realistic images. The converted image is inserted back into the original picture. As a ground true bounding box for the obtained data bounding box with centers corresponding to the centers of squares, and sizes equal to the largest of the marked data are taken. 6 A. Kostin, V. Gorbachev Fig. 4. Architecture of the algorithm 5 Experiments and results The experiments were carried out on the data, which is a footage of the drone flight in a hangar. The footage was carried out with six tripod-mounted cameras at different angles (Fig.5). To get rid of the need for manual annotation of video files, the following algorithm of automatic annotation [1] was used to create a training collection. Optical stream maps were calculated for each video frame. The area with the maximum mag- nitude of the optical stream was selected on the maps. It was believed that this area corresponds to the drone. However, due to the presence of other moving objects, shad- ows and segmentation inaccuracies, such a mark-up can’t be considered completely true. Images from 4 cameras were taken for training of the generator and detector model, images from another two cameras were taken as validation and test datasets. Fig. 5. An example of an image from the available data. The drone is highlighted with a red rectangle Dataset Expansion by Generative Adversarial Networks for Detectors Quality… 7 To establish the effectiveness of the proposed algorithm three experiments were carried out. The first consisted in training the detector on a raw training set of 900 images, which is the hangar footage from 4 angles (4 training backgrounds). The second exper- iment repeated the method proposed in the article [1]. The data for this experiment were extended by placing drone renders in arbitrary places. The third was to teach the detec- tor on the same data, extended by new pictures created by the trained generator. In the process of data extension, drones were applied to each background by a uniform grid with 30 and 20 pixel steps in the x and y axis respectively. All manipulations were performed with images of original resolution 1280x720 pixels. The sizes of datasets in the second and third experiment were equal. Since the source data are very poor in variety of backgrounds (there are only 4 angles of the same hangar in the training set), it was decided to add to the dataset random images that do not contain detectable ob- jects. This solution expands the variability of backgrounds, which increases the dis- criminatory ability of the network and improves precision of the detector. 5.1 Details of AGGAN training To teach the generator, a square with the side of 152 pixels and the center coinciding with the center of the limiting rectangle was cut out from each image in the training sample. The data obtained by this procedure formed domain A. Train dataset was parsed in order to find empty square for each corresponding square in domain A. Such empty pictures formed domain B. In this way, pairs of images from different domains were obtained. In order to introduce variability into the generated data, it was decided to put drone renderers on the images from domain B. It was assumed that the generator would make the transition between domains by increasing the visual likelihood of pasted drones. In this case, the data could be expanded by overlaying new renders from different angles. There were a total of 15 images of drone renderers. Attention Guided GAN training was running on the data obtained in this way for 200 epochs. Adam with parameters lr=0.0002, beta1=0.5 and beta2=0.999 was used to optimize this network. After passing 100 epoch, learning rate began to decrease linearly to zero. Learning rate was decreasing linearly to zero after 100 epochs passed. 5.1 Detector training details In all experiments the detector was trained for 100 epochs. As the optimizer Adam with parameters lr=0.001, beta1=0.9, beta2=0.999 was chosen. Since it is known that the image will always contain no more than one object (this is the specificity of the applied task), as the network output was taken only prediction with the greatest confidence. 8 A. Kostin, V. Gorbachev Fig. 6. Examples of generated drones. The first and third rows shows source images with pasted drone render, the second and fourth rows shows corresponding transformed images 5.2 Results To establish the effectiveness of the proposed algorithm three experiments were carried out. The first consisted in training the detector on a raw training set of 900 images, which is the hangar footage from 4 angles (4 training backgrounds). The second exper- iment repeated the method proposed in the article [1]. The data for this experiment were extended by placing drone renders in arbitrary places. The third was to teach the detec- tor on the same data, extended by new pictures created by the trained generator. In the process of data extension, drones were applied to each background by a uniform grid with 30 and 20 pixel steps in the x and y axis respectively. All manipulations were performed with images of original resolution 1280x720 pixels. The sizes of datasets in the second and third experiment were equal. Since the source data are very poor in variety of backgrounds (there are only 4 angles of the same hangar in the training set), it was decided to add to the dataset random images that do not contain detectable ob- jects. This solution expands the variability of backgrounds, which increases the dis- criminatory ability of the network and improves precision of the detector. The f1 metric curves for all three experiments (Fig.7, Fig.8 and Fig.9) and the table with the results on the test dataset are presented in Table 1. Figure 7 shows instability of learning curve Dataset Expansion by Generative Adversarial Networks for Detectors Quality… 9 on raw real data, that witnesses insufficient amount of training data. The final experi- mental results (Table 1) prove that the addition of artificial data to training set is useful, but proposed method of image transformation is of most effectiveness. The amount of real images used in each experiment was equal, so artificial expansion of dataset by our method is a solution to lack of data problem. Fig. 7. Graph of the metric f1 while teaching only on real data Fig. 8. Graph of f1 metric while training on data extended by drone renders 10 A. Kostin, V. Gorbachev Fig. 9. Graph of f1 metric while training on data extended with the proposed method of image transformation Table 1. Values of metrics on the test sample in various experiments Dataset f1 recall precision 0.9723 0.9489 1.0 Raw data 0.9813 0.9659 1.0 Raw data + renders 0.9909 0.9830 1.0 Raw data + GAN 6 Conclusion The problem of artificial expansion of the training dataset via GAN for object detection neural network was solved in this work. Input data of the proposed algorithm is back- ground images containing renders of 3D model of target object. Proposed algorithm first pastes the object model render on the background image, then the neural network performs domain transfer for the local fragment of the image containing the object. It is shown in experiments that such synthesized images can be successfully used for learning detectors and allow to significantly improve their quality in comparison with the use of both only raw real images and a mixture of real images with 3D model ren- ders. Dataset Expansion by Generative Adversarial Networks for Detectors Quality… 11 References 1. Technology for the Visual Inspection of Aircraft Surfaces Using Programmable Unmanned Aerial Vehicles Blokhinov, Yu. B.; Gorbachev, V. A. ; Nikitin, A. D.; Skryabin, S. V. // Journal of computer and systems sciences international N 58 V 6 P 960-968, 2019 2. O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” Sep. 2014, [Online]. Available: http://arxiv.org/abs/1409.0575. 3. T.-Y. Lin et al., “Microsoft COCO: Common Objects in Context,” May 2014, [Online]. Available: http://arxiv.org/abs/1405.0312. 4. A. Brock, J. Donahue, and K. Simonyan, “Large Scale GAN Training for High Fidelity Natural Image Synthesis,” Sep. 2018, [Online]. Available: http://arxiv.org/abs/1809.11096. 5. P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-Image Translation with Conditional Adversarial Networks,” Nov. 2016, [Online]. Available: http://arxiv.org/abs/1611.07004. 6. J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks,” Mar. 2017, [Online]. Available: http://arxiv.org/abs/1703.10593. 7. Perez, Luis and Jason Wang. “The Effectiveness of Data Augmentation in Image Classifi- cation using Deep Learning.” ArXiv abs/1712.04621 (2017) 8. S. J. Pan and Q. Yang, "A Survey on Transfer Learning," in IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345-1359, Oct. 2010, doi: 10.1109/TKDE.2009.191. 9. S. Kim, J. Choi, T. Kim, and C. Kim KAIST, “Self-Training and Adversarial Background Regularization for Unsupervised Domain Adaptive One-Stage Object Detection.” Aug. 2019, [Online]. Available: http://arxiv.org/abs/1903.12296. 10. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Ran- domization. Tremblay, Jonathan, Aayush Prakash, David Acuna, Mark Brophy, Varun Jam- pani, Cem Anil, Thang To, Eric Cameracci, Shaad Boochoon and Stanley T. Birchfield. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2018): 1082-10828. 11. L. Liu, M. Muelly, J. Deng, T. Pfister, and L.-J. Li, “Generative Modeling for Small-Data Object Detection,” Oct. 2019, [Online]. Available: http://arxiv.org/abs/1910.07169. 12. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real- Time Object Detection,” Jun. 2015, [Online]. Available: http://arxiv.org/abs/1506.02640. 13. S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detec- tion with Region Proposal Networks,” Jun. 2015, [Online]. Available: http://arxiv.org/abs/1506.01497. 14. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for Dense Object Detec- tion,” Aug. 2017, [Online]. Available: http://arxiv.org/abs/1708.02002. 15. R. J. Wang, X. Li, and C. X. Ling, “Pelee: A Real-Time Object Detection System on Mobile Devices,” Apr. 2018, [Online]. Available: http://arxiv.org/abs/1804.06882. 16. H. Tang, D. Xu, N. Sebe, and Y. Yan, “Attention-Guided Generative Adversarial Networks for Unsupervised Image-to-Image Translation,” Mar. 2019, [Online]. Available:http: http://arxiv.org/abs/1903.12296.