Investigation of the Complex Data Distributions for their Efficient Generation Philipp Kofman1 , Oles Dobosevych1 , and Rostyslav Hryniv1,2 1 Ukrainian Catholic University, Lviv, Ukraine 2 University of Rzeszw, Rzeszw, Poland {kofman, dobosevych, rhryniv}@ucu.edu.ua Abstract. Currently, the active development of image processing methods re- quires large amounts of correctly labeled data. The lack of quality data makes it impossible to use various machine learning methods. In case of limited possibil- ities for collecting real data, used methods for their synthetic generation. In prac- tice, we can formulate the task of the high-quality generation of synthetic images as an efficient generation of complex data distributions, which is the object of study of this work. Generating high-quality synthetic data is an expensive and complicated process in terms of existing methods. We can distinguish two main approaches that are used to generate synthetic data: image generation based on rendered 3-D scenes and the use of GANs for simple images. These methods have some drawbacks, such as a narrow range of applicability and insufficient distri- bution complexity of the obtained data. When using GANs to generate complex distributions, in practice, we face a visible increase in the complexity of the model architecture and training procedure. A deep understanding of the real data complex distributions can be used to improve the quality of synthetic generation. Minimizing the differences in the real and synthetic data distributions can im- prove not only the generation process but also develop tools for solving the prob- lem of data lack in the field of image processing. Keywords: statistic · generative adversarial network · deep learning · synthetic data 1 Introduction Expanding the capabilities of computer vision and deep learning opens up opportunities and approaches to solving many problems that previously remained unresolved. Many tasks that need to be solved remain beyond the reach of modern deep learning technol- ogies - even though there is a large amount of manually annotated data. Deep learning models do not have an understanding of the input, at least not in the human sense. People understand images based on their experience. Machine learning models do not have access to such experience, and therefore they cannot understand the input data in this way. By annotating a large number of training examples for models, we force them to learn a geometric transformation that brings data to human concepts Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). In: Proceedings of the 1st Masters Symposium on Advances in Data Mining, Machine Learning, and Computer Vision (MS-AMLV 2019), Lviv, Ukraine, November 15-16, 2019, pp. 21–28 22 Philipp Kofman, Oles Dobosevych, and Rostyslav Hryniv for a specific set of examples, but this transformation is just a simplified outline of the original object model. Deep learning models do not currently have a mechanism for learning abstractions through the direct definition of an object, but working with thousands, millions, or even billions of training examples solves this problem only partially [1]. Data collection for such tasks is essential, but sometimes very difficult, especially in the case of rare classes of objects. We should note that for such amount of data, manual annotation is not the best decision, since it requires a lot of resources and well-estab- lished markup strategies. One way to solve this problem is to use artificially generated data. However, when using synthetic data, we may face the problem of a big jump in the complexity of choos- ing the architecture and methods for training the model. We can assume that for the model, there is a fundamental difference between real and generated data. This study aims to compare the distributions of real and synthetic data, study the reasons for the increase of work complexity when using synthetic data and the way to eliminate it. 2 Related Work Since 2010, research has been conducted in the field of visual domain adaptation, where the first approach to the problem was statistical methods. However, since 2014, neural network methods have gained considerable popularity. Soon the lack of data became a related problem, which led to the growth of synthetics generation methods [2-4]. The need for synthetic data often arises in many tasks. An outstanding representative of such tasks is autonomous driving. In order to make high-precision classifiers of road markings, signs, cars and other vast volumes of qualitatively marked data are needed. To solve this problem, in 2016 was proposed the idea of generating a dataset based on the gaming world. [5] used the GTA5 game world. The purpose of this work was to get markup from screenshots of the game. The main idea was to use an existing virtual 3-D world. However, the limitations of the game did not allow obtain the complete markup necessary to solve the problem of autonomous driving. At the same time, in 2016, using the same idea of the virtual world, the SYNTHIA dataset was generated in order to assist in semantic segmentation and the problems of understanding related scenes in the context of driving scenarios [6]. The authors a bit changed the approach and created their virtual world using the Unity development plat- form. They built their virtual cities based on real city prototypes [6]. One of the main advantages was the ability to add natural events, such as time of day, rain, snow, fog and other. These methods are automatic from generating datasets point of view but challenging at the design stage of the virtual world. Another remarkable example of using synthetic data is the task of a person’s gaze direction recognition. In 2016, [7] presented a method for generating eyes, taking into account the eyeball biological features, as well as the skin around it. This work paid great attention to the light characteristics of the eye surface. It used the same idea of Investigation of Complex Data Distributions … 23 generating 3-D models of objects, but took into account their physical characteristics for higher realism. Often there are also problems in which the original dataset contains data of a differ- ent nature. From here, naturally arise the tasks of domain adaptation [19] and style transfer [20]. In 2018, [20] discussed the methods of transferring style from one image to another using neural network methods. The ideas were based on the principle that neural networks highlight the features of style. The first articles on this topic used fea- tures obtained by neural networks VGG [13], as well as the principles of auto-encoders [17]. Style transfers were carried out due to tricks with intermediate outputs of neural networks, as well as in various ways of constructing a loss function. In 2017, Judy Hoffman introduced a domain adaptation method called CYCADA [21]. Its essence was the use of a complex architecture consisting of two generators, two discriminators, and four auxiliary decision networks. The method showed good results; however, for training, it was necessary to have labelled semantic segmentation of data [22, 23]. Recently, a large number of approaches, methods, and architectures have been de- veloped to solve this and similar problems. However, analyzing the work in this area, we can say that insufficient attention was paid to the problem consideration of generat- ing synthetic data precisely from the statistical methods point of view. 3 Research Hypothesis and Problem The main problem considered in this paper is the difficulty of generating highquality synthetic data for their further use in deep learning models for image processing. So, the central objective is to identify the hidden differences between real and synthetic data for their high-quality generation. We highlight related objectives: ─ Hypothesis confirmation of the presence of a statistically significant difference in the distributions of real and synthetic data ─ Building a pipeline for image conversion ─ Quality criterion selection for assessing the generated data The objects of the study are four primary datasets: real photos collected from auto- recorders [8], generated pictures ”SYNTHIA” transport routes [9], real photos of dogs [10] and generated images of dogs using GANs [11, 12]. We assume that the identification of distinctive features in the distributions of real and synthetic data will help to avoid the difficulty of transferring the machine learning model between them. The formal statement of the problem: 1. Conversion of images and their transformation into vector space using neural net- work methods 2. Construction of space and two presentations: from images to hidden space and vice versa 3. Analysis of distributions in a new hidden space and their investigation using statis- tical methods 24 Philipp Kofman, Oles Dobosevych, and Rostyslav Hryniv 4. Conducting transformations on data in hidden space to minimize differences 5. Display modified synthetic data into the image space 6. Selection of a formal criterion for assessing the quality of artificially generated data so that machine learning models in the field of computer vision containing synthetics in the training dataset show high quality working with test and validation samples of real data. 4 Envisioned Approach 4.1 Dataset Collecting For the experiments, we were selecting data according to the two criteria: the relevance of the task for which these could be used; and the simplicity of objects for human per- ception. The principal requirement was the existence of a pair (real data, synthetic data) since the generation of large amounts of synthetic data from scratch is a costly and time- consuming process. By the first criterion, we selected the SYNTHIA dataset 1. SYNTHIA is a dataset that has been generated to aid semantic segmentation and related scene understanding prob- lems in the context of driving scenarios. SYNTHIA consists of a collection of photo- realistic frames rendered from a virtual city and comes with precise pixel-level semantic annotations for 13 classes: miscellaneous, sky, building, road, sidewalk, fence, vegeta- tion, pole, car, sign, pedestrian, cyclist, lane marking [9]. The self-driving task requires maximum accuracy in its solution, and therefore large high-quality datasets, which is consistent with the relevance of our work. In this case, we chose the Berkeley DeepDrive dataset 2 as real data on which three complex tasks for the CVPR 2018 Autonomous Driving Workshop were conducted: detection of road objects, segmentation of the driving region, and adaptation of semantic segmentation domains [8]. According to the second criterion, we took dogs images dataset because they are easy for human perception, but challenging to formalize for a computer. It follows that the distribution is complex, and this is a vital aspect to consider in our study. As a real dataset, we selected Stanford Dogs Dataset 3 [10], which contains images of 120 dog breeds from around the world. This dataset was created using images and annotations 1 The SYNTHIA dataset (https://synthia-dataset.net/) is provided by the Computer Vision Cen- ter, Barcelona, and may be used for non-commercial purposes only, subject to the CC BY-NC- SA 3.0 license (http://creativecommons.org/licenses/by-nc-sa/3.0/legalcode)/ 2 Berkeley DeepDrive dataset is freely available for download and use at https://bdd-data.berke- ley.edu/ 3 Open source Stanford Dogs Dataset is freely available for download and use at http://vi- sion.stanford.edu/aditya86/StanfordDogs/ Investigation of Complex Data Distributions … 25 from ImageNet 4 for the task of detailed categorization of images. As its synthetic ana- logue, we chose images of dogs generated using GAN [12] method from the Kaggle Generative Dog Images 5 [11] competition. It contains 10 000 examples of synthetically generated dogs without markup. 4.2 Problem Solution Before the experiment starts, we converted our data to a single image of 224x224 size and three-color channel format [13] Our hypothesis assumes that the distributions of real and synthetic data have statis- tically significant differences. For humans, the difference between synthetic images and real data is intuitive, but like many similar processes, hard to formalize. Based on this statement, our approach attempts to formalize these differences. The approach we chose for the first iteration of the experiment involves using trained neural networks to extract image information in vector form. We will use the VGG16 network trained on ImageNet [14] with batch normalization as feature extractor [15]. Using statistical tests such as Student’s T-test [16], Kullback- Leibler divergence (relative entropy) [24], we can confirm our assumption about the distinguishability of synthetics and real data with a certain level of confidence. For evaluating how synthetic data is suitable for modeling, we will use the approach pro- posed in [25], which is grounded in a particular application of synthetic data generation. The next step is to train the variational auto-encoder [17] on real and synthetic data, thereby constructing a hidden space and two pretensions: from pictures to hidden space and vice versa. We will analyze and compare the basic statistical characteristics of real and synthetic data in a hidden space. We will use simple mathematical operations in order to approximate the statistical characteristics of synthetic data to real ones. Then we pass the converted hidden representations of the synthetic data through the decoder. At the output, we expect to get images close to real. 4.3 Hypothesis Verification Two experiments can serve as verification of our hypothesis. First, we can re-pass the generated data through the trained VGG16 [14] with batch normalization. Then, a measure of quality will be a statistically insignificant difference in data distributions. As a second experiment, let us pass the transformed synthetic and initial real data through a simple neural network, which will solve the binary classification problem, i.e., determine the nature of the image. 4 ImageNet data are freely available for non-commercial research and/or educational use at http://image-net.org/download-images 5 Open source Kaggle Generative Dog Images dataset is freely available for download and use at https://www.kaggle.com/c/generative-dog-images/data 26 Philipp Kofman, Oles Dobosevych, and Rostyslav Hryniv After that, we will use the validation dataset to predict the binary classification label. If a neural network cannot accurately predict the correct label, then the conversion qual- ity of synthetic data can be considered high. We assume that a neural network cannot distinguish class labels if the ROC-AUC [18] value is about 0.5 on the validation da- taset. 5 Research Methodology and Plan 5.1 Dataset Preparation Data volume is a critical parameter for statistical methods. In our case, we operate on four central datasets. However, for further work, it may be necessary to extend them to obtain greater representativeness. Therefore, it is planned to finish the preparation of data for experiments in the middle of October. 5.2 Hypothesis Confirmation It is necessary to allocate one month of work to test the central hypothesis about the statistical difference in the distributions of synthetic and real data. Since trained, ma- chine learning models can highlight non-representative features. Statistical tests in the first approximations can give mixed results. As a result, it may be necessary to adjust the design of the experiment or move to more stringent statistical tests. Since this is a fundamental hypothesis, we plan to end this stage in the middle of November. 5.3 Building a Pipeline for Image Conversion Learning deep neural networks is a labor-intensive process. As a result, we laid the time until the middle of December for the second stage of the experiment. The complexity of this stage includes the fact that difficulties may arise in processing hidden data rep- resentations. 5.4 Result Evaluation The mechanisms for conducting validation will be partially implemented in the first part of the experiment. We allocate two weeks for training simple models required for validation. Therefore, we plan to produce conclusions on the described experiments by the end of December. 6 Conclusion Lack of data is the cornerstone of a large number of computer vision tasks. Synthetic data can be the solution to this problem. The use of classical methods of statistical anal- Investigation of Complex Data Distributions … 27 ysis in conjunction with new ways of neural networks can give a much deeper under- standing of the data and lead to the emergence of plans for the efficient generation of synthetic data. References 1. Marcus, G.: Deep learning: a critical appraisal. arXiv preprint, arXiv:1801.00631 (2018) 2. Saenko, K., Kulis, B., Fritz, M., Darrell, T.: Adapting visual category models to new do- mains. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 2314, pp. 213–226. Springer, Berlin, Heidelberg (2010). doi: 10.1007/978-3-642-15561-1_16 3. Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: 2011 IEEE Conference on Com- puter Vision and Pattern Recognition, pp. 1521–1528. IEEE Press, New York (2011). doi: 10.1109/CVPR.2011.5995347 4. Tzeng, E., Darrell, N.: Deep domain confusion: maximizing for domain invariance. arXiv preprint, arXiv:1412.3474 (2014) 5. Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: ground truth from computer games. In: Leibe B., Matas J., Sebe N., Welling M. (eds.) ECCV 2016. ECCV 2016. LNCS, vol. 9906, pp. 102–118. Springer, Cham (2016). doi: 10.1007/978-3-319-46475-6_7 6. Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.: The SYNTHIA dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3234–3243. IEEE Press, New York (2016). doi: 10.1109/CVPR.2016.352 7. Wood, E., Baltrusaitis, T., Morency, L.-P., Robinson, P., Bulling, A.: Learning an appear- ance-based gaze estimator from one million synthesised images, In: 9th Biennial ACM Sym- posium on Eye Tracking Research & Applications, pp. 131–138. ACM (2016). doi: 10.1145/2857491.2857492 8. Yu, F., Xian, W., Chen, Y., Liu, F., Liao, M., Madhavan, V., Darrell, T.: BDD100K: a di- verse driving video database with scalable annotation tooling. arXiv preprint, arXiv:1805.04687 (2018) 9. SYNTHIA Home page. http://synthia-dataset.net/ 10. Robots pets page. https://www.robots.ox.ac.uk/~vgg/data/pets/ 11. Kaggle Biggan competition. https://www.kaggle.com/dvorobiev/doggies-biggan-sub-final 12. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27, pp. 2672–2680. Curran Associates, Inc. (2014) 13. VGG16 Homepage. https://keras.io/applications/#vgg16 14. Simonyan, K., Zisserman, A.,: Very deep convolutional networks for largescale image recognition. arXiv preprint, arXiv:1409.1556 (2015) 15. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint, arXiv:1502.03167 (2015) 16. William H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C: the Art of Scientific Computing. Cambridge University Press, Cambridge (1992) 17. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint, arXiv:1312.6114 (2014) 18. Understanding AUC-ROC Curve. https://towardsdatascience.com/ understanding-auc-roc- curve-68b2303cc9c5 19. Su, J.-C., Tsai, Y.-H., Sohn, K., Liu, B., Maji, S., Chandraker, M.: Active adversarial domain adaptation. arXiv preprint, arXiv:1904.07848 (2019) 28 Philipp Kofman, Oles Dobosevych, and Rostyslav Hryniv 20. Li, H.: A literature review of neural style transfer. https://www.cs. princeton.edu/Litera- tureReview/COSBspr/NealStyleTransfer.pdf 21. Hoffman, J., Tzeng, E., Park, T., Zhu, J.-Y.: CYCADA: cycle consistent adversarial domain adaptation. arXiv preprint, arXiv:1711.03213 (2017) 22. Karacan, L., Akata, Z., Erdem, A., Erdem, E.: Learning to generate images of outdoor scenes from attributes and semantic layouts. arXiv preprint, arXiv:1612.00215 (2016) 23. Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmenta- tion. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(4), 640–651 (2015). doi: 10.1109/TPAMI.2016.2572683 24. Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Statist. 22(1), 79–86 (1951). doi:10.1214/aoms/1177729694 25. Jordon, J., Yoon, J., van der Schaar, M.: Measuring the quality of synthetic data for use in competitions. arXiv preprint, arXiv:1806.11345 (2018)