=Paper=
{{Paper
|id=Vol-2763/CPT2020_paper_s6-6
|storemode=property
|title=Variable Realistic Image Synthesis for CNN Training Dataset Generation
|pdfUrl=https://ceur-ws.org/Vol-2763/CPT2020_paper_s6-6.pdf
|volume=Vol-2763
|authors=Vadim Sanzharov,Vladimir Frolov,Alexey Voloboy
}}
==Variable Realistic Image Synthesis for CNN Training Dataset Generation==
<pdf width="1500px">https://ceur-ws.org/Vol-2763/CPT2020_paper_s6-6.pdf</pdf>
<pre>
  Variable photorealistic image synthesis for training dataset generation
                                     V.V. Sanzharov1, V.A. Frolov2,3, A.G. Voloboy3
                      vs@asugubkin.ru | vladimir.frolov@graphics.cs.msu.ru | voloboy@gin.keldysh.ru
                            1
                              Gubkin Russian State University of Oil and Gas, Moscow, Russia
                                        2
                                          Moscow State University, Moscow, Russia
                               3
                                 Keldysh Institute of Applied Mathematics, Moscow, Russia

    Photorealistic rendering systems have recently found new applications in artificial intelligence, specifically in computer vision for
the purpose of generation of image and video sequence datasets. The problem associated with this application is producing large
number of photorealistic images with high variability of 3d models and their appearance. In this work, we propose an approach based
on combining existing procedural texture generation techniques and domain randomization to generate large number of highly
variative digital assets during the rendering process. This eliminates the need for a large pre-existing database of digital assets (only a
small set of 3d models is required), and generates objects with unique appearance during rendering stage, reducing the needed post-
processing of images and storage requirements. Our approach uses procedural texturing and material substitution to rapidly produce
large number of variations of digital assets. The proposed solution can be used to produce training datasets for artificial intelligence
applications and can be combined with most of state-of-the-art methods of scene generation.
    Keywords: photorealistic rendering, procedural generation, synthetic datasets, computer vision.

                                                                              But there are also several drawbacks with synthetic
1. Introduction                                                           data generation. The main problem is producing 3d
    There are two main challenges in the training of                      scene setups suitable for rendering adequate image
artificial intelligence models is data quantity and data                  dataset. The first part of this problem is to generate a 3d
quality. Data quantity concerns availability of sufficient                scene layout meaningful placement of objects (3d
amounts of training and testing data. Training modern                     models and light sources) and choosing 3d models to
computer vision algorithms requires image datasets of a                   include in the generated scene. And the second part is
significant volume - tens and hundreds of thousands of                    setting the optical properties of material models for
images for training on static images and an order of                      objects in scene, so that they will mimic real-life
magnitude more for animation [1, 2] Also, by data                         objects. Similar problem arises in digital content
quantity we mean how balanced is the data – are all the                   creation in visual effects and video games industries,
different classes, which the model must recognize,                        where several variations of the same digital asset (such
represented enough. This can be a significant problem                     as 3d models, material models, texture, etc.) are created
because certain classes can be very rare in the data                      by 3d artists using software tools. However, for large
obtained from real world [3]. Data quality can mean                       datasets generation it’s not feasible to manual produce
many different characteristics, but one that is especially                variations of 3d assets.
important for images is accurate markup. For example,                         There is also drawback associated with time and
if model needs to detect certain objects in an image,                     computational resources required to render dataset with
then in the training data these objects must be                           the size of thousands of images.
accurately annotated. This is usually done manually or                        In this work, we propose an approach aimed at
semi-automatically with the help of segmentation tools                    alleviating the latter problem by using procedural
[4, 5]. Annotating a large image dataset manually is                      texturing and material substitution to produce large
extremely expensive, and often manual marking does                        number of variations from small set of base digital
not have the necessary accuracy (automated markup                         assets.
options also suffer from insufficient accuracy and have
                                                                          2. Related Work
disadvantages).
    Using synthetic data (in this case, photorealistic                        The drawback associated with computational
images produced by rendering 3d scenes) can easily                        resources can be solved by using fast and simple
solve these problems. The solution to data quantity                       rasterization-based rendering solutions (usually
problem can be achieved by using the algorithms for                       OpenGL-based) [6], possibly in tandem with global
procedurally setting the optical properties of materials                  illumination approximation such as ambient occlusion
and surfaces (displacement maps), this way it’s possible                  [7].
to quickly generate almost unlimited number of training                       Of course, rendering of thousands of images or
examples with any distribution of objects (and therefore                  sequences requires significant computational resources.
classes) present in generated images. In addition, one                    But while in simulators developed for training people in
can create training examples which are scarce or almost                   many cases a schematic image of a three-dimensional
non-existent in “real-life” datasets. For example,                        scene is enough for a person (although it is necessary to
emergency situations on the road or in production,                        provide real visualization time), modern AI systems
military operations, objects that only exist as design                    based on deep neural networks are trained according to
projects or prototypes. And second, it is possible to                     different principles. It is important for AI systems to
produce pixel-perfect image annotation together with                      accurately model the data on which training is supposed
rendered image.                                                           to be carried out (but there is no real-time requirement).
                                                                          For synthetic data to closely approximate real-life

Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY
4.0)
datasets, it should simulate reality (i.e. the image should               There are also works that use variety of approaches
be photorealistic). Otherwise, there is no guarantee that            to scene description and generation, such as domain
AI will “work” on similar examples in reality, since it is           specific languages [23], scene graphs [24], stochastic
currently practically impossible to understand the                   grammars [25] for scenes description and generation.
causes of a failure in a multilayer neural network [8-10].                Finally, there are solutions that can generate a
So, photorealistic rendering, which is usually based on              whole synthetic dataset similar to specified real-world
path tracing algorithm or its many variants needs to be              dataset [26]
used. Works [11-13] demonstrate advantages of using                       These works mainly focus on composing realistic
photorealistic rendering for synthetic datasets                      3d scenes from existing digital assets – 3d models,
generation. Because the computational cost of                        textures, materials, etc. While in some cases [8] the
physically correct rendering is still quite high and                 digital assets itself are randomized, this is done in a
rendering speed and scaling of the training dataset                  very limited manner, -usually only the base material
generation system as a whole is important, solutions,                color is changed. And because of it, these approaches
relying on photorealistic rendering have disadvantage in             require large databases of digital assets to produce
this regard. It is, however, alleviated by the recent                images with high variability of objects in them.
advent of publicly available hardware accelerated ray-                    One of the methods to further increase variety and
tracing, which can provide significant speedups for                  realism of synthesized images and to match them more
photorealistic rendering [14] as well as denoisers [15].             closely to real-life datasets is domain adaptation [27-
    Several approaches exist to automate the process of              28]. However, such techniques require additional image
creating 3d scenes for photorealistic image datasets                 processing stage which requires significant time and
creation. In [16-19] authors use Augmented Reality                   computational power, especially for images with
(AR) based techniques to insert synthetic objects in                 relatively high resolution.
photos. This approach requires a way to choose the                        In this work, we propose an approach based on
position of inserted objects – random with some                      combining existing procedural texture generation
distribution, using existing image annotation or                     techniques and domain randomization to generate large
additional reconstruction tools.                                     number of highly variative digital assets during the
    In [8] the 3d scene is generated by a set of rules               rendering process. The proposed solution can be
which make use of randomized parameters to select                    combined with most of the reviewed methods of scene
some of 3d models from a database and to procedurally                generation.
generate others. A similar approach, called Domain
Randomization (DR) is used in [20-21]. Domain                        3. Proposed solution
randomization implies making selection of parameters                     The motivation behind our solution is to produce
(aspects of domain) which are randomized for each                    many variants of the same digital asset (in particular, 3d
generated sample. Such parameters may include camera                 model with assigned materials and textures) to
position and field of view, number of objects, number                minimize the amount of manual and expensive work
of lights, textures used for objects, etc.                           done by 3d artists. To achieve this, we propose the
     In [22] physical simulation is used to achieve                  following generation pipeline (fig. 1).
realistic placement of 3d models on a surface.


                                  Fig. 1. Architecture of the proposed image generation pipeline

    Input scenario specifies settings for the whole                  be done after rendering and randomization domain –
pipeline: what kind of scenes are to be generated –                  which parameters should be randomized and what is
classes of objects to be included, lighting type (indoor             randomization distribution - material model parameters,
or outdoor, day or night, etc.), which AOVs (arbitrary               procedural textures and effects, object classes
output values) should be output by rendering system                  distributions, object placement and so on.
(instance segmentation masks, binary object masks,                       Cloud storage or database contains base digital
normals, depth, etc.), image post-processing (if any) to             assets:
1.   3d models with material markup - i.e. what parts of     3.   Finally, procedural textures don’t have fixed
     the model have or can have different material types.         resolution (resulting texture is infinite and has no
2. Materials – base material types, representing                  seams) and because of this it is possible to produce
     common BRDF blends, such as purely diffuse                   detailed high-quality materials suitable for
     materials (such as brushed wood or rubber),                  application to variety of 3d models of different
     reflective materials (such as polished metals),              scale.
     diffuse + reflective materials (such as plastics or         As a part of this work we developed several
     brushed metals), reflective + refractive materials      procedural textures which allowed us to greatly increase
     (such as glass), diffuse with two reflective layer      variation of 3d objects and also increase realism of their
     (such as car paint with coating) and so on.             appearance.
3. Textures – collection of image textures and normal            Image post-processing tools goal is to adjust
     maps to be used in materials.                           images, output by the rendering system or produce
4. Environment maps – HDR spherical panorama                 additional data about these images. The tasks performed
     images for use for image-based lighting,                by this stage can involve:
     representing variety of lighting conditions.            1) measuring 2d bounding boxes for objects/instances;
5. Content metadata – information that is used by            2) applying variety of image-space effects to further
     domain randomization tools to select fitting digital         increase variety of output images or better match
     assets from the storage according to input scenario.         them to real-life datasets, for example:
     This includes:                                               ˗    chromatic aberrations,
     ˗    correspondence of classes to 3d models (for             ˗    barrel simulation,
          example, which 3d models are models of cars,            ˗    blur,
          chairs or humans),                                      ˗    transformations and warping the image,
     ˗    correspondence of material classes to materials              including resampling for the purpose of anti-
          in the storage (for example, that stained glass              aliasing,
          and clear non-refractive glass are both of type         ˗    noise
          “glass” and therefore can be assigned to a 3d           ˗    and others.
          model part marked as “glass” type),                3) cutting objects out of rendered image,
     ˗    correspondence of textures to material             4) composing rendered objects with other images (as
          parameters (which textures can be used for              Augmented Reality based solutions mentioned
          which material parameters),                             earlier do)
     ˗    information of HDR images (what lighting           5) format conversions,
          conditions this particular image has) and so on.   6) and others.
    Domain randomization tools produce scene                     It is worth noting that all listed tasks can be
descriptions from input scenario. This stage can query       performed using simple python scripts or open-source
digital assets storage and using content metadata            compositing software like Natron [31,32] and don’t
randomly or deterministically (depending on input            need complex and computation-intensive processing
scenario) select appropriate digital assets and generate     with neural networks.
requested number of scene descriptions. The generated            In the described image generation pipeline
scene description is intended to be used by rendering        architecture, the domain randomization tools stage can
system directly.                                             be replaced by any other of the reviewed approaches to
    As photorealistic rendering system in our work we        scene generation – scene graph produced by neural
used open-source system Hydra Renderer [29] which            network processing of existing datasets, stochastic
uses.xml scene description. Scene description also           grammars, markup data from existing dataset for
includes what procedural effects should be used and          employing augmentation reality techniques. Or any
what are their input parameters (if any). Hydra Renderer     custom scene generation solution, for example placing
supports user extensions for procedural textures [30]        objects inside existing scene with respect to its depth
and the usage of this functionality is one of the key        buffer.
elements of our solution.
    There are several properties of procedural textures      4. Object appearance variation techniques
that make them a vital element in our generation
                                                                  Procedural textures
pipeline:
1. Procedural textures can be parametrized with                  As we mentioned before, one of the key parts of our
     arbitrary values and therefore it is possible to        work is the use of procedural textures. In this section we
     generate a large number of variations of the same       describe procedural textures developed for use in our
     texture.                                                generation pipeline.
2. It’s possible to apply texture to geometry without            The first problem we were trying to solve with
     uv-unwrapping if the texture is parametrized by, for    procedural textures is to provide additional details to
     example, world space or object space position. This     rendered 3d models to produce more realistic images in
     allows to relax requirements for 3d models and          contrast to crisp and clear look of rendered objects. For
     eliminate predominantly manual work of doing uv-        this purpose we implemented several procedural
     mapping for them.                                       textures, simulating effects such as dirt, rust and
                                                             scratches on materials textures (fig. 2-5).
                                                                  as to dynamically control how far effect spreads on 3d
                                                                  model, which is impossible with ordinary textures.
                                                                      Developed procedural textures can affect not only
                                                                  colors or blending masks between different materials,
                                                                  but also normal maps (fig. 6) or can be used for
                                                                  displacement maps to slightly deform the object (fig. 7).
                                                                  Changing geometry in this way also produces changes
                                                                  in its silhouette and therefore in segmentation masks.


Fig. 2. Rust procedural texture variations on different models


 Fig. 3. Dirt procedural texture variation on different models


                                                                  Fig. 6. Procedural displacement. Top - without displacement,
                                                                      bottom - with mild displacement, warped regions are
                                                                                           highlighted


 Fig. 4. Scratches procedural texture variations. Also affects
                         normal map


                                                                             Fig. 7. Material substitution example
Fig. 5. Rust and dirt procedural textures applied to road signs
                      models normal map
                                                                      Material substitution
    All these textures were parametrized in a way, that
allowed domain randomization tools significantly vary                 In addition to procedural textures, another technique
the appearance of the texture by passing different (and           to increase variability of 3d models implemented in
possibly random) values as these parameters. Since the            proposed solution is material substitution. In the
implementation of these procedural effects is                     proposed data generation pipeline, digital content
predominantly based on noise functions most of these              storage contains materials, while 3d models are marked
parameters correspond to noise parameters such as                 up with material types. This allows to specify a
amplitude, frequency and persistence. Among other                 collection of materials – manually created, pre-
common parameters used is the relative height (or other           generated or imported from one of the existing open
dimension) of an object, the effect reaches. This allows          libraries. These materials can then be classified into
several categories such as “wood”, “metal”, “car paint”,           Proceedings of the IEEE conference on Computer
“plastic” and so on. And during scene generation phase,            Vision and Pattern Recognition. 2014
domain randomization tools can allow 3d models to use         [2] Wu, Zuxuan, et al. Deep learning for video
random materials within classes, specified as possible             classification and captioning // Frontiers of
for this model. For example, chairs can have materials             multimedia research. 2017. 3-29.
from “wood” or “metal” classes, while “glass” type            [3] Фаизов Б.В., Шахуро В.И., Санжаров В.В.,
materials are unlikely to be assigned to chair model.              Конушин        А.С.     Классификация        редких
    In the reviewed existing solutions this technique is           дорожных знаков // Компьютерная Оптика, T.
mostly used in a very rudimentary form – only color is             44, №2, 2020
changed, not the material (i.e. BRDF or BRDF blend)           [4] Moehrmann, Julia, and Gunther Heidemann.
itself.                                                            Efficient annotation of image data sets for
                                                                   computer vision applications. // Proceedings of the
5. Results and conclusion                                          1st International Workshop on Visual Interfaces for
    Proposed image datasets generation pipeline                    Ground Truth Collection in Computer Vision
architecture can create many variations of the same 3d             Applications. 2012.
model using procedural textures and material                  [5] Gao, Chao, Dongguo Zhou, and Yongcai Guo.
substitution techniques. Other 3d scene parameters,                Automatic iterative algorithm for image
which are commonly varied in existing solutions, such              segmentation using a modified pulse-coupled
as lighting (HDR spherical maps for image-based                    neural network. // Neurocomputing 119 (2013):
lighting point and area lights) can also be utilized in the        332-338.
proposed pipeline. This allows to use much smaller pre-       [6] Su, Hao, et al. Render for cnn: Viewpoint
existing digital assets collection while producing output          estimation in images using cnns trained with
images of high variability.                                        rendered 3d model views. // Proceedings of the
    Let’s consider the case of a simple scene with only            IEEE International Conference on Computer
one 3d model in it. In existing solutions appearance of a          Vision. 2015.
single 3d model is most commonly varied by a single           [7] Kirsanov, Pavel, et al. DISCOMAN: Dataset of
parameter – base color. Proposed solution allows also to           Indoor Sсenes for Odometry, Mapping And
specify several procedural textures, each with at least 3          Navigation. // arXiv preprint arXiv:1909.12146
parameters (related to noise functions), which                     (2019).
significantly alter the appearance of objects (fig. 2-5).     [8] Nguyen, Anh, Jason Yosinski, and Jeff Clune.
So, for single 3d model with single material each                  Deep neural networks are easily fooled: High
procedural texture introduces another 3 dimensions of              confidence predictions for unrecognizable images.
appearance variability compared to only single “color”             // Proceedings of the IEEE conference on computer
dimension in existing solutions. This allows our                   vision and pattern recognition. 2015.
solution to produce exponentially more variations of a        [9] Zhang, Chiyuan, et al. Understanding deep learning
single 3d model with single material. And with material            requires rethinking generalization. // arXiv preprint
substitution we can also alter the material types suitable         arXiv:1611.03530 (2016).
for this particular 3d model.                                 [10] Montavon, Grégoire, Wojciech Samek, and Klaus-
    Existing works on synthetic image datasets                     Robert Müller. Methods for interpreting and
generation are mostly concerned with scene generation              understanding deep neural networks // Digital
and rely on having large collections of digital assets to          Signal Processing 73 (2018): 1-15.
construct these scenes from. While there exist large 3d       [11] Movshovitz-Attias, Yair, Takeo Kanade, and Yaser
models collections such as ShapeNet [33], they usually             Sheikh. How useful is photo-realistic rendering for
have content of poor quality compared to assets, created           visual learning?. // European Conference on
by experienced 3d artists. And with proposed approach              Computer Vision. Springer, Cham, 2016.
of using procedural textures and material substitution, it    [12] Tsirikoglou, Apostolia, et al. Procedural modeling
is possible to produce many variations out of small set            and physically based rendering for synthetic data
of high-quality models that portray real-life objects              generation in automotive applications. // arXiv
more accurate.                                                     preprint arXiv:1710.06270 (2017).
    However, proposed solution doesn’t exclude lower          [13] Zhang, Yinda, et al. Physically-based rendering for
quality models and is able to deal with 3d models                  indoor scene understanding using convolutional
without texture coordinates because of used procedural             neural networks. // Proceedings of the IEEE
texturing techniques.                                              Conference on Computer Vision and Pattern
    Finally, proposed image generation pipeline can be             Recognition. 2017
integrated with any of the reviewed solutions for scene       [14] Sanzharov V., Gorbonosov A., Frolov V., Voloboy
generation.                                                        A. Examination of the Nvidia RTX // CEUR
                                                                   Workshop Proceedings, vol. 2485 (2019), p. 7-12
6. References:                                                [15] S.V.Ershov,       D.D.Zhdanov,         A.G.Voloboy,
                                                                   V.A.Galaktionov. Two denoising algorithms for bi-
[1] Karpathy, Andrej, et al. Large-scale video                     directional Monte Carlo ray tracing // Mathematica
    classification with convolutional neural networks. //          Montisnigri, Vol. XLIII, 2018, p. 78-100.
     https://lppm3.ru/files/journal/XLIII/MathMontXLII       [30] V.V. Sanzharov, V.F. Frolov. Level of Detail for
     I-Ershov.pdf                                                 Precomputed Procedural Textures // Programming
[16] Alhaija, Hassan Abu, et al. Augmented reality                and Computer Software, 2019, V. 45, Issue 4, pp.
     meets computer vision: Efficient data generation             187-195 DOI:10.1134/S0361768819040078
     for urban driving scenes. // International Journal of   [31] Natron, Open Source Compositing Software For
     Computer Vision 126.9 (2018): 961-972.                       VFX            and         Motion          Graphics
[17] Dosovitskiy, Alexey, et al. Flownet: Learning                https://natrongithub.github.io/
     optical flow with convolutional networks. //            [32] A.E. Bondarev. On visualization problems in a
     Proceedings of the IEEE international conference             generalized computational experiment (2019).
     on computer vision. 2015.                                    Scientific Visualization 11.2: 156 - 162, DOI:
[18] Varol, Gul, et al. Learning from synthetic humans.           10.26583/sv.11.2.12      (Scopus)   http://www.sv-
     // Proceedings of the IEEE Conference on                     journal.org/2019-2/12/
     Computer Vision and Pattern Recognition. 2017.          [33] Chang, Angel X., et al. "Shapenet: An information-
[19] Chen, Wenzheng, et al. Synthesizing training                 rich 3d model repository." arXiv preprint
     images for boosting human 3d pose estimation. //             arXiv:1512.03012 (2015).
     2016 Fourth International Conference on 3D Vision
     (3DV). IEEE, 2016.                                      About the authors
[20] Tobin, Josh, et al. Domain randomization for                Vadim Sanzharov, senior lecturer, Gubkin Russian State
     transferring deep neural networks from simulation       University of Oil and Gas. E-mail: vs@asugubkin.ru.
     to the real world. // 2017 IEEE/RSJ international           Vladimir Frolov, PhD, senior researcher at Keldysh
     conference on intelligent robots and systems            Institute of Applied mathematics RAS, researcher at Moscow
     (IROS). IEEE, 2017                                      State University.
[21] Prakash, Aayush, et al. Structured domain                   Alexey Voloboy, D.Sc., PhD, leading researcher at
                                                             Keldysh Institute of Applied mathematics RAS.
     randomization: Bridging the reality gap by context-
     aware synthetic data. // 2019 International
     Conference on Robotics and Automation (ICRA).
     IEEE, 2019
[22] Mitash, Chaitanya, Kostas E. Bekris, and Abdeslam
     Boularias. A self-supervised learning system for
     object detection using physics simulation and
     multi-view pose estimation. // 2017 IEEE/RSJ
     International Conference on Intelligent Robots and
     Systems (IROS). IEEE, 2017.
[23] Fremont, Daniel J., et al. Scenic: a language for
     scenario specification and scene generation. //
     Proceedings of the 40th ACM SIGPLAN
     Conference on Programming Language Design and
     Implementation. 2019
[24] Armeni, Iro, et al. 3D Scene Graph: A Structure for
     Unified Semantics, 3D Space, and Camera. //
     Proceedings of the IEEE International Conference
     on Computer Vision. 2019
[25] Jiang, Chenfanfu, et al. "Configurable 3d scene
     synthesis and 2d image rendering with per-pixel
     ground truth using stochastic grammars."
     International Journal of Computer Vision 126.9
     (2018): 920-941.
[26] Kar, Amlan, et al. Meta-sim: Learning to generate
     synthetic datasets. // Proceedings of the IEEE
     International Conference on Computer Vision.
     2019.
[27] Hoffman, Judy, et al. Cycada: Cycle-consistent
     adversarial domain adaptation. // arXiv preprint
     arXiv:1711.03213 (2017).
[28] French, Geoffrey, Michal Mackiewicz, and Mark
     Fisher. Self-ensembling for visual domain
     adaptation. // arXiv preprint arXiv:1706.05208
     (2017)
[29] Ray Tracing Systems, Keldysh Institute of Applied
     Mathematics, Moscow State Uiversity. Hydra
     Renderer. Open source rendering system, 2019,
     https://github.com/Ray-Tracing-Systems/HydraAPI

</pre>