<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dataset Expansion by Generative Adversarial Networks for Detectors Quality Improvement *</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexander Kostin</string-name>
          <email>akostin@gosniias.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vadim Gorbachev</string-name>
          <email>vadim.gorbachev@gosniias.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Federal State Unitary Enterprise «State Research Institute Of Aviation Systems» (GosNIIAS)</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Modern neural network algorithms for object detection tasks require large labelled dataset for training. In a number of practical applications creation and annotation of large data collections requires considerable resources which are not always available. One of the solutions to this problem is creation of artificial images containing the object of interest. In this work the use of generative adversarial networks (GAN) for generation of images of target objects is proposed. It is demonstrated experimentally that GAN's allows to create new images on the basis of the initial collection of real images on background images (not containing objects), which simulate real images accurately enough. Due to this, it is possible to create a new training collection containing a greater variety of training examples, which allows to achieve higher precision for detection algorithm. In our setting, GAN training does not require more data than is required for direct detector training. The proposed method has been tested to teach a network for detecting unmanned aerial vehicles (UAVs).</p>
      </abstract>
      <kwd-group>
        <kwd>Object Detection</kwd>
        <kwd>GAN</kwd>
        <kwd>Domain Adaptation</kwd>
        <kwd>UAV</kwd>
        <kwd>Drone</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The majority of modern object detection systems and computer vision algorithms are
based on machine learning, primarily neural networks. They have proven their
reliability and quality in a wide range of tasks. The main disadvantage of such algorithms is
the requirement of large (or even super large) annotated training datasets. Thus, the
problem of lack of such data is usually faced in applied tasks. For example, in case of
training the detector for a specific object that is not represented in large public annotated
data collections, or the need to work in specific conditions.</p>
      <p>
        The problem of development of visual indoor UAV positioning system [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is one of
such cases. In the absence of data from satellite navigation systems and requirements
for the absence of additional radio wave sources, the use of passive sensors such as
video cameras with detection algorithm is extremely useful.
A massive training set is required to train a highly accurate detection network, while
the amount of labelled data is severely limited due to limited resources for markup. The
authors had at their disposal about 900 images with the drone taken from only 6
different angles. It will be shown below that this number is not sufficient to provide training
for a robust detector. For comparison, the standard data collections for object detection
tasks include a huge number of images. For example, data set ImageNet [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] contains
more than 14 million images, MS COCO [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] - 328000.
      </p>
      <p>It is standard practice for tasks with small data to use augmentations, but in our case
they have not been effective enough. In order to achieve high accuracy of detector
training in conditions of very limited training set, we have investigated the possibility of
using generative adversarial neural networks (GAN) for creation of synthetic training
images. They were created by drawing drones with neural network in different areas of
the background. With this approach it is possible to achieve enriching the dataset with
new object-background combinations and automatic annotations and to raise the
detector quality.</p>
      <p>
        Adversarial algorithms are learning method in which two agents (a generator and a
descriptor) are created inside a neural network that pursue opposite goals and have
corresponding loss functions. The generator tries to draw an artificial image that the
discriminator cannot distinguish from the real one, while the discriminator tries to learn to
distinguish the imitation from real images. This method makes it possible to create a
generative network, the output of which will simulate the distribution of available data
accurately enough. Such algorithms show that it is possible to create highly realistic
artificial images. For example, it is possible to train a network to generate plausible
images from noise [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or a network to transform the data domain with or without a
teacher [
        <xref ref-type="bibr" rid="ref5 ref6">5,6</xref>
        ].
      </p>
      <p>Dataset Expansion by Generative Adversarial Networks for Detectors Quality… 3
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        The problem of data lack in neural network training is well known. There are various
approaches to expanding training samples without additional manual markup. The main
approach is augmentation of the original collection of marked images. Augmentation
consists in turns, reflections and distortions of color channels, image noise and so on
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Another approach is the so-called "transfer learning", i.e., teaching the neural
network on large available collections of similar data with additional training directly on
the target data [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The approach is generally accepted, but it does not solve the problem
of lack of target data until the end.
      </p>
      <p>
        One more possible approach is to add undetected data to a training sample and then
mark it up using a training model [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Thus, we obtained pseudo labels, which will be
the markup for added data on the next epoch of neural network training. But the quality
of the bounding box regression task could not be significantly improved with this
approach.
      </p>
      <p>
        An alternative approach is to create and use in training synthetic images obtained by
rendering 3D models of objects [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. At such approach annotations of objects are
generated automatically, and the volume of received data is theoretically not limited. The
disadvantage of this method is that the synthesized images do not always simulate real
ones accurately enough, and neural networks are highly sensitive to the data distribution
fluctuations. In the article [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] this very approach was used to extend the training dataset
and train the detector. The data were synthesized on with an existing 3D drone model.
The drone model was drawn in the 3D modeling system with different angles. Then
random transformations were applied to the image and its mask: rotation, scaling,
displacement, reflection, etc. After that the image of the object by its mask was inserted
into arbitrary background. This approach showed a good result in case the drones on
the images looked relatively large and contrasting, but when switching to higher
resolution images with smaller objects (near-realistic conditions) proved to be ineffective.
      </p>
      <p>
        In the article [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] GAN was used for creating additional training samples. The key
feature of this approach is that an attempt was made to obtain feedback (in the form of
gradients in training network) for the generator from the detector. The network was
arranged in such a way that the input of the generator was fed with a background image
in RGB format and a window representing a rectangle of ones against the background
of zeros, which indicated the place to which the generator should overlay the detectable
object. The entire image was created by the generator. In the application task under
consideration such architecture did not work, as the detectable objects were much
smaller than the image itself. In contrast to DetectorGAN, in our approach images do
not generated entirely, but only a small square containing the object to be detected,
which is then inserted back into the high-resolution image.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Detection algorithm</title>
      <p>
        The algorithms of object detection based on neural networks can be divided into
singlestage [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and two-stage [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Detectors from the first group are faster, but in general
they are inferior in accuracy to detectors from the second group. Since in the application
task under consideration it was required to provide real-time processing of image
stream from 6 cameras, it was decided to use single stage detectors. RetinaNet [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] with
PeleeNet [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] backbone was chosen as the detection network. As the specificity of the
task consists in detecting small objects, the network has been modified accordingly.
The initial version of PeleeNet received a 304x304 resolution image for input and
detected objects using 5 different scales. The initial image was divided into grids of sizes
from 1x1 to 19x19 cells. The 19x19 grids were not enough to provide a sufficiently
dense coverage of the image with anchor boxes. Therefore, in this study the network
was modified in such a way as to accept an image of 608x608 resolution at the input
and split it on a 38x38 mesh grid.
      </p>
      <p>4</p>
    </sec>
    <sec id="sec-4">
      <title>Data generation algorithm</title>
      <p>The main objective of our work was to create an algorithm that draw target objet on a
given background image. To solve this problem, the principle of domain transfer was
applied. It bases on a render image of 3D object model which is pasted into some
background image. This image is processed by the neural network, which transforms the
synthetic image of the object to a more realistic one. The generative adversarial network
(GAN) was used as such transformation network.
In spite of the fact that in general the scheme is quite simple, in practice a number of
problems arise in its implementation, because of which synthesized data cannot be an
effective substitute for real data. The main difficulty with this approach is a sharp
boundary (Fig. 3.), which appears at the place where the modified fragment is inserted
into the original image. Usually, image transformation algorithms change not only the
object itself and its domain, but also partially change the background. When a fragment
is inserted back into the original image, a sharp non-uniform border appears. Such an</p>
      <p>
        Dataset Expansion by Generative Adversarial Networks for Detectors Quality… 5
artifact may be "learned" by the detection algorithm at the training stage as an important
informative feature, which prevents it from working correctly on real data.
In order to get rid of this boundary effect, Attention Guided GAN was taken as a
generative model [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Its learning algorithm is similar to classical CycleGAN [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], but the
principle of image generation is different. The generator receives the input image from
the original (synthetic) domain, the encoder extracts latent features, and then the
decoder creates a mask and some image, which is pasted into the original image by the
mask. This changes only the part of the image that is directly related to the domain of
the image. The border of the changed part of the picture turns out to be smooth and it
is possible to insert a fragment back in picture without obvious artifacts, which in
training may mislead the detector. Such generator produces 9 masks and 9 images. To train
it, two sets of images from different domains are required: a set of background images
with pasted drone renderers into them and a set of real drone images cropped from
training dataset. This network will learn how to convert images from one domain to
another like CycleGAN. The proposed data extension algorithm consists of 3 stages:
      </p>
      <p>The input of the algorithm contains a picture and a parameters of a bounding box
inside which the target object is located. A square fragment is cut out of the picture
with a center equals to a center of the box and fixed by height and width. The values of
these parameters depend on the data and should be large enough that the resulting
fragment can hold an object in the training dataset with the largest bounding rectangle. The
fragment’s side was set to 152 pixels in our experiment.</p>
      <p>An arbitrary image of the rendered drone is pasted into the cut out fragment, after
which it is fed to the input of the trained generator, which performs the transition from
the domain of synthetic data to the domain of realistic images.</p>
      <p>
        The converted image is inserted back into the original picture. As a ground true
bounding box for the obtained data bounding box with centers corresponding to the
centers of squares, and sizes equal to the largest of the marked data are taken.
The experiments were carried out on the data, which is a footage of the drone flight in
a hangar. The footage was carried out with six tripod-mounted cameras at different
angles (Fig.5). To get rid of the need for manual annotation of video files, the following
algorithm of automatic annotation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] was used to create a training collection. Optical
stream maps were calculated for each video frame. The area with the maximum
magnitude of the optical stream was selected on the maps. It was believed that this area
corresponds to the drone. However, due to the presence of other moving objects,
shadows and segmentation inaccuracies, such a mark-up can’t be considered completely
true. Images from 4 cameras were taken for training of the generator and detector
model, images from another two cameras were taken as validation and test datasets.
To establish the effectiveness of the proposed algorithm three experiments were carried
out. The first consisted in training the detector on a raw training set of 900 images,
which is the hangar footage from 4 angles (4 training backgrounds). The second
experiment repeated the method proposed in the article [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The data for this experiment were
extended by placing drone renders in arbitrary places. The third was to teach the
detector on the same data, extended by new pictures created by the trained generator. In the
process of data extension, drones were applied to each background by a uniform grid
with 30 and 20 pixel steps in the x and y axis respectively. All manipulations were
performed with images of original resolution 1280x720 pixels. The sizes of datasets in
the second and third experiment were equal. Since the source data are very poor in
variety of backgrounds (there are only 4 angles of the same hangar in the training set),
it was decided to add to the dataset random images that do not contain detectable
objects. This solution expands the variability of backgrounds, which increases the
discriminatory ability of the network and improves precision of the detector.
5.1
      </p>
      <sec id="sec-4-1">
        <title>Details of AGGAN training</title>
        <p>To teach the generator, a square with the side of 152 pixels and the center coinciding
with the center of the limiting rectangle was cut out from each image in the training
sample. The data obtained by this procedure formed domain A. Train dataset was
parsed in order to find empty square for each corresponding square in domain A. Such
empty pictures formed domain B. In this way, pairs of images from different domains
were obtained. In order to introduce variability into the generated data, it was decided
to put drone renderers on the images from domain B. It was assumed that the generator
would make the transition between domains by increasing the visual likelihood of
pasted drones. In this case, the data could be expanded by overlaying new renders from
different angles. There were a total of 15 images of drone renderers. Attention Guided
GAN training was running on the data obtained in this way for 200 epochs. Adam with
parameters lr=0.0002, beta1=0.5 and beta2=0.999 was used to optimize this network.
After passing 100 epoch, learning rate began to decrease linearly to zero. Learning rate
was decreasing linearly to zero after 100 epochs passed.</p>
        <p>5.1</p>
      </sec>
      <sec id="sec-4-2">
        <title>Detector training details</title>
        <p>
          In all experiments the detector was trained for 100 epochs. As the optimizer Adam with
parameters lr=0.001, beta1=0.9, beta2=0.999 was chosen. Since it is known that the
image will always contain no more than one object (this is the specificity of the applied
task), as the network output was taken only prediction with the greatest confidence.
To establish the effectiveness of the proposed algorithm three experiments were carried
out. The first consisted in training the detector on a raw training set of 900 images,
which is the hangar footage from 4 angles (4 training backgrounds). The second
experiment repeated the method proposed in the article [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The data for this experiment were
extended by placing drone renders in arbitrary places. The third was to teach the
detector on the same data, extended by new pictures created by the trained generator. In the
process of data extension, drones were applied to each background by a uniform grid
with 30 and 20 pixel steps in the x and y axis respectively. All manipulations were
performed with images of original resolution 1280x720 pixels. The sizes of datasets in
the second and third experiment were equal. Since the source data are very poor in
variety of backgrounds (there are only 4 angles of the same hangar in the training set),
it was decided to add to the dataset random images that do not contain detectable
objects. This solution expands the variability of backgrounds, which increases the
discriminatory ability of the network and improves precision of the detector. The f1 metric
curves for all three experiments (Fig.7, Fig.8 and Fig.9) and the table with the results
on the test dataset are presented in Table 1. Figure 7 shows instability of learning curve
        </p>
        <p>Dataset Expansion by Generative Adversarial Networks for Detectors Quality… 9
on raw real data, that witnesses insufficient amount of training data. The final
experimental results (Table 1) prove that the addition of artificial data to training set is useful,
but proposed method of image transformation is of most effectiveness. The amount of
real images used in each experiment was equal, so artificial expansion of dataset by our
method is a solution to lack of data problem.
The problem of artificial expansion of the training dataset via GAN for object detection
neural network was solved in this work. Input data of the proposed algorithm is
background images containing renders of 3D model of target object. Proposed algorithm
first pastes the object model render on the background image, then the neural network
performs domain transfer for the local fragment of the image containing the object. It
is shown in experiments that such synthesized images can be successfully used for
learning detectors and allow to significantly improve their quality in comparison with
the use of both only raw real images and a mixture of real images with 3D model
renders.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <article-title>Technology for the Visual Inspection of Aircraft Surfaces Using Programmable Unmanned Aerial Vehicles Blokhinov</article-title>
          ,
          <string-name>
            <surname>Yu</surname>
          </string-name>
          . B.;
          <string-name>
            <surname>Gorbachev</surname>
            ,
            <given-names>V. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Nikitin</surname>
            ,
            <given-names>A. D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Skryabin</surname>
          </string-name>
          , S. V. // Journal of computer and
          <source>systems sciences international N 58 V 6 P 960-968</source>
          ,
          <year>2019</year>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>O.</given-names>
            <surname>Russakovsky</surname>
          </string-name>
          et al.,
          <string-name>
            <surname>“ImageNet Large Scale Visual Recognition Challenge</surname>
          </string-name>
          ,” Sep.
          <year>2014</year>
          , [Online]. Available: http://arxiv.org/abs/1409.0575.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>T.-Y. Lin</surname>
          </string-name>
          et al., “
          <string-name>
            <surname>Microsoft</surname>
            <given-names>COCO</given-names>
          </string-name>
          : Common Objects in Context,”
          <source>May</source>
          <year>2014</year>
          , [Online]. Available: http://arxiv.org/abs/1405.0312.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A.</given-names>
            <surname>Brock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Donahue</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          , “
          <article-title>Large Scale GAN Training for High Fidelity Natural Image Synthesis</article-title>
          ,” Sep.
          <year>2018</year>
          , [Online]. Available: http://arxiv.org/abs/
          <year>1809</year>
          .11096.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>P.</given-names>
            <surname>Isola</surname>
          </string-name>
          , J.-Y. Zhu,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Efros</surname>
          </string-name>
          , “
          <article-title>Image-to-Image Translation with Conditional Adversarial Networks</article-title>
          ,” Nov.
          <year>2016</year>
          , [Online]. Available: http://arxiv.org/abs/1611.07004.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>J.-Y. Zhu</surname>
            , T. Park,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Isola</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Efros</surname>
          </string-name>
          , “
          <article-title>Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks</article-title>
          ,” Mar.
          <year>2017</year>
          , [Online]. Available: http://arxiv.org/abs/1703.10593.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Perez</surname>
            , Luis and
            <given-names>Jason</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          . “
          <article-title>The Effectiveness of Data Augmentation in Image Classification using Deep Learning</article-title>
          .”
          <source>ArXiv abs/1712</source>
          .04621 (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Pan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>"A Survey on Transfer Learning,"</article-title>
          <source>in IEEE Transactions on Knowledge and Data Engineering</source>
          , vol.
          <volume>22</volume>
          , no.
          <issue>10</issue>
          , pp.
          <fpage>1345</fpage>
          -
          <lpage>1359</lpage>
          , Oct.
          <year>2010</year>
          , doi: 10.1109/TKDE.
          <year>2009</year>
          .
          <volume>191</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kim</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Kim</surname>
          </string-name>
          <string-name>
            <surname>KAIST</surname>
          </string-name>
          , “
          <article-title>Self-Training and Adversarial Background Regularization for Unsupervised Domain Adaptive One-Stage Object Detection</article-title>
          .” Aug.
          <year>2019</year>
          , [Online]. Available: http://arxiv.org/abs/
          <year>1903</year>
          .12296.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <article-title>Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization</article-title>
          . Tremblay, Jonathan, Aayush Prakash, David Acuna,
          <string-name>
            <given-names>Mark</given-names>
            <surname>Brophy</surname>
          </string-name>
          , Varun Jampani, Cem Anil, Thang To, Eric Cameracci, Shaad Boochoon and
          <string-name>
            <surname>Stanley</surname>
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Birchfield</surname>
          </string-name>
          .
          <source>2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)</source>
          (
          <year>2018</year>
          ):
          <fpage>1082</fpage>
          -
          <lpage>10828</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. L. Liu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Muelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pfister</surname>
          </string-name>
          , and L.
          <string-name>
            <surname>-J. Li</surname>
          </string-name>
          , “
          <article-title>Generative Modeling for Small-Data Object Detection</article-title>
          ,” Oct.
          <year>2019</year>
          , [Online]. Available: http://arxiv.org/abs/
          <year>1910</year>
          .07169.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>J. Redmon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Divvala</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          , “You Only Look Once: Unified, RealTime Object Detection,” Jun.
          <year>2015</year>
          , [Online]. Available: http://arxiv.org/abs/1506.02640.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          , “
          <string-name>
            <surname>Faster</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          :
          <article-title>Towards Real-Time Object Detection with Region Proposal Networks</article-title>
          ,” Jun.
          <year>2015</year>
          , [Online]. Available: http://arxiv.org/abs/1506.01497.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            , and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dollár</surname>
          </string-name>
          , “
          <article-title>Focal Loss for Dense Object Detection</article-title>
          ,” Aug.
          <year>2017</year>
          , [Online]. Available: http://arxiv.org/abs/1708.
          <year>02002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. R. J.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            , and
            <given-names>C. X.</given-names>
          </string-name>
          <string-name>
            <surname>Ling</surname>
            , “Pelee:
            <given-names>A</given-names>
          </string-name>
          <string-name>
            <surname>Real-Time Object</surname>
          </string-name>
          Detection System on Mobile Devices,” Apr.
          <year>2018</year>
          , [Online]. Available: http://arxiv.org/abs/
          <year>1804</year>
          .06882.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>H.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sebe</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yan</surname>
          </string-name>
          , “
          <article-title>Attention-Guided Generative Adversarial Networks for Unsupervised Image-to-</article-title>
          <string-name>
            <surname>Image</surname>
            <given-names>Translation</given-names>
          </string-name>
          ,” Mar.
          <year>2019</year>
          , [Online]. Available:http: http://arxiv.org/abs/
          <year>1903</year>
          .12296.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>