<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Polina Katkova</string-name>
          <email>Lin997@yandex.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pavel Yakimov</string-name>
          <email>yakimov@ssau.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Samara National Research University</institution>
          ,
          <addr-line>Samara</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>305</fpage>
      <lpage>308</lpage>
      <abstract>
        <p>-Computer Vision technology is rapidly developing nowadays. The need for 3D-reconstruction methods increases along with a number of Computer Vision system implementation. The highest need is for methods, which are using single image as an input data. This article provides an overview of existing methods for 3D-reconstruction and an explanation of planned implementation, which consists of a platform and a 3D-reconstruction algorithm using single image. Also, this article contains implementation of the Telegram bot, which allows anyone to test PIFu and an overview of the Mask R-CNN which will be used in this work later on.</p>
      </abstract>
      <kwd-group>
        <kwd>3D Reconstruction</kwd>
        <kwd>3D Human body recovery algorithms</kwd>
        <kwd>PIFu algorithm</kwd>
        <kwd>Telegram bot</kwd>
        <kwd>Segmentation methods</kwd>
        <kwd>Mask RCNN</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Computer Vision (CV) systems are widely used
nowadays. Most of them have only one camera, so they are
not able to capture a set of images from different angles. So,
the possibility to create 3D content via a single image is
getting highly relevant. The progress of such methods as
deep learning, neural networks and segmentation algorithms
helps to simplify the process of 3D reconstruction and thus,
will help to develop different areas, such as CV or
immersive technologies.</p>
      <p>The range of 3D reconstruction method usage also
contains the following areas: medicine (e.g. in computer
tomography), Computer Vision (e.g. scene reconstruction,
which can be used for calculating a trajectory of movement),
microscopy, cinematography, multiplication, video-tracking
(e.g. for biometric person identification), retail (e.g. online
product demonstration in 3D), immersive technologies et
cetera.</p>
      <p>The article contains an overview of frameworks for
popular 3D Human reconstruction methods. Most of those
methods have been released in the past few years. The three
following types of methods were considered: parametric
methods, methods of recovering human shape and pose and
human body recovery methods. Said methods use a
combination of such methods as Convolutional Neural
Networks, Semantic Segmentation, Marching cubes et
cetera.</p>
      <p>
        In the case, the input data consists of multiple images
which have a different angle of view (an example of the
process of getting an image set with different points of view
is illustrated in Figure 1) the result of 3D reconstruction is
almost unambiguous. Some years ago the company
Autodesk released a new product named Recap, which is
able to reconstruct a 3D model via image set [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However,
in reality there is a higher need for 3D reconstruction
methods via single image, because it has a higher practical
use.
      </p>
      <p>
        The problem of model reconstruction via single image is
an ambiguity of the back side shape definition (which is not
visible on the picture) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. There is a similar problem with
texturing – the texture part which is visible can be partially
copied, but the reverse side has to be calculated by an
algorithm which has to be implemented.
      </p>
      <p>II.</p>
    </sec>
    <sec id="sec-2">
      <title>THE OVERVIEW OF EXISTING METHODS</title>
      <p>A.</p>
      <sec id="sec-2-1">
        <title>Algorithms for recovering human shape and pose</title>
        <p>Algorithm End-to-end Recovery of Human Shape and
Pose was released in June 2019. It allows human model
recovering via a single image. Unlike other methods,
End-toend Recovery can determine the location of key joints even if
the person in the photo is turned away.</p>
        <p>The input data is an RGB image. Firstly, the image
passes through a convolutional encoder and then the result is
sent to the 3D regression module which iteratively minimizes
the loss on the 3D model. Lastly the result passes the
discriminant module, which determines if the resulting 3D
model belongs to a person or not. The scheme of End-to-end
Recovery of Human Shape and Pose algorithm is shown in
the figure 2.</p>
        <p>There was conducted a number of experimental studies
about this method. The compaction of 3D reconstruction
losses for different methods is illustrated in figure 3.</p>
        <p>The algorithm PIFu consists of a convolutional encoder
and a continuous function. The overview of PIFu’s
framework is illustrated in the figure 6.</p>
        <p>The comparation of HMR with other methods by time
executing is illustrated below, in the figure 4.</p>
        <p>Illustrations in figures 3 and 4 show that the End-to-end
Recovery has the best results in comparation with other
methods.</p>
      </sec>
      <sec id="sec-2-2">
        <title>A. Algorithms for recovering human shape and pose</title>
        <p>The previous algorithm was able to recover the human
shape, but not the shape of the clothes. The method</p>
      </sec>
      <sec id="sec-2-3">
        <title>SiCloPe: Silhouette-Based Clothed People was released in</title>
        <p>august 2019 and it has the ability to reconstruct human
shapes and clothes. After the 3D model reconstruction
process, SiCloPe recreates the model texture.</p>
        <p>The algorithm consists of the following steps: firstly, it
defines 2D human silhouettes and creates a 3D map with
model joint locations; secondly, the method generates new
2D silhouettes of the model via a 3D joint location map;
after that, SiCloPe reconstructs the 3D model by using a set
of 2D silhouettes from step 2. If the 2D silhouettes are built
incorrect then the grid used for reconstruction also will not
match the actual model. SiCloPe uses an algorithm of deep
surface recognition, which includes “greedy sampling”.
Using this algorithm guarantees that the reconstruction grid
will be correct. The last step of the algorithm is texturing the
reconstructed model.</p>
        <p>The scheme of the SiCloPe algorithm is illustrated
below, in the figure 5.</p>
        <p>
          Method PIFu: Pixel-Aligned Implicit Function for
HighResolution Clothed Human Digitization [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] was released in
November 2019. This method allows reconstructing 3D
models by using one image or a set of images. The feature
of PIFu is a high-quality texture reconstruction even on the
invisible parts of the object in the picture.
        </p>
        <p>The algorithm is able to reconstruct even complicated
figures, which includes crumbled clothes, high heels or
complex hair-style.</p>
      </sec>
      <sec id="sec-2-4">
        <title>B. The parameterized algorithms of human recovery</title>
        <p>
          An algorithm named skinned multi-person linear model
(SMPL) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] is one of the most popular parameterized
algorithms of human body recovering. SMPL was released in
2015 and it is still being used in other 3D reconstruction
works as part of the implementation or for a comparation
process.
        </p>
        <p>SMPL has been trained on some thousands of 3D models
of human bodies which have different forms and figures. The
recovered 3D model has a map with data of weight at each
point of the body model, so the joints can look realistic when
a model is changing its pose.</p>
        <p>3D models recovered via SMPL algorithm can be used in
such programs as Autodesk Maya or Unity, where they can
get animated later on.</p>
        <p>The SMPL model is illustrated below, in the Figure 7.</p>
        <p>
          Another popular parametric algorithm is Shape
Completion and Animation of People (SCAPE) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. SCAPE
was published in 2005 in the ACM Transactions on
Graphics journal.
        </p>
        <p>SCAPE allows to combine a single scan of a person with
a motion markers sequence. So, as the result this algorithm
returns an animation made by mixing a body shape with a
pose.</p>
        <p>
          The algorithm consists of three parts: pose deformation,
body shape deformation and animation via motion capture
data. The body shape can get deformed by changing a
template shape with four possible parameters, such as
height, weight, muscularity and gender. The overview of the
deformation parameters are illustrated in the Figure 8. In the
case the body scan is missing a part of a surface, the SCAPE
can complete the shape using the Correlated
Correspondence (CC) algorithm [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The pose can be
deformed via CC algorithm as well.
        </p>
        <p>The authors have created two data sets: the pose data set
consists of 70 poses and the shape data set, consists of 45
different body shapes. Also, the SCAPE algorithm can be
applied to other shapes than human.</p>
        <p>(a)
(b)
(c)
(d)</p>
      </sec>
      <sec id="sec-2-5">
        <title>C. The overview results</title>
        <p>The future purpose of current work is creating a virtual
fitting room. It is proposed to use the PIFu algorithm for this
purpose. The reason for this choice is an open repository and
a simple installation and run of PIFu. The Implementation of
PIFu uses an RGB image of human body and a mask which
allows to detect a human on the image. It is proposed to
research segmentation method Mask R-CNN and implement
it for future realization, so the only image can be used as an
input data for the PIFu algorithm. This method will be
overviewed in the next chapter.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>III. EXPERIMENTAL RESEARCH</title>
      <p>The PIFu algorithm has been tested on different data
while experimental studies. The input data consists of a
photo and a created mask for this image (to determine an
object on the background) via Photoshop. The images have
PNG format and a resolution of 720x1080 pixels. The input
data and the results for each of the three experiments are
illustrated in Figures 9-12. In the Figure 12 is demonstrated
that the result of the fourth experiment is not very precise
and has a high loss.</p>
      <p>(a) (b) (c) (d)
Fig. 9. Results of the first experiment: (a) – input RGB image, (b) – mask
for the Input image, (с) – front view of the resulting 3D object, (d) – view
of the resulting 3D object from the backside.</p>
      <p>
        (a) (b) (c) (d)
Fig. 10. Results of the second experiment: (a) – input RGB image, (b) –
mask for the Input image, (с) – front view of the resulting 3D object, (d) –
view of the resulting 3D object from the backside[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>(a) (b) (c) (d)
Fig. 11. Results of the third experiment: (a) – input RGB image, (b) – mask
for the Input image, (с) – front view of the resulting 3D object, (d) – view
of the resulting 3D object from the backside.</p>
      <p>(a) (b) (c) (d)
Fig. 12. Results of the fourth experiment: (a) – input RGB image, (b) –
mask for the Input image, (с) – front view of the resulting 3D object, (d) –
view of the resulting 3D object from the backside.</p>
      <p>The run time of PIFu algorithm in the first experiment
equals 8.92 seconds, in the second – 10.47 seconds, in the
third – 7.19 seconds, and in the fourth – 7.34 seconds.</p>
      <p>There are more result images on the following GitHub
account: https://github.com/thePolly/PIFu. This repository
contains the code for Telegram bot and PIFu’s algorithm as
well.</p>
    </sec>
    <sec id="sec-4">
      <title>IV. IMPLEMENTATION</title>
      <p>It is proposed that in this work a 3D human model
recovery algorithm via single image has to be implemented.
This algorithm can be used for virtual fitting room
implementation later on. As an example for implementation,
the earlier on overviewed methods can be used.</p>
      <p>This method will consist of two convolution encoders
for both 3D model and texture reconstruction.</p>
      <p>
        It is planned that for the realization a dataset is created,
which includes a set of 2D people images and a set of 3D
models for these images. A Microsoft Kinect 2.0 camera
and stereo-cam ZED 2K will be used for creating 3D
objects. The Microsoft Kinect camera uses an infrared laser
for determining the depth of the image matrix. The optimal
distance between objects and the Kinect camera is between
one and four meters [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Unlike Kinect, the ZED camera
has no infrared sensor. ZED uses methods, which include
artificial intelligence for determining the image depth [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <sec id="sec-4-1">
        <title>A. Telegram bot</title>
        <p>Anyone can use the Telegram bot as a platform for 3D
reconstruction. Currently, the bot accepts a single image and
a mask of this image as an input data and returns a file with
resulting 3D model. For more details “/help” command can
be used. The name of the bot is
@human_body_recnstruction_bot.</p>
      </sec>
      <sec id="sec-4-2">
        <title>B. Image segmentation</title>
        <p>To make the process of using Computer Vision easier, a
segmentation method has to be implemented. It will allow
users to upload only one RGB image without any mask.</p>
        <p>
          Mask R-CNN segmentation method has been released
in 2018 by Facebook AI Research [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. The framework
allows to detect multiple number of objects of different
type on the image.
        </p>
        <p>The Mask R-CNN framework is based on the Faster
RCNN. The Faster R-CNN has two outputs, a class label and
a bounding-box offset and the Mask R-CNN has an
additional third mask output, which predicts the layout of
the segmentation mask for detected object. So, the loss for
the Mask R-CNN is defined as sum of losses for each
output:

where  
loss and</p>
        <p>=   +   +  
is classification loss,  
is a mask definition loss.</p>
        <p> 
is a bounding-box</p>
        <p>The framework allows to choose specific classes to
detect. For online fitting room implementation, the class
list should contain only class for human bodies. The
examples of Mask R-CNN detection are illustrated in the
figure 13.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>V. CONCLUSION</title>
      <p>Thus, the following 3D reconstruction method types
have been overviewed: recovering of human shape and
pose, human model recovering and parametrized human
recovery. Most of those methods can accept a single image
as an input data.</p>
      <p>The overview contains a description of the most popular
3D Human reconstruction methods. Each overview
describes methods which have been used in the
implementation process. This paper may help in the design
phase of the method developing. It can be used to
understand which type of 3D reconstruction method has to
be implemented depending on the task and which
technologies this method should include. Thus, to
implement a virtual fitting room, it is appropriate to use a
parametric method or a method of recovering human shape
and pose, because then the resulting object will contain no
clothing items.</p>
      <p>In conclusion, the Telegram bot, which allows to test
PIFu algorithm has been created. The proposed realization
has been set. So, to create a virtual fitting room, firstly a
dataset has to be filed, a method for 3D human pose and
shape has to be implemented and the overviewed Mask
RCNN has to be implemented as well.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>O.V.</given-names>
            <surname>Evseev</surname>
          </string-name>
          , “
          <article-title>Implementation and research of models and algorithms of 3D reconstruction of cloud points defined by a sequence of parallel sections</article-title>
          ,”
          <string-name>
            <surname>Ph</surname>
          </string-name>
          . D,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>“</given-names>
            <surname>Autodesk</surname>
          </string-name>
          <string-name>
            <surname>ReCup</surname>
          </string-name>
          ,” Autodesk Knowledge Network,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Choi</surname>
          </string-name>
          , “
          <article-title>Understanding indoor scenes using 3D geometric phrases</article-title>
          ,
          <source>” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>33</fpage>
          -
          <lpage>40</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Photogrammetry</surname>
          </string-name>
          ,
          <year>2019</year>
          [Online]. URL: https://imgur.com/gallery/ yuEncdf/comment.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kanazawa</surname>
          </string-name>
          , “
          <article-title>End-to-end recovery of human shape and pose</article-title>
          ,
          <source>” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>7122</fpage>
          -
          <lpage>7131</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>SiCloPe</given-names>
            <surname>: Silhouette-Based Clothed</surname>
          </string-name>
          <string-name>
            <surname>People</surname>
          </string-name>
          , arXiv Preprint:
          <year>1901</year>
          .00049v2.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Sh</surname>
          </string-name>
          .
          <string-name>
            <surname>Saito</surname>
          </string-name>
          ,
          <article-title>"Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization,"</article-title>
          <source>Proceedings of the IEEE International Conference on Computer Vision</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Loper</surname>
          </string-name>
          , “
          <article-title>SMPL: A skinned multi-person linear model,” ACM transactions on graphics (TOG)</article-title>
          , vol.
          <volume>34</volume>
          , no.
          <issue>6</issue>
          , pp.
          <fpage>248</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <article-title>"SCAPE: shape completion and animation of people,"</article-title>
          <source>ACM SIGGRAPH</source>
          , pp.
          <fpage>408</fpage>
          -
          <lpage>416</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <article-title>"The correlated correspondence algorithm for unsupervised registration of nonrigid surfaces,"</article-title>
          <source>Advances in neural information processing systems</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Leonardo DiCaprio Club</surname>
          </string-name>
          ,
          <year>2020</year>
          [Online]. URL: https://ru.fanpop.com/ clubs/leonardodicaprio/images/10841990/title/leonardo-dicaprio-photo.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <article-title>How stuff works</article-title>
          .
          <source>How Microsoft Kinect Works</source>
          ,
          <year>2020</year>
          [Online]. URL: https://electronics.howstuffworks.com/microsoft-kinect1.
          <fpage>htm</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>[13] ZED 2</source>
          .
          <string-name>
            <surname>Stereolabs</surname>
          </string-name>
          ,
          <year>2020</year>
          [Online]. URL: https://www.stereolabs.com /zed-2.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>"Mask</surname>
          </string-name>
          r-cnn,
          <source>" Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>