3D Human Reconstruction using single 2D Image
                        Polina Katkova                                                                Pavel Yakimov
              Samara National Research University                                            Samara National Research University
                        Samara, Russia                                                                Samara, Russia
                      Lin997@yandex.ru                                                               yakimov@ssau.ru

    Abstract—Computer Vision technology is rapidly                           almost unambiguous. Some years ago the company
developing nowadays. The need for 3D-reconstruction methods                  Autodesk released a new product named Recap, which is
increases along with a number of Computer Vision system                      able to reconstruct a 3D model via image set [2]. However,
implementation. The highest need is for methods, which are                   in reality there is a higher need for 3D reconstruction
using single image as an input data. This article provides an
                                                                             methods via single image, because it has a higher practical
overview of existing methods for 3D-reconstruction and an
explanation of planned implementation, which consists of a                   use.
platform and a 3D-reconstruction algorithm using single
image. Also, this article contains implementation of the
Telegram bot, which allows anyone to test PIFu and an
overview of the Mask R-CNN which will be used in this work
later on.

   Keywords—3D Reconstruction, 3D Human body recovery
algorithms, PIFu algorithm, Telegram bot, Segmentation
methods, Mask RCNN

                         I. INTRODUCTION
    At the moment using a 3D model instead of a real
physical object often is a very important requirement. Digital
copies provide more variety and flexibility to users than                    Fig. 1. Example of Studio, which allows getting a set of images from
                                                                             different angles [4].
physical objects. Besides that, using a digital model can save
a lot of time because it can be used despite the location of the                 The problem of model reconstruction via single image is
model prototype and because the process of parameter                         an ambiguity of the back side shape definition (which is not
calculating can be some degrees faster than while using a                    visible on the picture) [3]. There is a similar problem with
real model[1].                                                               texturing – the texture part which is visible can be partially
    Computer Vision (CV) systems are widely used                             copied, but the reverse side has to be calculated by an
nowadays. Most of them have only one camera, so they are                     algorithm which has to be implemented.
not able to capture a set of images from different angles. So,                     II.      THE OVERVIEW OF EXISTING METHODS
the possibility to create 3D content via a single image is
getting highly relevant. The progress of such methods as                      A. Algorithms for recovering human shape and pose
deep learning, neural networks and segmentation algorithms                       Algorithm End-to-end Recovery of Human Shape and
helps to simplify the process of 3D reconstruction and thus,                 Pose was released in June 2019. It allows human model
will help to develop different areas, such as CV or                          recovering via a single image. Unlike other methods, End-to-
immersive technologies.                                                      end Recovery can determine the location of key joints even if
    The range of 3D reconstruction method usage also                         the person in the photo is turned away.
contains the following areas: medicine (e.g. in computer
tomography), Computer Vision (e.g. scene reconstruction,
which can be used for calculating a trajectory of movement),
microscopy, cinematography, multiplication, video-tracking
(e.g. for biometric person identification), retail (e.g. online
product demonstration in 3D), immersive technologies et
cetera.
    The article contains an overview of frameworks for
popular 3D Human reconstruction methods. Most of those
methods have been released in the past few years. The three                  Fig. 2. Scheme of End-to-end Recovery of Human Shape and Pose
following types of methods were considered: parametric                       algorithm [5].
methods, methods of recovering human shape and pose and
human body recovery methods. Said methods use a                                  The input data is an RGB image. Firstly, the image
combination of such methods as Convolutional Neural                          passes through a convolutional encoder and then the result is
Networks, Semantic Segmentation, Marching cubes et                           sent to the 3D regression module which iteratively minimizes
cetera.                                                                      the loss on the 3D model. Lastly the result passes the
    In the case, the input data consists of multiple images                  discriminant module, which determines if the resulting 3D
which have a different angle of view (an example of the                      model belongs to a person or not. The scheme of End-to-end
process of getting an image set with different points of view                Recovery of Human Shape and Pose algorithm is shown in
is illustrated in Figure 1) the result of 3D reconstruction is               the figure 2.


Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
Data Science

    There was conducted a number of experimental studies                     The algorithm PIFu consists of a convolutional encoder
about this method. The compaction of 3D reconstruction                    and a continuous function. The overview of PIFu’s
losses for different methods is illustrated in figure 3.                  framework is illustrated in the figure 6.


                                                                          Fig. 5. Overview   of   SiCloPe:   Silhouette-Based   Clothed   People’
                                                                          framework [6].

Fig. 3. Comparation of HMR with other methods by criteria of 3D
reconstruction loss [5].

   The comparation of HMR with other methods by time
executing is illustrated below, in the figure 4.


Fig. 4. Comparation of HMR with other methods by criteria of time         Fig. 6. The overview of PIFu: Pixel-Aligned Implicit Function for High-
needed for 3D reconstruction [5].                                         Resolution Clothed Human Digitization [7].

   Illustrations in figures 3 and 4 show that the End-to-end              B. The parameterized algorithms of human recovery
Recovery has the best results in comparation with other                      An algorithm named skinned multi-person linear model
methods.                                                                  (SMPL) [8] is one of the most popular parameterized
A. Algorithms for recovering human shape and pose                         algorithms of human body recovering. SMPL was released in
                                                                          2015 and it is still being used in other 3D reconstruction
   The previous algorithm was able to recover the human                   works as part of the implementation or for a comparation
shape, but not the shape of the clothes. The method                       process.
SiCloPe: Silhouette-Based Clothed People was released in
august 2019 and it has the ability to reconstruct human                       SMPL has been trained on some thousands of 3D models
shapes and clothes. After the 3D model reconstruction                     of human bodies which have different forms and figures. The
process, SiCloPe recreates the model texture.                             recovered 3D model has a map with data of weight at each
                                                                          point of the body model, so the joints can look realistic when
    The algorithm consists of the following steps: firstly, it            a model is changing its pose.
defines 2D human silhouettes and creates a 3D map with
                                                                              3D models recovered via SMPL algorithm can be used in
model joint locations; secondly, the method generates new                 such programs as Autodesk Maya or Unity, where they can
2D silhouettes of the model via a 3D joint location map;                  get animated later on.
after that, SiCloPe reconstructs the 3D model by using a set
of 2D silhouettes from step 2. If the 2D silhouettes are built                 The SMPL model is illustrated below, in the Figure 7.
incorrect then the grid used for reconstruction also will not                Another popular parametric algorithm is Shape
match the actual model. SiCloPe uses an algorithm of deep                 Completion and Animation of People (SCAPE) [9]. SCAPE
surface recognition, which includes “greedy sampling”.                    was published in 2005 in the ACM Transactions on
Using this algorithm guarantees that the reconstruction grid              Graphics journal.
will be correct. The last step of the algorithm is texturing the
reconstructed model.                                                          SCAPE allows to combine a single scan of a person with
                                                                          a motion markers sequence. So, as the result this algorithm
   The scheme of the SiCloPe algorithm is illustrated                     returns an animation made by mixing a body shape with a
below, in the figure 5.                                                   pose.
    Method PIFu: Pixel-Aligned Implicit Function for High-                   The algorithm consists of three parts: pose deformation,
Resolution Clothed Human Digitization [7] was released in                 body shape deformation and animation via motion capture
November 2019. This method allows reconstructing 3D                       data. The body shape can get deformed by changing a
models by using one image or a set of images. The feature                 template shape with four possible parameters, such as
of PIFu is a high-quality texture reconstruction even on the              height, weight, muscularity and gender. The overview of the
invisible parts of the object in the picture.                             deformation parameters are illustrated in the Figure 8. In the
                                                                          case the body scan is missing a part of a surface, the SCAPE
    The algorithm is able to reconstruct even complicated
                                                                          can complete the shape using the Correlated
figures, which includes crumbled clothes, high heels or
complex hair-style.                                                       Correspondence (CC) algorithm [10]. The pose can be
                                                                          deformed via CC algorithm as well.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                       306
Data Science

    The authors have created two data sets: the pose data set
consists of 70 poses and the shape data set, consists of 45
different body shapes. Also, the SCAPE algorithm can be
applied to other shapes than human.


                                                                                     (a)                (b)                (c)                 (d)
                                                                               Fig. 10. Results of the second experiment: (a) – input RGB image, (b) –
                                                                               mask for the Input image, (с) – front view of the resulting 3D object, (d) –
           (a)             (b)              (c)          (d)
                                                                               view of the resulting 3D object from the backside[11].
Fig. 7. SMPL model: (a) – human model with a weight grid, (b) –
parametrized human model, (с) – human body model with mixed shape on,
(d) – human body model in a pose [8].


                                                                                      (a)                  (b)                (c)                 (d)
                                                                               Fig. 11. Results of the third experiment: (a) – input RGB image, (b) – mask
                                                                               for the Input image, (с) – front view of the resulting 3D object, (d) – view
                                                                               of the resulting 3D object from the backside.


Fig. 8. The four parameters for shape deformation in the SCAPE model.

C. The overview results
     The future purpose of current work is creating a virtual
fitting room. It is proposed to use the PIFu algorithm for this
                                                                                      (a)                 (b)                  (c)               (d)
purpose. The reason for this choice is an open repository and                  Fig. 12. Results of the fourth experiment: (a) – input RGB image, (b) –
a simple installation and run of PIFu. The Implementation of                   mask for the Input image, (с) – front view of the resulting 3D object, (d) –
PIFu uses an RGB image of human body and a mask which                          view of the resulting 3D object from the backside.
allows to detect a human on the image. It is proposed to
research segmentation method Mask R-CNN and implement                              The run time of PIFu algorithm in the first experiment
it for future realization, so the only image can be used as an                 equals 8.92 seconds, in the second – 10.47 seconds, in the
input data for the PIFu algorithm. This method will be                         third – 7.19 seconds, and in the fourth – 7.34 seconds.
overviewed in the next chapter.
                                                                                  There are more result images on the following GitHub
                    III. EXPERIMENTAL RESEARCH                                 account: https://github.com/thePolly/PIFu. This repository
    The PIFu algorithm has been tested on different data                       contains the code for Telegram bot and PIFu’s algorithm as
while experimental studies. The input data consists of a                       well.
photo and a created mask for this image (to determine an
                                                                                                      IV. IMPLEMENTATION
object on the background) via Photoshop. The images have
PNG format and a resolution of 720x1080 pixels. The input                           A. Proposed implementation
data and the results for each of the three experiments are                         It is proposed that in this work a 3D human model
illustrated in Figures 9-12. In the Figure 12 is demonstrated
                                                                               recovery algorithm via single image has to be implemented.
that the result of the fourth experiment is not very precise
and has a high loss.                                                           This algorithm can be used for virtual fitting room
                                                                               implementation later on. As an example for implementation,
                                                                               the earlier on overviewed methods can be used.
                                                                                   This method will consist of two convolution encoders
                                                                               for both 3D model and texture reconstruction.
                                                                                   It is planned that for the realization a dataset is created,
                                                                               which includes a set of 2D people images and a set of 3D
                                                                               models for these images. A Microsoft Kinect 2.0 camera
       (a)                (b)                   (c)                (d)         and stereo-cam ZED 2K will be used for creating 3D
Fig. 9. Results of the first experiment: (a) – input RGB image, (b) – mask     objects. The Microsoft Kinect camera uses an infrared laser
for the Input image, (с) – front view of the resulting 3D object, (d) – view   for determining the depth of the image matrix. The optimal
of the resulting 3D object from the backside.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                                307
Data Science

distance between objects and the Kinect camera is between                                     V. CONCLUSION
one and four meters [12]. Unlike Kinect, the ZED camera                       Thus, the following 3D reconstruction method types
has no infrared sensor. ZED uses methods, which include                   have been overviewed: recovering of human shape and
artificial intelligence for determining the image depth [13].             pose, human model recovering and parametrized human
                                                                          recovery. Most of those methods can accept a single image
     A. Telegram bot
                                                                          as an input data.
    Anyone can use the Telegram bot as a platform for 3D
reconstruction. Currently, the bot accepts a single image and                 The overview contains a description of the most popular
a mask of this image as an input data and returns a file with             3D Human reconstruction methods. Each overview
resulting 3D model. For more details “/help” command can                  describes methods which have been used in the
be      used.    The      name       of     the      bot    is            implementation process. This paper may help in the design
@human_body_recnstruction_bot.                                            phase of the method developing. It can be used to
                                                                          understand which type of 3D reconstruction method has to
       B. Image segmentation                                              be implemented depending on the task and which
       To make the process of using Computer Vision easier, a             technologies this method should include. Thus, to
    segmentation method has to be implemented. It will allow              implement a virtual fitting room, it is appropriate to use a
    users to upload only one RGB image without any mask.                  parametric method or a method of recovering human shape
                                                                          and pose, because then the resulting object will contain no
        Mask R-CNN segmentation method has been released
                                                                          clothing items.
    in 2018 by Facebook AI Research [14]. The framework
    allows to detect multiple number of objects of different                 In conclusion, the Telegram bot, which allows to test
    type on the image.                                                    PIFu algorithm has been created. The proposed realization
                                                                          has been set. So, to create a virtual fitting room, firstly a
                                                                          dataset has to be filed, a method for 3D human pose and
                                                                          shape has to be implemented and the overviewed Mask R-
                                                                          CNN has to be implemented as well.
                                                                                                         REFERENCES
                                                                          [1]    O.V. Evseev, “Implementation and research of models and
                                                                                 algorithms of 3D reconstruction of cloud points defined by a
                                                                                 sequence of parallel sections,” Ph. D, 2016.
                                                                          [2]    “Autodesk ReCup,” Autodesk Knowledge Network, 2019.
                                                                          [3]    W. Choi, “Understanding indoor scenes using 3D geometric
                                                                                 phrases,” Proceedings of the IEEE Conference on Computer Vision
                                                                                 and Pattern Recognition, pp. 33-40, 2013.
                                                                          [4]    Photogrammetry, 2019 [Online]. URL: https://imgur.com/gallery/
                                                                                 yuEncdf/comment.
                                                                          [5]    A. Kanazawa, “End-to-end recovery of human shape and pose,”
                                                                                 Proceedings of the IEEE Conference on Computer Vision and
                                                                                 Pattern Recognition, pp. 7122-7131, 2018.
Fig. 13. Mask R-CNN results on the COCO test set [14].                    [6]    SiCloPe: Silhouette-Based Clothed People, arXiv Preprint:
                                                                                 1901.00049v2.
        The Mask R-CNN framework is based on the Faster R-                [7]    Sh. Saito, "Pifu: Pixel-aligned implicit function for high-resolution
    CNN. The Faster R-CNN has two outputs, a class label and                     clothed human digitization," Proceedings of the IEEE International
    a bounding-box offset and the Mask R-CNN has an                              Conference on Computer Vision, 2019.
    additional third mask output, which predicts the layout of            [8]    M. Loper, “SMPL: A skinned multi-person linear model,” ACM
                                                                                 transactions on graphics (TOG), vol. 34, no. 6, pp. 248, 2015.
    the segmentation mask for detected object. So, the loss for           [9]    D. Anguelov, "SCAPE: shape completion and animation of people,"
    the Mask R-CNN is defined as sum of losses for each                          ACM SIGGRAPH, pp. 408-416, 2005.
    output:                                                               [10]   D. Anguelov, "The correlated correspondence algorithm for
                                                                                 unsupervised registration of nonrigid surfaces," Advances in neural
                     𝐿 = 𝐿𝑐𝑙𝑠 + 𝐿𝑏𝑜𝑥 + 𝐿𝑚𝑎𝑠𝑘                              information processing systems, 2005.
    where 𝐿𝑐𝑙𝑠 is classification loss, 𝐿𝑏𝑜𝑥 is a bounding-box             [11]   Leonardo       DiCaprio      Club,      2020      [Online].     URL:
                                                                                 https://ru.fanpop.com/                                clubs/leonardo-
    loss and 𝐿𝑚𝑎𝑠𝑘 is a mask definition loss.                                    dicaprio/images/10841990/title/leonardo-dicaprio-photo.
        The framework allows to choose specific classes to                [12]   How stuff works. How Microsoft Kinect Works, 2020 [Online].
                                                                                 URL: https://electronics.howstuffworks.com/microsoft-kinect1.htm.
    detect. For online fitting room implementation, the class             [13]   ZED 2. Stereolabs, 2020 [Online]. URL: https://www.stereolabs.com
    list should contain only class for human bodies. The                         /zed-2.
    examples of Mask R-CNN detection are illustrated in the               [14]   K. He, "Mask r-cnn," Proceedings of the IEEE international
    figure 13.                                                                   conference on computer vision, 2017.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                           308