3D Human Reconstruction using single 2D Image Polina Katkova Pavel Yakimov Samara National Research University Samara National Research University Samara, Russia Samara, Russia Lin997@yandex.ru yakimov@ssau.ru Abstract—Computer Vision technology is rapidly almost unambiguous. Some years ago the company developing nowadays. The need for 3D-reconstruction methods Autodesk released a new product named Recap, which is increases along with a number of Computer Vision system able to reconstruct a 3D model via image set [2]. However, implementation. The highest need is for methods, which are in reality there is a higher need for 3D reconstruction using single image as an input data. This article provides an methods via single image, because it has a higher practical overview of existing methods for 3D-reconstruction and an explanation of planned implementation, which consists of a use. platform and a 3D-reconstruction algorithm using single image. Also, this article contains implementation of the Telegram bot, which allows anyone to test PIFu and an overview of the Mask R-CNN which will be used in this work later on. Keywords—3D Reconstruction, 3D Human body recovery algorithms, PIFu algorithm, Telegram bot, Segmentation methods, Mask RCNN I. INTRODUCTION At the moment using a 3D model instead of a real physical object often is a very important requirement. Digital copies provide more variety and flexibility to users than Fig. 1. Example of Studio, which allows getting a set of images from different angles [4]. physical objects. Besides that, using a digital model can save a lot of time because it can be used despite the location of the The problem of model reconstruction via single image is model prototype and because the process of parameter an ambiguity of the back side shape definition (which is not calculating can be some degrees faster than while using a visible on the picture) [3]. There is a similar problem with real model[1]. texturing – the texture part which is visible can be partially Computer Vision (CV) systems are widely used copied, but the reverse side has to be calculated by an nowadays. Most of them have only one camera, so they are algorithm which has to be implemented. not able to capture a set of images from different angles. So, II. THE OVERVIEW OF EXISTING METHODS the possibility to create 3D content via a single image is getting highly relevant. The progress of such methods as A. Algorithms for recovering human shape and pose deep learning, neural networks and segmentation algorithms Algorithm End-to-end Recovery of Human Shape and helps to simplify the process of 3D reconstruction and thus, Pose was released in June 2019. It allows human model will help to develop different areas, such as CV or recovering via a single image. Unlike other methods, End-to- immersive technologies. end Recovery can determine the location of key joints even if The range of 3D reconstruction method usage also the person in the photo is turned away. contains the following areas: medicine (e.g. in computer tomography), Computer Vision (e.g. scene reconstruction, which can be used for calculating a trajectory of movement), microscopy, cinematography, multiplication, video-tracking (e.g. for biometric person identification), retail (e.g. online product demonstration in 3D), immersive technologies et cetera. The article contains an overview of frameworks for popular 3D Human reconstruction methods. Most of those methods have been released in the past few years. The three Fig. 2. Scheme of End-to-end Recovery of Human Shape and Pose following types of methods were considered: parametric algorithm [5]. methods, methods of recovering human shape and pose and human body recovery methods. Said methods use a The input data is an RGB image. Firstly, the image combination of such methods as Convolutional Neural passes through a convolutional encoder and then the result is Networks, Semantic Segmentation, Marching cubes et sent to the 3D regression module which iteratively minimizes cetera. the loss on the 3D model. Lastly the result passes the In the case, the input data consists of multiple images discriminant module, which determines if the resulting 3D which have a different angle of view (an example of the model belongs to a person or not. The scheme of End-to-end process of getting an image set with different points of view Recovery of Human Shape and Pose algorithm is shown in is illustrated in Figure 1) the result of 3D reconstruction is the figure 2. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) Data Science There was conducted a number of experimental studies The algorithm PIFu consists of a convolutional encoder about this method. The compaction of 3D reconstruction and a continuous function. The overview of PIFu’s losses for different methods is illustrated in figure 3. framework is illustrated in the figure 6. Fig. 5. Overview of SiCloPe: Silhouette-Based Clothed People’ framework [6]. Fig. 3. Comparation of HMR with other methods by criteria of 3D reconstruction loss [5]. The comparation of HMR with other methods by time executing is illustrated below, in the figure 4. Fig. 4. Comparation of HMR with other methods by criteria of time Fig. 6. The overview of PIFu: Pixel-Aligned Implicit Function for High- needed for 3D reconstruction [5]. Resolution Clothed Human Digitization [7]. Illustrations in figures 3 and 4 show that the End-to-end B. The parameterized algorithms of human recovery Recovery has the best results in comparation with other An algorithm named skinned multi-person linear model methods. (SMPL) [8] is one of the most popular parameterized A. Algorithms for recovering human shape and pose algorithms of human body recovering. SMPL was released in 2015 and it is still being used in other 3D reconstruction The previous algorithm was able to recover the human works as part of the implementation or for a comparation shape, but not the shape of the clothes. The method process. SiCloPe: Silhouette-Based Clothed People was released in august 2019 and it has the ability to reconstruct human SMPL has been trained on some thousands of 3D models shapes and clothes. After the 3D model reconstruction of human bodies which have different forms and figures. The process, SiCloPe recreates the model texture. recovered 3D model has a map with data of weight at each point of the body model, so the joints can look realistic when The algorithm consists of the following steps: firstly, it a model is changing its pose. defines 2D human silhouettes and creates a 3D map with 3D models recovered via SMPL algorithm can be used in model joint locations; secondly, the method generates new such programs as Autodesk Maya or Unity, where they can 2D silhouettes of the model via a 3D joint location map; get animated later on. after that, SiCloPe reconstructs the 3D model by using a set of 2D silhouettes from step 2. If the 2D silhouettes are built The SMPL model is illustrated below, in the Figure 7. incorrect then the grid used for reconstruction also will not Another popular parametric algorithm is Shape match the actual model. SiCloPe uses an algorithm of deep Completion and Animation of People (SCAPE) [9]. SCAPE surface recognition, which includes “greedy sampling”. was published in 2005 in the ACM Transactions on Using this algorithm guarantees that the reconstruction grid Graphics journal. will be correct. The last step of the algorithm is texturing the reconstructed model. SCAPE allows to combine a single scan of a person with a motion markers sequence. So, as the result this algorithm The scheme of the SiCloPe algorithm is illustrated returns an animation made by mixing a body shape with a below, in the figure 5. pose. Method PIFu: Pixel-Aligned Implicit Function for High- The algorithm consists of three parts: pose deformation, Resolution Clothed Human Digitization [7] was released in body shape deformation and animation via motion capture November 2019. This method allows reconstructing 3D data. The body shape can get deformed by changing a models by using one image or a set of images. The feature template shape with four possible parameters, such as of PIFu is a high-quality texture reconstruction even on the height, weight, muscularity and gender. The overview of the invisible parts of the object in the picture. deformation parameters are illustrated in the Figure 8. In the case the body scan is missing a part of a surface, the SCAPE The algorithm is able to reconstruct even complicated can complete the shape using the Correlated figures, which includes crumbled clothes, high heels or complex hair-style. Correspondence (CC) algorithm [10]. The pose can be deformed via CC algorithm as well. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 306 Data Science The authors have created two data sets: the pose data set consists of 70 poses and the shape data set, consists of 45 different body shapes. Also, the SCAPE algorithm can be applied to other shapes than human. (a) (b) (c) (d) Fig. 10. Results of the second experiment: (a) – input RGB image, (b) – mask for the Input image, (с) – front view of the resulting 3D object, (d) – (a) (b) (c) (d) view of the resulting 3D object from the backside[11]. Fig. 7. SMPL model: (a) – human model with a weight grid, (b) – parametrized human model, (с) – human body model with mixed shape on, (d) – human body model in a pose [8]. (a) (b) (c) (d) Fig. 11. Results of the third experiment: (a) – input RGB image, (b) – mask for the Input image, (с) – front view of the resulting 3D object, (d) – view of the resulting 3D object from the backside. Fig. 8. The four parameters for shape deformation in the SCAPE model. C. The overview results The future purpose of current work is creating a virtual fitting room. It is proposed to use the PIFu algorithm for this (a) (b) (c) (d) purpose. The reason for this choice is an open repository and Fig. 12. Results of the fourth experiment: (a) – input RGB image, (b) – a simple installation and run of PIFu. The Implementation of mask for the Input image, (с) – front view of the resulting 3D object, (d) – PIFu uses an RGB image of human body and a mask which view of the resulting 3D object from the backside. allows to detect a human on the image. It is proposed to research segmentation method Mask R-CNN and implement The run time of PIFu algorithm in the first experiment it for future realization, so the only image can be used as an equals 8.92 seconds, in the second – 10.47 seconds, in the input data for the PIFu algorithm. This method will be third – 7.19 seconds, and in the fourth – 7.34 seconds. overviewed in the next chapter. There are more result images on the following GitHub III. EXPERIMENTAL RESEARCH account: https://github.com/thePolly/PIFu. This repository The PIFu algorithm has been tested on different data contains the code for Telegram bot and PIFu’s algorithm as while experimental studies. The input data consists of a well. photo and a created mask for this image (to determine an IV. IMPLEMENTATION object on the background) via Photoshop. The images have PNG format and a resolution of 720x1080 pixels. The input A. Proposed implementation data and the results for each of the three experiments are It is proposed that in this work a 3D human model illustrated in Figures 9-12. In the Figure 12 is demonstrated recovery algorithm via single image has to be implemented. that the result of the fourth experiment is not very precise and has a high loss. This algorithm can be used for virtual fitting room implementation later on. As an example for implementation, the earlier on overviewed methods can be used. This method will consist of two convolution encoders for both 3D model and texture reconstruction. It is planned that for the realization a dataset is created, which includes a set of 2D people images and a set of 3D models for these images. A Microsoft Kinect 2.0 camera (a) (b) (c) (d) and stereo-cam ZED 2K will be used for creating 3D Fig. 9. Results of the first experiment: (a) – input RGB image, (b) – mask objects. The Microsoft Kinect camera uses an infrared laser for the Input image, (с) – front view of the resulting 3D object, (d) – view for determining the depth of the image matrix. The optimal of the resulting 3D object from the backside. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 307 Data Science distance between objects and the Kinect camera is between V. CONCLUSION one and four meters [12]. Unlike Kinect, the ZED camera Thus, the following 3D reconstruction method types has no infrared sensor. ZED uses methods, which include have been overviewed: recovering of human shape and artificial intelligence for determining the image depth [13]. pose, human model recovering and parametrized human recovery. Most of those methods can accept a single image A. Telegram bot as an input data. Anyone can use the Telegram bot as a platform for 3D reconstruction. Currently, the bot accepts a single image and The overview contains a description of the most popular a mask of this image as an input data and returns a file with 3D Human reconstruction methods. Each overview resulting 3D model. For more details “/help” command can describes methods which have been used in the be used. The name of the bot is implementation process. This paper may help in the design @human_body_recnstruction_bot. phase of the method developing. It can be used to understand which type of 3D reconstruction method has to B. Image segmentation be implemented depending on the task and which To make the process of using Computer Vision easier, a technologies this method should include. Thus, to segmentation method has to be implemented. It will allow implement a virtual fitting room, it is appropriate to use a users to upload only one RGB image without any mask. parametric method or a method of recovering human shape and pose, because then the resulting object will contain no Mask R-CNN segmentation method has been released clothing items. in 2018 by Facebook AI Research [14]. The framework allows to detect multiple number of objects of different In conclusion, the Telegram bot, which allows to test type on the image. PIFu algorithm has been created. The proposed realization has been set. So, to create a virtual fitting room, firstly a dataset has to be filed, a method for 3D human pose and shape has to be implemented and the overviewed Mask R- CNN has to be implemented as well. REFERENCES [1] O.V. Evseev, “Implementation and research of models and algorithms of 3D reconstruction of cloud points defined by a sequence of parallel sections,” Ph. D, 2016. [2] “Autodesk ReCup,” Autodesk Knowledge Network, 2019. [3] W. Choi, “Understanding indoor scenes using 3D geometric phrases,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 33-40, 2013. [4] Photogrammetry, 2019 [Online]. URL: https://imgur.com/gallery/ yuEncdf/comment. [5] A. Kanazawa, “End-to-end recovery of human shape and pose,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7122-7131, 2018. Fig. 13. Mask R-CNN results on the COCO test set [14]. [6] SiCloPe: Silhouette-Based Clothed People, arXiv Preprint: 1901.00049v2. The Mask R-CNN framework is based on the Faster R- [7] Sh. Saito, "Pifu: Pixel-aligned implicit function for high-resolution CNN. The Faster R-CNN has two outputs, a class label and clothed human digitization," Proceedings of the IEEE International a bounding-box offset and the Mask R-CNN has an Conference on Computer Vision, 2019. additional third mask output, which predicts the layout of [8] M. Loper, “SMPL: A skinned multi-person linear model,” ACM transactions on graphics (TOG), vol. 34, no. 6, pp. 248, 2015. the segmentation mask for detected object. So, the loss for [9] D. Anguelov, "SCAPE: shape completion and animation of people," the Mask R-CNN is defined as sum of losses for each ACM SIGGRAPH, pp. 408-416, 2005. output: [10] D. Anguelov, "The correlated correspondence algorithm for unsupervised registration of nonrigid surfaces," Advances in neural  𝐿 = 𝐿𝑐𝑙𝑠 + 𝐿𝑏𝑜𝑥 + 𝐿𝑚𝑎𝑠𝑘   information processing systems, 2005. where 𝐿𝑐𝑙𝑠 is classification loss, 𝐿𝑏𝑜𝑥 is a bounding-box [11] Leonardo DiCaprio Club, 2020 [Online]. URL: https://ru.fanpop.com/ clubs/leonardo- loss and 𝐿𝑚𝑎𝑠𝑘 is a mask definition loss. dicaprio/images/10841990/title/leonardo-dicaprio-photo. The framework allows to choose specific classes to [12] How stuff works. How Microsoft Kinect Works, 2020 [Online]. URL: https://electronics.howstuffworks.com/microsoft-kinect1.htm. detect. For online fitting room implementation, the class [13] ZED 2. Stereolabs, 2020 [Online]. URL: https://www.stereolabs.com list should contain only class for human bodies. The /zed-2. examples of Mask R-CNN detection are illustrated in the [14] K. He, "Mask r-cnn," Proceedings of the IEEE international figure 13. conference on computer vision, 2017. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 308