A Machine Learning-Based Framework for Real-Time
                                3D Reconstruction and Space Utilization in Built
                                Environments
                                Abhishek Mukhopadhyay1,* , Samarth Patel1 , Priyavrat Sharma1 and Pradipta Biswas1
                                1
                                    Indian Institute of Science, Bangalore, India


                                                                         Abstract
                                                                         The process of 3D reconstruction involves transforming 2D images or data into a three-dimensional
                                                                         representation of an object, model, or environment. While supervised 3D reconstruction has made
                                                                         significant strides using deep neural networks, it is often time-consuming due to extensive image
                                                                         stitching and the requirement for specialized imaging sensors such as 360-degree or depth cameras. This
                                                                         paper introduces a machine learning-based 3D reconstruction framework aimed at making informed
                                                                         decisions regarding space utilization and asset management within any built environment. The proposed
                                                                         system comprises three key components: (I) object detection on 2D frames to identify target objects,
                                                                         (II) calculation of their pose using image processing techniques, and (III) utilization of an artificial
                                                                         neural network to map real and virtual environments. The evaluation using YOLOv7 demonstrated an
                                                                         accuracy of F1 score of 0.70 in detecting objects of interest. Pose estimation analysis indicated that the
                                                                         proposed algorithm could estimate object orientation with an error rate of 8.03∘ . The mapping algorithm
                                                                         exhibited high-quality performance, achieving a correlation coefficient of R2 = 0.97. Ultimately, all this
                                                                         information is transmitted and visualized in the reconstructed virtual model, enabling remote monitoring
                                                                         and simulation.

                                                                         Keywords
                                                                         Soft continuum manipulator, Soft snake robot, Multi-modal interaction, Hand gesture, Eye tracker


                                1. Introduction
                                The process of reconstructing a real-world scenario in three dimensions (3D) entails creating a
                                3D model from 2D images, point clouds, silhouettes, and similar data sources [1]. The process
                                aims to generate a virtual representation applicable in visualization, animation, simulation,
                                and analysis across fields like computer vision, robotics, and virtual reality. In the realm of
                                computer vision research, significant attention has been given to 3D reconstruction, with a
                                focus on areas such as structure from motion [2] or Multiview stereo [3]. These methods rely
                                on multiple images to establish accurate correspondences or ensure comprehensive coverage,
                                but they can be time-consuming due to the extensive image stitching and the requirement

                                Joint Proceedings of the ACM IUI Workshops 2024, March 18-21, 2024, Greenville, South Carolina, USA
                                *
                                 Corresponding author.
                                $ abhishekmukh@iisc.ac.in (A. Mukhopadhyay); samarthpatel@iisc.ac.in (S. Patel); priyavrats@iisc.ac.in
                                (P. Sharma); pradipta@iisc.ac.in (P. Biswas)
                                 0000-0002-4341-0523 (A. Mukhopadhyay); 0009-0006-0705-3465 (S. Patel); 0009-0002-5063-141X (P. Sharma);
                                0000-0003-3054-6699 (P. Biswas)
                                                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                                                           1


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Abhishek Mukhopadhyay et al. CEUR Workshop Proceedings                                       1–15


for specialized imaging sensors like 360-degree or depth cameras. Reconstructing built en-
vironments, such as car or ship interiors, is comparatively more straightforward, given the
availability of standard Computer-aided design (CAD). Previous works [4, 5, 6] have developed
virtual models by leveraging architectural drawings and integrating 3D furniture models. How-
ever, this process is time-consuming, involving modeling, physics simulation, and rendering
capabilities of commercial game engines. The advent of deep neural networks has facilitated
the incorporation of real-world objects into virtual spaces by learning from large datasets. CAD
models of real-world objects assist in learning the mapping from images to virtual models,
streamlining the process. Once a virtual reality (VR) model is established, it can be connected to
real-time imaging and environmental sensors, enabling the estimation of room occupancy and
power consumption within built environments [5]. In this context, we propose a digital twin
(DT) of a built environment through an interactive and immersive VR experience. A digital twin
represents an object or system virtually throughout its lifecycle, updated with real-time data and
employing simulation, machine learning, and reasoning to facilitate decision-making [7]. This
paper aims to develop an efficient VR-based DT for built environments in real-time, expediting
the existing 3D reconstruction process. This system allows standard virtual walkthroughs
and provides real-time room occupancy estimates, energy assessments, and the potential for
expansion into asset tracking and maintenance. The detection and localization of objects in
the real world involved deploying a pre-trained YOLOv7 model, which underwent transfer
learning on a custom dataset. Image processing techniques were employed to estimate the pose
of real-world objects and map them into a virtual environment. An extensive comparison study
among machine learning models was conducted to determine an accurate mapping technique.
Mapping two-dimensional coordinates onto the virtual camera feed establishes a connection
between the real and virtual worlds, enabling the real-time simulation of object movements in
physical space. The contributions of this work are as follows.

    • Proposed a new way of reconstructing real-world space to virtual environment in real-
      time.
    • Validated the mapping between real and virtual environment by comparing several
      machine learning models.

   The paper is organized as follows. Section 2 explains literature review on different methods
for object detection systems. Section 3 explains the training and evaluation of object detection
models, comparison studies between different machine learning models used for mapping and
pose calculation technique in detail. Experiments and results are discussed in detail in Section
4, followed by discussion and conclusion in Section 5 and 6, respectively.


2. Related Work
In this section, the previous works on mapping, pose estimation and applications of object
detection on digital twin are summarized in detail.


                                                2
Abhishek Mukhopadhyay et al. CEUR Workshop Proceedings                                        1–15


2.1. Digital Twin
Research on automated construction of digital twins and the inclusion of secondary objects
has employed a blend of simple image processing techniques and sophisticated deep learning.
Commercial software such as EdgeWise, Pointfuse, Point Cab, and Leica Cyclone Model leverage
shape detection and fitting algorithms, with academic literature also using machine learning for
secondary object detection [8, 9, 10, 11]. Traditional computer vision techniques and CNN-based
methods, such as the DeepLab architecture, have also been used to classify object classes and
generate walls, cable trays, and ventilation ducts [12, 13, 14, 15]. A noteworthy study utilized
laser scanning, object detection, and OCR to enrich a digital twin with secondary objects and
semantic information [16]. Despite their potential, deep learning approaches often require
extensive labelled training data, a challenge that synthetic data generation techniques may
alleviate [6, 16, 17, 18]. Perhaps, the most closely related work to ours is by Zhou et al. [19],
who used computer vision to update a BIM-based digital twin of a building in real-time. Their
approach incorporated YOLOv5 for object detection and utilized the Total3DUnderstanding
method [20] for estimating object pose. This work claimed to achieve successful object capture,
including desks, chairs, plants, and computer monitors, by transforming object coordinates
from the image to the physical world and then to the digital twin. However, the paper did not
provide a report on the accuracy of the object detection models, which is a crucial component
of the pipeline. Instead, their focus was primarily on orientation correction. In contrast, our
approach targets more significant object categories and includes a comprehensive report on the
accuracy of individual components: object detection, mapping algorithm, and pose estimation.
This individual accuracy analysis allows for a better understanding of any accuracy limitations
in the deployment process.

2.2. Mapping Between Real and Virtual Space
The coordinates of the different objects that were detected has to be mapped in the virtual 3D
space of unity. Sun et al. [21] focused on mapping virtual space onto real space using planar
mapping, exploring the methods and algorithms employed to achieve accurate and efficient
mapping. Ren et al. [22] employed spatial affine transformation to map virtual objects onto 2D
images, aiming to integrate virtual objects into real-world scenes seamlessly. Huang et al. [23]
proposed an algorithm specifically designed for mapping real space to a virtual globe space,
addressing the need for robust and efficient mapping techniques applicable to diverse real-world
environments. Schwarz et al. [24] discussed the usage of LIDAR systems by participants in a
DARPA-sponsored event to generate a digital view of the surrounding terrain in autonomous
vehicles, emphasizing the crucial role of LIDAR technology in facilitating accurate perception
and navigation in autonomous systems. Mandli Communications [25] employed LIDAR sensors
to generate point clouds and subsequently create digital terrain models, showcasing the practical
application of LIDAR for terrain modelling purposes.

2.3. Pose Estimation
The classical approach for estimating pose traditionally treated it as a nonlinear least-squares
problem, employing nonlinear optimization algorithms for solving it [26, 27, 28]. Patil et al. [29]


                                                3
Abhishek Mukhopadhyay et al. CEUR Workshop Proceedings                                       1–15


investigated and compared various vision-based, hybrid, and deep learning-based approaches
for pose estimation from monocular vision. Similarly, Lan et al. [30] provided an overview
of different deep learning-based techniques for human pose estimation. Another notable
contribution by Collet et al. [31] is the MOPED (Multiple Object Pose Estimation and Detection)
framework, designed to deliver robust performance in complex scenes and low latency for
real-time applications. Vikstenen et al. [32] developed a system that leverages algorithmic
multi-cue integration (AMC) and temporal multi-cue integration (TMC) to increase the pose
estimation performance.

2.4. Deep Learning in Digital Twin
Deep learning is a subset of a larger family of ML approaches; it can take data and automatically
perform tasks like Classification, regression, clustering and pattern recognition. Deep learning
is very effective in the detection of complicated structures and things [33]. In another research,
Ogunseiju et al. [34] investigated the effectiveness of a variety of deep CNNs for recognizing
construction worker activities from images of signals from time-series data using large datasets
for training and testing DL algorithms. Deep learning has proved to be very good in object
detection [35], and the use of DT involving humans has been proposed by a lot of researchers
in the construction activity to prevent causalities and improve ergonomics of workers, Boton et
al. [36] explored the use introducing a temporal dimension in the 3D simulation of Construction
activities. Summarizing the literature survey, this paper aims to address several key research
gaps in the field of digital twin construction. Certain limitations are inherent in past work,
including the high cost associated with depth cameras and the inability to capture specular and
transparent objects. In addition, the manual intervention required to verify detected objects in
the point clouds results in a time-consuming process [16]. Furthermore, while deep learning
techniques have been applied to detect building elements, the literature lacks evidence of their
utilization for identifying room furniture and other entities, which assists in understanding
space utilization. Moreover, there has been limited exploration in the architecture, engineering,
and construction (AEC) domain concerning the mapping of secondary objects. Zhou et al. [19]
utilize a 3D estimation network for extracting the pose of each object and a camera-BIM location
transformation algorithm for mapping the coordinates of the detected objects in BIM. In contrast,
our approach utilizes image processing steps and the minimal area rectangle method for pose
estimation, and a Machine learning-based algorithm for mapping. By addressing these research
gaps, this paper enhances the efficiency, accuracy, and comprehensiveness of digital twin
creation and updating in real-time using computer vision techniques.


3. Proposed Approach
The proposed 3D reconstruction process begins with the detection of objects in the real-world
environment. While numerous pre-trained models are available for this purpose, our study
relies on the findings of previous work [37, 38] which demonstrate that YOLO performs better
in terms of the tradeoff between latency and accuracy. YOLOv7 was chosen over various YOLO
model variants following a comparative analysis. The detected objects are further classified
into three categories: movable, partially movable, and immovable. Movable objects encompass


                                                4
Abhishek Mukhopadhyay et al. CEUR Workshop Proceedings                                      1–15


Figure 1: Flow diagram of the proposed system


persons and chairs, while the partially movable are keyboard, laptop, TV/monitor, and mouse
while remaining objects are deemed immovable. This distinction is crucial to maintain real-time
accuracy and resource efficiency, as movable, and partially movable objects require frequent
updates due to their volatile positions, while immovable objects, remaining static most of the
time, necessitate fewer updates. Accordingly, the movable objects undergo more frequent
updates, occurring every frame, while partially movable objects are updated every 5 minutes.
The immovable objects are updated less frequently, with a refresh rate of every 24 hours. The
classification of the objects into groups helps the proposed system to reconstruct real-world in
real time. We also compare linear regression with degrees 1, 2, and 3, support vector regression
(SVR) with radial basis function (RBF) and polynomial kernels [39], and a vanilla neural network
to map between real-world and virtual objects. The pose estimation of objects is performed
using a combination of different image processing techniques. The position and orientation
of the objects are communicated to the game engine using socket programming for real-time
interaction and synchronization. Figure 1 represents the working of the proposed system.

3.1. Object Detection
In this Section, a detailed analysis of the dataset preparation strategy is discussed followed by
comparison study between object detection models in study.
   Dataset Preparation: The dataset encompasses various day-to-day office utilities, including
chair, keyboard, laptop, TV/monitor, refrigerators, desk, mouse, person, and couch. The dataset
used in this study was derived from two sources. The first source involved filtering the MS
COCO dataset [40] to extract the above-mentioned class labels. The second source was a crowd-
sourced dataset collected for this research. The images obtained from the user-generated dataset
were annotated with nine class labels. To annotate the images, the Computer Vision Annotation
Tool (CVAT) was employed, allowing for manual annotation through visual inspection. The
dataset was then divided into train, validation, and test subsets, with each class label having


                                                5
Abhishek Mukhopadhyay et al. CEUR Workshop Proceedings                                     1–15


specified entries in each subset, as shown in Table 1. The annotations were compiled into an
XML file format using the CVAT tool. Subsequently, the XML file format was converted to the
YOLO file format, which was utilized for training the object detection model.

Table 1
Instances of class label in train, validation, and test data
                                  Class        Train       Validation    Test
                                  Chair        12322          7729      1813
                                Keyboard       1002           585        175
                                 Laptop        1520           1092       231
                               TV/Monitor      2128           1384       479
                               Refrigerator     824           629        126
                                  Desk          590            96         43
                                 Mouse          876           464        125
                                 Person        81153         53753      10995
                                 Couch         1741           1192       261


   Comparison between Object Detection Models: We compared YOLO models (v4 to v7)
to choose the best model. By using transfer learning, we leveraged the pretrained weights of
these models on a large-scale dataset and fine-tuned the model on bespoke dataset, resulting in
improved object detection capabilities within our digital twin environment. Model training was
conducted on an NVIDIA 3090 Ti GPU, and performance was assessed using three evaluation
metrics, including mean Intersection Over Union (mIOU), precision, and F1 score.

3.2. Mapping Techniques
In the next step, the detected objects were used to map corresponding objects in a virtual
environment created using Unity 3D. The mapping focused on three out of six degrees of
freedom (DOFs), specifically translation along the X and Y-axis, and rotation around the yaw.
This selection of three DOFs was specific to this problem, given that all objects are situated
within a 2D plane, such as the floor or tabletop. We mapped the center of base from 2D frame
to appropriate plane in virtual environment. The extracted center coordinates were utilized as
inputs for various regression algorithms, including linear regression with degrees 1, 2, and 3,
support vector regression (SVR) with radial basis function (RBF) and polynomial kernels [40],
and a vanilla ANN. Linear regression with degree 1 fits a straight-line relationship between the
features and the target. When degree 2 is used, quadratic terms are introduced to capture curved
patterns. Additionally, when degree 3 is employed, cubic terms are added, enabling the model to
fit S-shaped patterns and capture more complex relationships. SVR with RBF kernel is a powerful
non-linear data regression approach with the use of a Gaussian-like function, it modifies the
input space to capture complex connections and trends similarly polynomial relationships can
be captured using SVR with a polynomial kernel it converts data into a higher-dimensional
space and uses polynomial functions to assess similarity. The architecture of the ANN consists
of four Multi-Layer Perceptron (MLP) blocks, with three of them responsible for the mapping
process in conjunction with the input (Figure 2). We propose a hybrid structure by using a


                                                       6
Abhishek Mukhopadhyay et al. CEUR Workshop Proceedings                                         1–15


weighted summation of the output of these MLP blocks. The structure for the first three blocks
is 2-4-2 (input layer-hidden layer-output layer) and for the fourth block (weight block) it is 2-8-4
(input layer-hidden layer-output layer). The activation used in the first three blocks are linear,
sigmoid, and tanh respectively and for the weight block linear activation is used. The model
was trained using Adam optimizer and mean square error as loss function. The fourth block
incorporates weights for all four outputs, scaling the output based on the specific degree of
non-linearity. In the feature fusion part, we undertake feature fusion as described in Equation 1.

               𝑂(𝑥, 𝑦) = 𝛼 · 𝐵1 (𝑥, 𝑦) + 𝛽 · 𝐵2 (𝑥, 𝑦) + 𝛿 · 𝐵3 (𝑥, 𝑦) + 𝛾 · 𝐼(𝑥, 𝑦)            (1)
where, 𝛼, 𝛽, 𝛿, 𝛾 denotes the weight parameters, 𝐵1 , 𝐵2 , and 𝐵3 indicates three different blocks
which are connected in parallel and 𝑂(𝑥, 𝑦) is the weighted sum of the parallel outputs and input.
Each algorithm was trained using a dataset consisting of frame coordinates and corresponding
virtual world coordinates. Frame coordinates are generated based on the bounding box location
on the screen generated by object detection model. Correspondingly, in virtual environment, we
placed objects in same location and generated the virtual world coordinates. After the training
process, the algorithms could predict the coordinates of objects within the virtual environment
based on the object detection model generated coordinates. The predicted coordinates were then
transmitted to virtual environment using socket programming. This communication facilitated
the dynamic mapping of the detected objects within the virtual environment, resulting in an
immersive and interactive user experience.

3.3. Pose Correction
A series of image processing steps were followed for estimating the pose of the object in real-
world space in the process of digital twin reconstruction. First, the region of interest (ROI) was
obtained from the coordinates generated by the object detection model. Subsequently, image
sharpening was applied to enhance the visibility of the objects in the images. Next, segmentation
was performed to isolate the required objects from the background. The resulting segmented
images were then converted into a binary grayscale format, simplifying the subsequent image
analysis process. To further refine the binary images, a morphological process was applied to
fill in small patches present in the binary images. Following this, contour fitting was performed
on the binary images to accurately obtain the shapes and boundaries of the objects. Finally, the
minimum area rectangle method was utilized to determine the precise positions and orientations
of the objects in the images. This comprehensive approach facilitated the acquisition of more
accurate representations of the objects’ poses in the digital twin environment. By implementing
these image processing steps, the objects were successfully detected and mapped in the digital
twin, enabling the creation of an accurate representation of the physical world. The proposed
pose detection algorithm is described in Figure 3.

3.4. Placing in 3D Virtual Environment
The virtual environment reconstruction involves the transmission of pose and location data for
target objects from computer vision algorithms to a virtual environment. Prior to receiving the
data stream, a virtual representation of the real-world space is constructed using a game engine,


                                                 7
Abhishek Mukhopadhyay et al. CEUR Workshop Proceedings                                          1–15


Figure 2: The proposed neural network architecture for mapping objects between real and virtual world.
Here 𝐵1 , 𝐵2 , and 𝐵3 indicates three MLP blocks whereas W indicate weight block to derive weight
outputs


Figure 3: Image processing steps involve pose estimation


incorporating precise real-world measurements. The virtual environment also captures and
represents the movement of individual people within the virtual environment. This is achieved
by mapping people’s movement in each frame and depicting their actions, such as sitting at
specific workstations or engaging in movement. Walking animations are employed to visualize
the movement of people.


                                                  8
Abhishek Mukhopadhyay et al. CEUR Workshop Proceedings                                       1–15


4. Experimental Evaluation
In this section, we have described the results of individual components of the proposed system
in detail.

4.1. Object Detection Accuracy
The accuracy of the object detection models was evaluated using various metrics. These metrics
were employed to measure the model’s performance in identifying and localizing objects in the
provided test data. Table 2 illustrates the performance of the models concerning the test data.
YOLOv5 showcased the highest IoU score, while YOLOv7 exhibited consistent performance
across classes, boasting the highest F1 score among the models. Moreover, YOLOv7 demonstrated
a processing speed of 30.23 frames per second.

Table 2
Comparing Performance between Object Detection Models
                          Models     Average IoU    Precision   F1 Score
                          YOLOv4         0.28         0.73        0.41
                          YOLOv5         0.59         0.63        0.63
                          YOLOv6         0.52         0.56        0.56
                          YOLOv7         0.56         0.72        0.70


4.2. Mapping Accuracy
The accuracy of the mapping algorithms is measured using the Euclidean distance between
the ground truth and predicted virtual-world positions. Figure 4 summarizes the comparison
between the machine learning models as discussed in Section 3.2, deployed for mapping. It may
be noted that neural network model achieved highest accuracy with the error of 80 centimeters.
Figure 5 shows the correlation graph with coefficient of determination or R2 = 0.97 between
actual distance measured in virtual environment and predicted by the proposed neural network
model.

4.3. Pose Accuracy
We reported pose accuracy by analyzing the actual orientation of the object and corresponding
orientation measured by the algorithm. We marked a cartesian coordinate system on a table by
marking the angles. Then we placed the keyboard and TV on the table. We noted their actual
angle with reference line and the orientation generated by the algorithm. Figure 6 shows the
correlation between actual and predicted orientation. It may be noted that the coefficient of
determination was observed as R2 = 0.99, while average error was found to be 8.03∘ . The
mapping between real-world objects and corresponding virtual objects is exemplified in Figure 7,
providing a concrete illustration. In this scenario, the focus was only on the yaw angle regarding
the positioning of movable objects—keyboards, laptops, TVs/Monitors, and desks—on a level


                                                9
Abhishek Mukhopadhyay et al. CEUR Workshop Proceedings                                            1–15


Figure 4: Comparison of mapping accuracy between different models


Figure 5: Scatter plot of distances measured in actual and predicted by neural network model in virtual
environment


2D plane. Considering the nature of these objects placed on a flat plane, any rotation around
the pitch or roll angles in a 3D scene was deemed unnecessary or not applicable for the specific
context being addressed.


                                                  10
Abhishek Mukhopadhyay et al. CEUR Workshop Proceedings                                            1–15


Figure 6: Correlation graph between actual angle and predicted angle estimated by pose estimation
algorithm


Figure 7: Scatter plot of distances measured in actual and predicted by neural network model in virtual
environment


5. General Discussion
This paper presents a data-driven approach for reconstructing a digital twin of an office space
environment. The previous work [5] proposed a VR digital twin of physical space for measuring
room occupancy and social distancing measurement. However, the main problem with this
approach lies in replicating physical space. It is a time-consuming process as it requires modeling,


                                                  11
Abhishek Mukhopadhyay et al. CEUR Workshop Proceedings                                         1–15


physics simulation, and rendering capabilities of Unity 3D. On the other hand, this proposed
system offers a different and beneficial approach to constructing VR digital twins compared
to conventional methods. By incorporating machine learning techniques, the development of
digital twins becomes more scalable and efficient. Machine learning enables accurate object
detection, orientation estimation, and 2D-to-3D mapping, resulting in high-quality virtual
workspaces. This approach addresses challenges associated with asset placement, enhances
space planning and visualization, and promotes energy efficiency and sustainability within
workspaces. In this section, we provide a summary of all the key points discussed in Section 1.
   Reconstruction Techniques: There are various techniques available to 3D reconstruct a
space from 2D images. Cutting-edge automated image orientation techniques, such as Structure
from Motion, and dense image matching methods like Multiple View Stereo, which are widely
utilized for deriving 3D information from 2D images, can yield 3D outcomes, such as point
clouds or meshes, exhibiting diverse levels of geometric accuracy and visual fidelity. This
technique requires a lot of images from various directions. It takes ample time to reconstruct an
object for a given instance of time, making it harder to reconstruct a real-time scenario of the
real world. This paper uses a novel data-driven approach, which utilizes only one 2D image of
the real world and reconstructs it in virtual reality. This reconstruction happens in two stages.
The first stage is detecting objects from the real world using 2D images and calculating their
pose using various image processing steps, and the second stage is mapping the objects from
the real world to the virtual world. We utilized the YOLOv7 model as our object detection model
to detect objects from the real world. The model performed well with an accuracy of F1 score
of 0.70. This is beneficial for reconstruction purposes as it can detect small objects with higher
accuracy, e.g., a mouse with a mAP of 0.81. The proposed pose detection algorithm showed
its efficiency with error rate of 8.03∘ . This higher accuracy of pose estimation can be helpful
in detecting orientation in the real world to reflect in the corresponding virtual environment.
Thus, the proposed system will be impact full in creating a digital twin. A supplementary video
(https://youtu.be/advtKAQ02Nk) shows the working of the proposed system as well as how
accurately real world is depicted in the virtual environment.
   Mapping Techniques: There is a plethora of mapping techniques available for reconstructing
a real space. One widely used technique is COLMAP [2], which is an end-to-end image-based 3D
reconstruction pipeline. It employs Multi-View Stereo (MVS) to compute depth and/or normal
information for every pixel in an image, using the output of Structure-from-Motion (SfM) [2, 3].
By fusing the depth and normal maps of multiple images in 3D, a dense point cloud of the scene
is generated. However, this technique requires many images from different viewpoints and high
visual overlaps, making it slower and more time-consuming when creating a representation of
real-world scenarios at a specific moment in time. In this paper, after comparing various classical
machine learning techniques, it was determined that neural networks were the most suitable
for the task as it is designed to handle the different degree of non-linearity for different points.
The neural networks achieved an impressive average error of only 80 cm when mapping objects
within the virtual world. The error was found to be 43.66% lower compared to the second-best
model, which was linear regression. The proposed system holds potential for application on
workshop floors, facilitating remote monitoring, asset tracking, and various other functions.
   This proposed approach employs machine learning techniques to reconstruct digital twins of
physical spaces efficiently, streamlining the process and enhancing accuracy in object detection,


                                                12
Abhishek Mukhopadhyay et al. CEUR Workshop Proceedings                                         1–15


orientation estimation, and mapping. Addressing limitations in standard 3D construction, such
as missing CAD/BIM data of objects, extensive object volumes, or wide capture areas, the
approach simplifies real-time reconstruction using a single 2D image. This streamlined process
demonstrates promising potential for swift and precise translations from the real world to the
virtual realm, contrasting with time-consuming traditional methods.


6. Conclusion
This study endeavors to devise a cost-effective solution for constructing a virtual model of a built
environment. The suggested system serves as a VR-based digital replica of an office, integrating
real-time monitoring of human occupancy and asset utilization. The system’s accuracy hinges
on the performance of the object detection model, with a reported high level of precision, i.e., an
F1 score of 0.70. The pose estimation algorithm significantly corrects the movement of movable
objects, such as keyboards, monitors, and desks, exhibiting a high correlation (R2 = 0.99). The
proposed neural network model successfully maps objects from the 2D image plane to a 3D
plane in a virtual environment, demonstrating a correlation of R2 = 0.97. Real-time mapping
of human positions and precise estimation of asset poses offer numerous advantages, enabling
the floor management team to conduct thorough remote walkthroughs and gain insights into
room occupancy and office asset utilization. This information empowers informed decisions
on sustainable space usage and asset management. However, challenges are encountered in
mapping objects from the 2D plane to the 3D plane. Future work will focus on implementing
3D object detection, promising accurate positioning and orientation of real-world objects. The
proposed framework underwent testing in various office spaces, with all data transmitted to the
digital twin of the real-world space (Please refer to https://youtu.be/advtKAQ02Nk).


References
 [1] Paperswithcode, 3d reconstruction, https://paperswithcode.com/task/3d-reconstruction/,
     2023. Accessed 18 April 2023.
 [2] J. L. Schonberger, J.-M. Frahm, Structure-from-motion revisited, in: Proceedings of the
     IEEE conference on computer vision and pattern recognition, 2016, pp. 4104–4113.
 [3] J. L. Schönberger, E. Zheng, J.-M. Frahm, M. Pollefeys, Pixelwise view selection for unstruc-
     tured multi-view stereo, in: Computer Vision–ECCV 2016: 14th European Conference,
     Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, Springer, 2016,
     pp. 501–518.
 [4] A. Mukhopadhyay, G. R. Reddy, S. Ghosh, M. LRD, P. Biswas, Validating social distancing
     through deep learning and vr-based digital twins, in: Proceedings of the 27th ACM
     symposium on virtual reality software and technology, 2021, pp. 1–2.
 [5] A. Mukhopadhyay, G. R. Reddy, K. S. Saluja, S. Ghosh, A. Peña-Rios, G. Gopal, P. Biswas,
     Virtual-reality-based digital twin of office spaces with social distance measurement feature,
     Virtual Reality & Intelligent Hardware 4 (2022) 55–75.
 [6] A. Mukhopadhyay, G. Rajshekar Reddy, I. Mukherjee, G. Kumar Gopa, A. Pena-Rios,
     P. Biswas, Generating synthetic data for deep learning using vr digital twin, in: Proceedings


                                                13
Abhishek Mukhopadhyay et al. CEUR Workshop Proceedings                                      1–15


     of the 2021 5th International Conference on Cloud and Big Data Computing, 2021, pp.
     52–56.
 [7] IBM, What is a digital twin?, https://www.ibm.com/topics/what-is-a-digital-twin/, 2023.
     Accessed 18 April 2023.
 [8] U. Krispel, H. L. Evers, M. Tamke, R. Viehauser, D. W. Fellner, Automatic texture and
     orthophoto generation from registered panoramic views, The international archives of
     the photogrammetry, remote sensing and spatial information sciences 40 (2015) 131–137.
 [9] L. Díaz-Vilariño, H. González-Jorge, J. Martínez-Sánchez, H. Lorenzo, Automatic lidar-
     based lighting inventory in buildings, Measurement 73 (2015) 544–550.
[10] T. Czerniawski, M. Nahangi, C. Haas, S. Walbridge, Pipe spool recognition in cluttered
     point clouds using a curvature-based shape descriptor, Automation in Construction 71
     (2016) 346–358.
[11] P. Kima, J. Chenb, Y. K. Choa, Building element recognition with thermal-mapped point
     clouds, in: 34th International Symposium on Automation and Robotics in Construction
     (ISARC 2017), 2017.
[12] I. Anagnostopoulos, V. Pătrăucean, I. Brilakis, P. Vela, Detection of walls, floors, and
     ceilings in point cloud data, in: Construction Research Congress 2016, 2016, pp. 2302–2311.
[13] J. Han, M. Rong, H. Jiang, H. Liu, S. Shen, Vectorized indoor surface reconstruction from
     3d point cloud with multistep 2d optimization, ISPRS Journal of Photogrammetry and
     Remote Sensing 177 (2021) 57–74.
[14] J. Guo, Q. Wang, J.-H. Park, Geometric quality inspection of prefabricated mep modules
     with 3d laser scanning, Automation in Construction 111 (2020) 103053.
[15] T. Czerniawski, F. Leite, Automated segmentation of rgb-d images into a comprehensive
     set of building components using deep learning, Advanced Engineering Informatics 45
     (2020) 101131.
[16] V. Drobnyi, Z. Hu, Y. Fathy, I. Brilakis, Construction and maintenance of building geometric
     digital twins: state of the art review, Sensors 23 (2023) 4382.
[17] S. I. Nikolenko, Synthetic-to-real domain adaptation and refinement, in: Synthetic data
     for deep learning, Springer, 2021, pp. 235–268.
[18] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci,
     S. Boochoon, S. Birchfield, Training deep networks with synthetic data: Bridging the
     reality gap by domain randomization, in: Proceedings of the IEEE conference on computer
     vision and pattern recognition workshops, 2018, pp. 969–977.
[19] X. Zhou, K. Sun, J. Wang, J. Zhao, C. Feng, Y. Yang, W. Zhou, Computer vision enabled
     building digital twin using building information model, IEEE Transactions on Industrial
     Informatics 19 (2022) 2684–2692.
[20] Y. Nie, X. Han, S. Guo, Y. Zheng, J. Chang, J. J. Zhang, Total3dunderstanding: Joint layout,
     object pose and mesh reconstruction for indoor scenes from a single image, in: Proceedings
     of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 55–64.
[21] Q. Sun, L.-Y. Wei, A. Kaufman, Mapping virtual and physical reality, ACM Transactions
     on Graphics (TOG) 35 (2016) 1–12.
[22] F. Ren, X. Wu, Outdoor augmented reality spatial information representation, Appl. Math
     7 (2013) 505–509.
[23] W. Huang, J. Chen, A multi-scale vr navigation method for vr globes, International journal


                                               14
Abhishek Mukhopadhyay et al. CEUR Workshop Proceedings                                        1–15


     of digital earth 12 (2019) 228–249.
[24] B. Schwarz, Mapping the world in 3d, Nature Photonics 4 (2010) 429–430.
[25] M.          COMMUNICATIONS,                    Maverick,          https://www.mandli.com/
     maverick-by-mandli-communications//, 2023. Accessed 18 February 2021.
[26] G. H. Rosenfield, The problem of exterior orientation in photogrammetry, Photogrammetric
     Engineering 25 (1959).
[27] E. Thompson, The projective theory of relative orientation, Photogrammetria 23 (1968)
     67–75.
[28] R. M. Haralick, L. G. Shapiro, Computer and robot vision, volume 1, Addison-wesley
     Reading, MA, 1992.
[29] A. V. Patil, P. Rabha, A survey on joint object detection and pose estimation using
     monocular vision, arXiv preprint arXiv:1811.10216 (2018).
[30] G. Lan, Y. Wu, F. Hu, Q. Hao, Vision-based human pose estimation via deep learning: a
     survey, IEEE Transactions on Human-Machine Systems 53 (2022) 253–268.
[31] A. Collet, M. Martinez, S. S. Srinivasa, The moped framework: Object recognition and
     pose estimation for manipulation, The international journal of robotics research 30 (2011)
     1284–1306.
[32] F. Viksten, R. Soderberg, K. Nordberg, C. Perwass, Increasing pose estimation performance
     using multi-cue integration, in: Proceedings 2006 IEEE International Conference on
     Robotics and Automation, 2006. ICRA 2006., IEEE, 2006, pp. 3760–3767.
[33] J. Lee, M. Azamfar, J. Singh, S. Siahpour, Integration of digital twin and deep learning
     in cyber-physical systems: towards smart manufacturing, IET Collaborative Intelligent
     Manufacturing 2 (2020) 34–36.
[34] O. R. Ogunseiju, J. Olayiwola, A. A. Akanmu, C. Nnaji, Recognition of workers’ actions
     from time-series signal images using deep convolutional neural network, Smart and
     Sustainable Built Environment 11 (2022) 812–831.
[35] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521 (2015) 436–444.
[36] C. Boton, Supporting constructability analysis meetings with immersive virtual reality-
     based collaborative bim 4d simulation, Automation in Construction 96 (2018) 1–15.
[37] A. Mukhopadhyay, I. Mukherjee, P. Biswas, Comparing cnns for non-conventional traffic
     participants, in: Proceedings of the 11th International Conference on Automotive User
     Interfaces and Interactive Vehicular Applications: Adjunct Proceedings, 2019, pp. 171–175.
[38] M. Carranza-García, J. Torres-Mateo, P. Lara-Benítez, J. García-Gutiérrez, On the perfor-
     mance of one-stage and two-stage object detectors in autonomous vehicles using camera
     data, Remote Sensing 13 (2020) 89.
[39] A. J. Smola, B. Schölkopf, A tutorial on support vector regression, Statistics and computing
     14 (2004) 199–222.
[40] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Mi-
     crosoft COCO: common objects in context, in: D. J. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars
     (Eds.), Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland,
     September 6-12, 2014, Proceedings, Part V, volume 8693 of Lecture Notes in Computer
     Science, Springer, 2014, pp. 740–755. URL: https://doi.org/10.1007/978-3-319-10602-1_48.
     doi:10.1007/978-3-319-10602-1\_48.


                                                15