<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>March</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Machine Learning-Based Framework for Real-Time 3D Reconstruction and Space Utilization in Built Environments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Abhishek Mukhopadhyay</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Samarth Patel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Priyavrat Sharma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pradipta Biswas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Science</institution>
          ,
          <addr-line>Bangalore</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>1</volume>
      <fpage>8</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>The process of 3D reconstruction involves transforming 2D images or data into a three-dimensional representation of an object, model, or environment. While supervised 3D reconstruction has made significant strides using deep neural networks, it is often time-consuming due to extensive image stitching and the requirement for specialized imaging sensors such as 360-degree or depth cameras. This paper introduces a machine learning-based 3D reconstruction framework aimed at making informed decisions regarding space utilization and asset management within any built environment. The proposed system comprises three key components: (I) object detection on 2D frames to identify target objects, (II) calculation of their pose using image processing techniques, and (III) utilization of an artificial neural network to map real and virtual environments. The evaluation using YOLOv7 demonstrated an accuracy of F1 score of 0.70 in detecting objects of interest. Pose estimation analysis indicated that the proposed algorithm could estimate object orientation with an error rate of 8.03∘ . The mapping algorithm exhibited high-quality performance, achieving a correlation coeficient of R 2 = 0.97. Ultimately, all this information is transmitted and visualized in the reconstructed virtual model, enabling remote monitoring and simulation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Soft continuum manipulator</kwd>
        <kwd>Soft snake robot</kwd>
        <kwd>Multi-modal interaction</kwd>
        <kwd>Hand gesture</kwd>
        <kwd>Eye tracker</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The process of reconstructing a real-world scenario in three dimensions (3D) entails creating a
3D model from 2D images, point clouds, silhouettes, and similar data sources [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The process
aims to generate a virtual representation applicable in visualization, animation, simulation,
and analysis across fields like computer vision, robotics, and virtual reality. In the realm of
computer vision research, significant attention has been given to 3D reconstruction, with a
focus on areas such as structure from motion [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or Multiview stereo [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These methods rely
on multiple images to establish accurate correspondences or ensure comprehensive coverage,
but they can be time-consuming due to the extensive image stitching and the requirement
for specialized imaging sensors like 360-degree or depth cameras. Reconstructing built
environments, such as car or ship interiors, is comparatively more straightforward, given the
availability of standard Computer-aided design (CAD). Previous works [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ] have developed
virtual models by leveraging architectural drawings and integrating 3D furniture models.
However, this process is time-consuming, involving modeling, physics simulation, and rendering
capabilities of commercial game engines. The advent of deep neural networks has facilitated
the incorporation of real-world objects into virtual spaces by learning from large datasets. CAD
models of real-world objects assist in learning the mapping from images to virtual models,
streamlining the process. Once a virtual reality (VR) model is established, it can be connected to
real-time imaging and environmental sensors, enabling the estimation of room occupancy and
power consumption within built environments [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In this context, we propose a digital twin
(DT) of a built environment through an interactive and immersive VR experience. A digital twin
represents an object or system virtually throughout its lifecycle, updated with real-time data and
employing simulation, machine learning, and reasoning to facilitate decision-making [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This
paper aims to develop an eficient VR-based DT for built environments in real-time, expediting
the existing 3D reconstruction process. This system allows standard virtual walkthroughs
and provides real-time room occupancy estimates, energy assessments, and the potential for
expansion into asset tracking and maintenance. The detection and localization of objects in
the real world involved deploying a pre-trained YOLOv7 model, which underwent transfer
learning on a custom dataset. Image processing techniques were employed to estimate the pose
of real-world objects and map them into a virtual environment. An extensive comparison study
among machine learning models was conducted to determine an accurate mapping technique.
Mapping two-dimensional coordinates onto the virtual camera feed establishes a connection
between the real and virtual worlds, enabling the real-time simulation of object movements in
physical space. The contributions of this work are as follows.
      </p>
      <p>• Proposed a new way of reconstructing real-world space to virtual environment in
realtime.
• Validated the mapping between real and virtual environment by comparing several
machine learning models.</p>
      <p>The paper is organized as follows. Section 2 explains literature review on diferent methods
for object detection systems. Section 3 explains the training and evaluation of object detection
models, comparison studies between diferent machine learning models used for mapping and
pose calculation technique in detail. Experiments and results are discussed in detail in Section
4, followed by discussion and conclusion in Section 5 and 6, respectively.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>In this section, the previous works on mapping, pose estimation and applications of object
detection on digital twin are summarized in detail.</p>
      <sec id="sec-2-1">
        <title>2.1. Digital Twin</title>
        <p>
          Research on automated construction of digital twins and the inclusion of secondary objects
has employed a blend of simple image processing techniques and sophisticated deep learning.
Commercial software such as EdgeWise, Pointfuse, Point Cab, and Leica Cyclone Model leverage
shape detection and fitting algorithms, with academic literature also using machine learning for
secondary object detection [
          <xref ref-type="bibr" rid="ref10 ref11 ref8 ref9">8, 9, 10, 11</xref>
          ]. Traditional computer vision techniques and CNN-based
methods, such as the DeepLab architecture, have also been used to classify object classes and
generate walls, cable trays, and ventilation ducts [
          <xref ref-type="bibr" rid="ref12 ref13 ref14 ref15">12, 13, 14, 15</xref>
          ]. A noteworthy study utilized
laser scanning, object detection, and OCR to enrich a digital twin with secondary objects and
semantic information [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Despite their potential, deep learning approaches often require
extensive labelled training data, a challenge that synthetic data generation techniques may
alleviate [
          <xref ref-type="bibr" rid="ref16 ref17 ref18 ref6">6, 16, 17, 18</xref>
          ]. Perhaps, the most closely related work to ours is by Zhou et al. [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ],
who used computer vision to update a BIM-based digital twin of a building in real-time. Their
approach incorporated YOLOv5 for object detection and utilized the Total3DUnderstanding
method [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] for estimating object pose. This work claimed to achieve successful object capture,
including desks, chairs, plants, and computer monitors, by transforming object coordinates
from the image to the physical world and then to the digital twin. However, the paper did not
provide a report on the accuracy of the object detection models, which is a crucial component
of the pipeline. Instead, their focus was primarily on orientation correction. In contrast, our
approach targets more significant object categories and includes a comprehensive report on the
accuracy of individual components: object detection, mapping algorithm, and pose estimation.
This individual accuracy analysis allows for a better understanding of any accuracy limitations
in the deployment process.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Mapping Between Real and Virtual Space</title>
        <p>
          The coordinates of the diferent objects that were detected has to be mapped in the virtual 3D
space of unity. Sun et al. [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] focused on mapping virtual space onto real space using planar
mapping, exploring the methods and algorithms employed to achieve accurate and eficient
mapping. Ren et al. [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] employed spatial afine transformation to map virtual objects onto 2D
images, aiming to integrate virtual objects into real-world scenes seamlessly. Huang et al. [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]
proposed an algorithm specifically designed for mapping real space to a virtual globe space,
addressing the need for robust and eficient mapping techniques applicable to diverse real-world
environments. Schwarz et al. [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] discussed the usage of LIDAR systems by participants in a
DARPA-sponsored event to generate a digital view of the surrounding terrain in autonomous
vehicles, emphasizing the crucial role of LIDAR technology in facilitating accurate perception
and navigation in autonomous systems. Mandli Communications [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] employed LIDAR sensors
to generate point clouds and subsequently create digital terrain models, showcasing the practical
application of LIDAR for terrain modelling purposes.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Pose Estimation</title>
        <p>
          The classical approach for estimating pose traditionally treated it as a nonlinear least-squares
problem, employing nonlinear optimization algorithms for solving it [
          <xref ref-type="bibr" rid="ref26 ref27 ref28">26, 27, 28</xref>
          ]. Patil et al. [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]
investigated and compared various vision-based, hybrid, and deep learning-based approaches
for pose estimation from monocular vision. Similarly, Lan et al. [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] provided an overview
of diferent deep learning-based techniques for human pose estimation. Another notable
contribution by Collet et al. [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] is the MOPED (Multiple Object Pose Estimation and Detection)
framework, designed to deliver robust performance in complex scenes and low latency for
real-time applications. Vikstenen et al. [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] developed a system that leverages algorithmic
multi-cue integration (AMC) and temporal multi-cue integration (TMC) to increase the pose
estimation performance.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Deep Learning in Digital Twin</title>
        <p>
          Deep learning is a subset of a larger family of ML approaches; it can take data and automatically
perform tasks like Classification, regression, clustering and pattern recognition. Deep learning
is very efective in the detection of complicated structures and things [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ]. In another research,
Ogunseiju et al. [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ] investigated the efectiveness of a variety of deep CNNs for recognizing
construction worker activities from images of signals from time-series data using large datasets
for training and testing DL algorithms. Deep learning has proved to be very good in object
detection [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ], and the use of DT involving humans has been proposed by a lot of researchers
in the construction activity to prevent causalities and improve ergonomics of workers, Boton et
al. [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ] explored the use introducing a temporal dimension in the 3D simulation of Construction
activities. Summarizing the literature survey, this paper aims to address several key research
gaps in the field of digital twin construction. Certain limitations are inherent in past work,
including the high cost associated with depth cameras and the inability to capture specular and
transparent objects. In addition, the manual intervention required to verify detected objects in
the point clouds results in a time-consuming process [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Furthermore, while deep learning
techniques have been applied to detect building elements, the literature lacks evidence of their
utilization for identifying room furniture and other entities, which assists in understanding
space utilization. Moreover, there has been limited exploration in the architecture, engineering,
and construction (AEC) domain concerning the mapping of secondary objects. Zhou et al. [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]
utilize a 3D estimation network for extracting the pose of each object and a camera-BIM location
transformation algorithm for mapping the coordinates of the detected objects in BIM. In contrast,
our approach utilizes image processing steps and the minimal area rectangle method for pose
estimation, and a Machine learning-based algorithm for mapping. By addressing these research
gaps, this paper enhances the eficiency, accuracy, and comprehensiveness of digital twin
creation and updating in real-time using computer vision techniques.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Approach</title>
      <p>
        The proposed 3D reconstruction process begins with the detection of objects in the real-world
environment. While numerous pre-trained models are available for this purpose, our study
relies on the findings of previous work [
        <xref ref-type="bibr" rid="ref37 ref38">37, 38</xref>
        ] which demonstrate that YOLO performs better
in terms of the tradeof between latency and accuracy. YOLOv7 was chosen over various YOLO
model variants following a comparative analysis. The detected objects are further classified
into three categories: movable, partially movable, and immovable. Movable objects encompass
persons and chairs, while the partially movable are keyboard, laptop, TV/monitor, and mouse
while remaining objects are deemed immovable. This distinction is crucial to maintain real-time
accuracy and resource eficiency, as movable, and partially movable objects require frequent
updates due to their volatile positions, while immovable objects, remaining static most of the
time, necessitate fewer updates. Accordingly, the movable objects undergo more frequent
updates, occurring every frame, while partially movable objects are updated every 5 minutes.
The immovable objects are updated less frequently, with a refresh rate of every 24 hours. The
classification of the objects into groups helps the proposed system to reconstruct real-world in
real time. We also compare linear regression with degrees 1, 2, and 3, support vector regression
(SVR) with radial basis function (RBF) and polynomial kernels [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ], and a vanilla neural network
to map between real-world and virtual objects. The pose estimation of objects is performed
using a combination of diferent image processing techniques. The position and orientation
of the objects are communicated to the game engine using socket programming for real-time
interaction and synchronization. Figure 1 represents the working of the proposed system.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Object Detection</title>
        <p>In this Section, a detailed analysis of the dataset preparation strategy is discussed followed by
comparison study between object detection models in study.</p>
        <p>
          Dataset Preparation: The dataset encompasses various day-to-day ofice utilities, including
chair, keyboard, laptop, TV/monitor, refrigerators, desk, mouse, person, and couch. The dataset
used in this study was derived from two sources. The first source involved filtering the MS
COCO dataset [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ] to extract the above-mentioned class labels. The second source was a
crowdsourced dataset collected for this research. The images obtained from the user-generated dataset
were annotated with nine class labels. To annotate the images, the Computer Vision Annotation
Tool (CVAT) was employed, allowing for manual annotation through visual inspection. The
dataset was then divided into train, validation, and test subsets, with each class label having
specified entries in each subset, as shown in Table 1. The annotations were compiled into an
XML file format using the CVAT tool. Subsequently, the XML file format was converted to the
YOLO file format, which was utilized for training the object detection model.
        </p>
        <p>Comparison between Object Detection Models: We compared YOLO models (v4 to v7)
to choose the best model. By using transfer learning, we leveraged the pretrained weights of
these models on a large-scale dataset and fine-tuned the model on bespoke dataset, resulting in
improved object detection capabilities within our digital twin environment. Model training was
conducted on an NVIDIA 3090 Ti GPU, and performance was assessed using three evaluation
metrics, including mean Intersection Over Union (mIOU), precision, and F1 score.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Mapping Techniques</title>
        <p>
          In the next step, the detected objects were used to map corresponding objects in a virtual
environment created using Unity 3D. The mapping focused on three out of six degrees of
freedom (DOFs), specifically translation along the X and Y-axis, and rotation around the yaw.
This selection of three DOFs was specific to this problem, given that all objects are situated
within a 2D plane, such as the floor or tabletop. We mapped the center of base from 2D frame
to appropriate plane in virtual environment. The extracted center coordinates were utilized as
inputs for various regression algorithms, including linear regression with degrees 1, 2, and 3,
support vector regression (SVR) with radial basis function (RBF) and polynomial kernels [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ],
and a vanilla ANN. Linear regression with degree 1 fits a straight-line relationship between the
features and the target. When degree 2 is used, quadratic terms are introduced to capture curved
patterns. Additionally, when degree 3 is employed, cubic terms are added, enabling the model to
ift S-shaped patterns and capture more complex relationships. SVR with RBF kernel is a powerful
non-linear data regression approach with the use of a Gaussian-like function, it modifies the
input space to capture complex connections and trends similarly polynomial relationships can
be captured using SVR with a polynomial kernel it converts data into a higher-dimensional
space and uses polynomial functions to assess similarity. The architecture of the ANN consists
of four Multi-Layer Perceptron (MLP) blocks, with three of them responsible for the mapping
process in conjunction with the input (Figure 2). We propose a hybrid structure by using a
weighted summation of the output of these MLP blocks. The structure for the first three blocks
is 2-4-2 (input layer-hidden layer-output layer) and for the fourth block (weight block) it is 2-8-4
(input layer-hidden layer-output layer). The activation used in the first three blocks are linear,
sigmoid, and tanh respectively and for the weight block linear activation is used. The model
was trained using Adam optimizer and mean square error as loss function. The fourth block
incorporates weights for all four outputs, scaling the output based on the specific degree of
non-linearity. In the feature fusion part, we undertake feature fusion as described in Equation 1.
(, ) =  · 1(, ) +  · 2(, ) +  · 3(, ) +  · (, )
(1)
where, , , ,  denotes the weight parameters, 1, 2, and 3 indicates three diferent blocks
which are connected in parallel and (, ) is the weighted sum of the parallel outputs and input.
Each algorithm was trained using a dataset consisting of frame coordinates and corresponding
virtual world coordinates. Frame coordinates are generated based on the bounding box location
on the screen generated by object detection model. Correspondingly, in virtual environment, we
placed objects in same location and generated the virtual world coordinates. After the training
process, the algorithms could predict the coordinates of objects within the virtual environment
based on the object detection model generated coordinates. The predicted coordinates were then
transmitted to virtual environment using socket programming. This communication facilitated
the dynamic mapping of the detected objects within the virtual environment, resulting in an
immersive and interactive user experience.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Pose Correction</title>
        <p>A series of image processing steps were followed for estimating the pose of the object in
realworld space in the process of digital twin reconstruction. First, the region of interest (ROI) was
obtained from the coordinates generated by the object detection model. Subsequently, image
sharpening was applied to enhance the visibility of the objects in the images. Next, segmentation
was performed to isolate the required objects from the background. The resulting segmented
images were then converted into a binary grayscale format, simplifying the subsequent image
analysis process. To further refine the binary images, a morphological process was applied to
ifll in small patches present in the binary images. Following this, contour fitting was performed
on the binary images to accurately obtain the shapes and boundaries of the objects. Finally, the
minimum area rectangle method was utilized to determine the precise positions and orientations
of the objects in the images. This comprehensive approach facilitated the acquisition of more
accurate representations of the objects’ poses in the digital twin environment. By implementing
these image processing steps, the objects were successfully detected and mapped in the digital
twin, enabling the creation of an accurate representation of the physical world. The proposed
pose detection algorithm is described in Figure 3.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Placing in 3D Virtual Environment</title>
        <p>The virtual environment reconstruction involves the transmission of pose and location data for
target objects from computer vision algorithms to a virtual environment. Prior to receiving the
data stream, a virtual representation of the real-world space is constructed using a game engine,
incorporating precise real-world measurements. The virtual environment also captures and
represents the movement of individual people within the virtual environment. This is achieved
by mapping people’s movement in each frame and depicting their actions, such as sitting at
specific workstations or engaging in movement. Walking animations are employed to visualize
the movement of people.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Evaluation</title>
      <p>In this section, we have described the results of individual components of the proposed system
in detail.</p>
      <sec id="sec-4-1">
        <title>4.1. Object Detection Accuracy</title>
        <p>The accuracy of the object detection models was evaluated using various metrics. These metrics
were employed to measure the model’s performance in identifying and localizing objects in the
provided test data. Table 2 illustrates the performance of the models concerning the test data.
YOLOv5 showcased the highest IoU score, while YOLOv7 exhibited consistent performance
across classes, boasting the highest F1 score among the models. Moreover, YOLOv7 demonstrated
a processing speed of 30.23 frames per second.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Mapping Accuracy</title>
        <p>The accuracy of the mapping algorithms is measured using the Euclidean distance between
the ground truth and predicted virtual-world positions. Figure 4 summarizes the comparison
between the machine learning models as discussed in Section 3.2, deployed for mapping. It may
be noted that neural network model achieved highest accuracy with the error of 80 centimeters.
Figure 5 shows the correlation graph with coeficient of determination or R 2 = 0.97 between
actual distance measured in virtual environment and predicted by the proposed neural network
model.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Pose Accuracy</title>
        <p>We reported pose accuracy by analyzing the actual orientation of the object and corresponding
orientation measured by the algorithm. We marked a cartesian coordinate system on a table by
marking the angles. Then we placed the keyboard and TV on the table. We noted their actual
angle with reference line and the orientation generated by the algorithm. Figure 6 shows the
correlation between actual and predicted orientation. It may be noted that the coeficient of
determination was observed as R2 = 0.99, while average error was found to be 8.03∘ . The
mapping between real-world objects and corresponding virtual objects is exemplified in Figure 7,
providing a concrete illustration. In this scenario, the focus was only on the yaw angle regarding
the positioning of movable objects—keyboards, laptops, TVs/Monitors, and desks—on a level
2D plane. Considering the nature of these objects placed on a flat plane, any rotation around
the pitch or roll angles in a 3D scene was deemed unnecessary or not applicable for the specific
context being addressed.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. General Discussion</title>
      <p>
        This paper presents a data-driven approach for reconstructing a digital twin of an ofice space
environment. The previous work [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposed a VR digital twin of physical space for measuring
room occupancy and social distancing measurement. However, the main problem with this
approach lies in replicating physical space. It is a time-consuming process as it requires modeling,
physics simulation, and rendering capabilities of Unity 3D. On the other hand, this proposed
system ofers a diferent and beneficial approach to constructing VR digital twins compared
to conventional methods. By incorporating machine learning techniques, the development of
digital twins becomes more scalable and eficient. Machine learning enables accurate object
detection, orientation estimation, and 2D-to-3D mapping, resulting in high-quality virtual
workspaces. This approach addresses challenges associated with asset placement, enhances
space planning and visualization, and promotes energy eficiency and sustainability within
workspaces. In this section, we provide a summary of all the key points discussed in Section 1.
      </p>
      <p>Reconstruction Techniques: There are various techniques available to 3D reconstruct a
space from 2D images. Cutting-edge automated image orientation techniques, such as Structure
from Motion, and dense image matching methods like Multiple View Stereo, which are widely
utilized for deriving 3D information from 2D images, can yield 3D outcomes, such as point
clouds or meshes, exhibiting diverse levels of geometric accuracy and visual fidelity. This
technique requires a lot of images from various directions. It takes ample time to reconstruct an
object for a given instance of time, making it harder to reconstruct a real-time scenario of the
real world. This paper uses a novel data-driven approach, which utilizes only one 2D image of
the real world and reconstructs it in virtual reality. This reconstruction happens in two stages.
The first stage is detecting objects from the real world using 2D images and calculating their
pose using various image processing steps, and the second stage is mapping the objects from
the real world to the virtual world. We utilized the YOLOv7 model as our object detection model
to detect objects from the real world. The model performed well with an accuracy of F1 score
of 0.70. This is beneficial for reconstruction purposes as it can detect small objects with higher
accuracy, e.g., a mouse with a mAP of 0.81. The proposed pose detection algorithm showed
its eficiency with error rate of 8.03∘ . This higher accuracy of pose estimation can be helpful
in detecting orientation in the real world to reflect in the corresponding virtual environment.
Thus, the proposed system will be impact full in creating a digital twin. A supplementary video
(https://youtu.be/advtKAQ02Nk) shows the working of the proposed system as well as how
accurately real world is depicted in the virtual environment.</p>
      <p>
        Mapping Techniques: There is a plethora of mapping techniques available for reconstructing
a real space. One widely used technique is COLMAP [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which is an end-to-end image-based 3D
reconstruction pipeline. It employs Multi-View Stereo (MVS) to compute depth and/or normal
information for every pixel in an image, using the output of Structure-from-Motion (SfM) [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
By fusing the depth and normal maps of multiple images in 3D, a dense point cloud of the scene
is generated. However, this technique requires many images from diferent viewpoints and high
visual overlaps, making it slower and more time-consuming when creating a representation of
real-world scenarios at a specific moment in time. In this paper, after comparing various classical
machine learning techniques, it was determined that neural networks were the most suitable
for the task as it is designed to handle the diferent degree of non-linearity for diferent points.
The neural networks achieved an impressive average error of only 80 cm when mapping objects
within the virtual world. The error was found to be 43.66% lower compared to the second-best
model, which was linear regression. The proposed system holds potential for application on
workshop floors, facilitating remote monitoring, asset tracking, and various other functions.
      </p>
      <p>This proposed approach employs machine learning techniques to reconstruct digital twins of
physical spaces eficiently, streamlining the process and enhancing accuracy in object detection,
orientation estimation, and mapping. Addressing limitations in standard 3D construction, such
as missing CAD/BIM data of objects, extensive object volumes, or wide capture areas, the
approach simplifies real-time reconstruction using a single 2D image. This streamlined process
demonstrates promising potential for swift and precise translations from the real world to the
virtual realm, contrasting with time-consuming traditional methods.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This study endeavors to devise a cost-efective solution for constructing a virtual model of a built
environment. The suggested system serves as a VR-based digital replica of an ofice, integrating
real-time monitoring of human occupancy and asset utilization. The system’s accuracy hinges
on the performance of the object detection model, with a reported high level of precision, i.e., an
F1 score of 0.70. The pose estimation algorithm significantly corrects the movement of movable
objects, such as keyboards, monitors, and desks, exhibiting a high correlation (R2 = 0.99). The
proposed neural network model successfully maps objects from the 2D image plane to a 3D
plane in a virtual environment, demonstrating a correlation of R2 = 0.97. Real-time mapping
of human positions and precise estimation of asset poses ofer numerous advantages, enabling
the oflor management team to conduct thorough remote walkthroughs and gain insights into
room occupancy and ofice asset utilization. This information empowers informed decisions
on sustainable space usage and asset management. However, challenges are encountered in
mapping objects from the 2D plane to the 3D plane. Future work will focus on implementing
3D object detection, promising accurate positioning and orientation of real-world objects. The
proposed framework underwent testing in various ofice spaces, with all data transmitted to the
digital twin of the real-world space (Please refer to https://youtu.be/advtKAQ02Nk).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Paperswithcode</surname>
          </string-name>
          , 3d reconstruction, https://paperswithcode.com/task/3d-reconstruction/,
          <source>2023. Accessed 18 April</source>
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Schonberger</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Frahm</surname>
          </string-name>
          ,
          <article-title>Structure-from-motion revisited</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>4104</fpage>
          -
          <lpage>4113</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Schönberger</surname>
          </string-name>
          , E. Zheng,
          <string-name>
            <surname>J.-M. Frahm</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Pollefeys</surname>
          </string-name>
          ,
          <article-title>Pixelwise view selection for unstructured multi-view stereo</article-title>
          , in: Computer Vision-ECCV
          <year>2016</year>
          : 14th European Conference, Amsterdam, The Netherlands,
          <source>October 11-14</source>
          ,
          <year>2016</year>
          , Proceedings,
          <source>Part III 14</source>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>501</fpage>
          -
          <lpage>518</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukhopadhyay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. R.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. LRD</surname>
          </string-name>
          , P. Biswas,
          <article-title>Validating social distancing through deep learning and vr-based digital twins</article-title>
          ,
          <source>in: Proceedings of the 27th ACM symposium on virtual reality software and technology</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukhopadhyay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. R.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Saluja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Peña-Rios</surname>
          </string-name>
          , G. Gopal,
          <string-name>
            <given-names>P.</given-names>
            <surname>Biswas</surname>
          </string-name>
          ,
          <article-title>Virtual-reality-based digital twin of ofice spaces with social distance measurement feature</article-title>
          ,
          <source>Virtual Reality &amp; Intelligent Hardware</source>
          <volume>4</volume>
          (
          <year>2022</year>
          )
          <fpage>55</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukhopadhyay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. Rajshekar</given-names>
            <surname>Reddy</surname>
          </string-name>
          , I. Mukherjee,
          <string-name>
            <given-names>G. Kumar</given-names>
            <surname>Gopa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pena-Rios</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Biswas</surname>
          </string-name>
          ,
          <article-title>Generating synthetic data for deep learning using vr digital twin</article-title>
          ,
          <source>in: Proceedings of the 2021 5th International Conference on Cloud and Big Data Computing</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>52</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>IBM</surname>
          </string-name>
          ,
          <article-title>What is a digital twin?</article-title>
          , https://www.ibm.com/topics/what-is
          <article-title>-a-digital-</article-title>
          <string-name>
            <surname>twin</surname>
            <given-names>/</given-names>
          </string-name>
          ,
          <source>2023. Accessed 18 April</source>
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>U.</given-names>
            <surname>Krispel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Evers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tamke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Viehauser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Fellner</surname>
          </string-name>
          ,
          <article-title>Automatic texture and orthophoto generation from registered panoramic views, The international archives of the photogrammetry</article-title>
          ,
          <source>remote sensing and spatial information sciences 40</source>
          (
          <year>2015</year>
          )
          <fpage>131</fpage>
          -
          <lpage>137</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Díaz-Vilariño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>González-Jorge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Martínez-Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lorenzo</surname>
          </string-name>
          ,
          <article-title>Automatic lidarbased lighting inventory in buildings</article-title>
          ,
          <source>Measurement</source>
          <volume>73</volume>
          (
          <year>2015</year>
          )
          <fpage>544</fpage>
          -
          <lpage>550</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Czerniawski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nahangi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Haas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Walbridge</surname>
          </string-name>
          ,
          <article-title>Pipe spool recognition in cluttered point clouds using a curvature-based shape descriptor</article-title>
          ,
          <source>Automation in Construction</source>
          <volume>71</volume>
          (
          <year>2016</year>
          )
          <fpage>346</fpage>
          -
          <lpage>358</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chenb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. K.</given-names>
            <surname>Choa</surname>
          </string-name>
          ,
          <article-title>Building element recognition with thermal-mapped point clouds</article-title>
          ,
          <source>in: 34th International Symposium on Automation and Robotics in Construction (ISARC</source>
          <year>2017</year>
          ),
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>I.</given-names>
            <surname>Anagnostopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Pătrăucean</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Brilakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vela</surname>
          </string-name>
          ,
          <article-title>Detection of walls, floors, and ceilings in point cloud data</article-title>
          ,
          <source>in: Construction Research Congress</source>
          <year>2016</year>
          ,
          <year>2016</year>
          , pp.
          <fpage>2302</fpage>
          -
          <lpage>2311</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          , H. Liu,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <article-title>Vectorized indoor surface reconstruction from 3d point cloud with multistep 2d optimization</article-title>
          ,
          <source>ISPRS Journal of Photogrammetry and Remote Sensing</source>
          <volume>177</volume>
          (
          <year>2021</year>
          )
          <fpage>57</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-H.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <article-title>Geometric quality inspection of prefabricated mep modules with 3d laser scanning</article-title>
          ,
          <source>Automation in Construction</source>
          <volume>111</volume>
          (
          <year>2020</year>
          )
          <fpage>103053</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Czerniawski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Leite</surname>
          </string-name>
          ,
          <article-title>Automated segmentation of rgb-d images into a comprehensive set of building components using deep learning</article-title>
          ,
          <source>Advanced Engineering Informatics</source>
          <volume>45</volume>
          (
          <year>2020</year>
          )
          <fpage>101131</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>V.</given-names>
            <surname>Drobnyi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fathy</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Brilakis</surname>
          </string-name>
          ,
          <article-title>Construction and maintenance of building geometric digital twins: state of the art review</article-title>
          ,
          <source>Sensors</source>
          <volume>23</volume>
          (
          <year>2023</year>
          )
          <fpage>4382</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>S. I. Nikolenko</surname>
          </string-name>
          ,
          <article-title>Synthetic-to-real domain adaptation and refinement, in: Synthetic data for deep learning</article-title>
          , Springer,
          <year>2021</year>
          , pp.
          <fpage>235</fpage>
          -
          <lpage>268</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tremblay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Prakash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Acuna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brophy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Jampani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Anil</surname>
          </string-name>
          , T. To, E. Cameracci,
          <string-name>
            <given-names>S.</given-names>
            <surname>Boochoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Birchfield</surname>
          </string-name>
          ,
          <article-title>Training deep networks with synthetic data: Bridging the reality gap by domain randomization</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition workshops</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>969</fpage>
          -
          <lpage>977</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Computer vision enabled building digital twin using building information model</article-title>
          ,
          <source>IEEE Transactions on Industrial Informatics</source>
          <volume>19</volume>
          (
          <year>2022</year>
          )
          <fpage>2684</fpage>
          -
          <lpage>2692</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nie</surname>
          </string-name>
          , X. Han,
          <string-name>
            <surname>S</surname>
          </string-name>
          . Guo,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. J. Zhang,</surname>
          </string-name>
          <article-title>Total3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>55</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Sun</surname>
          </string-name>
          , L.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Kaufman,
          <article-title>Mapping virtual and physical reality</article-title>
          ,
          <source>ACM Transactions on Graphics (TOG) 35</source>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Outdoor augmented reality spatial information representation</article-title>
          ,
          <source>Appl. Math</source>
          <volume>7</volume>
          (
          <year>2013</year>
          )
          <fpage>505</fpage>
          -
          <lpage>509</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>W.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>A multi-scale vr navigation method for vr globes</article-title>
          ,
          <source>International journal of digital earth 12</source>
          (
          <year>2019</year>
          )
          <fpage>228</fpage>
          -
          <lpage>249</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>B.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          , Mapping the world in 3d,
          <source>Nature Photonics</source>
          <volume>4</volume>
          (
          <year>2010</year>
          )
          <fpage>429</fpage>
          -
          <lpage>430</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>M. COMMUNICATIONS</given-names>
            , Maverick, https://www.mandli.com/ maverick
            <surname>-</surname>
            by-
          </string-name>
          mandli-communications//, 2023. Accessed 18
          <string-name>
            <surname>February</surname>
          </string-name>
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>G. H.</given-names>
            <surname>Rosenfield</surname>
          </string-name>
          ,
          <article-title>The problem of exterior orientation in photogrammetry</article-title>
          ,
          <source>Photogrammetric Engineering</source>
          <volume>25</volume>
          (
          <year>1959</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>E. Thompson,</surname>
          </string-name>
          <article-title>The projective theory of relative orientation</article-title>
          ,
          <source>Photogrammetria</source>
          <volume>23</volume>
          (
          <year>1968</year>
          )
          <fpage>67</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>R. M. Haralick</surname>
            ,
            <given-names>L. G.</given-names>
          </string-name>
          <string-name>
            <surname>Shapiro</surname>
          </string-name>
          ,
          <source>Computer and robot vision</source>
          , volume
          <volume>1</volume>
          ,
          <string-name>
            <surname>Addison-wesley</surname>
            <given-names>Reading</given-names>
          </string-name>
          , MA,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Patil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rabha</surname>
          </string-name>
          ,
          <article-title>A survey on joint object detection and pose estimation using monocular vision</article-title>
          , arXiv preprint arXiv:
          <year>1811</year>
          .
          <volume>10216</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>G.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <article-title>Vision-based human pose estimation via deep learning: a survey</article-title>
          ,
          <source>IEEE Transactions on Human-Machine Systems</source>
          <volume>53</volume>
          (
          <year>2022</year>
          )
          <fpage>253</fpage>
          -
          <lpage>268</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>A.</given-names>
            <surname>Collet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Srinivasa</surname>
          </string-name>
          ,
          <article-title>The moped framework: Object recognition and pose estimation for manipulation</article-title>
          ,
          <source>The international journal of robotics research 30</source>
          (
          <year>2011</year>
          )
          <fpage>1284</fpage>
          -
          <lpage>1306</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>F.</given-names>
            <surname>Viksten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Soderberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nordberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Perwass</surname>
          </string-name>
          ,
          <article-title>Increasing pose estimation performance using multi-cue integration</article-title>
          ,
          <source>in: Proceedings 2006 IEEE International Conference on Robotics and Automation</source>
          ,
          <year>2006</year>
          .
          <source>ICRA</source>
          <year>2006</year>
          ., IEEE,
          <year>2006</year>
          , pp.
          <fpage>3760</fpage>
          -
          <lpage>3767</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Azamfar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Siahpour</surname>
          </string-name>
          ,
          <article-title>Integration of digital twin and deep learning in cyber-physical systems: towards smart manufacturing</article-title>
          ,
          <source>IET Collaborative Intelligent Manufacturing</source>
          <volume>2</volume>
          (
          <year>2020</year>
          )
          <fpage>34</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>O. R.</given-names>
            <surname>Ogunseiju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Olayiwola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Akanmu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nnaji</surname>
          </string-name>
          ,
          <article-title>Recognition of workers' actions from time-series signal images using deep convolutional neural network</article-title>
          ,
          <source>Smart and Sustainable Built Environment</source>
          <volume>11</volume>
          (
          <year>2022</year>
          )
          <fpage>812</fpage>
          -
          <lpage>831</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , Y. Bengio, G. Hinton,
          <article-title>Deep learning</article-title>
          , nature
          <volume>521</volume>
          (
          <year>2015</year>
          )
          <fpage>436</fpage>
          -
          <lpage>444</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>C.</given-names>
            <surname>Boton</surname>
          </string-name>
          ,
          <article-title>Supporting constructability analysis meetings with immersive virtual realitybased collaborative bim 4d simulation</article-title>
          ,
          <source>Automation in Construction</source>
          <volume>96</volume>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukhopadhyay</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Mukherjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Biswas</surname>
          </string-name>
          ,
          <article-title>Comparing cnns for non-conventional trafic participants</article-title>
          ,
          <source>in: Proceedings of the 11th International Conference on Automotive User Interfaces and Interactive Vehicular Applications: Adjunct Proceedings</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>171</fpage>
          -
          <lpage>175</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>M.</given-names>
            <surname>Carranza-García</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Torres-Mateo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lara-Benítez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>García-Gutiérrez</surname>
          </string-name>
          ,
          <article-title>On the performance of one-stage and two-stage object detectors in autonomous vehicles using camera data</article-title>
          ,
          <source>Remote Sensing</source>
          <volume>13</volume>
          (
          <year>2020</year>
          )
          <fpage>89</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Smola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schölkopf</surname>
          </string-name>
          ,
          <article-title>A tutorial on support vector regression</article-title>
          ,
          <source>Statistics and computing 14</source>
          (
          <year>2004</year>
          )
          <fpage>199</fpage>
          -
          <lpage>222</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>T.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Belongie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hays</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Perona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <string-name>
            <surname>Microsoft</surname>
            <given-names>COCO</given-names>
          </string-name>
          :
          <article-title>common objects in context</article-title>
          , in: D. J.
          <string-name>
            <surname>Fleet</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Pajdla</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Schiele</surname>
          </string-name>
          , T. Tuytelaars (Eds.),
          <source>Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September</source>
          <volume>6</volume>
          -
          <issue>12</issue>
          ,
          <year>2014</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>V</given-names>
          </string-name>
          , volume
          <volume>8693</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2014</year>
          , pp.
          <fpage>740</fpage>
          -
          <lpage>755</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -10602-1_
          <fpage>48</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -10602-1\_
          <fpage>48</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>