<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>" JOURNAL
OF CIVIL ENGINEERING AND MANAGEMENT</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.3846/jcem.2020.12611</article-id>
      <title-group>
        <article-title>Integrating Building Information Modeling and Panoramic Structure-from-Motion for Accurate Camera Pose Estimation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Max Jwo Lem Lee</string-name>
          <email>maxjl.lee@connect.polyu.hk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Weisong Wen</string-name>
          <email>welson.wen@polyu.edu.hk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stephen Ling Ming Au</string-name>
          <email>stephenau@mtech.com.hk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Aeronautical and Aviation Engineering, The Hong Kong Polytechnic University (PolyU)</institution>
          ,
          <country country="HK">Hong Kong</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>MTECH Engineering Co.,Ltd</institution>
          ,
          <country country="HK">Hong Kong</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>26</volume>
      <fpage>513</fpage>
      <lpage>523</lpage>
      <abstract>
        <p>In this study, we present a novel approach for combining Building Information Modeling (BIM) and panoramic photogrammetry-based Structure-from-Motion (SfM) to achieve accurate camera pose estimation in architectural scenes. The fusion of BIM and SfM information addresses the limitations of individual methods: the former offers global positioning, but it suffers from suboptimal accuracy; while the latter provides accurate relative positioning, it lacks scaling and global positioning. Our method consists of four key steps: (1) computationally efficient global positioning of panorama images in the BIM model using indoor semantic skymasks to generate probability distributions, (2) relative positioning estimation from the panoramic SfM process, (3) rough alignment of the SfM reconstruction with the BIM positioning using generalized Procrustes analysis (GPA), (4) refinement of the camera pose using non-linear least-squares optimization. We evaluate the performance of our proposed method using the real-world dataset of panoramic images capturing architectural scenes and compare the refined camera poses with ground truth. The results demonstrate camera positioning accuracy of fewer than 0.6 meters when compared to using BIM or panoramic SfM individually. This research highlights the potential benefits of fusing SfM and BIM modalities, paving the way for more accurate and efficient camera pose estimation pipelines in the architecture, engineering, and construction (AEC) domain.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Panorama</kwd>
        <kwd>BIM</kwd>
        <kwd>Localization</kwd>
        <kwd>SfM 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The burgeoning interest in construction automation, virtual reality-enabled scene navigation,
sophisticated facility management, and as-built documentation has catalyzed the development of
a myriad of techniques in computer vision, photogrammetry, and Building Information Modeling
(BIM) [1-3]. Visual documentation is critical in construction projects for monitoring site conditions
and progress. While images alone may not suffice, BIM can provide a detailed digital
representation of the building or structure, allowing construction teams to identify discrepancies
between planned designs and actual construction [4]. Integrating images and BIM models enables
effective communication and collaboration between stakeholders, improving decision-making and
problem-solving. Recent research has shown that such integration can enhance the accuracy and
completeness of construction documentation [5]. Precise camera pose estimation constitutes a
pivotal aspect of these applications, as it substantially influences the integration quality between
captured images and BIM models [6]. However, current approaches do not adequately address the
challenges in automating camera pose estimation for panoramic images, leaving a research gap in
achieving successful BIM model integration.</p>
      <p>Panoramic imaging has witnessed a surge in popularity in recent years, attributed to its capacity
to encapsulate an extensive field of view within a single image, rendering it particularly apt for
0000-0002-5524-6724 (M. J. L. Lee); 0000-0003-4158-0913 (W. Wen)
© 2023 Copyright for this paper by its authors.</p>
      <p>Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>CEUR Workshop Proceedings (CEUR-WS.org)
architectural scenarios. Nonetheless, automating camera pose estimation for panoramic images
engenders distinct challenges that must be addressed to ensure successful BIM model integration.
Structure-from-Motion (SfM) represents a widely employed photogrammetric technique for
deducing camera poses and reconstructing 3D scenes from an unordered set of panorama images.
Despite progress in panoramic SfM, issues pertaining to scale ambiguity and restrictions on
relative positioning may impede the applicability of this method.</p>
      <p>Several studies have explored the integration of BIM and photogrammetric techniques for 3D
reconstruction and as-built documentation. For example, [7] proposed a framework for
integrating BIM and laser scanning data to generate accurate and detailed as-built models.
Similarly, [8] developed a method that combines BIM and photogrammetry to create a 3D model
of a building interior for virtual reality applications. However, these studies do not specifically
address the challenges of automating camera pose estimation for panoramic images in the context
of BIM integration.</p>
      <p>Concurrently, BIM has emerged as an indispensable tool within the architecture, engineering, and
construction (AEC) fields, providing an exhaustive digital representation of a building's physical
and functional characteristics [9]. BIM models offer a global coordinate system, along with
invaluable geometric and semantic information to direct and constrain the camera pose estimation
procedure. However, the limited level of detail in BIM models renders attaining high-precision
positioning both arduous and computationally demanding. Previous studies on BIM-based visual
positioning systems (VPS) have exclusively explored virtual environments, which do not
accurately represent real-world conditions, as they lack scene changes due to dynamic objects
[10].</p>
      <p>In this paper, we propose a method that leverages static objects exclusively for BIM-based
positioning, incorporating the layout and static semantic objects such as doors and windows.
Additionally, we introduce a novel approach for positioning panoramic images by employing
geotagged indoor semantic skymasks, building upon our previous research on skymask
matchingaided positioning in urban canyons [11]. These indoor semantic skymasks offer a detailed
representation of a 3D location's spatial layout within an indoor environment, encompassing both
geometric and static semantic information. This approach enhances the computational efficiency
of candidate-based matching as opposed to traditional image-based comparisons [10, 11].
Following this, we utilize panoramic photogrammetry based SfM to achieve accurate relative
positioning. Our method aims to overcome the limitations inherent in each individual technique
by harnessing the complementary information provided by both BIM and SfM. We contend that
the integration of panoramic BIM and SfM will lead to more precise and reliable camera pose
estimation, benefiting a wide array of applications within the AEC domain, including but not
limited to:
• Construction automation: Accurate camera pose estimation facilitates streamlined
construction processes by enabling precise alignment of the physical building with its digital
representation.
• Virtual reality-enabled navigation: Realistic virtual tours of architectural spaces can be
generated by integrating accurate camera pose estimation with immersive virtual reality
experiences.
• Facility management: Maintaining and updating building documentation becomes more
efficient and accurate through precise camera pose estimation, allowing facility managers to
better track building conditions and plan renovations.
• As-built documentation: By combining accurate camera pose estimation with BIM
models, architects and engineers can create more reliable as-built documentation for future
reference and regulatory compliance.</p>
      <p>The integration of BIM and panoramic SfM for accurate camera pose estimation, as proposed in
this research, offers a multitude of advantages over conventional methods that solely employ
either panoramic BIM or SfM. These benefits can be encapsulated within the following points:
Enhanced Accuracy: By amalgamating information derived from BIM global positioning
and panoramic SfM relative positioning, our methodology transcends the limitations
inherent to each individual technique, culminating in superior camera pose estimation
precision.</p>
      <p>Scale Recovery: By integrating the BIM model, replete with accurate scale information, our
approach effectively recovers the appropriate scale of the SfM reconstruction.</p>
      <p>Dependence on static objects: The proposed method leverages static objects such as
doors and windows for camera pose estimation, which may not be sufficient in all scenarios.
Limited level of detail: The level of detail in BIM models may not always be sufficient for
high-precision positioning, which could limit the effectiveness of the proposed method.
• Improved Computational Speed: Our method extracts geometric and semantic
information to generate pre-computed indoor semantic skymasks for positioning,
significantly reducing computation time compared to visual positioning systems (VPS) that
depend on whole-image comparisons.</p>
      <p>However, it is important to note that this approach has some potential weaknesses:
•
•
•
•</p>
    </sec>
    <sec id="sec-2">
      <title>2. The proposed integration of BIM and SfM for accurate camera pose estimation</title>
      <p>The proposed algorithm can be divided into two main stages in Fig. 1: an offline process and an
online process. During the offline stage, the Building Information Modeling (BIM) model is
processed to extract two-dimensional (2D) semantics pertaining to its environmental
surroundings within the panorama frame at each position, as detailed in Sub-Section 3.1. The
extracted 2D semantics is subsequently used to create an indoor semantic skymask, which stores
the elevation profile of the upper and lower boundaries of walls at each azimuth, alongside binary
indicators denoting the presence or absence of doors and windows at each azimuth (Sub-Section
3.3). These indoor semantic skymasks are stored in a database for utilization during the online
stage.</p>
      <p>In the online stage, panoramic images are processed using deep learning models to extract layout
and semantics (Sub-Section 3.2), which are then employed to generate query indoor semantic
skymasks (Sub-Section 3.3). The proposed method comprises two main components: 1) Indoor
semantic skymask matching, wherein each query indoor semantic skymask is matched with
candidate indoor semantic skymasks to produce a probability distribution and its corresponding
expected global position (Sub-Section 3.4); and 2) SfM is applied to generate unscaled relative
positioning estimates (Sub-Section 3.5). The estimated unscaled relative 2D positions serve as
input for the Generalized Procrustes Analysis (GPA), which approximately aligns them with the
expected global positions of the query images (Section 3.6). Finally, the positions are refined
through non-linear least squares optimization, using their respective probability distributions
(Sub-Section 3.7).</p>
      <p>The paper is organized as follows: Section 3 introduces the methodology for integrating BIM and
SfM for accurate camera pose estimation. Section 4 presents the experiment setup and results.
Section 5 presents the conclusion and section 6 details future works, respectively.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Distribute Candidate Positions and Generate 2D Semantics</title>
        <p>The initial step in our method entails obtaining the BIM model of the indoor environment. In this
study, we made use of the BIM model of an office, which was provided in the Industry Foundation
Classes (IFC) open-source format. A BIM model encompasses all information related to the
building, including its physical attributes. For instance, a door within a BIM model would already
be labeled "door" in the IFC format, which can then be utilized to generate semantics. This study
employs the ADE20k classes to categorize objects in the BIM model [8]. Owing to the Level of
Detail 2 (LOD2) nature of our BIM model, we focused on the semantics of "ceiling," "wall," "floor,"
"door," and "window" while excluding all other dynamic objects. This exclusion is crucial, as
dynamic objects can be relocated in the real environment, and their presence may lead to
erroneous camera pose estimation.
Firstly, candidate positions  are spread across the BIM model with 1-meter separation and 1.8m in height
as shown in Fig. 2. We assume that the height of the query images is 1.8m. The following are
defined:




 = [ ,  ]
= { 0 ⋯   }
= 
( ,  )
(1)
Where  is a two-dimensional position, and the subscript  is the index of  , which are all the
candidate positions inside the BIM model. Position  is extracted from database  , where  ∈  .</p>
        <p>is the function that assigns each pixel ( ,  ) an indexed number to represent a class. A
segmented image for a candidate position is denoted as 
as shown in Fig. 3.

= 
(</p>
        <p>)
is the semantic segmentation, and</p>
        <p>is the segmented query image of




(2)
(3)</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Layout and semantic estimation</title>
        <p>Given a query panorama image, we perform layout estimation and semantic segmentation. This
can be achieved using LGT-net [12] and Segformer [13], respectively.</p>
        <p>First, we employ the LGT-net model to estimate the layout of the given panoramic image. LGT-net
is a state-of-the-art deep learning model specifically designed for the task of layout estimation. The
output of the LGT-net model is a set of elevation angles that describe the estimated upper and
Where the subscript  represents the index of the query images. α represents the azimuth angle
each azimuth angle (0 to 359 degrees), respectively for the upper and lower layout as shown in
[ ] represents the elevation angle (0 to 180 degrees) for
lower layout.
(0 to 359 degrees). 


index  as shown in Fig. 5.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Indoor Semantic Skymask Generation</title>
        <p>1. 
2. 



The indoor semantic skymask is a 360x4 matrix that serves as a detailed representation of an
indoor environment 3D location's spatial layout. It encodes the elevation angle (0 to 180 degrees)
for each azimuth angle (0 to 359 degrees). The matrix contains:


[ ] - Elevation angles of the upper layout at each azimuth (column 1)
[ ] - Elevation angles of the lower layout at each azimuth (column 2)
3.  [ ] - Binary indicators for the presence or absence of doors at each azimuth (column 3)
4.  [ ] - Binary indicators for the presence or absence of windows at each azimuth (column 4)
A segmented candidate image 
[ ] and lower layout</p>
        <p>boundaries at each azimuth angle as shown in Fig. 6.</p>
        <p />
        <p>can be converted to the elevation angles of the upper layout
[ ] by masking the ceiling and floor labels and extracting the
The binary indicators for the “door”  [ ] and “windows”  [ ] can be extracted by observing if
the respective labels are present at the azimuth angle.
The layouts and binary indicator are extracted and visualized on an indoor semantic skymask as
shown in Fig. 7.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Indoor Semantic Skymask Matching</title>
        <p>To find the similarity between two indoor semantic skymasks, we perform the following steps:</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.4.1. Calculate the normalized circular cross-correlation of the layouts</title>
        <p>For a pair of layouts, such as the upper layout of a query and candidate image, we calculate their
normalized circular cross-correlation   , [ ] as follows:




2</p>
        <p>[( +  )mod ]
[ ]) ⋅ (∑ =−01 

2
the elevation profiles as a function of the circular shift  in the azimuth domain.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.4.2. Calculate the weighted average normalized cross-correlation</title>
        <p>, [ ] = 0.5 ⋅   ,

[ ] + 0.5 ⋅    ,
[ ]
We compute the weighted average normalized cross-correlation   , [ ] using the results from
the upper and lower layouts:
The maximum value of   , [ ] signifies the highest similarity between the two boundaries, and
the corresponding shift   , provides the optimal alignment for query  with respect to candidate
Here,   ,</p>
        <p>with respect to</p>
        <p />
      </sec>
      <sec id="sec-3-7">
        <title>3.4.3. Compare the door and window semantics</title>
        <p>For each query  with the corresponding shift   , , we compare the door and window semantics
using binary comparison. We align the candidate semantics with the query semantics using the
[ ])
[ ] is the  -point circularly shifted version of the upper layout boundary  

[ ]. Where  is 360. The cross-correlation measures the similarity between
Next, we perform binary comparison:

 
  , [ ] =   [( +   , ) mod  ]

  , [ ] =   [( +   , ) mod  ]
 _  , [ ] =   [ ] ⋅  
 _  , [ ] =   [ ] ⋅  
  , [ ],  for  in [0,  ]
  , [ ],  for  in [0,  ]
(5)
(5)
(6)
(7)</p>
        <sec id="sec-3-7-1">
          <title>The overall semantic similarity score</title>
          <p>_
 , ranges from 0 to 1, with 1 representing the
highest similarity between the query and candidate indoor semantic skymasks.</p>
        </sec>
      </sec>
      <sec id="sec-3-8">
        <title>3.4.5. Combine the layout boundary and semantic similarity scores for candidate selection</title>
        <p>In order to determine the most likely candidate location for each query image k and candidate
vector c, we combine the maximum layout similarity, denoted as 
(  , [ ]), and the overall
semantic similarity score, represented by 
_
. The combination is achieved by assigning
weights to each component, as shown in the following equation:</p>
        <p>These values equal 1 when both semantics are present at the same azimuth angle and 0
otherwise.  _  , [ ] and  _  , [ ] represent the score of the door and window semantics
respectively.</p>
      </sec>
      <sec id="sec-3-9">
        <title>3.4.4. Calculate the semantics similarity score</title>
        <p>We determine the overall similarity score by summing the door and window similarity values
and normalizing by the total number of azimuth angles  :</p>
        <p>,  
for  in [0,  ]
  , = (0.3 ⋅ max(  , [ ])) + (0.7 ⋅  _  , )
(8)
The weighting for combining both layout and semantic similarity scores is estimated by comparing
the combined similarity that yields the best performance on a set of 10 images, with respect to
their ground truth locations. The combined similarity score 
 , for each candidate c and the
corresponding shift   , is used to distribute a likelihood heatmap   , as illustrated in Figure 8.
skymask to candidate semantic skymask (top), probability distribution of a query image (right)
location.</p>
        <p>The heatmap   represents the likelihood that a query image k is located at the candidate c</p>
      </sec>
      <sec id="sec-3-10">
        <title>3.4.6. Probability distribution</title>
        <p>To calculate the probability distribution, we first compute the sum of likelihood values for each
query image k across all candidate positions:

 = ∑    ,</p>
        <p>=1
  , =


  ,</p>
        <p>(9)
(10)
Where s is the index of the candidates and N is the total number of candidates. Next, we derive the
probability at each candidate position c in the probability distribution   :
heatmap   . Each value   , indicates the probability of query k occurring at position c.
The resulting 2D array   represents the probability distribution corresponding to the likelihood</p>
        <sec id="sec-3-10-1">
          <title>The maximum probability</title>
          <p>k. We denote this as  _  .</p>
        </sec>
      </sec>
      <sec id="sec-3-11">
        <title>3.5. Panorama SfM</title>
        <p>(  ) corresponds to the expected global position c of query image
In this study, we employed OpenMVG (Open Multiple View Geometry) software for panorama
Structure from Motion (SfM) to estimate the relative camera pose of each image [14]. OpenMVG
estimates the position of each camera used to capture the input images relative to a local
coordinate system with ambiguous scaling.
Where  _ is a two-dimensional SfM estimated position, and the subscript k is the query index of
 _ . Position  _ is extracted from database  _ , where  _ ∈  _ .</p>
      </sec>
      <sec id="sec-3-12">
        <title>3.6. Generalized Procrustes Analysis</title>
        <p>The key part of the GPA algorithm is the computation of the optimal similarity transformation.
Let  _ denote the set of positions and  _ denote the set of positions to be aligned. The optimal
rotation matrix  , scaling factor  , and translation vector  are computed.
(11)
(12)
(13)
(14)
 _ = [ ,  ]
 _ = { _  ⋯  _  }
 _
 _
=  ∙  ∙  _ + 
= { _  ⋯  _  }
 _ = [ ,  ]

 , ,
min ∑

∑
 =1  =1, ≠</p>
        <p>(  ,   ,   )
probability distribution:
Where   (  ,   ,   ) is a function</p>
        <p>that computes the squared difference between the
transformed positions   and   and their expected relative positions   , weighted by the spatial
  (  ,   ,   ) =   (  ) ⋅   (  ) ⋅ ||  −   −   ||
2
The optimization problem is solved using the Levenberg-Marquardt algorithm, and the final
positions are denoted as</p>
        <p>= {  ⋯   }.</p>
      </sec>
      <sec id="sec-3-13">
        <title>4. Experiment setup and results</title>
      </sec>
      <sec id="sec-3-14">
        <title>4.1. Dataset and Preprocessing</title>
        <p>The GPA algorithm iteratively aligns  _ to the global reference frame by computing the optimal
similarity transformations and updating the reference shape until convergence. The transformed
positions are denoted as  _ .</p>
      </sec>
      <sec id="sec-3-15">
        <title>3.7. Non-linear least squares optimization</title>
        <p>Let's denote the set of query image positions from GPA algorithm as  _  , where 
= 1, 2, … ,  ,
and each query image position  _</p>
        <p>has an associated spatial probability distribution   .</p>
        <p>Additionally, we know the relative positions between the landmarks from SfM as   , where   =
 _  −  _  .
 . The non-linear least squares optimization problem can be formulated as:
Let the transformation be represented by a rotation matrix  , a scaling factor  , and a translation
vector  . The transformed query image positions can be written as   , where   =  ∙  ∙  _  +
In this section, we describe the dataset used to evaluate our proposed method for indoor camera
pose estimation. The dataset consisted of 14 panoramic images and the corresponding BIM model
of an office. The images were captured using an Insta360 ONE X2 camera mounted on a tripod
and covered a spatial extent of approximately 120 2. The ground truth location of the images
was established by aligning them with the BIM model via visual overlap. The BIM models were
generated from architectural plans and as-built drawings, providing Level of Detail 2 (LOD2)
geometric and semantic information about the built environment.</p>
        <p>To evaluate the performance of our proposed method, we compared it against two other
methods: ground truth (aligned via visual overlap) and indoor semantic skymask matching. The
ground truth method served as a reference for assessing the accuracy of the other methods, while
the indoor semantic skymask matching method represented a state-of-the-art approach for
indoor camera pose estimation. Our proposed method combined indoor semantic skymask
matching with panoramic photogrammetry-based Structure-from-Motion (SfM) techniques to
achieve more accurate and reliable camera pose estimation in architectural scenes. By comparing
the performance of these methods, we aimed to demonstrate the effectiveness of our proposed
approach and its potential for various applications, such as virtual reality simulations, robotics,
and 3D modeling.</p>
      </sec>
      <sec id="sec-3-16">
        <title>4.2. Results and Analysis</title>
        <p>The performance metrics presented in Table I and the accompanying statements provide
important information about the accuracy of camera pose estimation using different methods in
architectural scenes. The exact position and orientation are critical for various applications, such
as virtual reality simulations, robotics, 3D modeling, and building inspection. In addition, for
building inspection and review applications, it is important to ensure that the captured images
cover the entire building area with sufficient overlap for accurate reconstruction. The precision
requirements may vary depending on the specific application, but in general, the camera pose
estimation must be accurate enough to enable reliable localization and navigation in the
environment.</p>
        <p>In our study, we evaluated the performance of two methods for camera pose estimation: indoor
semantic skymask matching and BIM and SfM integration. The results showed that the indoor
semantic skymask matching method had an average positioning error of 1.28 meters to the
ground truth, which indicates relatively poor accuracy and might not be sufficient for some
applications. In contrast, the BIM and SfM integration method achieved a mean distance error of
0.59 meters with 5.32° heading accuracy, which represents a significant improvement over the
indoor semantic skymask matching method. The standard deviation of the errors was also lower
for the BIM and SfM integration method, indicating more consistent performance across different
scenes.</p>
        <p>Method
Indoor semantic skymask matching
Proposed BIM and SfM Integration
Overall, these results demonstrate that fusing BIM and SfM information can lead to more accurate
and reliable camera pose estimation in architectural scenes. The precision requirements may
vary depending on the specific application, but our approach has shown promising results and
could be further optimized for specific use cases.</p>
      </sec>
      <sec id="sec-3-17">
        <title>5. Conclusion</title>
        <p>In this paper, we propose a novel approach for accurate camera pose estimation in architectural
scenes by leveraging the complementary strengths of Building Information Modeling (BIM) and
panoramic photogrammetry-based Structure-from-Motion (SfM). Our method fuses BIM's global
positioning capabilities with SfM's precise relative positioning to overcome the limitations
associated with each technique when used in isolation. Specifically, our proposed approach
utilizes global positioning information from the BIM model to guide the camera pose estimation
process, providing a global coordinate system and geometric information to constrain the relative
positioning of panoramic images. This information is critical for achieving accurate and precise
camera pose estimation and improving the integration quality between captured images and BIM
models. By combining the benefits of both BIM and panoramic SfM, we aim to overcome the
limitations inherent in each individual technique and achieve superior performance in camera
pose estimation for architectural scenes.</p>
        <p>The proposed pipeline consists of four key steps, including global positioning using indoor
semantic skymask matching, relative positioning estimation from panoramic SfM, rough
alignment using generalized Procrustes analysis, and refinement through non-linear
leastsquares optimization. The measured heading and point positioning are within 0.6m positioning
accuracy according to the results performed in an office. The contributions of the proposed
method are:
• The formulation of positioning as an indoor semantic skymask problem enables us to apply
an existing wide variety of advanced matching metrics to this problem.
• Detection and exclusion of dynamic objects to prevent false measurements.
• By integrating the BIM model, replete with accurate scale information, our approach effectively
recovers the appropriate scale of the SfM reconstruction.</p>
        <p>Considering the preliminary results presented in this paper, we believe the proposed method can
provide accurate positioning and heading estimation to support various indoor applications.
Furthermore, this research has significant implications for the AEC domain, as the enhanced
accuracy and efficiency offered by our method can lead to considerable advancements in various
applications, such as augmented reality, facility management, and construction monitoring.</p>
      </sec>
      <sec id="sec-3-18">
        <title>6. Future Works</title>
        <p>Several potential future developments on the proposed method are suggested.
• Simultaneous differential rendering and Factor Graph Optimization: The proposed
method uses SfM to estimate the relative position of the images and semantic skymask
matching to estimate the global position. Factor graph optimization can then applied to
optimize the pose of the images simultaneously to maximize the overall indoor semantic
matching score for all the images. Future work will explore the use of more advanced
optimization techniques, such as bundle adjustment, to further improve accuracy.
• Fault detection: A simulation platform will be created to allow the injection of faults, such as
occlusions, lighting changes, and object movement, and to evaluate the relationship between
fault semantic indoor matching estimation and the accuracy of positioning. The proposed
method will be evaluated under different fault scenarios to assess its robustness and
effectiveness in real-world construction projects.
• Dynamic update: Dynamic updates will be introduced to automatically update the BIM
model when changes are detected in the images, improving its real-time performance. Future
work will explore the use of machine learning techniques, such as deep learning, to enable
more accurate and efficient dynamic updates, especially when dealing with complex and
dynamic scenes.</p>
      </sec>
      <sec id="sec-3-19">
        <title>7. References</title>
        <p>[1]
[2]
[3]
[4]</p>
        <p>A. H. H. A. Ahmed, "Survey on indoor positioning applications based on different
technologies," presented at the 12th International Conference on Mathematics, Actuarial
Science, Computer Science and Statistics (MACS), Karachi, Pakistan, 2018.
C. Marouane, M. M. Feld, and M. Werner, "Visual positioning systems — An extension to
MoVIPS," presented at the 2014 International Conference on Indoor Positioning and
Indoor Navigation (IPIN), Busan, Republic of Korea, 2014.</p>
        <p>M. Shu, G. Chen, Z. Zhang, and L. Xu, "Accurate Indoor 3D Location Based on MEMS/Vision
by Using A Smartphone," presented at the 2022 IEEE 12th International Conference on
Indoor Positioning and Indoor Navigation (IPIN), Beijing, People's Republic of China, 2022.
J. Xue, X. Hou, and Y. Zeng, "Review of image-based 3D reconstruction of building for
automated construction progress monitoring," Applied Sciences, vol. 11, no. 17, p. 7840,
2021.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>