1. Introduction

Ital-IA

3D reconstruction methods in industrial settings: a comparative study for COLMAP, NeRF and 3D Gaussian Splatting

Zeno Sambugaro

Lorenzo Orlandi

Nicola Conci

0 0 DISI, University of Trento , via Sommarive, 5, Povo, 38123 , Italy

2024

4 29 30

3D rendering techniques have undergone a rapid evolution with the emergence of novel and advanced methodologies, redefining the boundaries of realism and computational eficiency. This study explores recent advancements in the field, comparing established approaches like photogrammetry with software such as COLMAP against the new frontiers opened by emerging view synthesis approaches like Neural Radiance Fields (NeRF), and 3D Gaussian Splatting. In this paper, we present a comprehensive comparison of the described methods tailored for industrial applications, where the data acquisition is generally conducted by human operators employing handheld devices.

eol>Photogrammetry NeRF Gaussian Splatting 3D Reconstruction

1. Introduction

Geo referenced Image acquisition Image position re nement Colmap Sparse Photogrammetry Dense point cloud nerfstudio 3DGS Neural radiance eld Gaussian scene representation Point cloud Generation Mesh Generation Mesh Generation Georeferencing Point cloud Generation Comparison of the methods

2. Background

tion. This relation is formulated through the Multi-Layer Perceptron (MLP) , expressed as: 3D reconstruction is crucial in fields like construction, excavation, and worksite management. Employing multi- : (x, d) → (c, ) (1) view reconstruction techniques, scenes are captured from various angles using 2D images. This enables detailed where x = (, , ) denotes the coordinates within the monitoring of the project progress and provides the scene, and d (, ) represents the 3D Cartesian unit vecability to virtually navigate sites, both during and after tor indicating the direction. The color c = (, , ) shifts completion, utilizing geo-referencing and virtual reality. with the viewing angle, while , denoting volume density, Among the most common photogrammetric solutions remains invariant. The usage of neural volume rendering for 3D view reconstruction, we focus on COLMAP [ 3 ] pipelines, over traditional point clouds or meshes, enfor its open-access policy and continual improvements. able the modeling of variations in color and illumination. COLMAP enables the conversion of 2D images into com- InstantNGP [ 4 ], short for Instant Neural Graphics Primprehensive 3D models, including point clouds and tex- itives, is a variant that enhances NeRF’s framework to tured meshes, enabling advanced spatial analyses. How- expedite scene reconstruction significantly. By refining ever, the application of photogrammetric reconstruction the neural network’s architecture and computations, Inencounters several challenges, particularly when dealing stantNGP facilitates quicker achievement of high-quality with objects characterized by complex optical properties results, positioning it as a viable option for real-time such as high absorbency, reflectivity, or scattering. These applications. methods can also sufer from variance in lighting con- NeRFStudio introduces an innovative platform, leverditions, including shadows, glare, or inconsistent illumi- aging the Nerfacto model, to streamline NeRF-based nation, as well as by surfaces with uniform or repetitive model creation and manipulation. Nerfacto integrates textures and complex shapes or geometries. insights from very recent research, including MipNeRF

NeRF-based technologies ofer cutting-edge solutions 360 [ 8 ], Instant-NGP [ 4 ], and Ref-NeRF [ 7 ], focusing on to overcome limitations in scene representation by optimizing camera views and sampling processes. resembling the scene with particles characterized by density and color. This study compares two neural 3D Gaussian Splatting for Real-Time Radiance Field radiance-based techniques, Nerfacto (a variation of Rendering 3D Gaussian Splatting [ 2 ], a novel apInstantNGP [ 4 ] in Nerfstudio [ 5 ]) and SuGaR [ 6 ] proach to scene representation, contrasts with neural (a variation of 3D Gaussian Splatting [ 2 ]), against ifelds by optimizing an explicit point-based scene model. traditional photogrammetry methods. Each point in this representation is associated with various attributes: a position ∈ R3, opacity ∈ [ 0, 1 ], third-degree spherical harmonics (SH) coeficients ∈ R16, 3D scale ∈ R3, and 3D rotation ∈ (3) represented by 4D quaternions ∈ R4. Rendering to the image plane involves accumulating the color from correctly-sorted points using the equation: Neural Radiance Fields Neural Radiance Fields (NeRF) have emerged as a significant advancement in the field of 3D scene reconstruction. The scene is represented with a novel 5D function. This function correlates each spatial point (, , ) with the radiance emitted in any direction, defined by azimuthal and polar angles (, ). The outcome, characterized by volume density and RGB color values , varies with the viewing direc = ∑︁ =1

− 1 where = ∏︁(1 − ) =1 (2) with determined by SH coeficients

and calcu- selected some simple playground games mixed with lated from the projected 2D Gaussian with covariance Σ ′ = Σ , incorporating per-point opacity , viewing transformation , and Jacobian of the afine approximation of the projective transformation. The 3D covariance matrix Σ ensures positive semi-definiteness real excavation scenarios where the reconstruction is more challenging. Our datasets consist of 7 playground scenarios and 3 excavation scenarios.

Acquisition Process. The dataset has been acquired tion , following Σ =

. through the scale matrix = diag(1, 2, 3) and rota- following the standard procedure that an operator would follow when working in a given site. The trajectory

Building upon the principles of 3D Gaussian Splatting, reflects a rotation around the object, maintaining the cap

Surface Gaussian Approximation for Rendering (SuGaR)

ture at eye level. During acquisition, the frame rate is set [ 6 ] leverages Gaussian functions to model object surfaces at 5 frames per second with a resolution of 1280 x 720. within a scene, achieving precision in handling occluThe accuracy of the geopose data is always less than 3 sions and detailed surface texturing through Gaussian cm in traslation and less than 1 degree for each acquired "splats" projected onto a volume grid. Each splat influimage. We maintain a uniform velocity during acquiences the volume’s density and color, based on its spatial sition, so that the number of images for each scenario location and Gaussian distribution, described mathematdepends on the length of the trajectory. The playground ically as: (x; , Σ) = dataset comprises approximately 200 images, while the excavation dataset contains around 500 images, which ︂) reflects longer trajectories. (3)

3.2. Methodologies Employed Three distinct reconstruction methodologies were applied to the captured datasets; an overview is shown in Figure 1:

Methodology

This study aims to evaluate the efectiveness and potential benefits of Neural Radiance Fields (NeRF) against traditional image-based reconstruction techniques, particularly photogrammetry, in the context of augmented/virtual reality applications. Our focus is on challenging outdoor scenarios. We include excavation sites and playground objects, which are characterized by unbounded environments and non-Lambertian surfaces. To facilitate a direct comparison, the same dataset of images, captured with geo-referencing, is utilized across all reconstruction methods. This standardized approach ensures that diferences in the reconstruction quality and eficiency can be attributed solely to the methodologies rather than due to a bad alignment.

3.1. Dataset acquisition The datasets are collected using a system comprised

of two devices:

a smartphone and an RTK-GNSS spatially calibrated as can be seen from [ 9 ]. These devices ensure highly accurate pose information for

1. Photogrammetry: The classical photogrammet

ric procedure involves estimating camera orientation parameters for sparse point cloud construction, generating a dense point cloud; mesh creation and texture extraction complete the reconstruction process. For this purpose we used

COLMAP, with all phases conducted in highquality mode to ensure maximum detail and accuracy.

2. NeRF-Based Reconstruction: The training of Neural Radiance Field reconstruction requires known camera poses as input. We use nerfstudio [ 5 ], and in particular "nerfacto", a model strongly based on InstantNGP [ 4 ], used for its fast training and inference. We then extract the dense point clouds and textured mesh from nerfstudio’s API; in particular, for mesh extraction we exploited

Poisson reconstruction.

3. Gaussian Splatting (SuGaR): Similarly to NeRF this method requires known camera poses as input. This explicit model is then trained to approximate the radiance field of the scene. The training of SuGaR involves more than one step.

The training starts with 7k iterations of normal 3D Gaussian Splatting and 7k iterations of SuGaR ifnetuning to extract a more precise geometry. The acquisition of our dataset incorporated geo

all the collected scenes.

The study aims to analyze referencing, so as to simplify the alignment process for industrial applications, therefore, as scenarios, we have the reconstructions. The only exception is NeRF, as an Rendering results NeRF implicit framework this model normalizes its coordinates meshes. The point clouds are easily exported since the between -1 and 1. This aspect of NeRF requires an ad- neural representation can be inspected at any 3D point. ditional step to calibrate the model, to incorporate scale For the meshes this conversion employs the marching and translation derived from the geo-referenced input to cubes algorithm and the Poisson surface reconstruction ensure accurate alignment. For the dataset to be used in method. In the SuGaR framework the mesh extraction training, we first need to estimate the camera parame- phase it also done through marching cubes or Possian ters from the input images. This estimation is necessary surface reconstruction. In this case the reconstruction is because the neural network requires knowledge of both enhanced thanks to the precise estimation of the normals the camera’s positions and the corresponding images to of the sampled points. To obtain an accuracy metric we accurately generate the scene representation. To achieve derive a cloud-to-cloud comparison using the CloudComthis, we utilized COLMAP, a known software for its ap- pare software. plication of Structure from Motion (SfM) techniques [ 3 ], for estimating three-dimensional structures from two- 3.3. Comparative Analysis Framework dimensional image sequences.

To facilitate comparison, given that outputs from pho- The comparative analysis between these methods focuses togrammetry are not directly comparable with those from on the following key metrics: (i) Accuracy and Detail neural fields or Gaussian splatting, we incorporate an Resolution, to evaluate the fidelity of the reconstructed additional conversion phase. NeRFstudio provides func- models to the original scenes, and (ii) Processing Time, tionality to convert NeRF outputs into point clouds and to assess the eficiency of each methodology in terms Rendering results

Colmap nerfstudio

SuGaR NeRF Cloud to Cloud distance

SuGaR 10cm 0cm 4. Discussion of computational resources and time required for reconstruction. To compare the level of fidelity of the reconstructed models we propose using the point clouds gen- We show a comparison of NeRF-based techniques against erated by the studied methods. In this way we can obtain traditional photogrammetry utilizing COLMAP. All moda quantitative metric. To be more specific we measure els are trained on an NVIDIA RTX 3090 GPU. The assessthe cloud to cloud deviation of the methods based on ment focuses on their efectiveness in view synthesis and radiance fields with respect to the reconstruction using 3D reconstruction, particularly in expansive, unbounded classical photogrammetry. This measure is an absolute environments. The results of our analysis highlights value, which doesn’t tell which method is performing that the three methodologies produce high quality point better; it only informs about the deviation from one re- clouds, with very close results especially in the fine strucconstruction to the other. Therefore, we also show the tures of the 3D scene, as illustrated in in Figure 2. Norendering results in order to see the performances in tably, NeRF’s output shows a denser point cloud around graphical terms, in Figure 3. In addition to this quantita- high-frequency scene features but has gaps in smoother tive result we also propose a qualitative comparison of regions. The radiance field rendering results show that the resulting meshes, comparing the proposed method- the quality of the reconstructed views is really high and ologies in Figure 2. is very dificult to say if nerfstudio or SuGaR presents the best result. However, the comparison illustrated in Figure 2 highlights a failure case of nerfstudio, with a red area within the scene’s object of interest indicating a high cloud-to-cloud distance. This issue not only pro- analyzed is the reliance of the current rendering pipelines duce a discrepancy in the point cloud representation but for virtual and augmented reality on meshes represenalso results in blurring within the targeted region of the tations. This advantages the classical photogrammetry neural reconstruction. since its final goal is to obtain a mesh representation. In Considering the extensive usage of meshes in VR and contrast, neural rendering technologies focus primarily AR applications, for their simplicity and low memory on view synthesis, ofering an alternative that eliminates footprint, we present a comparison of the meshes pro- the need for mesh generation. SuGaR and more in genduced with the three methodologies. In Figure 2 we show eral 3D Gaussian Splatting techniques produce an explicit the obtained meshes also showing a detail of the recon- representation that allow for the splatting of Gaussians struction in the region of the 3D scene with finer details. in the same way traditional methods splat triangle. This As depicted in Figure 2, there’s a noticeable variance in feature enable SuGaR to render the scene in real time, detail and texture among the outputs. The COLMAP making it possible to use it into existing pipelines. In the mesh, while being consistent, falls short on represent- future, we see 3D Gaussian Splatting to be a potential ing thin structures. In contrast, the NeRF mesh shows replacement for for meshes representations, especially in greater detail but presents some holes. The SuGaR mesh scenarios requiring the realistic reconstruction of comstands out for its superior detail, accurately capturing plex environment. complex structures where others falter, thanks to its precise normal calculations. Another point to consider is This research is supported by the project DIMOTY, the diference in accuracy between the two scenarios we funded by the Autonomous Province of Trento unhave examined. The playground scene is easier and, in der the LP6/99 framework fact, has better results compared to the case of excavations. The complexity of the excavation scenario reduces the performance in reconstruction, especially with the References SuGaR and NeRF method. It is noticeable in the figure 3 that there are many artifacts on the road surface visible on the Cloud to Cloud distance analysis, especially in the case of SuGaR, and there are also many holes, especially in the excavation bottom. Finally, we analyze the processing time for each method. Regarding this aspect, there is no diference between SuGaR and COLMAP. Instead, the best performance is observed with InstantNGP, which takes about a quarter of the time compared to the other methods. Additional materials regarding to our analysis, they can be accessed through this link 1.

5. Conclusions

In this paper we provide a comparative analysis of Neural radiance fields based reconstruction methods and classical photogrammetry for unbounded scenarios. We show results in playgrounds and excavations sites, to access the performances in easy and complex scenarios. In our set-up, photogrammetry has provided superior reliability in complex scenes, especially on the excavation sites. Proving also better results in modeling completely flat area which in the NeRF methods presents some artifacts. Although training/reconstruction times are generally not the main concern in the reconstruction of working areas, some application might benefit from fast reconstruction times. In this aspect nerfstudio provided the best speed in the reconstruction, requiring just 15 minutes for the training of a scene. An important aspect that needs to be

[1] Mildenhall , Ben , et al. "Nerf: Representing scenes as neural radiance fields for view synthesis . " Communications of the ACM 65.1 ( 2021 ): 99 - 106 .

[2] Kerbl , Bernhard , et al. "3d gaussian splatting for real-time radiance field rendering . " ACM Transactions on Graphics 42.4 ( 2023 ): 1 - 14 .

[3] Schonberger et al. "Structure-from-motion revisited . " Proceedings of the IEEE conference on computer vision and pattern recognition . 2016 .

[4] Müller et al. "Instant neural graphics primitives with a multiresolution hash encoding . " ACM transactions on graphics (TOG) 41.4 ( 2022 ): 1 - 15 .

[5] Tancik et al. "Nerfstudio: A Modular Framework for Neural Radiance Field Development." ACM SIGGRAPH 2023 .

[6] Chen et al., C.

"SuGaR: Pre-training 3D Visual Representations for Robotics."

arXiv preprint arXiv:2404.01491 , 2024 .

[7] Verbin et al. "Ref-nerf: Structured view-dependent appearance for neural radiance fields." 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Barron et al.. "Mip-nerf 360: Unbounded antialiased neural radiance fields." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 5470 - 5479 , 2022 .

[9] Lorenzo

, Kevin

, et al. "Spatial-Temporal Calibration for Outdoor Location-Based Augmented Reality' . IEEE Sensor Journal ( 2024 ) : “accepted for publication” .