<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Large Scale Point Cloud Semantic Segmentation for Indoor Digital Twin Generation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Johann Nikolai Hark</string-name>
          <email>j.hark@dcaiti.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bernd Schaeufele</string-name>
          <email>bernd.schaeufele@fokus.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ilja Radusch</string-name>
          <email>ilja.radusch@dcaiti.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Daimler Center for Automotive Information Technology Innovations (DCAITI)</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fraunhofer Institute for Open Communication Systems (FOKUS)</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Positioning and Indoor Navigation</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Automated processing of large amounts of sensor data poses a significant challenge in many fields, particularly with LiDAR point clouds. One field of application for LiDAR is the creation of high-definition (HD) maps, which are utilized in various domains, such as mobile indoor navigation. For visually impaired individuals, indoor navigation is crucial for enhancing their quality of life and independence. In this work, we focus on two advanced methods for semantic segmentation of point clouds generated using LiDAR and 360° camera sensors. These methods generate digital twins to create accurate and detailed representations of physical environments. The digital twins can be used to produce HD maps for indoor navigation. The first method involves converting the point cloud into a graph structure known as a superpoint graph (SPG). The second method, RandLA-Net, is based on eficient random sampling of points within the point cloud. Both methods are evaluated with our dataset, achieving an overall accuracy of 89%. This reproduced performance is consistent with their results on the public point cloud benchmark from Stanford University, demonstrating the eficiency of state-of-the-art semantic segmentation methods.</p>
      </abstract>
      <kwd-group>
        <kwd>indoor mapping</kwd>
        <kwd>SLAM</kwd>
        <kwd>semantic segmentation</kwd>
        <kwd>digital twin</kwd>
        <kwd>point clouds</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Accurate HD maps are essential for indoor navigation for visually impaired [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], but also the foundation
for other applications, such as automated driving [
        <xref ref-type="bibr" rid="ref2">2, 3</xref>
        ]. Creating digital maps manually via CAD
and laser measurements is a laborious task. However, advancements in 3D acquisition technologies
such as LiDAR, TOF, and RGB-D cameras have made these sensors more accessible and afordable.
SLAM techniques are used to convert 3D sensor data into point clouds (PCs), which form the basis of
automated generation of digital twins. State-of-the-art SLAM methods can process large amounts of
data in real-time and map extensive environments eficiently.
      </p>
      <p>We use two indoor sensor platforms with modular hardware: two LiDAR sensors, a 360° camera, and
an industrial-grade IMU. Our backpack supports 3D SLAM with six degrees of freedom (DOF), and the
trolley is designed for 2D SLAM with three DOF, reducing complexity and potential mapping errors.
We conduct mapping experiments in various indoor environments at the Fraunhofer Institute for Open
Communication Systems (FOKUS) and the Daimler Center for Automotive Information Technology
Innovations (DCAITI). The sensor data is processed with the Google Cartographer SLAM algorithm
[4], requiring time synchronization and sensor calibration. The recordings contain over 40 minutes of
sensor data from various physical environments, generating PCs with more than 673 million points.
The data acquisition pipeline generates models of indoor environments with pointwise RGB coloring.</p>
      <p>Subsequently, the PCs are processed for obtaining semantic labels to create a digital twin. For this,
we apply semantic segmentation, i.e., classify each point into a predefined class. For the supervised
semantic segmentation, the PCs are manually annotated in 11 semantic classes for training. Finally, we</p>
      <p>CEUR</p>
      <p>ceur-ws.org
evaluate two state-of-the-art semantic segmentation methods, achieving significant results. In particular,
RandLA-Net shows an overall accuracy of 89%, demonstrating its efectiveness on the generated PCs.</p>
      <p>The paper is structured as follows. Existing approaches for semantic segmentation of large PC data
are discussed in section 2. In section 3, the hardware platform and the data recording process are
described. Consequently, in section 4, the method for automated digital twin generation is explained.
The results are evaluated in section 5, before a conclusion and an outlook are given in section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Creating digital twins of indoor environments is costly and time consuming, especially when updating
existing plans. Mobile systems are more flexible and eficient compared to static tripod solutions such as
the Matterport scanner, simplifying mapping by eliminating occlusion studies and multiple measurement
positions. Current mobile indoor mapping systems, categorized by their physical
configurations—handheld, backpack, and trolley [5] typically use LiDAR or RGB-D cameras. Handheld devices, like the
ZEB-HORIZON and ZEB-REVO-RT [6], prioritize a lightweight design to prevent operator fatigue and
often incorporate 2D LiDAR sensors. Backpack systems, although heavier, distribute weight across
the user’s body, allowing more robust sensors, typically LiDAR for its longer range. Examples include
Google’s Cartographer backpack [4] and NavVis VLX [7]. Trolley systems, the least weight-constrained,
integrate high-quality sensors like terrestrial LiDAR sensors for superior mapping accuracy.</p>
      <p>SLAM algorithms rely on available sensors, with LiDAR being common. Scan-to-scan matching
with algorithms like ICP [8] matches consecutive LiDAR scans to compute relative pose changes but
accumulates errors over time. Scan-to-map matching requires good initial pose estimates, mitigating
error accumulation, while being robust and eficient [ 4]. The LOAM algorithm [9] uses LiDAR points,
optionally combined with IMU data, to achieve low drift and low computational complexity. LOAM
splits SLAM into high-frequency, low-fidelity odometry (scan-to-scan) and slower, high-accuracy PC
registration (scan-to-map). VLOAM [10] extends LOAM by adding high-frequency monocular camera
data for robustness to heavy motion and visual feature scarcity.</p>
      <p>Particle-filter methods minimize local error accumulation but can be resource-intensive, though
smaller dimensional feature representations can help [11]. Loop closure, essential for global SLAM, uses
histogram-based matching [12], feature detection [13], and graph-based methods [14]. Optimization
techniques minimize errors from these constraints [15], and recent approaches incorporate semantic
segmentation [16]. Extensive reviews of SLAM algorithms are available [17].</p>
      <p>Initial success in 2D image segmentation is achieved using advanced neural network architectures
such as [18, 19, 20]. However, it cannot be directly applied to PCs due to its irregular, non-uniform
nature and varying density, complicating the use of standard convolutional neural networks.</p>
      <p>Thorough reviews on PC segmentation are given by Grilli et al. [21] and Lu et al. [22], categorizing
methods into projection-based, discretization-based, point-based, and hybrid approaches. Projection and
discretization convert PCs into regular representations like multi-view or volumetric forms, enabling
the use of 2D architectures [23]. Discretization methods like SEGCloud [24] preserve neighborhood
structures but can introduce artifacts and information loss.</p>
      <p>Point-based methods, such as PointNet [25], directly process PCs and capture local geometries, e.g.,
RandLA-Net [26]. Point convolution methods like KPConv [27] ofer eficient solutions. RNN-based
methods, such as RSNet, capture context features. Graph-based methods, like the approach by Landrieu
and Simonovsky [28], use graph neural networks to detect geometric structures.</p>
      <p>In [29], an indoor semantic segmentation approach based on the FCGF architecture [30] is shown,
which allows using only partially labeled PC datasets. By registering PCs through overlapping regions,
labels can be transferred. UnScene3D [31] is an unsupervised method for 3D indoor LiDAR segmentation.
It applies instance segmentation with a 3D Transformer architecture [32], working on RGB-D sensor
data, though. The authors in [33] present Swin3D, a pretrained Transformer backbone for 3D indoor
semantic segmentation, which is an extension of the Swin Transformer architecture [34]. The method
is evaluated with real world data, but the model is trained on the synthetic Structured3D dataset [35].</p>
    </sec>
    <sec id="sec-3">
      <title>3. Data Acquisition</title>
      <p>One main design goal of the indoor mapping platforms is high-quality, comparable sensor data on
diferent platforms. Our design is compact, lightweight, yet stif for mounting on a backpack or a trolley.
It features optimized sensor mounting positions to avoid interference and aid calibration. Mounting the
camera directly under the main LiDAR ensures an unobstructed horizontal field of view. The platform,
made of lightweight aluminum with 3D-printed parts, holds two LiDARs, an industrial-grade IMU, and
a dual-lens fish-eye camera. The sensors and sampling rates are listed in Table 1.</p>
      <p>The diferent sensor setups are shown in Fig. 1. The backpack platform (Fig. 1a) is lightweight yet
stif, allowing the operator to move freely while mapping larger environments. The sensor platform
(Fig. 1b) can be adjusted for diferent heights of the operator. The trolley ( Fig. 1c) reduces DOF in
motion to simplify the underlying SLAM problem. Built from aluminum extrusions, it features a sensor
platform and a table top for a notebook. Wheel-encoders are installed for additional mapping accuracy,
as shown in Fig. 2b, making the trolley a diferential drive robot.</p>
      <p>Camera and LiDAR calibration is essential [36], and calibration, including various cameras and
LiDAR sensors, is complex and time-consuming. As all sensors needing calibration are mounted on the
same platform, calibration is required only once. The camera software includes intrinsic parameters,
requiring extrinsic calibration only. The sensor platform is mounted on the trolley and placed in a static
environment with various test objects (Fig. 2a), including chess boards and colored objects at diferent
distances and angles. Single shots are recorded with the trolley from diferent locations.</p>
      <p>Recordings are used for manual calibration. LiDAR-to-LiDAR calibration aligns points from
overlapping areas of both LiDARs. The camera-LiDAR calibration involves only two parameters, making
manual calibration suficient. The process starts with initial extrinsic parameters (translation and
rotation) from the CAD model. LiDAR points are colored by projecting coordinates from the LiDAR
frame to the image frame, assigning corresponding pixel colors to each point.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Digital Twin Generation</title>
      <p>We use Google Cartographer (GC), a real-time 2D and 3D SLAM system for multiple sensor
configurations and various platforms [4]. For 3D SLAM, GC is extended using 3D probabilistic grid maps. GC
contains local and global SLAM algorithms, without particle filters for better performance on modest
hardware. For platforms with more DOF, like the backpack, an IMU is used to project LiDAR scans
into the horizontal plane, before matching each scan to a submap using non-linear optimization. The
algorithm is modified to handle PCs with the additional RGB information of each point.</p>
      <p>The PCs are manually divided into smaller tiles using CloudCompare [37]. This tiling process is
essential for efectively partitioning the PCs into training, validation, and test splits, as well as for
breaking down the labeling task into smaller steps to make it more eficient. Afterwards, a categorization
of the tiles based on the corresponding rooms and corridors is performed. The name is extended with a
prefix that includes the hardware platform utilized for the recording. To ensure consistency, the tiles
are selected to be approximately identical across both platforms. This practice allows for performing
segmentation experiments that can exclude specific areas of both platforms during training time. This
tiling process results in a total of 99 tiles, where 49 tiles are from the backpack and 50 from the trolley.</p>
      <p>To be utilized for semantic segmentation, the PCs are labeled in a manual process. For our work, 11
semantic classes are distinguished that comprise static building elements, such as ceiling, floor, wall,
beam, column, window, and door, as well as items and furniture that are often found and that can be
moved, such as table, chair, sofa, and bookcase. It is important to note that these classes are more
ifne-grained than those found in many existing indoor semantic segmentation datasets [ 38, 39].</p>
      <p>Additionally, while the classes are similar to those in [40], the class board, from the S3DIS dataset,
has been omitted for various reasons. First of all, the typical measurement error of LiDARs is very close
to the thickness of the boards (typically 2-3 cm). Secondly, excluding this class accelerates the manual
task of labeling. Nevertheless, the data used for our work include classes that are hard to perceive, such
as closed white doors located in corridors with white walls. These door points are similar to wall points
in two diferent ways. They share a similar geometric shape and additionally the points contain similar
color information. These two factors result in complicating the annotation process.</p>
      <p>To divide the segmentation data into training, validation and testing splits, our work uses a
methodology similar to the Cityscapes benchmark [41]. For size reasons, we could not apply a more uniform
distribution, e.g., as shown in the PASCAL VOC 2012 benchmark [42] (33:33:33) or sKITTI [43] (50:50 for
test and train). Hence, a division of 70:15:15 is used for training, validation, and testing. Moreover, tiles
are assigned to the same split if they belong to the same region within the PCs across both platforms.
This is motivated by excluding certain regions during training.</p>
      <p>For the generation of the digital twin, we use two diferent methods, superpoint graph (SPG) [ 28] and
RandLA-Net [26], as they provide very good accuracy for indoor scenes [44]. SPG is a deep learning
(a) Calibration with colored objects and chessboard
(b) Wheel-encoder installation for trolley odometry
framework for large scale PC semantic segmentation, representing PCs as graphs, interconnecting
simple shapes called superpoints. The SPG classifies object parts as a whole, using superedges for
contextual relationships, which are useful in supervised learning. It is much smaller than the total
number of points, enabling eficient long-range interaction modeling.</p>
      <p>The SPG is computed in an unsupervised handcrafted way. Superpoints, assumed to be semantically
homogeneous, are assigned a ground truth label based on the majority label, which can cause
inaccuracies. The framework includes partitioning the input PC, constructing the SPG, superpoint embedding
using PointNets [25], and contextual segmentation using graph convolutions, with the latter two steps
trainable in a supervised end-to-end way (Fig. 3). To improve partitioning, Landrieu and Simonovsky
propose a supervised, graph-based approach for oversegmentation of PCs [45].</p>
      <p>A second method for the digital twin generation is RandLA-Net [26], an eficient neural network
architecture based on random sampling, which is chosen for its computational and memory eficiency.
It applies a local feature aggregation module preserving key features and geometric details. RandLA-Net
processes one million points in a single pass and is up to 200 times faster than previous methods. It
shows high performance on benchmarks like S3DIS [40], Semantic3D [46], and SemanticKITTI [43].</p>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>To compare backpack and trolley, data from the same environments is recorded to create maps (Fig. 4).
In the first area (corridor on the second floor of FOKUS, Fig. 4a and Fig. 4d), the backpack records for
419 seconds, yielding over 130 million points, while the trolley records for 363 seconds, yielding over 90
million points. In the second area (entire second floor, Fig. 4b and Fig. 4e), the backpack records for
341 seconds, producing over 103 million points, and the trolley for 414 seconds, resulting in 94 million
points. In the third area (DCAITI, Fig. 4c and Fig. 4f), the backpack records for 355 seconds, yielding
over 101 million points, and the trolley for 524 seconds, yielding over 155 million points.</p>
      <p>Backpack maps show no significant qualitative loss, demonstrating the robustness of GC for the
backpack. Despite the trolley’s additional odometry sensor, backpack maps are comparable, indicating
no significant quality gain for simpler 2D SLAM problems. However, the odometry sensor may be
more important for larger areas with more loop closures. The trolley’s advantage is better real-time
performance due to simpler 2D scan-matching. However, for ofline processing, the backpack’s mobility
is superior, as shown in Fig. 4c and Fig. 4f, where the trolley struggles with a small bump.</p>
      <p>The trolley has a blind horizontal layer due to the fixed horizontally mounted LiDAR, resulting in
more sparse ceiling coverage and occlusions. The coloring of the PCs is shown in Fig. 5. The backpack
has better overall coverage, as seen in Fig. 5b. The visualizations demonstrate the successful fusion of
LiDAR and cameras. The systematic measurement errors and diferent sampling rates are accounted
for. For example, the car in Fig. 5c is recognizable, and the fire extinguisher is correctly colored.</p>
      <p>PCs from both platforms show consistent coloring, due to the same sensor setup. The coloring
depends on the automatic white balancing of the images, leading to variable brightness and shading
in areas with changing lighting conditions. This variability is evident in the coloring of the walls.
Additionally, the positional ofset between camera and LiDAR can create blind spots for the camera.
Since the LiDAR sensor is mounted on top of the camera, some points acquired by the LiDAR may be
(a) RGB
(b) Features
(c) Partition
(d) Graph
(e) Prediction
occluded in the images. One coloring error due to occlusions is shown in Fig. 5a, where green points on
the floor and the left-hand side desk are incorrectly colored by images of the green locker.</p>
      <p>Three training experiments for semantic segmentation are conducted for both SPG and RandLA-Net:
all available training data (All), only backpack training data (BP), and only trolley training data (trolley).
The overall separation yields 67.3% of the points for training, 13.4% for validation, and 19.2% for testing.
Each data point has RGB and intensity values, but only RGB is used in the experiments. Intensity varies
based on the laser ray’s hitting angle and the object, causing variability in the same object. The void
class unlabeled is ignored in training and testing. All semantic segmentation analyses are performed on
an Intel Core i7-9700K CPU @ 3.60GHz with 64GiB of memory and an NVIDIA GeForce RTX 2080 Ti.
Fig. 6 shows a qualitative assessment of ground truth annotations for all three areas.</p>
      <p>Our evaluation compares the overall accuracy (OA), unweighted mean accuracy (mAcc), and
unweighted mean intersection over union (mIoU). These metrics are evaluated for all 11 classes. The
results from these quantitative measurements are illustrated in Fig. 7. In the All experiment, SPG
achieves an overall accuracy (OA) of 68% with an mIoU of only 27%. There is a significant performance
variation observed in the smaller experiments (backpack and trolley). The lowest performance occurs in
the backpack experiment, resulting in an OA of 50% and an mIoU of 12%. In contrast, SPG performs best
in the trolley experiment, achieving an OA of 84% and an mIoU of 43%. Notably, the SPG framework
shows particularly poor performance in the backpack experiment, with certain per-class IoUs (such as
for ceiling, chair, table, and bookcase) approaching zero. This indicates potential issues or errors during
the training or testing phases of the framework.</p>
      <p>When comparing both methods, RandLA-Net consistently outperforms SPG across all experiments.
Even in the most successful SPG experiment (trolley), RandLA-Net achieves an overall accuracy (OA)
that is 6% higher, with an even larger diference for mIoU. Additionally, RandLA-Net demonstrates
significantly faster computation times compared to the SPG framework, aligning with the results
presented in [26]. However, for the practical application of automatically generating semantically
segmented PCs, SPG performs reasonably fast on the used hardware.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This paper presents two state-of-the-art deep neural networks for semantic segmentation to create
digital twins for diferent indoor environments. A novel modular sensor platform with two LiDARs, an
industrial-grade IMU, and a dual-lens fish-eye camera, produces high-quality PCs, enabling LiDAR
measurements to be colored from camera images. Two experimental platforms are designed: a lightweight
backpack for non-planar environments and a trolley for planar movements. Six PCs covering three
areas are produced, totaling over 41 minutes of raw sensor data and 673 million points. Analysis show
that the backpack is superior for this task due to its mobility. Two deep neural network architectures
are trained and evaluated. The SPG framework reveals issues with backpack PCs but provides valuable
insights. RandLA-Net outperforms the SPG framework, achieving 89.7% overall Acc and an mIoU of
70.74% on trolley data, demonstrating the efectiveness of deep neural networks for digital twin creation.</p>
      <p>
        A primary issue in supervised machine learning is the lack of training data. Future work could extend
the training dataset using the developed sensor platform. Current settings from Google Cartographer
can be applied, and label classes could be extended to improve coverage. Analyzing unlabeled points
and creating new classes could enhance data coverage. Currently, the indoor digital twin exists as 3D
environment that can be accessed with a browser. A next step could be the deduction of a 2D map as it
is used for routing in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] by 3D-to-2D projection.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4o mini in order to: Grammar and spelling
check, Paraphrase and reword. After using this tool, the authors reviewed and edited the content as
needed and take full responsibility for the publication’s content.
[3] B. Henke, J. N. Hark, D. Becker, O. Sawade, I. Radusch, Map Switching Monte Carlo LiDAR
Localization for Automated Driving in Parking Garages, in: Intelligent Vehicles Symposium, IEEE,
2019.
[4] W. Hess, D. Kohler, H. Rapp, D. Andor, Real-time loop closure in 2D LiDAR SLAM, in: 2016 IEEE</p>
      <p>International Conference on Robotics and Automation (ICRA), IEEE, 2016, pp. 1271–1278.
[5] R. Otero, S. Lagüela, I. Garrido, P. Arias, Mobile indoor mapping technologies: A review,
Automation in Construction 120 (2020).
[6] M. Bosse, R. Zlot, P. Flick, Zebedee: Design of a Spring-Mounted 3-D Range Sensor with Application
to Mobile Mapping, IEEE Transactions on Robotics 28 (2012) 1104–1119.
[7] A. Nüchter, M. Bleier, J. Schauer, P. Janotta, Improving Google’s Cartographer 3D mapping by
continuous-time slam, The International Archives of Photogrammetry, Remote Sensing and Spatial
Information Sciences 42 (2017) 543.
[8] D. Chetverikov, D. Svirko, D. Stepanov, P. Krsek, The trimmed iterative closest point algorithm,
in: 2002 International Conference on Pattern Recognition, volume 3, IEEE, 2002, pp. 545–548.
[9] J. Zhang, S. Singh, LOAM: Lidar Odometry and Mapping in Real-time, in: Robotics: Science and</p>
      <p>Systems, volume 2, 2014, p. 9.
[10] J. Zhang, S. Singh, Visual-lidar odometry and mapping: Low-drift, robust, and fast, in: 2015 IEEE</p>
      <p>International Conference on Robotics and Automation (ICRA), IEEE, 2015, pp. 2174–2181.
[11] G. D. Tipaldi, M. Braun, K. O. Arras, FLIRT: Interest Regions for 2D Range Data with Applications
to Robot Navigation, in: Experimental Robotics, Springer, 2014, pp. 695–710.
[12] M. Himstedt, J. Frost, S. Hellbach, H.-J. Böhme, E. Maehle, Large scale place recognition in 2D
LiDAR scans using geometrical landmark relations, in: 2014 IEEE/RSJ International Conference
on Intelligent Robots and Systems, IEEE, 2014, pp. 5030–5035.
[13] K. Granström, T. B. Schön, J. I. Nieto, F. T. Ramos, Learning to close loops from range data, The</p>
      <p>International Journal of Robotics Research 30 (2011) 1728–1754.
[14] G. Grisetti, R. Kümmerle, C. Stachniss, W. Burgard, A tutorial on graph-based SLAM, IEEE</p>
      <p>Intelligent Transportation Systems Magazine 2 (2010) 31–43.
[15] R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, W. Burgard, G²o: A general framework for
graph optimization, in: IEEE International Conference on Robotics and Automation, 2011.
[16] X. Chen, A. Milioto, E. Palazzolo, P. Giguère, J. Behley, C. Stachniss, SuMa++: Eficient LiDAR-based</p>
      <p>Semantic SLAM, in: International Conf. on Intelligent Robots and Systems, IEEE, 2019.
[17] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, J. J. Leonard, Past,
present, and future of simultaneous localization and mapping: Toward the robust-perception age,
IEEE Transactions on Robotics 32 (2016) 1309–1332.
[18] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schrof, H. Adam, Encoder-decoder with atrous separable
convolution for semantic image segmentation, in: European conference on computer vision, 2018.
[19] J. Yuan, Z. Deng, S. Wang, Z. Luo, Multi receptive field network for semantic segmentation, in:</p>
      <p>IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2020, pp. 1883–1892.
[20] B. Zoph, G. Ghiasi, T.-Y. Lin, Y. Cui, H. Liu, E. D. Cubuk, Q. V. Le, Rethinking pre-training and
self-training, Advances in Neural Information Processing Systems 2020-December (2020).
[21] E. Grilli, F. Menna, F. Remondino, A review of point clouds segmentation and classification
algorithms, The International Archives of Photogrammetry, Remote Sensing and Spatial Information
Sciences 42 (2017) 339.
[22] H. Lu, H. Shi, Deep Learning for 3D Point Cloud Understanding: A Survey, arXiv preprint
arXiv:2009.08920 (2021).
[23] F. J. Lawin, M. Danelljan, P. Tosteberg, G. Bhat, F. S. Khan, M. Felsberg, Deep projective 3D
semantic segmentation, in: International Conference on Computer Analysis of Images and
Patterns, Springer, 2017, pp. 95–107.
[24] L. Tchapmi, C. Choy, I. Armeni, J. Gwak, S. Savarese, Segcloud: Semantic segmentation of 3d point
clouds, in: International Conference on 3D Vision (3DV), IEEE, 2017, pp. 537–547.
[25] C. R. Qi, H. Su, K. Mo, L. J. Guibas, PointNet: Deep Learning on Point Sets for 3D Classification
and Segmentation, in: IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[26] Q. Hu, B. Yang, L. Xie, S. Rosa, Y. Guo, Z. Wang, N. Trigoni, A. Markham, RandLA-Net: Eficient
semantic segmentation of large-scale point clouds, in: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2020, pp. 11108–11117.
[27] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, L. J. Guibas, KPConv: Flexible
and Deformable Convolution for Point Clouds, in: International Conference on Computer Vision,
IEEE, 2019.
[28] L. Landrieu, M. Simonovsky, Large-scale point cloud semantic segmentation with superpoint
graphs, in: IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4558–4567.
[29] Z. Xu, X. Huang, B. Yuan, Y. Wang, Q. Zhang, W. Li, X. Gao, Retrieval-and-alignment based
large-scale indoor point cloud semantic segmentation, Science China Information Sciences 67
(2024).
[30] C. Choy, J. Park, V. Koltun, Fully convolutional geometric features, in: Proceedings of the IEEE/CVF
international conference on computer vision, 2019, pp. 8958–8966.
[31] D. Rozenberszki, O. Litany, A. Dai, UnScene3D: Unsupervised 3D Instance Segmentation for</p>
      <p>Indoor Scenes, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
[32] J. Schult, F. Engelmann, A. Hermans, O. Litany, S. Tang, B. Leibe, Mask3D: Mask Transformer
for 3D Semantic Instance Segmentation, in: IEEE International Conference on Robotics and
Automation, 2023.
[33] Y.-Q. Yang, Y.-X. Guo, J.-Y. Xiong, Y. Liu, H. Pan, P.-S. Wang, X. Tong, B. Guo, Swin3D: A pretrained
transformer backbone for 3D indoor scene understanding, Computational Visual Media 11 (2025).
[34] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision
transformer using shifted windows, in: 2021 IEEE/CVF International Conference on Computer
Vision (ICCV), 2021.
[35] J. Zheng, J. Zhang, J. Li, R. Tang, S. Gao, Z. Zhou, Structured3D: A Large Photo-Realistic Dataset
for Structured 3D Modeling, in: Computer Vision – ECCV 2020, Springer, 2020.
[36] A. Geiger, F. Moosmann, Ömer Car, B. Schuster, Automatic camera and range sensor calibration
using a single shot, in: IEEE International Conference on Robotics and Automation, 2012.
[37] D. Girardeau-Montaut, CloudCompare, https://www.danielgm.net/cc, 2016. Retrieved from
Cloud</p>
      <p>Compare.
[38] J. Xiao, A. Owens, A. Torralba, SUN3D: A Database of Big Spaces Reconstructed using SfM and</p>
      <p>Object Labels, in: Proceedings of the IEEE International Conference on Computer Vision, 2013.
[39] N. Silberman, R. Fergus, Indoor scene segmentation using a structured light sensor, in: 2011 IEEE</p>
      <p>International Conference on Computer Vision Workshops (ICCV Workshops), IEEE, 2011.
[40] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, S. Savarese, 3D Semantic Parsing
of Large-Scale Indoor Spaces, in: IEEE Conf. on Computer Vision and Pattern Recognition, 2016.
[41] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth,
B. Schiele, The cityscapes dataset for semantic urban scene understanding, in: Proceedings of the
IEEE conference on computer vision and pattern recognition, 2016.
[42] M. Everingham, S. A. Eslami, L. V. Gool, C. K. Williams, J. Winn, A. Zisserman, The pascal visual
object classes challenge: A retrospective, International Journal of Computer Vision 111 (2015)
98–136.
[43] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, J. Gall, SemanticKITTI: A
Dataset for Semantic Scene Understanding of LiDAR Sequences, in: IEEE International Conference
on Computer Vision, 2019, pp. 9297–9307.
[44] Y. Sun, X. Zhang, Y. Miao, A review of point cloud segmentation for understanding 3D indoor
scenes, Visual Intelligence 2 (2024) 14.
[45] L. Landrieu, M. Boussaha, Point cloud oversegmentation with graph-structured deep metric
learning, in: IEEE Conference on Computer Vision and Pattern Recognition, 2019.
[46] T. Hackel, N. Savinov, L. Ladicky, J. D. Wegner, K. Schindler, M. Pollefeys, Semantic3D.net: A
new Large-scale Point Cloud Classification Benchmark, in: ISPRS Annals of the Photogrammetry,
Remote Sensing and Spatial Information Sciences, volume IV-1-W1, 2017.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wortmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schäufele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Klipp</surname>
          </string-name>
          , I. Radusch,
          <string-name>
            <given-names>K.</given-names>
            <surname>Blaß</surname>
          </string-name>
          , T. Jung,
          <article-title>Enhanced accessibility for mobile indoor navigation</article-title>
          ,
          <source>in: 14th International Conference on Indoor Positioning and Indoor Navigation (IPIN)</source>
          , IEEE,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Massow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kwella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pfeifer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Häusler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pontow</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Radusch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hipp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dölitzscher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Haueis</surname>
          </string-name>
          ,
          <article-title>Deriving HD maps for highly automated driving from vehicular probe data</article-title>
          ,
          <source>in: IEEE 19th International Conference on Intelligent Transportation Systems</source>
          , IEEE,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>