-

Point cloud change detection in indoor environments

Tomoya Matsubara

tomoya.matsubara@hvrl.ics.keio.ac.jp 0

Hideo Saito

0 0 Keio University , Yokohama , Japan

3D change detection plays a crucial role in a wide range of applications, including disaster management, as well as in robotics for search, rescue, security, and surveillance purposes. Although previous works exist, most of them are limited to detecting a few specific targets or are restricted to 2D images. Additionally, some assume prior knowledge of the object positions of interest. This paper presents a novel change detection algorithm that combines panoptic segmentation and -NN, enabling the detection of changes without relying on positional information about the objects of interest. Experimental evaluations on indoor point clouds demonstrate the algorithm's capability to detect the removal of densely and closely placed objects, an aspect overlooked by previous approaches due to their inherent limitations. Despite variations in settings and datasets, our algorithm achieves a recall improvement of 0.06 for the removed class, surpassing the performance of existing related works.

metaverse, point cloud, change detection, panoptic segmentation, k-Nearest Neighbor

rithm that combines 2D and 3D nearest points. However,

1. Introduction

Point clouds, which accurately capture the 3D geometry of scenes, have extensive applications in scene understanding and robotics [1], encompassing tasks such as 3D shape classification [ 2, 3], 3D object detection [4, 5, 6, 7], and point cloud segmentation [8]. In the realm of the metaverse, point clouds frequently serve as representations of scenes within virtual worlds.

The metaverse is built upon immersive user experiences, necessitating an interaction layer that efectively bridges the physical and virtual worlds [9]. Digital twins [10] serve as a critical component within this layer, facilitating the transmission and synchronization of data and information between the virtual and physical worlds. However, the constant scanning of the entire scene to update the digital twin is impractical due to the vast amount of data involved. Thus, the selective updating of the digital twin in areas where changes have occurred becomes paramount, highlighting the essential role of change detection.

2D change detection techniques, which primarily focus on comparing two input images, have been proposed prithese approaches are constrained by the requirement of image alignment between the two inputs. In contrast, 3D change detection has garnered attention for its ability to overcome this limitation. Although some research has been conducted on 3D change detection in domains such as disaster management [13], security patrols [14], and APMAR’23: The 15th Asia-Pacific Workshop on Mixed and Augmented

2. Related Work

ods; however, these methods often sufer from limitations in their applicability. Some approaches necessitate prior knowledge of the object positions of interest [16], while others are confined to detecting a limited number of targets [17, 7]. Additionally, certain techniques are only suitable for change detection in 2D images [18]. The process of collecting positional information can be laborintensive, particularly in the case of 3D data, as it involves annotation. The range and diversity of detection targets directly impact the algorithm’s applicability.

S. Nikoohemat et al. [17] concentrate on changes in marily for remote sensing applications [11, 12]. However, 3D change detection has been studied by various meth3. Methods their algorithm’s applicability is limited as it only considers vertical planes as potential objects of interest that may undergo changes. M. Voelse et al. [7] introduce change Let and ′ denote a pair of already registered point detection as a means to distinguish between static and clouds captured at diferent times and ′ (where ≠ ′), temporal objects when creating updated 3D models of respectively, of the same scene. Given that an object environments. The algorithm initiates by segmenting instance exists in but not in ′, we define this scenario point clouds using region growing, with a set distance as the instance being removed when < ′, or added when threshold of 10 cm. Additionally, segments with a height > ′. Therefore, the addition and removal of an object below 20 cm are excluded (i.e., discarded) to mitigate instance can be treated equivalently by interchanging the misclassifications. These defined thresholds make it chal- time values and ′. Accordingly, we formulate the task lenging to apply the algorithm in environments where of change detection as a binary classification problem, objects are closely positioned, such as indoor scenarios. distinguishing between no change and removed, with a

T. Ku et al. [16] propose three distinct algorithms, particular focus on identifying the specific object instance namely PoChaDeHH, HGI-CD, and SiamGCN, for change that underwent the change in the point cloud . detection on a street-scene dataset consisting of point First, 3D point clouds of the scene are reconstructed clouds. The dataset encompasses various street furniture from captured images. Subsequently, panoptic segmentaobjects, including road signs, advertisements, statues, and tion is applied to assign an object instance label to each garbage bins, with the positions of each object of interest pixel in the images, thereby associating them with the provided, facilitating the extraction of these objects from corresponding points in the point clouds. In the next step, the point cloud. PoChaDeHH initially eliminates outliers partial point clouds containing the same object instance and noisy objects from the extracted point cloud, then are extracted from , while their bounding volumes are employs clustering techniques to separate the remaining utilized to extract the corresponding point clouds from ′. objects. The change is estimated based on the mean Finally, the pair of extracted point clouds undergo classidistance between points in the registered point clouds. fication using the -NN algorithm, determining whether HGI-CD utilizes statistical techniques to remove outliers, there is no change or the instance has been removed. An constructs color and geometric change graphs using the overview of the proposed change detection algorithm is -NN algorithm, and estimates change using Siamese depicted in Figure 1. graph convolutional networks (GCNs) with Fast Point The details are described in this section. Feature Histograms (FPFH) [19] as the node features.

SiamGCN also employs GCNs with graphs constructed 3.1. Point Cloud Reconstruction through -NN, but does not include a cleaning step for the extracted point cloud. The proposed algorithm for change detection relies on

As a change detection algorithm that works without the comparison of two point clouds. These point clouds prior information of the positions of objects of interest, are reconstructed using RGB-D images captured by an K. Sakura et al. [18] propose two deep learning models, iPhone running ARKit, along with the corresponding namely CSCDNet and SSCDNet. CSCDNet, a Siamese confidence maps and camera parameters (i.e., intrinsic network based on ResNet-18 [20], estimates the proba- and extrinsic). bility mask of change from two input images. SSCDNet, The confidence maps, provided by ARKit, are 2D arrays on the other hand, is a U-Net-based network that uti- with the same dimensions as the depth map. They take lizes the input images and the output of CSCDNet to three values: low, medium, and high, which indicate the predict semantic change labels for each pixel. While the accuracy of the depth values. To optimize computational networks can be trained with semantic labels from non- eficiency and enhance performance, pixels with low or aligned images, during inference, two aligned images medium confidence values are excluded and not utilized are required. If the input images are not aligned during in the reconstruction process. inference, the models would erroneously detect changes By utilizing the camera parameters and depth map, because many pixels in the input images represent dif- the position of each pixel in the 3D world coordinate ferent objects that did not undergo change but appear system can be uniquely computed. The RGB images are diferent. Similarly, W. G. C. Bandara and V. M. Patel employed to extract color information, while the instance [11] propose ChangeFormer for 2D change detection, labels are obtained through segmentation, as discussed which has demonstrated state-of-the-art performance on in Subsection 3.2.

LEVIR-CD [21] and DSIFN-CD [22]. However, it faces the same challenge mentioned above, where the align- 3.2. Panoptic Segmentation ment of input images during inference is crucial to avoid misinterpretation of changes.

To perform panoptic segmentation on the RGB images, we employ Detectron2 [23], a pre-trained deep learn

Pt : Point cloud at t Pt′ : Point cloud at t′ Bounding Volume Keyboard 1 at t Keyboard 1 at t′

< nchange points? Yes No N N k

Removed No Change Removed

in Figure 3. This algorithm verifies whether a point 1 and its neighboring point 2 possess the same object label and if their distance falls below the predefined threshold merge. Moreover, it examines the set to determine if 1 and 2 should be considered as the same instance.

It is important to note that the condition ( 1, 2) ∉ Figure 2: Instance labels obtained from panoptic segmen- alone is insuficient; 1 may share the same instance tation. chair_1 (resp. chair_2) of the left image is the same label with other points , where ( , 2) ∈ , and likeinstance as chair_4 (resp. chair_3) in the right image. wise for ( 1, ′), where ′ represents a point within the group sharing the instance label with 2. If all three conditions are satisfied, the instance labels are merged. The ing model that assigns an instance label to each pixel. Union-Find data structure is employed in the algorithm The instance label takes the form of object_id (e.g., key- to manage and merge instance labels. board_10)), where object is a character string indicating the type of object represented by the pixel, and id denotes 3.3. Point Cloud Extraction the instance’s unique identifier.

It is important to note that the uniqueness of instance Following the reconstruction and panoptic segmentalabels extends not only within each image but also across tion steps, the point cloud is partitioned into partial all images used for point cloud reconstruction. Conse- point clouds based on their instance labels. This division quently, diferent instance labels in two separate images ensures that each resulting partial point cloud contains may correspond to the same underlying object (as il- precisely one instance label, and no other partial point lustrated in Figure 2). Conversely, distinct instance la- clouds share the same label. However, due to imperfecbels within a single image necessarily represent diferent tions in the panoptic segmentation performed by Detecobjects. At this stage, we establish an unpaired set, , tron2, erroneous instance labels may occasionally arise. comprising pairs of instance labels that must represent These incorrectly labeled partial point clouds often condistinct instances. sist of only a few points, as they do not correspond to

By projecting pixels with instance labels onto the 3D any actual instances in the scene. To address this issue, space, each point within the reconstructed point cloud is we introduce a threshold discard and discard any partial assigned an instance label. Consequently, the point cloud point clouds containing fewer than discard points. is represented as a 7-dimensional vector, comprising the Let () denote the partial point cloud associated with 3D position, RGB color, and instance label. the instance label obtained from . In contrast to the

To consolidate diferent instance labels corresponding extraction of partial point clouds from the extraction to the same object, we employ a -NN-based algorithm. of corresponding partial point clouds from ′ is based on Initially, for each point, we compute the merge nearest the bounding volume of () . Specifically, if () resides points and their corresponding distances. Subsequently, within the range [ min, max] × [ min, max] × [ min, max], instance labels are merged using the algorithm depicted then the corresponding point cloud ′() consists of Data: : Point cloud, : merge nearest neighbors, : merge nearest neighbors’ distances, : Unpair set, merge: Distance threshold, : Union-Find instance equipped with merges instance labels and method that

method that returns the group members of the given point.

Result: Instance labels of the same instance are

merged into one single label. 1 for 1 ∈ do 1 ← 1. 1 ← 1. for 2 ∈ [

_ _ points from ′ that also fall within the same range [ min, max] × [ min, max] × [ min, max]. It is important to note that ′() may contain multiple instance labels, unlike () which only has a single instance label associated

3.4. Change Detection

Our proposed algorithm focuses on detecting changes between two point clouds, () and ′() . Initially, the algorithm counts the number of points in ′() , denoted as , and promptly identifies the change as removed if is less than a predefined threshold change. This decision is based on the observation that instances are less likely to be represented by a small number of points, indicating the absence of the instance in ′() .

In cases where is not less than change, the algorithm proceeds with the change classification process, as outlined in Figure 4. This process involves counting the Data: 1, 2: Point clouds to compare, : change nearest neighbors, : change nearest neighbors’ distances, change: Ratio threshold.

Result: Detection result: no change or removed. 1 _ℎ ← 0 2 for 1 ∈ 1 do

; 3 4 5 6 7 8

end 10 if _ℎ/( ∗ 9 end 11 13 12 else 14 end 1 ← 1. for 2 ∈ [ ] do

_ 2 ← 2. if 1 = 2 then _ℎ ← ; _ return removed return no change ; _ℎ + 1 1.ℎ) < change then

4. Experiments 4.1. Dataset

We utilized an iPhone equipped with ARKit to capture a dataset1 comprising RGB-D images, confidence maps, and camera parameters within a room containing various objects, including chairs, computers, books, cell phones, and keyboards. The dataset consists of a collection of frames captured at two diferent time instances, and ′. The specific number of frames captured at each time instance is presented in Table 1. It is important to highlight that individual partial point clouds within the dataset do not necessarily correspond to distinct object instances; some partial point clouds may represent diferent parts of the same object instance. 1Our source code is available on GitHub: https://github.com/ Tomoya-Matsubara/RGB-D-Scan-with-ARKit

4.2. Implementation Details

Table 2 provides an overview of the parameter settings (c) Chair before merging (d) Chair after merging employed in our implementation2. During the point cloud reconstruction phase, we applied random sampling Fmiegrugrineg5.:EaIncshtainnscteanlacbeellasboefl htvasainsdcoclhoarierdbuenfioqrueealyn.dInaftethre and selected sample sample pixels from each frame to left images (a) and (c), each instance ( tv and chair ) has many optimize processing time for subsequent operations. colors, which shows many labels are assigned to the same

Although Detectron2 ofers support for various object instance before merging. In the right images (b) and (d), in instances, we focused on extracting point clouds associ- contrast, each of them has only a few colors, which indicates ated with specific labels, namely book, bottle, cup, chair, those labels are merged correctly. keyboard, laptop, and cell phone.

4.3. Annotation 5.1. Label Merge

The proposed algorithm extracted 88 partial point clouds from the dataset. Table 3 shows the detail of the extracted point clouds.

We performed manual annotation to assign labels (i.e., no Figure 5 demonstrates the successful merging of instance change and removed) to the extracted partial point clouds. labels belonging to the object categories tv and chair. During the annotation process, we carefully examined It can be observed that not only the labels of planar tv the origin of each partial point cloud in , determining instances but also those of chair instances with more the corresponding object instance it belonged to, and complex shapes were efectively merged. This can be atverified the presence of the same object instance in ′. tributed to the fact that the merge operation solely relied The presence of the object instance indicated no change, on distance information, without making any assumpwhile its absence indicated that the object instance had tions about the shapes of the instances. been removed. Although some instances still retained multiple labels, the overall number of instance labels was significantly reduced by approximately 40% (from 4, 534 to 2, 729). After 5. Results discarding partial point clouds with a point count below the threshold discard, the remaining labels were further reduced to a final count of 88. 2Our implementation is available on GitHub: https://github.com/ Tomoya-Matsubara/point-cloud-change-detection

5.2. Change Detection

The result of the change detection is shown in Figure 6. As the false negative (bottom left corner of the matrix) indicates, the proposed algorithm detected changes perfectly. th 0 u r T d n u o rG1 Prediction 40 35 30 25 20 15 10 5 0

However, there were 17 cases of false positives, where the algorithm incorrectly predicted that an object instance was removed when it was actually present. This can be attributed to the limited number of partial point clouds in ′. Figure 7 provides a visual representation of the change detection results, with and ′ captured from the same angle. False positive cases are highlighted in green, particularly noticeable in the central chair in Figure 7 (a). Upon closer examination of the same chair in Figure 7 (b) and (c), it becomes evident that contains a substantial number of points accurately representing the chair’s shape. Conversely, ′ only consists of a few points, failing to capture the chair adequately. Consequently, due to the algorithm’s tendency to predict removal in cases with limited point coverage, these false positives were triggered.

Since there is no false negative, as explained above, no red-colored point can be seen in Figure 7 (a).

Figure 8 provides a visualization from a diferent angle, showcasing objects on a table, such as a laptop and a smartphone. These objects are typically ignored as change detection targets in previous works due to their close proximity to each other, often just a few centimeters apart. However, in the proposed algorithm, we successfully detected the removal of such objects by leveraging pixel-level object segmentation rather than relying solely on distance-based criteria. This approach allowed us to accurately identify and classify the removal of objects, even in challenging scenarios where objects are spatially close to each other.

5.3. Comparison with 2D Change Detection

Figure 9 presents the change detection results obtained using the pre-trained ChangeFormer [11] model. This ifgure showcases the same scene as shown in Figure 8, with a focus on the successful detection of the laptop removal. Notably, Figure 9 (a) and (b) exhibit misalignment (a) Change detection result (b) : Point cloud at

(c) ′: Point cloud at ′ because they were recorded by humans, as opposed to robots whose movements can be pre-defined and controlled.

In Figure 9 (c), the ChangeFormer model trained on DSIFN-CD detects changes in the top left corner, although no actual changes occurred in that region. Additionally, while the pixels at the center seemingly detect the re(a) Image from (b) Image from ′ (c) ChangeFormer (DSIFN) (d) ChangeFormer (LEVIR) (e) Point Cloud captured at (f) Point Cloud captured at ′ (g) ChangeFormer (DSIFN) (h) ChangeFormer (LEVIR) moval of the laptop, their size is significantly smaller compared to the actual change. On the other hand, Figure 9 (e) and (f) depict manually captured images achieved by aligning two point clouds. In this particular case, Figure 9 (g) successfully detects the removal.

It is worth noting that these results are not surprising, considering that ChangeFormer was not specifically designed or trained to detect changes in unaligned images.

However, in metaverse applications, it is expected that both robots and humans contribute to data collection (e.g., image capture) for immediate updates to the virtual world. Consequently, captured images are not always perfectly aligned, and cases resembling Figure 9 (e) and (f) are less likely to arise, especially when the detection target instance is not pre-determined. From this perspective, our proposed algorithm demonstrates its ability to detect changes in object instances, even when captured from diferent angles or under misalignment conditions.

5.4. Comparison with Related Work of 3D Change Detection

For reference, we conducted a comparison of our change detection results with the performance of PoChaDeHH, HGI-CD, and SiamGCN [16] algorithms on their street scene dataset, as presented in Table 4. It should be noted that direct comparison between our algorithm and the reference algorithms is challenging due to the following reasons: • Classification Diferences: The reference algorithms are designed for five-class classification, including categories such as no change, removed, added, change, and color change. In contrast, our algorithm focuses on detecting the removal of object instances. • Dataset Variation: The reference algorithms utilize a diferent dataset consisting of street scenes, which may introduce variations in terms of scene composition, object types, and background elements. • Known Object Positions: The positions of the objects of interest are provided in the reference algorithms, whereas our algorithm operates without this prior knowledge.

Despite these diferences, our algorithm demonstrates superior performance in terms of recall for the removed class compared to the reference algorithms. Even when considering the added class as equivalent to the removed class, our algorithm still exhibits a slight performance advantage of 0.06. However, the recall for the no change class is comparatively lower than that of PoChaDeHH and HGI-CD; according to [16], these algorithms tend to predict no change, but this specialization comes at the expense of generalization performance for other classes. Conversely, SiamGCN, which showcases the best generalization performance among the reference algorithms, exhibits a recall rate similar to ours.

6. Conclusion

In this study, we have presented a change detection algorithm that relies on panoptic segmentation and -NN, operating without the need for positional information about the object of interest.

Our label merge algorithm efectively combines different instance labels that may correspond to the same object instance, resulting in a reduced number of labels. We have demonstrated its success in merging labels for instances with complex shapes, such as chairs.

Through experiments conducted on an indoor point cloud dataset, our change detection algorithm has proven its ability to detect the removal of closely situated objects. Unlike 2D change detection techniques, our algorithm surpasses the limitations of capturing changes from a single angle and showcases its capability to detect changes in objects captured from diferent angles. Furthermore, our algorithm has been compared with a state-of-theart algorithm, revealing its competitive performance in terms of recall, particularly for the removed class.

In future research, we propose exploring techniques to assess the quality of input images. Blurred images caused by camera shake can adversely impact the segmentation performance, and addressing this issue would enhance the overall accuracy of our algorithm. Additionally, as multiple frames may capture the same scene with minimal diferences, removing duplicates could be considered to reduce the number of frames for processing, ultimately improving computational eficiency.

Robots and Systems, 2007, pp. 3429–3435. doi:10. 1109/IROS.2007.4399381. [15] U. Katsura, K. Matsumoto, A. Kawamura, T. Ishigami, T. Okada, R. Kurazume, Spatial change detection using voxel classification by normal distributions transform, in: 2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 2953–2959. doi:10.1109/ICRA.2019.8794173. [16] T. Ku, S. Galanakis, B. Boom, R. C. Veltkamp, D. Bangera, S. Gangisetty, N. Stagakis, G. Arvanitis, K. Moustakas, Shrec 2021: 3d point cloud change detection for street scenes, Computers Graphics 99 (2021) 192–200. URL: https://www.sciencedirect.com/science/ article/pii/S0097849321001369. doi:https: //doi.org/10.1016/j.cag.2021.07.004. [17] S. Nikoohemat, M. Koeva, S. Oude Elberink, C. Lemmen, Change detection from point clouds to support indoor 3d cadastre, The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 42 (2018) 451–457. [18] K. Sakurada, M. Shibuya, W. Wang, Weakly supervised silhouette-based semantic scene change detection, in: 2020 IEEE International conference on robotics and automation (ICRA), IEEE, 2020, pp. 6861–6867. [19] R. B. Rusu, N. Blodow, M. Beetz, Fast point feature histograms (fpfh) for 3d registration, in: 2009 IEEE international conference on robotics and automation, IEEE, 2009, pp. 3212–3217. [20] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [21] H. Chen, Z. Shi, A spatial-temporal attention-based method and a new dataset for remote sensing image change detection, Remote Sensing 12 (2020) 1662. [22] C. Zhang, P. Yue, D. Tapete, L. Jiang, B. Shangguan, L. Huang, G. Liu, A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images, ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 183–200. [23] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, R. Girshick, Detectron2, https://github.com/facebookresearch/ detectron2, 2019.