=Paper=
{{Paper
|id=Vol-3349/paper8
|storemode=property
|title=Real3D-Aug: Point Cloud Augmentation by Placing Real
Objects with Occlusion Handling for 3D Detection and
Segmentation
|pdfUrl=https://ceur-ws.org/Vol-3349/paper8.pdf
|volume=Vol-3349
|authors=Petr Sebek,Simon Pokorny,Patrik Vacek,Tomas Svoboda
|dblpUrl=https://dblp.org/rec/conf/cvww/SebekPVS23
}}
==Real3D-Aug: Point Cloud Augmentation by Placing Real
Objects with Occlusion Handling for 3D Detection and
Segmentation==
Real3D-Aug: Point Cloud Augmentation by Placing Real Objects with Occlusion Handling for 3D Detection and Segmentation Petr Šebek1,† , Šimon Pokorný1,† , Patrik Vacek1,* and Tomáš Svoboda1 1 Vision for Robotics and Autonomous Systems, Dept. of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague Abstract Object detection and semantic segmentation with the 3D LiDAR point cloud data require expensive annotation. We propose a data augmentation method that takes advantage of already annotated data multiple times. We propose an augmentation framework that reuses real data, automatically finds suitable placements in the scene to be augmented, and handles occlusions explicitly. Due to the usage of the real data, the scan points of newly inserted objects in augmentation sustain the physical characteristics of the LiDAR, such as intensity and raydrop. The pipeline proves competitive in training top-performing models for 3D object detection and semantic segmentation. The new augmentation provides a significant performance gain in rare and essential classes, notably 6.65% average precision gain for “Hard” pedestrian class in KITTI object detection or 2.14 mean IoU gain in the SemanticKITTI segmentation challenge over the state of the art. Keywords LiDAR, pointclouds, augmentation, semantic segmentation, object detection 1. Introduction not enough annotated data to train large neural networks. Data augmentation is a way to effectively Accurate detection and scene segmentation are decrease the need for more annotated data by integral to any autonomous robotic pipeline. enriching the training set with computed variations Perception and understanding are possible thanks to of the data. This type of augmentation is usually various sensors, such as RGB cameras, radars, and achieved with geometrical transformations, such LiDARs. These sensors produce structural data as translation, rotation, and rescale applied to the and must be interpreted for the proper function already labeled samples [4, 5, 6, 7]. of critical safety systems. We focus on LiDARs. In general, 3D point cloud augmentations [4, Recently, the most promising way to process LiDAR 8] have been much less researched than image data is to train deep neural networks [1, 2, 3] with augmentation techniques [5, 7, 9, 10]. For example, full supervision, which requires a large amount of the aforementioned 3D point cloud augmentations annotated data. only enrich the geometrical features of the training The manual annotation process is very time samples but do not create new scenarios with the and resource-consuming. For example, to perform previously unseen layout of objects. The lack of semantic segmentation on LiDAR point clouds, one modeling a realistic class population of the scenes is needs to accurately label all the points in the scene still a bottleneck of augmentation techniques. This as a specific object class. As a result, there is problem can be addressed by augmentation that uses simulated virtual data and scene configurations. 26th Computer Vision Winter Workshop, Robert Sablatnig and Florian Kleber (eds.), Krems, Lower Austria, Austria, However, the effect of such data on training is Feb. 15-17, 2023 low due to nonrealistic physical and visual features * † Corresponding author. compared to real data. These authors contributed equally. We focus on improving the learning of 3D $ sebekpe1@fel.cvut.cz (P. Šebek); pokorsi1@fel.cvut.cz perception networks by enhancing LiDAR data ( Pokorný); vacekpa2@fel.cvut.cz (P. Vacek); svobodat@fel.cvut.cz (T. Svoboda) in autonomous driving scenarios with data http://cmp.felk.cvut.cz/~vacekpa2/ (P. Vacek); augmentation. Depth information allows for per- http://cmp.felk.cvut.cz/~svoboda/ (T. Svoboda) object manipulation when augmenting the point 0000-0001-8587-5364 (P. Šebek); 0000-0002-7812-5634 clouds [8]. We take advantage of the spatial position ( Pokorný); 0000-0003-3001-1483 (P. Vacek); of annotated objects and place them in different 0000-0002-7184-1785 (T. Svoboda) © 2023 Copyright for this paper by its authors. Use permitted under scenes while handling occlusions and class-specific Creative Commons License Attribution 4.0 International (CC BY 4.0). inhabitancy, see Figure 1. CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1 Petr Šebek et al. CEUR Workshop Proceedings 1–10 improvement, especially in rarely represented classes. The codes for our method are publicly available1 . 2. Related Work 2.1. Data Augmentation One of the first approaches to augmenting LiDAR data was GT-Aug, which was published within the 3D detection model SECOND [11]. GT-Aug adds samples from the ground-truth database, which is precomputed before the training phase. The samples are randomly selected and inserted into the scene as is. If a collision occurs, the added object is simply removed. The visibility and occlusion handling of added scan points or the inserting strategy is not taken into account. Global data augmentations (Gl-Aug) [4] such as rotation, flip, and scale are commonly used in 3D point-cloud neural networks. These augmentations provide a different geometrical perspective, which supports the neural network with more diversity of training Figure 1: We show examples of our augmentation method in 3D object detection and semantic segmentation. First, samples. An attempt to automate the augmentation we insert objects one by one and then simulate their strategy was proposed in [12], which narrows the visibility to model realistic occlusions. Note the details of search space based on previous training iterations. the scene (circled) and the detection of occluded orange The state-of-the-art LiDAR-Aug [8] enriches the points. After removal, we see the final augmented version training data to improve the performance of the 3D of the point cloud in the last row detectors. Additional objects are rendered on the basis of CAD models. Simulations of intensity and raydrops are not discussed in the article. LiDAR- Our method segments the road and sidewalks for Aug [8] also simulates occlusion between additional class-specific insertion. Next, the method exploits objects and the rest of the scene, unlike GT- the bounding boxes of objects to avoid collisions. Aug [11]. Recent method [13], similar to our one, Compared to state-of-the-art LiDAR-Aug [8], which also focuses on inserting objects into point clouds. is suitable only for object detection, our bounding The main difference between the methods is in the box generation allows augmenting the semantic real visibility simulation. Approach [13] upsamples segmentation datasets and simulates realistic the number of points in the sample, which are then occlusions throughout the spherical projection. The projected into a range image, where visible points inserted augmentations come from the same dataset are selected and then sparsed. From our point and are placed at the same distance, ensuring of view, this approach does not consider possible natural reflection values and point distribution, raydrop on objects located between the ego and the including ray dropouts. We evaluate the proposed inserted sample. It can cause parts of the inserted method on tasks of 3D object detection and sample to be falsely visible because some LiDAR semantic segmentation. Our contribution is twofold: beams could drop out from the obstacle and create holes in the range image. • We present a new augmentation framework suitable for both 3D object detection and 2.2. Data Simulators semantic segmentation. • We propose a novel way to model occlusions The recent progress in computer vision brought large and physically consistent insertion of objects neural networks with a large number of learnable for augmentation. parameters, often unable to reach a saturation point with the size of current training sets. These We demonstrate the usefulness of our method on autonomous driving benchmarks and show 1 https://github.com/ctu-vras/pcl-augmentation 2 Petr Šebek et al. CEUR Workshop Proceedings 1–10 models require training on a very large number between additional objects and objects that are in of annotated examples. Commonly used solutions the original point cloud. We analyze overlapping include synthetically generated data [14] or using bounding boxes. Therefore, we need to create game simulators such as Grand Theft Auto V, bounding boxes for semantic datasets that come which was used to generate images for the semantic without object boxes (in Subsection 3.2). More segmentation of ground truth [15]. Some simulators details on placing additional objects are given in built on Unreal Engine, for example, Carla [16] are Subsection 3.3. Lastly, the method handles realistic also used in autonomous driving research. However, occlusions between objects (in Subsection 3.4). The the gap between real and synthetic data remains a overview of the proposed method is visualized in great challenge [14]. One of the approaches to deal Figure 2. with the difference and portability to the real world is [17, 18], which can produce more realistic LiDAR 3.1. Road Estimation data from simulation by learning GAN models. To place the new objects, we need to know where they realistically appear in the scene. This 2.3. 3D Perception Tasks information may be given by HD maps [26, 27] if Learning in the LiDAR point cloud domain poses included in datasets; however, KITTI dataset [28] challenges, such as low point density in regions at does not provide them. We estimate valid roads the far end of the FOV, the unordered structure of and sidewalk areas for both tasks according to the the data, and sparsity due to the sensor resolution. pipeline described in Figure 3. First, we pseudo- Three common approaches to aggregation and label 3D points by Cylinder3D [2], a state-of-the-art learning the LiDAR features are voxel-based models semantic segmentation neural network, which was [19, 11], re-projection of data into 2D structure pre-trained on the SemanticKITTI dataset [29]. The [20, 21], and point cloud-based models [2, 3]. To resulting predictions are then projected onto the 2D show the ability to generalize, we evaluate our LiDAR (𝑥, 𝑦) ground plane, discretized with a cell proposed method based on different model feature size resolution of 1 × 1 meter. Then we divide the extractors and on two tasks of 3D object detection space in the scene for the road (cyclist placement) and semantic segmentation. and the sidewalk (pedestrian placement) as follows: One of the key aspects of our approach is placing Road: To obtain a continuous road area, a the object in a realistic position by estimating morphological closing is used on the projection. We the road for vehicle and cyclist insertions and the use a disk seed with a dimension of three. sidewalk for pedestrian insertion. Recent research Pedestrian area: The estimate is based on the has shown, that a fast, fully convolutional neural assumption that pedestrians are supposed to walk network can predict the road from the bird’s eye along the road border. Cells closer than two pixels view projection of the scene [22]. However, this from the border of the road estimate are processed method does not handle occlusions, i.e. it does and subsequently dilated. We use a disk seed with not predict the road behind obstacles, e.g. vehicles. a dimension of two. Non-learnable methods proposed in [23, 24] can SemanticKITTI contains poses of each point separate ground from non-ground points, which cloud in sequence. Therefore, road and sidewalk can be further improved by utilizing the Jump- labels can be transformed into a global coordinate Convolution-Process [25]. All these methods (and system and accumulated in space. The accumulated other established types like RANSAC, PCA, and sequence of road and sidewalk labels leads to a height thresholding) filter out all ground points more accurate estimation of the placement areas regardless of class road or sidewalk. In our setup, in the 2D LiDAR (𝑥, 𝑦) ground plane projection. we need to distinguish them, so we rely on the Accumulating multiple scans in one frame densifies segmentation network learned from the dataset. the LiDAR point cloud and naturally reduces the need for morphological operations. 3. Method 3.2. Creating of Bounding Boxes Our augmentation method places additional objects For a collision-free placement of objects, the into an already captured point cloud. The objects bounding boxes are required. The bounding box must be placed in adequate locations; therefore, is parameterized by the center coordinates (𝑥, 𝑦, 𝑧), the road and pedestrian area must be estimated size dimensions (𝑙, 𝑤, ℎ), and heading angle (yaw). (in Subsection 3.1). The method avoids collisions For object detection in the KITTI dataset, the 3 Petr Šebek et al. CEUR Workshop Proceedings 1–10 Figure 2: Overview of the proposed pipeline. We process the data in order to estimate all possible placements, all bounding boxes in the scene, and augmenting objects from different frames. The possible placement of augmenting objects is a conjunction of the same depth as the cut-out object (yellow circle) and a suitable area from the map of possible insertions (green). Occlusion handling is performed in spherical projection. The result is re-projected to the scene to the 3D augmented point cloud. bounding boxes are already provided as ground- KITTI data set. truth labels. However, the SemanticKITTI dataset For bicycle, motorcycle, motorcyclist, and truck contains only the semantic label of the class together objects in the SemanticKITTI dataset, we do not with the instance of the object (each object in have corresponding statistics for bounding box one frame has a different instance). We mitigate dimensions since they are not present in KITTI. the absence of the bounding boxes by separating Therefore, the limits were hand-crafted from the individual objects from the scene based on an first 100 generated samples from SemanticKitti. We instance and estimate bounding boxes, see Figure 4. also used the first decile, but with a 10% margin of In case of the absence of instance labels, we would safety. cluster the semantic segmentation points to get the instances via density-based clustering. In the case 3.3. Placing of Objects of close-by segmentation, more than one instance can be inserted without damaging the consistency Placing one or multiple objects requires knowing of our approach. the bounding box dimensions and yaw angles. Only Modeling the bounding boxes is divided into three points within the bounding boxes are used to steps: augment different frames of the dataset. For the Wrapping: Object-labeled 3D Lidar points are semantic segmentation datasets (task), these points projected to the ground plane. The 2D projected are further filtered to have an appropriate label. In points are wrapped in a convex hull. the case of the object detection datasets, points that Smallest area: Assume the convex hull consists of are pseudo-labeled as the road or sidewalk classes 𝑛 points. We construct 𝑛 − 1 rectangles so that two are removed to ensure that the cutout point cloud neighboring points on the convex hull compose one contains only the object points. side of the rectangle. The remaining sides of the To maintain the most realistic augmentation, our rectangle are added to achieve the smallest area. method places the object at the same distance with Refinement: Too few points may represent some the same observation angle. It can be achieved by objects. They are scanned at a great distance or are rotating its point cloud by the vertical z-axis of significantly occluded by closer objects. Bounding the frame origin. This way, realistic object point boxes may also be distorted by occlusions. We density and LiDAR intensity are maintained due analyze the heights, widths, and lengths of the to the preserved range between the sensor and the bounding boxes in the KITTI dataset for classes object. It also keeps the same observation angle. “Car”, “Pedestrian”, and “Cyclists”, which we use Then, we consider the collision-free location of the in Semantic KITTI. We obtain the distributions insertion: for each class and parameter. For each random Location: Objects must be fully located on variable, we calculate the lowest decile. The lowest the appropriate surface. We place vehicles and decile values are the minimum threshold values cyclists on the streets and pedestrians on sidewalks. of the bounding box. The maximal values of Thought pedestrians can move on the streets as well, bounding boxes are set as the maximal values for we do not observe this occurance in the evaluation the corresponding dimension that occurred in the datasets and therefore do not consider it during 4 Petr Šebek et al. CEUR Workshop Proceedings 1–10 Figure 4: Creation of the bounding box in Bird’s Eye View around the car. First, a convex hull is constructed around points; then we fit a bounding box to estimate position x, y, dimensions length, width, height, and orientation yaw. The z is estimated as if the object touches the road without intersecting it. box belonging to the object is cut from the scene and placed in the augmented frame on the road level. For the insertion of vehicles and cyclists, the bounding box must not contain any point other than road; same for pedestrians and the pedestrian area. Then, we check whether the inserted bounding box overlaps with each of the original boxes from the augmented scene and skip insertion when it does. 3.4. Occlusion Handling By inserting objects into the scene, we model consistent occlusions in the point cloud from newly added points. We consider the occlusion of a newly inserted object by original points closer to the LiDAR sensor, as well as the occlusions caused by the inserted object itself. Data projection: The occlusion handling uses a spherical projection, similarly to [20], to solve realistic visibility after the additional object is placed. The spherical projection stores the minimal distance between the sensor and the points projected Figure 3: Rich map generating. Road maps are created from points’ positions and labels. Semantic datasets to the corresponding pixel. To correct the holes in already contain labels for each road point, in the case of the object, the projection is morphologically closed the detection dataset, labels are pseudo-labeled by neural by a rectangular seed of dimension 5 × 3 (5 rows and network [2]. We then project segmented points into a 2D three columns). The pixels closed by the seed are bird’s eye view and acquire road and sidewalk maps by assigned the depth computed from the neighboring morphological operations on the 2D projection, namely pixels as an average of the depths in that seed area. closing for the road and dilation of road boundary for the Morphological closing is computed separately for sidewalk–pedestrian area. the scene and object. insertion. For each appropriate position, the z coordinate of the object is adjusted to ensure that the object touches the surface according to the road prediction level. Collision avoidance: At first, the sole bounding 5 Petr Šebek et al. CEUR Workshop Proceedings 1–10 Algorithm 1 Occlusion handling scenes in both sets. The evaluation was carried out Input: Scene point-cloud 𝒫, Scene projection, Object point-cloud, on a validation set, where the labels are available, Object projection Output: success, Scene point-cloud as was done in [8, 28]. For object detection, we 1: point_counter ← 0 consider all possible classes, i.e., cars, pedestrians, 2: success ← False 3: for each pixel in object’s spherical projection do and cyclists. 4: if distance of object is smaller then in scene then A metric for conducting an evaluation is the 5: Remove scene points in pixel (they are occluded) 6: Add points projected to object s. p. pixel to scene standard average precision (AP) of 11 uniformly 7: point_counter ← point_counter + nbr of added sampled recall values. We use the IoU threshold points 8: end if 50%; true positive predictions are considered 9: end for bounding boxes with ground-truth overlaps greater 10: if point_counter > minimal point for class then 11: success ← True than 50% for pedestrians and cyclists. For 12: end if cars, the 70% threshold was used. We denote 13: return success, Scene AP for “Pedestrian” as APPed 50(%), APCyc 50(%) for “Cyclist” and APCar 70(%) for “Cars”. The Removing occluded points: The algorithm goes difficulties of the predictions are divided based through every pixel in the spherical projection. on the sizes of the bounding box, occlusion, and Every pixel contains information about the distance truncation into “Easy”, “Moderate”, and “Hard”, of the point. All scene points more distant than as required by the [28] benchmark. the inserted point are removed since they would be Semantic segmentation: We use the naturally occluded by the added object. as they SemanticKITTI [29] benchmark. The dataset is an are occluded by the placed object. Consequently, extension of the original KITTI [28] benchmark all object points, which were projected in the with dense point-wise annotations provided for each same pixel, are added to the scene point cloud. 360∘ field-of-view frame. The dataset generally The algorithm also returns boolean values, which offers 23,201 3D scans for training and 20,351 for represent if the number of added sample points testing. The training data set was divided into exceeds the threshold for a given class. We used training and validation parts with 19 annotated this to prevent super hard cases, with only, e.g., classes. three visible points from the object. A pseudocode Standard IoU = TP/(TP + FP + FN), the of the algorithm is shown in Algorithm 1. intersection over union, was used for comparison. Performance is evaluated for each class, as well as the average (mIoU) for all classes. 4. Experiments In this section, we show the experimental evaluation 4.2. 3D Perception Models of our method on KITTI and SemanticKITTI We tested the augmented data on two 3D object datasets with comparison to other types of data detection models, each based on a different type augmentation such as Global Augmentation [4], of feature extractor backbone. PV-RCNN [31] is Ground Truth insertion[11] and LiDAR-Aug [8]. We a 3D object detection model that combines a 3D experiment with two neural networks for each task. voxel convolutional neural network with a pointnet- based set abstraction approach [32]. The second 4.1. Datasets and Perception Tasks is PointPillar [1], which encodes the point cloud in vertical pillars. The pillars are later transformed 3D object detection: We use the KITTI 3D object into 3D pseudo-image features. detection benchmark. The data set consists of 7,481 For segmentation task, we use Cylinder3D [2] and training scenes and 7,518 testing scenes with three SPVNAS [3] multiclass detector. Cylinder3D [2] object classes: “car”, “pedestrian”, and “cyclist”. is the top-performing architecture on the Semantic The test labels are not accessible, and access to KITTI dataset with public codes. SPVNAS [3] the test server is limited. Therefore, we followed achieves significant computation reduction due to the methodology proposed by [8] and divided the the sparse Point-Voxel convolution and holds the training data set into training and validation parts, fourth place on the competitive SemanticKITTI where the training set contains 3,712 and the leaderboard right behind Cylinder3D [2]. validation 3,769 LiDAR samples [30]. The split of Each neural network was set to the default the dataset into training and validation was made parameters proposed by the authors of the consistent with the standard KITTI format, i.e., architectures, with its performance reported on with regard to avoiding having similar frames and KITTI 3D benchmark and SemanticKITTI. We 6 Petr Šebek et al. CEUR Workshop Proceedings 1–10 Table 1 Semantic segmentation on SemanticKITTI. Comparison of our method with the global augmentation baseline. Both methods are evaluated using SPVNAS [3] and Cylinder3D [2] architectures. The reported results are averaged over five runs for SPVNAS, and only one run was performed for Cylinder3D due to the large training time. The augmented categories are denoted by * for SPVNAS and by ** for Cylinder3D. We observe a performance gain in each of them except for one: trucks. Improvement is especially notable in the motorcyclist class, which contains only a few training examples in the dataset with only global augmentations. motorcyclist */ ** motorcycle */ ** bicyclist */ ** other-ground other-vehicle bicycle */ ** person */ ** truck */ ** traffic-sign vegetation sidewalk building parking terrain car ** trunk fence road pole mIoU SPVNAS w/o Obj-Aug 60.62 95.47 29.64 58.16 64.22 47.69 66.24 79.14 0.04 93.06 48.52 80.20 1.72 89.75 58.67 87.88 67.07 73.40 63.51 47.34 SPVNAS w Real3D-Aug 62.76 95.93 44.13 73.41 49.24 48.43 70.34 85.45 12.01 92.84 45.66 79.66 2.91 89.36 56.96 89.18 67.61 76.72 63.73 48.88 Cylinder3D w/o Obj-Aug 58.83 95.63 42.67 59.37 33.28 41.03 67.15 78.83 0.00 92.48 42.24 78.49 0.02 89.86 57.32 87.43 67.23 73.70 65.03 45.93 Cylinder3D w Real3D-Aug 63.00 96.27 50.47 71.29 64.28 50.20 69.78 88.84 12.66 93.37 35.43 79.81 0.00 90.60 59.86 87.42 59.02 73.71 64.83 49.24 trained each neural network three times for object All methods were trained with global augmentations detection and five times for semantic segmentation. [4] if not stated otherwise. Average performance was considered as the final In Table 2 we show the results of LiDAR-Aug score of the method. with PV-RCNN. The numbers are taken from the original paper due to the unpublished codes 4.3. Augmentations and the lack of technical details about their CAD model and ray-drop characteristic. In the original All augmentations were trained with the same article, LiDAR-Aug was trained under unknown hyperparameters to ensure a fair comparison hyperparameters and was not applied to the cyclist between methods. The approach of GT-Aug was category. Our method surpasses the LiDAR-Aug in performed with information of the precomputed the pedestrian class by a large margin despite all the planes, which is an approximation of the ground difficulties. Both GT-Aug and Real3D-Aug achieve from the KITTI dataset. This step should ensure significant performance improvement. Real3D-Aug that the inserted objects lie on the ground. For our achieves a significant improvement with PV-RCNN proposed augmentation method, we add objects in the pedestrian class, where we achieve 15.4%, with a zero-occlusion KITTI label only (Easy). 10.96%, and 7.87% improvement in Easy, Moderate, Some cases are naturally transformed into other and Hard difficulty, and GT-Aug achieves 7.52%, difficulties (Moderate and Hard) by newly created 3.74%, and 0.48% improvement compared to the occlusions. model without (w/o) any object augmentation. Our For global augmentation of the scenes, we method also slightly improves the performance on used uniformly distributed scaling of the scene in the car, but Lidar-Aug and GT-Aug overcome the the range [0.95, 1.05], rotation around the z-axis method. (vertical axis) in the range [−45∘ , 45∘ ] and random flipping over the x-axis from the point cloud as in Table 2 [4, 8]. Object detection results with PV-RCNN. Our method The maximum number of added objects in achieves the best results in the categories “pedestrian” semantic segmentation was set to 10 per scene, and “easy cyclists”. (mc) abbreviates multiclass and the object class is selected randomly (uniform AP 70(%) AP 50(%) AP 50(%) Car Ped Cyc Method Easy Mod Hard Easy Mod Hard Easy Mod Hard distribution) each time of the insertion. w/o Object-Aug 87.77 78.12 76.88 65.92 59.14 54.51 76.80 59.36 56.61 GT-Aug [11] 89.17 81.92 78.78 65.69 59.33 54.78 88.30 72.55 67.79 LiDAR-Aug [8] 90.18 84.23 78.95 65.05 58.90 55.52 N/A N/A N/A Real3D-Aug (mc) 88.70 78.63 78.09 73.57 66.55 62.17 92.69 65.06 63.43 4.4. Evaluation We compare our method (Real3D-Aug) with copy- In Table 1 we show the results for SPVNAS [3] and-paste augmentation (GT-Aug) [11] and with and Cylinder3D [2] architecture. In the semantic state-of-the-art LiDAR-Aug augmentation [8]. In segmentation task, we increased the mean IoU for the Real3D-Aug multiclass (mc), we added 4.7 both networks. pedestrians and 6.7 cyclists on average per scene. We are not comparing with GT-Aug [11] and 7 Petr Šebek et al. CEUR Workshop Proceedings 1–10 LiDAR-Aug [8] in the semantic segmentation augmentation technique for 3D detection and task. The methods above were not designed semantic segmentation tasks. Our method improves for segmentation, whereas our method allows for performance on important and rarely occurring augmenting both tasks. classes, e.g. pedestrian, cyclist, motorcyclist, In the semantic segmentation task for SPVNAS, and others. Our method is self-contained and we achieve an increase of 2.14 in mean IoU compared requires only 3D data. All augmentations can be to the common augmentation technique [4], see preprocessed, so it does not increase the training Table 1. We observe an increased IoU of all classes time. One way to further improve the method added, except for the truck category. With the is to incorporate a more informative selection of Cylinder3D network, the increment can be seen placements based on the uncertainty of the detection in the IoU of all added classes. Our method also model. increases the performance on not augmented classes since we add more negative examples to other similar classes. Acknowledgments This work was supported in part by OP VVV MEYS 4.5. Ablation Study of Object Detection funded project CZ.02.1.01/0.0/0.0/16_019/0000765 In Tables 3 and 4 we show the influence of adding “Research Center for Informatics”, and by Grant a single object to the scene in comparison to Agency of the CTU Prague under Project GT-Aug. Each configuration is named after the SGS22/111/OHK3/2T/13. Authors want to thank added class, and the lower index indicates the colleagues from Valeo R&D for discussions and average number of objects added per scene. We Valeo company for a support. can see that, in the case of PointPillar, adding only one class decreases performance in the other References classes. We suspect that it is caused by similarities between classes. For example, pedestrians and [1] J. Tu, P. Wang, F. Liu, PP-RCNN: Point- bicycles are simultaneously present in the class Pillars Feature Set Abstraction for 3D Real- “cyclist”. Therefore, it is beneficial to add both time Object Detection, in: IEEE International classes simultaneously. In the case of PV-RCNN, Joint Conference on Neural Networks (IJCNN), the addition of one class improves the performance 2021. of both. [2] X. Zhu, H. Zhou, T. Wang, F. Hong, Y. Ma, W. Li, H. Li, D. Lin, Cylindrical Table 3 and Asymmetrical 3D Convolution Networks Real3D-Aug Object detection results with PointPillar for LiDAR Segmentation, in: IEEE/CVF architecture based on number of inserted classes. Conference on Computer Vision and Pattern APPed 50(%) APCyc 50(%) Recognition (CVPR), 2021. Augmentation Easy Mod Hard Easy Mod Hard GT-Aug 54.52 49.04 45.49 77.64 61.30 58.15 [3] H. Tang, Z. Liu, S. Zhao, Y. Lin, J. Lin, Real3D-Aug (Ped1 ) 55.72 51.30 47.47 46.33 33.84 32.47 H. Wang, S. Han, Searching Efficient Real3D-Aug (Cyc1 ) 46.87 44.17 41.77 72.65 52.71 49.04 3D Architectures with Sparse Point-Voxel Real3D-Aug (mc) 55.50 52.00 49.03 76.82 52.74 50.18 Convolution, in: European Conference on Computer Vision (ECCV), 2020. [4] M. Hahner, D. Dai, A. Liniger, L. V. Gool, Table 4 Quantifying Data Augmentation for LiDAR Real3D-Aug object detection results with PV-RCNN based 3D Object Detection, arXiv:2004.01643 based on the number of inserted classes. (2020). APPed 50(%) APCyc 50(%) Augmentation Easy Mod Hard Easy Mod Hard [5] X. Xu, Z. Chen, F. Yin, CutResize: Improved GT-Aug 65.69 59.33 54.78 88.30 72.55 67.79 data augmentation method for RGB-D Object Real3D-Aug (Ped1 ) 70.96 66.63 61.14 78.97 63.47 57.31 Recognition, IEEE Robotics and Automation Real3D-Aug (Cyc1 ) 65.63 59.14 57.47 82.79 63.69 62.39 Real3D-Aug (mc) 73.57 66.55 62.17 92.69 65.06 63.43 Letters (RA-L) (2022). [6] J. Yang, S. Shi, Z. Wang, H. Li, X. Qi, ST3D: Self-Training for Unsupervised Domain Adaptation on 3D Object Detection, in: 5. Conclusion IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. We propose an object-centered point cloud 8 Petr Šebek et al. CEUR Workshop Proceedings 1–10 [7] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, [19] Y. Zhou, O. Tuzel, VoxelNet: End-to-End A Simple Framework for Contrastive Learning Learning for Point Cloud Based 3D Object of Visual Representations, in: International Detection, in: IEEE/CVF Conference on Conference on Machine Learning (ICML), Computer Vision and Pattern Recognition 2020. (CVPR), 2018. [8] J. Fang, X. Zuo, D. Zhou, S. Jin, S. Wang, [20] C. Xu, B. Wu, Z. Wang, W. Zhan, P. Vajda, L. Zhang, LiDAR-Aug: A General Rendering- K. Keutzer, M. Tomizuka, Squeezesegv3: based Augmentation Framework for 3D Object Spatially-adaptive convolution for efficient Detection, in: IEEE/CVF Conference on point-cloud segmentation, in: European Computer Vision and Pattern Recognition Conference on Computer Vision (ECCV), (CVPR), 2021. 2020. [9] N. Cauli, D. Reforgiato Recupero, Survey on [21] A. Milioto, I. Vizzo, J. Behley, C. Stachniss, Videos Data Augmentation for Deep Learning Rangenet ++: Fast and accurate lidar Models, Future Internet (2022). semantic segmentation, IEEE/RSJ [10] Y.-C. Liu, C.-Y. Ma, Z. He, C.-W. Kuo, International Conference on Intelligent K. Chen, P. Zhang, B. Wu, Z. Kira, P. Vajda, Robots and Systems (IROS) (2019). Unbiased Teacher for Semi-Supervised Object [22] L. Caltagirone, S. Scheidegger, L. Svensson, Detection, in: International Conference on M. Wahde, Fast LIDAR-based road detection Learning Representations (ICLR), 2021. using fully convolutional neural networks, in: [11] Y. Yan, Y. Mao, B. Li, Second: Sparsely IEEE Intelligent Vehicles Symposium (IV), Embedded Convolutional Detection, Sensors 2017. (2018). [23] P. Chu, S. Cho, S. Fong, K. Cho, Enhanced [12] S. Cheng, Z. Leng, E. D. Cubuk, B. Zoph, ground segmentation method for Lidar point C. Bai, J. Ngiam, Y. Song, B. Caine, clouds in human-centric autonomous robot V. Vasudevan, C. Li, Q. V. Le, J. Shlens, systems, Human-centric Computing and D. Anguelov, Improving 3D Object Information Sciences (HCIS) (2019). Detection through Progressive Population [24] I. Bogoslavskyi, C. Stachniss, Efficient Based Augmentation, in: European Online Segmentation for Sparse 3D Laser Conference on Computer Vision (ECCV), Scans, Photogrammetrie, Fernerkundung, 2020. Geoinformation (PFG) (2016). [13] Y. Ren, S. Zhao, L. Bingbing, Object Insertion [25] Z. Shen, H. Liang, L. Lin, Z. Wang, W. Huang, Based Data Augmentation for Semantic J. Yu, Fast Ground Segmentation for Segmentation, in: International Conference 3D LiDAR Point Cloud Based on Jump- on Robotics and Automation (ICRA), 2022. Convolution-Process, Remote Sensing (2021). [14] P. Vacek, O. Jašek, K. Zimmermann, [26] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, T. Svoboda, Learning to Predict Lidar V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, Intensities, IEEE Transactions on Intelligent G. Baldan, O. Beijbom, nuScenes: A Transportation Systems (T-ITS) (2021). multimodal dataset for autonomous driving, in: [15] S. R. Richter, V. Vineet, S. Roth, V. Koltun, IEEE/CVF Conference on Computer Vision Playing for Data: Ground Truth from and Pattern Recognition (CVPR), 2020. Computer Games, in: European Conference [27] M.-F. Chang, J. Lambert, P. Sangkloy, on Computer Vision (ECCV), 2016. J. Singh, S. Bak, A. Hartnett, D. Wang, [16] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, P. Carr, S. Lucey, D. Ramanan, J. Hays, V. Koltun, CARLA: An Open Urban Driving Argoverse: 3D Tracking and Forecasting With Simulator, in: Conference on Robot Learning Rich Maps, in: IEEE/CVF Conference on (CoRL), 2017. Computer Vision and Pattern Recognition [17] A. E. Sallab, I. Sobh, M. Zahran, N. Essam, (CVPR), 2019. LiDAR Sensor modeling and Data [28] A. Geiger, P. Lenz, R. Urtasun, Are we ready augmentation with GANs for Autonomous for Autonomous Driving? The KITTI Vision driving, arXiv:1905.07290 (2019). Benchmark Suite, in: IEEE/CVF Conference [18] A. E. Sallab, I. Sobh, M. Zahran, M. Shawky, on Computer Vision and Pattern Recognition Unsupervised Neural Sensor Models for (CVPR), 2012. Synthetic LiDAR Data Augmentation, [29] J. Behley, M. Garbade, A. Milioto, Advances in Neural Information Processing J. Quenzel, S. Behnke, C. Stachniss, J. Gall, Systems (NIPS) (2019). SemanticKITTI: A Dataset for Semantic 9 Petr Šebek et al. CEUR Workshop Proceedings 1–10 Scene Understanding of LiDAR Sequences, in: IEEE/CVF International Conference on Computer Vision (ICCV), 2019. [30] X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, R. Urtasun, 3D Object Proposals using Stereo Imagery for Accurate Object Class Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence (2017). [31] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, H. Li, PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. [32] C. R. Qi, L. Yi, H. Su, L. J. Guibas, Pointnet++: Deep hierarchical feature learning on point sets in a metric space, in: Advances in Neural Information Processing Systems (NIPS), 2017. 10