=Paper= {{Paper |id=Vol-3349/paper8 |storemode=property |title=Real3D-Aug: Point Cloud Augmentation by Placing Real Objects with Occlusion Handling for 3D Detection and Segmentation |pdfUrl=https://ceur-ws.org/Vol-3349/paper8.pdf |volume=Vol-3349 |authors=Petr Sebek,Simon Pokorny,Patrik Vacek,Tomas Svoboda |dblpUrl=https://dblp.org/rec/conf/cvww/SebekPVS23 }} ==Real3D-Aug: Point Cloud Augmentation by Placing Real Objects with Occlusion Handling for 3D Detection and Segmentation== https://ceur-ws.org/Vol-3349/paper8.pdf
Real3D-Aug: Point Cloud Augmentation by Placing Real
Objects with Occlusion Handling for 3D Detection and
Segmentation
Petr Šebek1,† , Šimon Pokorný1,† , Patrik Vacek1,* and Tomáš Svoboda1
1
  Vision for Robotics and Autonomous Systems, Dept. of Cybernetics, Faculty of Electrical Engineering, Czech Technical
University in Prague


                                      Abstract
                                      Object detection and semantic segmentation with the 3D LiDAR point cloud data require expensive annotation.
                                      We propose a data augmentation method that takes advantage of already annotated data multiple times. We
                                      propose an augmentation framework that reuses real data, automatically finds suitable placements in the
                                      scene to be augmented, and handles occlusions explicitly. Due to the usage of the real data, the scan points of
                                      newly inserted objects in augmentation sustain the physical characteristics of the LiDAR, such as intensity
                                      and raydrop. The pipeline proves competitive in training top-performing models for 3D object detection and
                                      semantic segmentation. The new augmentation provides a significant performance gain in rare and essential
                                      classes, notably 6.65% average precision gain for “Hard” pedestrian class in KITTI object detection or 2.14
                                      mean IoU gain in the SemanticKITTI segmentation challenge over the state of the art.

                                      Keywords
                                      LiDAR, pointclouds, augmentation, semantic segmentation, object detection



1. Introduction                                                              not enough annotated data to train large neural
                                                                             networks. Data augmentation is a way to effectively
Accurate detection and scene segmentation are decrease the need for more annotated data by
integral to any autonomous robotic pipeline. enriching the training set with computed variations
Perception and understanding are possible thanks to of the data. This type of augmentation is usually
various sensors, such as RGB cameras, radars, and achieved with geometrical transformations, such
LiDARs. These sensors produce structural data as translation, rotation, and rescale applied to the
and must be interpreted for the proper function already labeled samples [4, 5, 6, 7].
of critical safety systems. We focus on LiDARs.                                 In general, 3D point cloud augmentations [4,
Recently, the most promising way to process LiDAR 8] have been much less researched than image
data is to train deep neural networks [1, 2, 3] with augmentation techniques [5, 7, 9, 10]. For example,
full supervision, which requires a large amount of the aforementioned 3D point cloud augmentations
annotated data.                                                              only enrich the geometrical features of the training
   The manual annotation process is very time samples but do not create new scenarios with the
and resource-consuming. For example, to perform previously unseen layout of objects. The lack of
semantic segmentation on LiDAR point clouds, one modeling a realistic class population of the scenes is
needs to accurately label all the points in the scene still a bottleneck of augmentation techniques. This
as a specific object class. As a result, there is problem can be addressed by augmentation that
                                                                             uses simulated virtual data and scene configurations.
26th Computer Vision Winter Workshop, Robert Sablatnig
and Florian Kleber (eds.), Krems, Lower Austria, Austria, However, the effect of such data on training is
Feb. 15-17, 2023                                                             low due to nonrealistic physical and visual features
*
†
  Corresponding author.                                                      compared to real data.
  These authors contributed equally.                                            We focus on improving the learning of 3D
$ sebekpe1@fel.cvut.cz (P. Šebek); pokorsi1@fel.cvut.cz                      perception networks by enhancing LiDAR data
( Pokorný); vacekpa2@fel.cvut.cz (P. Vacek);
svobodat@fel.cvut.cz (T. Svoboda)
                                                                             in autonomous driving scenarios with data
€ http://cmp.felk.cvut.cz/~vacekpa2/ (P. Vacek);                             augmentation. Depth information allows for per-
http://cmp.felk.cvut.cz/~svoboda/ (T. Svoboda)                               object manipulation when augmenting the point
 0000-0001-8587-5364 (P. Šebek); 0000-0002-7812-5634                        clouds [8]. We take advantage of the spatial position
( Pokorný); 0000-0003-3001-1483 (P. Vacek);                                  of annotated objects and place them in different
0000-0002-7184-1785 (T. Svoboda)
         © 2023 Copyright for this paper by its authors. Use permitted under scenes while handling occlusions and class-specific
         Creative Commons License Attribution 4.0 International (CC BY
         4.0).                                                               inhabitancy, see Figure 1.
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)




                                                                                      1
Petr Šebek et al. CEUR Workshop Proceedings                                                                            1–10



                                                                    improvement, especially in rarely represented
                                                                    classes. The codes for our method are publicly
                                                                    available1 .


                                                                    2. Related Work
                                                                    2.1. Data Augmentation
                                                                    One of the first approaches to augmenting LiDAR
                                                                    data was GT-Aug, which was published within the
                                                                    3D detection model SECOND [11]. GT-Aug adds
                                                                    samples from the ground-truth database, which
                                                                    is precomputed before the training phase. The
                                                                    samples are randomly selected and inserted into the
                                                                    scene as is. If a collision occurs, the added object
                                                                    is simply removed. The visibility and occlusion
                                                                    handling of added scan points or the inserting
                                                                    strategy is not taken into account. Global data
                                                                    augmentations (Gl-Aug) [4] such as rotation, flip,
                                                                    and scale are commonly used in 3D point-cloud
                                                                    neural networks. These augmentations provide a
                                                                    different geometrical perspective, which supports
                                                                    the neural network with more diversity of training
Figure 1: We show examples of our augmentation method
in 3D object detection and semantic segmentation. First,            samples. An attempt to automate the augmentation
we insert objects one by one and then simulate their                strategy was proposed in [12], which narrows the
visibility to model realistic occlusions. Note the details of       search space based on previous training iterations.
the scene (circled) and the detection of occluded orange            The state-of-the-art LiDAR-Aug [8] enriches the
points. After removal, we see the final augmented version           training data to improve the performance of the 3D
of the point cloud in the last row                                  detectors. Additional objects are rendered on the
                                                                    basis of CAD models. Simulations of intensity and
                                                                    raydrops are not discussed in the article. LiDAR-
   Our method segments the road and sidewalks for                   Aug [8] also simulates occlusion between additional
class-specific insertion. Next, the method exploits                 objects and the rest of the scene, unlike GT-
the bounding boxes of objects to avoid collisions.                  Aug [11]. Recent method [13], similar to our one,
Compared to state-of-the-art LiDAR-Aug [8], which                   also focuses on inserting objects into point clouds.
is suitable only for object detection, our bounding                 The main difference between the methods is in the
box generation allows augmenting the semantic                       real visibility simulation. Approach [13] upsamples
segmentation datasets and simulates realistic                       the number of points in the sample, which are then
occlusions throughout the spherical projection. The                 projected into a range image, where visible points
inserted augmentations come from the same dataset                   are selected and then sparsed. From our point
and are placed at the same distance, ensuring                       of view, this approach does not consider possible
natural reflection values and point distribution,                   raydrop on objects located between the ego and the
including ray dropouts. We evaluate the proposed                    inserted sample. It can cause parts of the inserted
method on tasks of 3D object detection and                          sample to be falsely visible because some LiDAR
semantic segmentation. Our contribution is twofold:                 beams could drop out from the obstacle and create
                                                                    holes in the range image.
    • We present a new augmentation framework
      suitable for both 3D object detection and                     2.2. Data Simulators
      semantic segmentation.
    • We propose a novel way to model occlusions                    The recent progress in computer vision brought large
      and physically consistent insertion of objects                neural networks with a large number of learnable
      for augmentation.                                             parameters, often unable to reach a saturation
                                                                    point with the size of current training sets. These
We demonstrate the usefulness of our method
on autonomous driving benchmarks and show                           1
                                                                        https://github.com/ctu-vras/pcl-augmentation




                                                                2
Petr Šebek et al. CEUR Workshop Proceedings                                                                 1–10



models require training on a very large number              between additional objects and objects that are in
of annotated examples. Commonly used solutions              the original point cloud. We analyze overlapping
include synthetically generated data [14] or using          bounding boxes. Therefore, we need to create
game simulators such as Grand Theft Auto V,                 bounding boxes for semantic datasets that come
which was used to generate images for the semantic          without object boxes (in Subsection 3.2). More
segmentation of ground truth [15]. Some simulators          details on placing additional objects are given in
built on Unreal Engine, for example, Carla [16] are         Subsection 3.3. Lastly, the method handles realistic
also used in autonomous driving research. However,          occlusions between objects (in Subsection 3.4). The
the gap between real and synthetic data remains a           overview of the proposed method is visualized in
great challenge [14]. One of the approaches to deal         Figure 2.
with the difference and portability to the real world
is [17, 18], which can produce more realistic LiDAR         3.1. Road Estimation
data from simulation by learning GAN models.
                                                            To place the new objects, we need to know
                                                            where they realistically appear in the scene. This
2.3. 3D Perception Tasks
                                                            information may be given by HD maps [26, 27] if
Learning in the LiDAR point cloud domain poses              included in datasets; however, KITTI dataset [28]
challenges, such as low point density in regions at         does not provide them. We estimate valid roads
the far end of the FOV, the unordered structure of          and sidewalk areas for both tasks according to the
the data, and sparsity due to the sensor resolution.        pipeline described in Figure 3. First, we pseudo-
Three common approaches to aggregation and                  label 3D points by Cylinder3D [2], a state-of-the-art
learning the LiDAR features are voxel-based models          semantic segmentation neural network, which was
[19, 11], re-projection of data into 2D structure           pre-trained on the SemanticKITTI dataset [29]. The
[20, 21], and point cloud-based models [2, 3]. To           resulting predictions are then projected onto the 2D
show the ability to generalize, we evaluate our             LiDAR (𝑥, 𝑦) ground plane, discretized with a cell
proposed method based on different model feature            size resolution of 1 × 1 meter. Then we divide the
extractors and on two tasks of 3D object detection          space in the scene for the road (cyclist placement)
and semantic segmentation.                                  and the sidewalk (pedestrian placement) as follows:
   One of the key aspects of our approach is placing           Road: To obtain a continuous road area, a
the object in a realistic position by estimating            morphological closing is used on the projection. We
the road for vehicle and cyclist insertions and the         use a disk seed with a dimension of three.
sidewalk for pedestrian insertion. Recent research             Pedestrian area: The estimate is based on the
has shown, that a fast, fully convolutional neural          assumption that pedestrians are supposed to walk
network can predict the road from the bird’s eye            along the road border. Cells closer than two pixels
view projection of the scene [22]. However, this            from the border of the road estimate are processed
method does not handle occlusions, i.e. it does             and subsequently dilated. We use a disk seed with
not predict the road behind obstacles, e.g. vehicles.       a dimension of two.
Non-learnable methods proposed in [23, 24] can                 SemanticKITTI contains poses of each point
separate ground from non-ground points, which               cloud in sequence. Therefore, road and sidewalk
can be further improved by utilizing the Jump-              labels can be transformed into a global coordinate
Convolution-Process [25]. All these methods (and            system and accumulated in space. The accumulated
other established types like RANSAC, PCA, and               sequence of road and sidewalk labels leads to a
height thresholding) filter out all ground points           more accurate estimation of the placement areas
regardless of class road or sidewalk. In our setup,         in the 2D LiDAR (𝑥, 𝑦) ground plane projection.
we need to distinguish them, so we rely on the              Accumulating multiple scans in one frame densifies
segmentation network learned from the dataset.              the LiDAR point cloud and naturally reduces the
                                                            need for morphological operations.

3. Method
                                                            3.2. Creating of Bounding Boxes
Our augmentation method places additional objects           For a collision-free placement of objects, the
into an already captured point cloud. The objects           bounding boxes are required. The bounding box
must be placed in adequate locations; therefore,            is parameterized by the center coordinates (𝑥, 𝑦, 𝑧),
the road and pedestrian area must be estimated              size dimensions (𝑙, 𝑤, ℎ), and heading angle (yaw).
(in Subsection 3.1). The method avoids collisions           For object detection in the KITTI dataset, the



                                                        3
Petr Šebek et al. CEUR Workshop Proceedings                                                                       1–10




Figure 2: Overview of the proposed pipeline. We process the data in order to estimate all possible placements, all
bounding boxes in the scene, and augmenting objects from different frames. The possible placement of augmenting
objects is a conjunction of the same depth as the cut-out object (yellow circle) and a suitable area from the map of
possible insertions (green). Occlusion handling is performed in spherical projection. The result is re-projected to the
scene to the 3D augmented point cloud.



bounding boxes are already provided as ground-                KITTI data set.
truth labels. However, the SemanticKITTI dataset                 For bicycle, motorcycle, motorcyclist, and truck
contains only the semantic label of the class together        objects in the SemanticKITTI dataset, we do not
with the instance of the object (each object in               have corresponding statistics for bounding box
one frame has a different instance). We mitigate              dimensions since they are not present in KITTI.
the absence of the bounding boxes by separating               Therefore, the limits were hand-crafted from the
individual objects from the scene based on an                 first 100 generated samples from SemanticKitti. We
instance and estimate bounding boxes, see Figure 4.           also used the first decile, but with a 10% margin of
In case of the absence of instance labels, we would           safety.
cluster the semantic segmentation points to get the
instances via density-based clustering. In the case           3.3. Placing of Objects
of close-by segmentation, more than one instance
can be inserted without damaging the consistency              Placing one or multiple objects requires knowing
of our approach.                                              the bounding box dimensions and yaw angles. Only
   Modeling the bounding boxes is divided into three          points within the bounding boxes are used to
steps:                                                        augment different frames of the dataset. For the
   Wrapping: Object-labeled 3D Lidar points are               semantic segmentation datasets (task), these points
projected to the ground plane. The 2D projected               are further filtered to have an appropriate label. In
points are wrapped in a convex hull.                          the case of the object detection datasets, points that
   Smallest area: Assume the convex hull consists of          are pseudo-labeled as the road or sidewalk classes
𝑛 points. We construct 𝑛 − 1 rectangles so that two           are removed to ensure that the cutout point cloud
neighboring points on the convex hull compose one             contains only the object points.
side of the rectangle. The remaining sides of the               To maintain the most realistic augmentation, our
rectangle are added to achieve the smallest area.             method places the object at the same distance with
   Refinement: Too few points may represent some              the same observation angle. It can be achieved by
objects. They are scanned at a great distance or are          rotating its point cloud by the vertical z-axis of
significantly occluded by closer objects. Bounding            the frame origin. This way, realistic object point
boxes may also be distorted by occlusions. We                 density and LiDAR intensity are maintained due
analyze the heights, widths, and lengths of the               to the preserved range between the sensor and the
bounding boxes in the KITTI dataset for classes               object. It also keeps the same observation angle.
“Car”, “Pedestrian”, and “Cyclists”, which we use             Then, we consider the collision-free location of the
in Semantic KITTI. We obtain the distributions                insertion:
for each class and parameter. For each random                   Location: Objects must be fully located on
variable, we calculate the lowest decile. The lowest          the appropriate surface. We place vehicles and
decile values are the minimum threshold values                cyclists on the streets and pedestrians on sidewalks.
of the bounding box. The maximal values of                    Thought pedestrians can move on the streets as well,
bounding boxes are set as the maximal values for              we do not observe this occurance in the evaluation
the corresponding dimension that occurred in the              datasets and therefore do not consider it during




                                                          4
Petr Šebek et al. CEUR Workshop Proceedings                                                                          1–10




                                                                 Figure 4: Creation of the bounding box in Bird’s Eye View
                                                                 around the car. First, a convex hull is constructed around
                                                                 points; then we fit a bounding box to estimate position
                                                                 x, y, dimensions length, width, height, and orientation
                                                                 yaw. The z is estimated as if the object touches the road
                                                                 without intersecting it.



                                                                 box belonging to the object is cut from the scene
                                                                 and placed in the augmented frame on the road
                                                                 level. For the insertion of vehicles and cyclists, the
                                                                 bounding box must not contain any point other than
                                                                 road; same for pedestrians and the pedestrian area.
                                                                 Then, we check whether the inserted bounding box
                                                                 overlaps with each of the original boxes from the
                                                                 augmented scene and skip insertion when it does.

                                                                 3.4. Occlusion Handling
                                                                 By inserting objects into the scene, we model
                                                                 consistent occlusions in the point cloud from newly
                                                                 added points. We consider the occlusion of a newly
                                                                 inserted object by original points closer to the
                                                                 LiDAR sensor, as well as the occlusions caused
                                                                 by the inserted object itself.
                                                                   Data projection: The occlusion handling uses
                                                                 a spherical projection, similarly to [20], to solve
                                                                 realistic visibility after the additional object is
                                                                 placed. The spherical projection stores the minimal
                                                                 distance between the sensor and the points projected
Figure 3: Rich map generating. Road maps are created
from points’ positions and labels. Semantic datasets
                                                                 to the corresponding pixel. To correct the holes in
already contain labels for each road point, in the case of       the object, the projection is morphologically closed
the detection dataset, labels are pseudo-labeled by neural       by a rectangular seed of dimension 5 × 3 (5 rows and
network [2]. We then project segmented points into a 2D          three columns). The pixels closed by the seed are
bird’s eye view and acquire road and sidewalk maps by            assigned the depth computed from the neighboring
morphological operations on the 2D projection, namely            pixels as an average of the depths in that seed area.
closing for the road and dilation of road boundary for the       Morphological closing is computed separately for
sidewalk–pedestrian area.                                        the scene and object.


insertion. For each appropriate position, the z
coordinate of the object is adjusted to ensure that
the object touches the surface according to the road
prediction level.
  Collision avoidance: At first, the sole bounding




                                                             5
Petr Šebek et al. CEUR Workshop Proceedings                                                                              1–10


Algorithm 1 Occlusion handling                                          scenes in both sets. The evaluation was carried out
Input: Scene point-cloud 𝒫, Scene projection, Object point-cloud,       on a validation set, where the labels are available,
Object projection
Output: success, Scene point-cloud                                      as was done in [8, 28]. For object detection, we
 1: point_counter ← 0                                                   consider all possible classes, i.e., cars, pedestrians,
 2: success ← False
 3: for each pixel in object’s spherical projection do                  and cyclists.
 4:     if distance of object is smaller then in scene then                A metric for conducting an evaluation is the
 5:         Remove scene points in pixel (they are occluded)
 6:         Add points projected to object s. p. pixel to scene         standard average precision (AP) of 11 uniformly
 7:         point_counter ← point_counter + nbr of added                sampled recall values. We use the IoU threshold
    points
 8:     end if                                                          50%; true positive predictions are considered
 9: end for                                                             bounding boxes with ground-truth overlaps greater
10: if point_counter > minimal point for class then
11:      success ← True                                                 than 50% for pedestrians and cyclists.             For
12: end if                                                              cars, the 70% threshold was used. We denote
13: return success, Scene
                                                                        AP for “Pedestrian” as APPed 50(%), APCyc 50(%)
                                                                        for “Cyclist” and APCar 70(%) for “Cars”. The
   Removing occluded points: The algorithm goes                         difficulties of the predictions are divided based
through every pixel in the spherical projection.                        on the sizes of the bounding box, occlusion, and
Every pixel contains information about the distance                     truncation into “Easy”, “Moderate”, and “Hard”,
of the point. All scene points more distant than                        as required by the [28] benchmark.
the inserted point are removed since they would be                         Semantic segmentation:             We use the
naturally occluded by the added object. as they                         SemanticKITTI [29] benchmark. The dataset is an
are occluded by the placed object. Consequently,                        extension of the original KITTI [28] benchmark
all object points, which were projected in the                          with dense point-wise annotations provided for each
same pixel, are added to the scene point cloud.                         360∘ field-of-view frame. The dataset generally
The algorithm also returns boolean values, which                        offers 23,201 3D scans for training and 20,351 for
represent if the number of added sample points                          testing. The training data set was divided into
exceeds the threshold for a given class. We used                        training and validation parts with 19 annotated
this to prevent super hard cases, with only, e.g.,                      classes.
three visible points from the object. A pseudocode                         Standard IoU = TP/(TP + FP + FN), the
of the algorithm is shown in Algorithm 1.                               intersection over union, was used for comparison.
                                                                        Performance is evaluated for each class, as well as
                                                                        the average (mIoU) for all classes.
4. Experiments
In this section, we show the experimental evaluation                    4.2. 3D Perception Models
of our method on KITTI and SemanticKITTI              We tested the augmented data on two 3D object
datasets with comparison to other types of data       detection models, each based on a different type
augmentation such as Global Augmentation [4],         of feature extractor backbone. PV-RCNN [31] is
Ground Truth insertion[11] and LiDAR-Aug [8]. We      a 3D object detection model that combines a 3D
experiment with two neural networks for each task.    voxel convolutional neural network with a pointnet-
                                                      based set abstraction approach [32]. The second
4.1. Datasets and Perception Tasks                    is PointPillar [1], which encodes the point cloud in
                                                      vertical pillars. The pillars are later transformed
3D object detection: We use the KITTI 3D object
                                                      into 3D pseudo-image features.
detection benchmark. The data set consists of 7,481
                                                         For segmentation task, we use Cylinder3D [2] and
training scenes and 7,518 testing scenes with three
                                                      SPVNAS [3] multiclass detector. Cylinder3D [2]
object classes: “car”, “pedestrian”, and “cyclist”.
                                                      is the top-performing architecture on the Semantic
  The test labels are not accessible, and access to
                                                      KITTI dataset with public codes. SPVNAS [3]
the test server is limited. Therefore, we followed
                                                      achieves significant computation reduction due to
the methodology proposed by [8] and divided the
                                                      the sparse Point-Voxel convolution and holds the
training data set into training and validation parts,
                                                      fourth place on the competitive SemanticKITTI
where the training set contains 3,712 and the
                                                      leaderboard right behind Cylinder3D [2].
validation 3,769 LiDAR samples [30]. The split of
                                                         Each neural network was set to the default
the dataset into training and validation was made
                                                      parameters proposed by the authors of the
consistent with the standard KITTI format, i.e.,
                                                      architectures, with its performance reported on
with regard to avoiding having similar frames and
                                                      KITTI 3D benchmark and SemanticKITTI. We



                                                                    6
 Petr Šebek et al. CEUR Workshop Proceedings                                                                                                                                                                                                                       1–10



Table 1
Semantic segmentation on SemanticKITTI. Comparison of our method with the global augmentation baseline. Both
methods are evaluated using SPVNAS [3] and Cylinder3D [2] architectures. The reported results are averaged over
five runs for SPVNAS, and only one run was performed for Cylinder3D due to the large training time. The augmented
categories are denoted by * for SPVNAS and by ** for Cylinder3D. We observe a performance gain in each of them
except for one: trucks. Improvement is especially notable in the motorcyclist class, which contains only a few training
examples in the dataset with only global augmentations.




                                                                                                                                             motorcyclist */ **
                                                          motorcycle */ **




                                                                                                                          bicyclist */ **




                                                                                                                                                                                               other-ground
                                                                                           other-vehicle
                                          bicycle */ **




                                                                                                           person */ **
                                                                             truck */ **




                                                                                                                                                                                                                                                                            traffic-sign
                                                                                                                                                                                                                                   vegetation
                                                                                                                                                                                    sidewalk




                                                                                                                                                                                                              building
                                                                                                                                                                         parking




                                                                                                                                                                                                                                                         terrain
                                 car **




                                                                                                                                                                                                                                                trunk
                                                                                                                                                                                                                          fence
                                                                                                                                                                  road




                                                                                                                                                                                                                                                                   pole
                        mIoU
SPVNAS w/o Obj-Aug      60.62 95.47 29.64 58.16 64.22 47.69 66.24 79.14 0.04 93.06 48.52 80.20 1.72 89.75 58.67 87.88 67.07 73.40 63.51 47.34
SPVNAS w Real3D-Aug     62.76 95.93 44.13 73.41 49.24 48.43 70.34 85.45 12.01 92.84 45.66 79.66 2.91 89.36 56.96 89.18 67.61 76.72 63.73 48.88
Cylinder3D w/o Obj-Aug 58.83 95.63 42.67 59.37 33.28 41.03 67.15 78.83 0.00 92.48 42.24 78.49 0.02 89.86 57.32 87.43 67.23 73.70 65.03 45.93
Cylinder3D w Real3D-Aug 63.00 96.27 50.47 71.29 64.28 50.20 69.78 88.84 12.66 93.37 35.43 79.81 0.00 90.60 59.86 87.42 59.02 73.71 64.83 49.24




 trained each neural network three times for object All methods were trained with global augmentations
 detection and five times for semantic segmentation. [4] if not stated otherwise.
 Average performance was considered as the final          In Table 2 we show the results of LiDAR-Aug
 score of the method.                                  with PV-RCNN. The numbers are taken from
                                                       the original paper due to the unpublished codes
 4.3. Augmentations                                    and the lack of technical details about their CAD
                                                       model and ray-drop characteristic. In the original
 All augmentations were trained with the same article, LiDAR-Aug was trained under unknown
 hyperparameters to ensure a fair comparison hyperparameters and was not applied to the cyclist
 between methods. The approach of GT-Aug was category. Our method surpasses the LiDAR-Aug in
 performed with information of the precomputed the pedestrian class by a large margin despite all the
 planes, which is an approximation of the ground difficulties. Both GT-Aug and Real3D-Aug achieve
 from the KITTI dataset. This step should ensure significant performance improvement. Real3D-Aug
 that the inserted objects lie on the ground. For our achieves a significant improvement with PV-RCNN
 proposed augmentation method, we add objects in the pedestrian class, where we achieve 15.4%,
 with a zero-occlusion KITTI label only (Easy). 10.96%, and 7.87% improvement in Easy, Moderate,
 Some cases are naturally transformed into other and Hard difficulty, and GT-Aug achieves 7.52%,
 difficulties (Moderate and Hard) by newly created 3.74%, and 0.48% improvement compared to the
 occlusions.                                           model without (w/o) any object augmentation. Our
    For global augmentation of the scenes, we method also slightly improves the performance on
 used uniformly distributed scaling of the scene in the car, but Lidar-Aug and GT-Aug overcome the
 the range [0.95, 1.05], rotation around the z-axis method.
 (vertical axis) in the range [−45∘ , 45∘ ] and random
 flipping over the x-axis from the point cloud as in Table 2
 [4, 8].                                               Object detection results with PV-RCNN. Our method
    The maximum number of added objects in achieves the best results in the categories “pedestrian”
 semantic segmentation was set to 10 per scene, and “easy cyclists”. (mc) abbreviates multiclass
 and the object class is selected randomly (uniform                         AP 70(%)           AP    50(%)           AP    50(%)                                                   Car                                    Ped                            Cyc
                                                        Method         Easy   Mod    Hard Easy    Mod     Hard  Easy    Mod     Hard
 distribution) each time of the insertion.              w/o Object-Aug 87.77  78.12 76.88 65.92   59.14   54.51 76.80   59.36   56.61
                                                                                                                                            GT-Aug [11]                  89.17     81.92              78.78       65.69    59.33        54.78    88.30    72.55     67.79
                                                                                                                                            LiDAR-Aug [8]                90.18     84.23              78.95       65.05    58.90        55.52    N/A      N/A       N/A
                                                                                                                                            Real3D-Aug (mc)              88.70     78.63              78.09       73.57    66.55        62.17    92.69    65.06     63.43
 4.4. Evaluation
 We compare our method (Real3D-Aug) with copy-      In Table 1 we show the results for SPVNAS [3]
 and-paste augmentation (GT-Aug) [11] and with and Cylinder3D [2] architecture. In the semantic
 state-of-the-art LiDAR-Aug augmentation [8]. In segmentation task, we increased the mean IoU for
 the Real3D-Aug multiclass (mc), we added 4.7 both networks.
 pedestrians and 6.7 cyclists on average per scene. We are not comparing with GT-Aug [11] and



                                                                                                                                 7
Petr Šebek et al. CEUR Workshop Proceedings                                                                                  1–10



LiDAR-Aug [8] in the semantic segmentation                                    augmentation technique for 3D detection and
task. The methods above were not designed                                     semantic segmentation tasks. Our method improves
for segmentation, whereas our method allows for                               performance on important and rarely occurring
augmenting both tasks.                                                        classes, e.g. pedestrian, cyclist, motorcyclist,
   In the semantic segmentation task for SPVNAS,                              and others. Our method is self-contained and
we achieve an increase of 2.14 in mean IoU compared                           requires only 3D data. All augmentations can be
to the common augmentation technique [4], see                                 preprocessed, so it does not increase the training
Table 1. We observe an increased IoU of all classes                           time. One way to further improve the method
added, except for the truck category. With the                                is to incorporate a more informative selection of
Cylinder3D network, the increment can be seen                                 placements based on the uncertainty of the detection
in the IoU of all added classes. Our method also                              model.
increases the performance on not augmented classes
since we add more negative examples to other
similar classes.                                                              Acknowledgments
                                                                              This work was supported in part by OP VVV MEYS
4.5. Ablation Study of Object Detection                                       funded project CZ.02.1.01/0.0/0.0/16_019/0000765
In Tables 3 and 4 we show the influence of adding                             “Research Center for Informatics”, and by Grant
a single object to the scene in comparison to                                 Agency of the CTU Prague under Project
GT-Aug. Each configuration is named after the                                 SGS22/111/OHK3/2T/13. Authors want to thank
added class, and the lower index indicates the                                colleagues from Valeo R&D for discussions and
average number of objects added per scene. We                                 Valeo company for a support.
can see that, in the case of PointPillar, adding
only one class decreases performance in the other                             References
classes. We suspect that it is caused by similarities
between classes. For example, pedestrians and                                  [1] J. Tu, P. Wang, F. Liu, PP-RCNN: Point-
bicycles are simultaneously present in the class                                   Pillars Feature Set Abstraction for 3D Real-
“cyclist”. Therefore, it is beneficial to add both                                 time Object Detection, in: IEEE International
classes simultaneously. In the case of PV-RCNN,                                    Joint Conference on Neural Networks (IJCNN),
the addition of one class improves the performance                                 2021.
of both.                                                                       [2] X. Zhu, H. Zhou, T. Wang, F. Hong,
                                                                                   Y. Ma, W. Li, H. Li, D. Lin, Cylindrical
Table 3                                                                            and Asymmetrical 3D Convolution Networks
Real3D-Aug Object detection results with PointPillar                               for LiDAR Segmentation, in: IEEE/CVF
architecture based on number of inserted classes.
                                                                                   Conference on Computer Vision and Pattern
                             APPed 50(%)               APCyc 50(%)                 Recognition (CVPR), 2021.
 Augmentation
                      Easy      Mod     Hard    Easy      Mod     Hard
 GT-Aug               54.52     49.04   45.49   77.64     61.30   58.15
                                                                               [3] H. Tang, Z. Liu, S. Zhao, Y. Lin, J. Lin,
 Real3D-Aug (Ped1 )   55.72     51.30   47.47   46.33     33.84   32.47            H. Wang, S. Han,          Searching Efficient
 Real3D-Aug (Cyc1 )   46.87     44.17   41.77   72.65     52.71   49.04            3D Architectures with Sparse Point-Voxel
 Real3D-Aug (mc)      55.50     52.00   49.03   76.82     52.74   50.18
                                                                                   Convolution, in: European Conference on
                                                                                   Computer Vision (ECCV), 2020.
                                                                               [4] M. Hahner, D. Dai, A. Liniger, L. V. Gool,
Table 4                                                                            Quantifying Data Augmentation for LiDAR
Real3D-Aug object detection results with PV-RCNN
                                                                                   based 3D Object Detection, arXiv:2004.01643
based on the number of inserted classes.
                                                                                   (2020).
                             APPed 50(%)               APCyc 50(%)
 Augmentation
                      Easy      Mod     Hard    Easy      Mod     Hard
                                                                               [5] X. Xu, Z. Chen, F. Yin, CutResize: Improved
 GT-Aug               65.69     59.33   54.78   88.30     72.55   67.79
                                                                                   data augmentation method for RGB-D Object
 Real3D-Aug (Ped1 )   70.96     66.63   61.14   78.97     63.47   57.31            Recognition, IEEE Robotics and Automation
 Real3D-Aug (Cyc1 )   65.63     59.14   57.47   82.79     63.69   62.39
 Real3D-Aug (mc)      73.57     66.55   62.17   92.69     65.06   63.43
                                                                                   Letters (RA-L) (2022).
                                                                               [6] J. Yang, S. Shi, Z. Wang, H. Li, X. Qi,
                                                                                   ST3D: Self-Training for Unsupervised Domain
                                                                                   Adaptation on 3D Object Detection, in:
5. Conclusion                                                                      IEEE/CVF Conference on Computer Vision
                                                                                   and Pattern Recognition (CVPR), 2021.
We propose an object-centered point cloud



                                                                          8
Petr Šebek et al. CEUR Workshop Proceedings                                                              1–10



 [7] T. Chen, S. Kornblith, M. Norouzi, G. Hinton,         [19] Y. Zhou, O. Tuzel, VoxelNet: End-to-End
     A Simple Framework for Contrastive Learning                Learning for Point Cloud Based 3D Object
     of Visual Representations, in: International               Detection, in: IEEE/CVF Conference on
     Conference on Machine Learning (ICML),                     Computer Vision and Pattern Recognition
     2020.                                                      (CVPR), 2018.
 [8] J. Fang, X. Zuo, D. Zhou, S. Jin, S. Wang,            [20] C. Xu, B. Wu, Z. Wang, W. Zhan, P. Vajda,
     L. Zhang, LiDAR-Aug: A General Rendering-                  K. Keutzer, M. Tomizuka, Squeezesegv3:
     based Augmentation Framework for 3D Object                 Spatially-adaptive convolution for efficient
     Detection, in: IEEE/CVF Conference on                      point-cloud segmentation,      in: European
     Computer Vision and Pattern Recognition                    Conference on Computer Vision (ECCV),
     (CVPR), 2021.                                              2020.
 [9] N. Cauli, D. Reforgiato Recupero, Survey on           [21] A. Milioto, I. Vizzo, J. Behley, C. Stachniss,
     Videos Data Augmentation for Deep Learning                 Rangenet ++: Fast and accurate lidar
     Models, Future Internet (2022).                            semantic segmentation,             IEEE/RSJ
[10] Y.-C. Liu, C.-Y. Ma, Z. He, C.-W. Kuo,                     International Conference on Intelligent
     K. Chen, P. Zhang, B. Wu, Z. Kira, P. Vajda,               Robots and Systems (IROS) (2019).
     Unbiased Teacher for Semi-Supervised Object           [22] L. Caltagirone, S. Scheidegger, L. Svensson,
     Detection, in: International Conference on                 M. Wahde, Fast LIDAR-based road detection
     Learning Representations (ICLR), 2021.                     using fully convolutional neural networks, in:
[11] Y. Yan, Y. Mao, B. Li, Second: Sparsely                    IEEE Intelligent Vehicles Symposium (IV),
     Embedded Convolutional Detection, Sensors                  2017.
     (2018).                                               [23] P. Chu, S. Cho, S. Fong, K. Cho, Enhanced
[12] S. Cheng, Z. Leng, E. D. Cubuk, B. Zoph,                   ground segmentation method for Lidar point
     C. Bai, J. Ngiam, Y. Song, B. Caine,                       clouds in human-centric autonomous robot
     V. Vasudevan, C. Li, Q. V. Le, J. Shlens,                  systems,     Human-centric Computing and
     D. Anguelov,         Improving 3D Object                   Information Sciences (HCIS) (2019).
     Detection through Progressive Population              [24] I. Bogoslavskyi, C. Stachniss,       Efficient
     Based Augmentation,          in:     European              Online Segmentation for Sparse 3D Laser
     Conference on Computer Vision (ECCV),                      Scans,    Photogrammetrie, Fernerkundung,
     2020.                                                      Geoinformation (PFG) (2016).
[13] Y. Ren, S. Zhao, L. Bingbing, Object Insertion        [25] Z. Shen, H. Liang, L. Lin, Z. Wang, W. Huang,
     Based Data Augmentation for Semantic                       J. Yu,      Fast Ground Segmentation for
     Segmentation, in: International Conference                 3D LiDAR Point Cloud Based on Jump-
     on Robotics and Automation (ICRA), 2022.                   Convolution-Process, Remote Sensing (2021).
[14] P. Vacek, O. Jašek, K. Zimmermann,                    [26] H. Caesar, V. Bankiti, A. H. Lang, S. Vora,
     T. Svoboda,       Learning to Predict Lidar                V. E. Liong, Q. Xu, A. Krishnan, Y. Pan,
     Intensities, IEEE Transactions on Intelligent              G. Baldan, O. Beijbom,          nuScenes: A
     Transportation Systems (T-ITS) (2021).                     multimodal dataset for autonomous driving, in:
[15] S. R. Richter, V. Vineet, S. Roth, V. Koltun,              IEEE/CVF Conference on Computer Vision
     Playing for Data: Ground Truth from                        and Pattern Recognition (CVPR), 2020.
     Computer Games, in: European Conference               [27] M.-F. Chang, J. Lambert, P. Sangkloy,
     on Computer Vision (ECCV), 2016.                           J. Singh, S. Bak, A. Hartnett, D. Wang,
[16] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez,            P. Carr, S. Lucey, D. Ramanan, J. Hays,
     V. Koltun, CARLA: An Open Urban Driving                    Argoverse: 3D Tracking and Forecasting With
     Simulator, in: Conference on Robot Learning                Rich Maps, in: IEEE/CVF Conference on
     (CoRL), 2017.                                              Computer Vision and Pattern Recognition
[17] A. E. Sallab, I. Sobh, M. Zahran, N. Essam,                (CVPR), 2019.
     LiDAR      Sensor    modeling     and     Data        [28] A. Geiger, P. Lenz, R. Urtasun, Are we ready
     augmentation with GANs for Autonomous                      for Autonomous Driving? The KITTI Vision
     driving, arXiv:1905.07290 (2019).                          Benchmark Suite, in: IEEE/CVF Conference
[18] A. E. Sallab, I. Sobh, M. Zahran, M. Shawky,               on Computer Vision and Pattern Recognition
     Unsupervised Neural Sensor Models for                      (CVPR), 2012.
     Synthetic LiDAR Data Augmentation,                    [29] J. Behley, M. Garbade, A. Milioto,
     Advances in Neural Information Processing                  J. Quenzel, S. Behnke, C. Stachniss, J. Gall,
     Systems (NIPS) (2019).                                     SemanticKITTI: A Dataset for Semantic



                                                       9
Petr Šebek et al. CEUR Workshop Proceedings           1–10



     Scene Understanding of LiDAR Sequences,
     in: IEEE/CVF International Conference on
     Computer Vision (ICCV), 2019.
[30] X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler,
     R. Urtasun, 3D Object Proposals using Stereo
     Imagery for Accurate Object Class Detection,
     IEEE Transactions on Pattern Analysis and
     Machine Intelligence (2017).
[31] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi,
     X. Wang, H. Li, PV-RCNN: Point-Voxel
     Feature Set Abstraction for 3D Object
     Detection, in: IEEE/CVF Conference on
     Computer Vision and Pattern Recognition
     (CVPR), 2020.
[32] C. R. Qi, L. Yi, H. Su, L. J. Guibas,
     Pointnet++:      Deep hierarchical feature
     learning on point sets in a metric space, in:
     Advances in Neural Information Processing
     Systems (NIPS), 2017.




                                                 10