3D Object Detection Algorithm Based on Point Cloud Multi- View Fusion 1 Yuan Liu1, Zhanlei Fang1, Yanqiang Li1, 2, *, Kang Wang1, Yong Wang1, Chao Zhang1 1 Institute of Automation, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China 2 School of Control Sciences and Engineering, Shandong University, Jinan, China Abstract The 3D object detection algorithm based on a single view of point cloud has limitations and cannot meet the requirements of complex scenes such as autonomous driving. And most of the existing point cloud multi-view fusion algorithms only focus on two views, and the fusion method is inefficient and simple. In order to coordinate the multiple view representations of point clouds, make full use of the advantages of different views, and alleviate their respective shortcomings in the 3D object detection task, we propose a multi-view fusion detection algo- rithm, namly PVR-SSD (Point-Voxel-Range Single Stage object Detector). PVR-SSD takes the point-based anchor-free center assignment algorithm as the backbone, and performs point cloud multi-view fusion in two parts. In the downsampling part, a point-range segmentation network is used for selective downsampling to increase the proportion of foreground points, especially small object points. In the feature fusion part, a point-voxel-range feature fusion module is designed, and an attention mechanism is introduced to adaptively aggregate multi- view features with the point-based view as an intermediate carrier. Finally, all-round evalua- tions on the highly competitive KITTI dataset demonstrate the effectiveness of the proposed algorithm. Keywords 3D Object Detection; Deep Learning; Multi-View Fusion; Attention Mechanism 1. Introduction 3D object detection is receiving increasing attention due to its wide range of applications, such as intelligent robotics and autonomous driving. Currently, the most widely used 3D data detector is the Lidar sensor. The raw data obtained by Lidar is represented as a 3D point cloud, which is a collection of points with spatial location information. With some preprocessing, it can be converted to other common view representations, such as Voxel, Range. The point-based algorithm directly extracts features from the original point cloud, which can retain accurate location information, but the disorder of the point cloud makes its neighborhood search inef- ficient and computationally expensive. The starting point of both voxel-based and range-based algo- rithms is to regularize irregular point clouds. The voxel-based algorithm rasterizes the point cloud in 3D space, and then extracts features through 3D sparse convolution. The voxel-based views can effec- tively preserve physical dimension information, but voxelization inevitably brings information loss, which reduces the fine-grained localization accuracy of the algorithm. Compared to the voxel-based algorithm, the range-based view has a more compact representation and has no quantization errors, which helps alleviate the sparse problem of point clouds. However, the dimensional compression caused by the 2D projection will inevitably bring about the distortion of the geometric structure and the loss of spatial information[1]. The 3D object detection algorithm based on a single view has different degrees of problems, and it is difficult to achieve a balance between detection accuracy and speed. Using complementary infor- ICCEIC2022@3rd International Conference on Computer Engineering and Intelligent Control EMAIL: *Liyq@sdas.org (Yanqiang Li) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 73 mation to preserve strengths and reduce weaknesses is an intuitive solution by combining different views together. Initially, researchers hoped to improve the performance of 3D object detection algo- rithms by fusing projected views from different perspectives. A certain effect has been achieved, but it still fails to break through the performance limitation of 2D space. The fusion scheme based on point- voxel integration is currently the most widely used. However, these schemes simply perform single- level feature interaction or result-level fusion, do not make full use of the potential relationship be- tween multi-view features, and cannot measure the importance of the respective features of different views. There is still a lot of room for improvement in the 3D object detection algorithm based on mul- ti-view fusion[2]. To sum up, how to effectively coordinate multiple view representations of point clouds, make full use of the advantages of different views, and introduce more effective views to deal with different scenarios is an issue to be studied at present. Therefore, this paper proposes a single-stage anchor-free algorithm PVR-SSD for 3D object detection by fusing different view features of point clouds. Main contributions in this paper are listed as follows: ·We propose a point cloud segmentation sampling strategy that introduces range-based view fea- tures to achieve selective sampling of point clouds and increase the proportion of foreground points. ·We designed a multi-view feature fusion module and introduced a self-attention mechanism, which can adaptively fuse point-based, voxel-based, and range-based view features, effectively im- proving the accuracy of 3D object detection algorithms. ·We conduct massive experiments on the KITTI da taset to evaluate the effectiveness and effi- ciency of our proposed method. Figure 1 PVR-SSD Framework Diagram 2. Our Framework 2.1. Overview The whole framework of PVR-SSD is shown in Figure 1. It can be divided into four parts, and there is information interaction in multiple parts. The first three parts are the feature extraction net- works for different views. The fourth part is the deep fusion of the three view features and the genera- tion of the subsequent 3D bounding box. The point-based feature extraction network is the backbone network of the algorithm. After the original point cloud is input into the network, the FPS is used to perform preliminary preprocessing of the point cloud to reduce the data scale, and the SA module[3] is used for feature extraction. The range-based view branch adopts the lightweight Unet network to ex- tract multi-level semantic features, and then feeds into the segmentation sampling layer in the point branch[4]. The segmentation sampling layer aggregates point-based and range-based features, and uses a point-range foreground point segmentation network for selective downsampling to obtain candidate center points. Then according to the preset confidence threshold, a certain number of original points are reserved and sent to the voxel-based branch of the third part. After voxelization, they are input into the 3D sparse convolutional network[5] to extract multi-scale voxel features, and then highly 74 compressed to obtain BEV feature. The features extracted by the three branches and the candidate center points are sent to the PVR feature fusion module to obtain the candidate point set with the best feature combination. The candidate point set is input into the centeroid prediction layer, and the con- text clues around the bounding box are merged to obtain the predicted centroid point set. Finally, the predicted centroid points and aggregated features are input into the region proposal generation layer, which predicts the class and regresses to obtain a 3D bounding box. 2.2. Point-Range Segmentation Sampling Common point cloud downsampling algorithms include Random sampling, FPS, Classs-aware sampling, and Centroid-aware sampling. The first two algorithms are based on traditional geometric reasoning ideas, and have good scene coverage, but the computational cost is high, and they treat all points equally, which cannot reflect the importance of foreground points. Classs-aware sampling fo- cuses on sampling to get more foreground points. It is based on point-based views and introduces a separate training branch to learn the latent semantics of each point, enabling selective downsampling. Centroid-aware sampling uses the principle of centrality for weighting based on Classs-aware sam- pling, aiming to obtain points closer to the center of the instance[2]. Class-aware sampling and centroid-aware sampling achieve good results in foreground point preservation, but the features of only point-based views are exploited. And the sampling performance of the network tends to get more large target points and ignore small target points. The range-based view features are dense, suitable for processing large-scale outdoor scenes, and has low operational complexity. In order to explore the advantages of downsampling with range-based view features, we tested with point-based view, range-based view and point-range fusion view respectively. The test re- sults on the KITTI validation set are shown in Table 1 and Table 2. In order to better compare the per- formance of several methods, data enhancement processing is performed on the KITTI validation set. It can be seen from the table that the point-range-based method has a similar instance recall rate com- pared with the point-based method, but the proportion of foreground points of pedestrian and cyclist is higher, indicating that more small target points are sampled on the basis of ensuring instance recall. Combined with the above analysis, we propose to use a point-range segmentation network for downsampling. As shown in Figure 1, after the point-based and range-based Table 1 Instance recall rate (number of instances covered by sampling points/number of all instances) 1024 512 views Car Ped. Cyc. Car Ped. Cyc. Point 98.1% 99.3% 97.5% 98.1% 99.0% 97.2% Range 97.5% 99.1% 97.2% 97.9% 98.3% 95.6% Point+Range 98.5% 99.4% 97.6% 98.2% 99.3% 97.9% Table 2 Proportion of foreground points (the proportion of objects of different categories in the sampled foreground points) 1024 512 views Car Ped. Cyc. Car Ped. Cyc. Point 62.1% 25.5% 12.3% 62.1% 25.5% 12.3% Point+Range 58.7% 27.7% 13.5% 58.7% 27.7% 13.5% features are extracted respectively, feature propagation and fusion between the two views are realized by establishing a point-range view mapping relationship. The mapping of point-based view to range- based view (P2R) adopts a spherical projection method based on scan expansion[6]. In this way, the data occlusion problem can be avoided and the resulting range-based view is smoother. The feature map can be described by Eq.1. rangeimage[θ ,ϕ ] = depth (1) 75 where θ represents the inclination angle, ϕ represents the azimuth angle. The feature map from range-based view to point-based view (R2P) adopts bilinear interpolation. In order to avoid the sam- pling network being too bloated, the feature fusion of point-range view adopts the method of splicing followed by MLP. The segmentation sampling strategy uses the above fused features to predict the semantic category of each point through a two-layer scale-invariant MLP. The class loss of points adopts cross entropy loss. In order to get points closer to the center of the instance, we weight the loss function according to the principle of centrality. The weight function is defined as follows: min( f , b) min(l , r ) min(u , d ) Weight = 3 × × (2) max( f , b) max(l , r ) max(u , d ) where (f,b,l,r,u,d) represents the distance from the foreground point to the front, rear, left, right, upper and lower faces, respectively. Points within the bounding box are assigned different weights based on their spatial location by multiplying with the loss term for foreground points, thereby assign- ing higher probability to points closer to the center. The weighted cross-entropy loss is defined as fol- lows: C Lseg = − (Weight ⋅ s log( sˆ) + (1 − s ) log(1 − sˆ)) (3) c =1 where C represents the number of classes, s is the one-hot semantic label, and ŝ represents the probability of predicting the semantic class. 2.3. Multi-View Fusion The voxel-based view is inferior to the original point-based view in detection accuracy and range- based view in detection efficiency, but it has the best comprehensive detection performance. The per- formance of voxel-based algorithm is heavily dependent on the voxel resolution. In order to apply the advantages of voxel-based view and suppress its disadvantages. As shown in part 3 of Fig. 1, we ap- ply a voxel feature extraction network to the segmented foreground points, and then compensate for the sparse problem of voxel features by introducing range-based features. Figure 2 Point-Voxel-Range Feature Fusion Module The purpose of multi-view feature fusion is to optimize the combination of features from different views, aggregate key information, and eliminate redundant information. At present, the commonly used feature aggregation methods in the field of 3D object detection are feature addition and feature splicing. This average weighted fusion method cannot measure the importance of the respective fea- tures of different views, and inevitably introduces many invalid features. Therefore, we design a mul- ti-view feature fusion module for different views of the same point cloud. The details of feature fusion are shown in Figure 2. Add in the figure represents feature additionand Multiplication represents ele- ment-by-element multiplication. 76 2.3.1. Attentional Feature Fusion BEV features and range-based features are 2D pseudo-image forms. Point-based features and mul- ti-scale voxel features are 3D forms. Therefore, the modules are first fused pairwise according to the feature form. The multi-scale voxel features are distributed in 3D space, so they are directly gathered on the sampling points through the SA module. The BEV features extracted by the deep network pay more attention to the global information, and the features are relatively sparse due to the number of input points. Range-bsed features contain multi-level semantic information and are relatively dense. In order to better fuse the above two semantically and scale-inconsistent features, we introduce an at- tention mechanism feature fusion module AFF[7]. The network structure is shown in Figure 3. Figure 3 AFF It can be expressed by the formula as: Z = M ( X ⊕ Y ) ⊗ X + (1 − M ( X ⊕ Y )) ⊗ Y (4) where M(X) represents the network inside the red box, and the dashed arrow represents the 1-M(X) operation. The network uses two branches with different scales to extract the channel atten- tion weights. One of the branches uses Global Avg Pooling to extract the attention of global features, and the other branch directly uses point-wise convolution to extract the channel attention of local fea- tures. To keep as lightweight as possible, Point-wise conv is chosen as the local channel context ag- gregator. 2.3.2. Adaptive Fusion Finally, the features converge into two branches with point views as carriers. In order to eliminate the interference of redundant information and measure the importance of different features, the mod- ule maps the features of the two branches respectively to obtain the initial weight vector of each branch. Then, the weights of each branch are voted and added, and the superimposed results are con- verted into probability weights through softmax. The features of different branches are weighted and then superimposed to obtain the final fusion feature. It can be expressed as the following formula: n n Ff =  split[softmax( sigmoid (wi ∗ Fi ))]i ⋅ Fi (5) i i 77 where Fi represents the branch feature, wi represents the convolution kernel, is used to estimate the initial weight vector of each branch mapping, ⋅ represents element-wise multiplication, n represents the number of fusion branches, and corresponds to the number of weight vectors. 3. experiment 3.1. Experimental Setup KITTI Dataset is one of the most popular dataset of 3D detection for autonomous driving. We di- vide the kitti training set into training set and validation set, which contain 3712 frames of data and 3769 frames of data respectively.The IoU threshold settings for different classes of objects are set to 0.7 for cars and 0.5 for pedestrians and cyclists. Out PVR-SSD is trained in an end-to-end manner, using Adam with a single-loop learning strategy for optimization. For the KITTI dataset, we train the entire network with the batch size 16, learning rate 0.01 for 80 epochs on 2 GTX 1080 Ti 10 GB GPUs, which takes around 6 hours. 3.2. Experimental Results and Analysis In the KITTI benchmark, the detection objects are divided into three categories: easy, medium, and difficult according to the difficulty level. We compare the performance and efficiency of PVR-SSD with existing algorithms according to the view representation category, which are recorded in Table 3. As shown in the table, it can be seen that: 1). In terms of detection accuracy, PVR-SSD outper- forms other existing methods in most categories; 2). In terms of detection efficiency, PVR-SSD has faster detection speed than multi-view algorithms; 3).PVR-SSD has a greater improvement in small target detection. Because with the introduction of the range-based view branch, the sampling network can effective- ly preserve foreground points and increase the proportion of small object points. And multi-view fea- ture fusion provides multi-dimensional and deep feature representation, eliminating the interference of redundant information, so as to achieve comprehensive and accurate perception of objects of different scales and categories. Despite using the multi-view features, our PVR-SSD still shows a very compet- itive running latency due to the anchor-free algorithm structure and efficient implementation. 3.3. Ablation Studies 3.3.1. Downsampling Ablation Experiment To further verify the effectiveness of the proposed point-range segmentation downsampling meth- od, we replace it with the D-FPS+F-FPS and the Centroid-aware sampling. As shown in Table 4, the proposed sampling method achieves the best detection performance in all three categories. Especially for small objects such as pedestrians and cyclists, the detection performance is significantly improved after adding the range-based view feature. This shows information provided by range-based view fea- tures can expand the receptive field, effectively preserve foreground information during downsam- pling, and improve the proportion of small objects, thereby achieving better detection performance. Table 3 Detection performance comparison on KITTI dataset. 3D Ped. 3D Cyc. Method 3D Car(IoU=0.7) (IoU=0.5) (IoU=0.5) D-FPS + F-FPS 79.37 40.27 61.30 Centroid-aware 80.42 41.02 62.84 Point + Range 81.36 45.53 65.76 78 Table 4 Ablation studies of PVR-SSD on different sampling strategies 3D Car (IoU=0.7) 3D Ped. (IoU=0.5) 3D Cyc. (IoU=0.5) Speed Method Reference GPU Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard (s) PointRCNN CVPR 2019 TITAN XP 86.96 75.64 70.70 47.98 39.37 36.01 74.96 58.82 52.53 0.1 Point 3DSSD CVPR 2020 TITAN V 88.36 79.57 74.55 54.64 44.27 40.23 82.48 64.10 56.90 0.04 IA-SSD CVPR 2022 RTX 2080Ti 88.34 80.13 75.04 46.51 39.03 35.60 78.35 61.94 55.70 0.013 SECOND Sensors 2018 GTX 1080Ti 84.65 75.96 68.71 45.31 35.52 33.14 75.83 60.82 53.67 0.04 Voxel PointPillars CVPR 2019 GTX 1080Ti 82.58 74.31 68.99 51.45 41.92 38.89 77.10 58.65 51.92 0.016 Range RangeRCNN CVPR 2021 Tesla V100 88.47 81.33 77.09 - - - - - - 0.06 STD ICCV 2019 TITAN V 87.95 79.71 75.09 53.29 42.47 38.35 78.69 61.59 55.30 0.08 Multi - PVRCNN CVPR 2020 GTX 1080Ti 90.25 81.43 76.82 52.17 43.29 40.29 78.60 63.71 57.65 0.08 View HVPR CVPR 2021 GTX 2080Ti 86.38 77.92 73.04 53.47 43.96 40.64 - - - -- ours - GTX 1080Ti 91.32 81.36 77.16 56.69 45.53 42.26 84.61 65.76 59.39 0.05 Improvement - - +1.07 -0.07 +0.07 +2.05 +1.26 +1.62 +2.13 +1.66 +1.74 +0.037 3.3.2. Fusion Ablation Experiment To verify the effectiveness of the feature fusion approach in Figure 2, we investigate the effect of changing the fusion style on multi-view features, as shown in Table 5. It can be seen that there are considerable improvements compared to these two methods. And the addition of features does not improve the detection accuracy, indicating that redundant features are harmful to the network, which proves the necessity of the feature fusion strategy in this paper. Table 5 Ablation studies of PVR-SSD on different fusion strategies 3D Car 3D Ped. 3D Cyc. Method (IoU=0.7) (IoU=0.5) (IoU=0.5) Add 79.73 38.63 58.94 Cat 80.46 41.43 63.52 ours 81.36 45.53 65.76 4. Conclusion This paper proposes a point-voxel-range view fusion detection algorithm, namely PVR-SSD. The algorithm proposes a segmentation sampling strategy and an adaptive multi-view fusion method, and conducts extensive experiments on the open source dataset KITTI, and achieves good detection re- sults. But the algorithm still has room for further research. In addition to point, voxel, and range views, point clouds have some other views, and real scene data can not only be captured with point clouds, but other data sources such as images can also be regarded as samples of the real physical world. In the future, more in-depth research can be done from the perspective of other views or multi-source da- ta. This work was supported by the National Natural Science Foundation of China (52072212), Natu- ral Science Foundation of Shandong Province (ZR2021MF103). 5. Reference [1] Yang, Z., Sun, Y., Liu, S. and Jia, J., 2020. 3dssd: Point-based 3d single stage object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11040-11048). [2] Zhang Y , Hu Q , Xu G , et al. Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds[J]. 2022. [3] Qi CR, Yi L, Su H, Guibas LJ. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems. 2017;30. 79 [4] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image computing and computer-assisted interven- tion 2015 Oct 5 (pp. 234-241). Springer, Cham. [5] Yan Y, Mao Y, Li B. Second: Sparsely embedded convolutional detection. Sensors. 2018 Oct 6;18(10):3337. [6] Fan L, Xiong X, Wang F, Wang N, Zhang Z. Rangedet: In defense of range view for lidar-based 3d object detection. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion 2021 (pp. 2918-2927). [7] Dai Y, Gieseke F, Oehmcke S, Wu Y, Barnard K. Attentional feature fusion. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision 2021 (pp. 3560-3569). 80