<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>3D Object Detection Algorithm Based on Point Cloud Multi- View Fusion 1</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yuan Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhanlei Fang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yanqiang Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kang Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yong Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chao Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Automation, Qilu University of Technology (Shandong Academy of Sciences)</institution>
          ,
          <addr-line>Jinan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Control Sciences and Engineering, Shandong University</institution>
          ,
          <addr-line>Jinan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <fpage>73</fpage>
      <lpage>80</lpage>
      <abstract>
        <p>The 3D object detection algorithm based on a single view of point cloud has limitations and cannot meet the requirements of complex scenes such as autonomous driving. And most of the existing point cloud multi-view fusion algorithms only focus on two views, and the fusion method is inefficient and simple. In order to coordinate the multiple view representations of point clouds, make full use of the advantages of different views, and alleviate their respective shortcomings in the 3D object detection task, we propose a multi-view fusion detection algorithm, namly PVR-SSD (Point-Voxel-Range Single Stage object Detector). PVR-SSD takes the point-based anchor-free center assignment algorithm as the backbone, and performs point cloud multi-view fusion in two parts. In the downsampling part, a point-range segmentation network is used for selective downsampling to increase the proportion of foreground points, especially small object points. In the feature fusion part, a point-voxel-range feature fusion module is designed, and an attention mechanism is introduced to adaptively aggregate multiview features with the point-based view as an intermediate carrier. Finally, all-round evaluations on the highly competitive KITTI dataset demonstrate the effectiveness of the proposed algorithm.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;3D Object Detection</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Multi-View Fusion</kwd>
        <kwd>Attention Mechanism</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>3D object detection is receiving increasing attention due to its wide range of applications, such as
intelligent robotics and autonomous driving. Currently, the most widely used 3D data detector is the
Lidar sensor. The raw data obtained by Lidar is represented as a 3D point cloud, which is a collection
of points with spatial location information. With some preprocessing, it can be converted to other
common view representations, such as Voxel, Range.</p>
      <p>
        The point-based algorithm directly extracts features from the original point cloud, which can retain
accurate location information, but the disorder of the point cloud makes its neighborhood search
inefficient and computationally expensive. The starting point of both voxel-based and range-based
algorithms is to regularize irregular point clouds. The voxel-based algorithm rasterizes the point cloud in
3D space, and then extracts features through 3D sparse convolution. The voxel-based views can
effectively preserve physical dimension information, but voxelization inevitably brings information loss,
which reduces the fine-grained localization accuracy of the algorithm. Compared to the voxel-based
algorithm, the range-based view has a more compact representation and has no quantization errors,
which helps alleviate the sparse problem of point clouds. However, the dimensional compression
caused by the 2D projection will inevitably bring about the distortion of the geometric structure and
the loss of spatial information[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        The 3D object detection algorithm based on a single view has different degrees of problems, and it
is difficult to achieve a balance between detection accuracy and speed. Using complementary
information to preserve strengths and reduce weaknesses is an intuitive solution by combining different
views together. Initially, researchers hoped to improve the performance of 3D object detection
algorithms by fusing projected views from different perspectives. A certain effect has been achieved, but it
still fails to break through the performance limitation of 2D space. The fusion scheme based on
pointvoxel integration is currently the most widely used. However, these schemes simply perform
singlelevel feature interaction or result-level fusion, do not make full use of the potential relationship
between multi-view features, and cannot measure the importance of the respective features of different
views. There is still a lot of room for improvement in the 3D object detection algorithm based on
multi-view fusion[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>To sum up, how to effectively coordinate multiple view representations of point clouds, make full
use of the advantages of different views, and introduce more effective views to deal with different
scenarios is an issue to be studied at present. Therefore, this paper proposes a single-stage anchor-free
algorithm PVR-SSD for 3D object detection by fusing different view features of point clouds.</p>
      <p>Main contributions in this paper are listed as follows:</p>
      <p>·We propose a point cloud segmentation sampling strategy that introduces range-based view
features to achieve selective sampling of point clouds and increase the proportion of foreground points.</p>
      <p>·We designed a multi-view feature fusion module and introduced a self-attention mechanism,
which can adaptively fuse point-based, voxel-based, and range-based view features, effectively
improving the accuracy of 3D object detection algorithms.</p>
      <p>·We conduct massive experiments on the KITTI da taset to evaluate the effectiveness and
efficiency of our proposed method.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Our Framework</title>
    </sec>
    <sec id="sec-3">
      <title>2.1. Overview</title>
      <p>
        The whole framework of PVR-SSD is shown in Figure 1. It can be divided into four parts, and
there is information interaction in multiple parts. The first three parts are the feature extraction
networks for different views. The fourth part is the deep fusion of the three view features and the
generation of the subsequent 3D bounding box. The point-based feature extraction network is the backbone
network of the algorithm. After the original point cloud is input into the network, the FPS is used to
perform preliminary preprocessing of the point cloud to reduce the data scale, and the SA module[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is
used for feature extraction. The range-based view branch adopts the lightweight Unet network to
extract multi-level semantic features, and then feeds into the segmentation sampling layer in the point
branch[4]. The segmentation sampling layer aggregates point-based and range-based features, and uses
a point-range foreground point segmentation network for selective downsampling to obtain candidate
center points. Then according to the preset confidence threshold, a certain number of original points
are reserved and sent to the voxel-based branch of the third part. After voxelization, they are input
into the 3D sparse convolutional network[5] to extract multi-scale voxel features, and then highly
compressed to obtain BEV feature. The features extracted by the three branches and the candidate
center points are sent to the PVR feature fusion module to obtain the candidate point set with the best
feature combination. The candidate point set is input into the centeroid prediction layer, and the
context clues around the bounding box are merged to obtain the predicted centroid point set. Finally, the
predicted centroid points and aggregated features are input into the region proposal generation layer,
which predicts the class and regresses to obtain a 3D bounding box.
      </p>
    </sec>
    <sec id="sec-4">
      <title>2.2. Point-Range Segmentation Sampling</title>
      <p>
        Common point cloud downsampling algorithms include Random sampling, FPS, Classs-aware
sampling, and Centroid-aware sampling. The first two algorithms are based on traditional geometric
reasoning ideas, and have good scene coverage, but the computational cost is high, and they treat all
points equally, which cannot reflect the importance of foreground points. Classs-aware sampling
focuses on sampling to get more foreground points. It is based on point-based views and introduces a
separate training branch to learn the latent semantics of each point, enabling selective downsampling.
Centroid-aware sampling uses the principle of centrality for weighting based on Classs-aware
sampling, aiming to obtain points closer to the center of the instance[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Class-aware sampling and centroid-aware sampling achieve good results in foreground point
preservation, but the features of only point-based views are exploited. And the sampling performance
of the network tends to get more large target points and ignore small target points. The range-based
view features are dense, suitable for processing large-scale outdoor scenes, and has low operational
complexity. In order to explore the advantages of downsampling with range-based view features, we
tested with point-based view, range-based view and point-range fusion view respectively. The test
results on the KITTI validation set are shown in Table 1 and Table 2. In order to better compare the
performance of several methods, data enhancement processing is performed on the KITTI validation set.
It can be seen from the table that the point-range-based method has a similar instance recall rate
compared with the point-based method, but the proportion of foreground points of pedestrian and cyclist is
higher, indicating that more small target points are sampled on the basis of ensuring instance recall.</p>
      <p>Combined with the above analysis, we propose to use a point-range segmentation network for
downsampling. As shown in Figure 1, after the point-based and range-based
features are extracted respectively, feature propagation and fusion between the two views are realized
by establishing a point-range view mapping relationship. The mapping of point-based view to
rangebased view (P2R) adopts a spherical projection method based on scan expansion[6]. In this way, the
data occlusion problem can be avoided and the resulting range-based view is smoother. The feature
map can be described by Eq.1.</p>
      <p>rangeimage[θ ,ϕ ] = depth
(1)
where θ represents the inclination angle, ϕ represents the azimuth angle. The feature map from
range-based view to point-based view (R2P) adopts bilinear interpolation. In order to avoid the
sampling network being too bloated, the feature fusion of point-range view adopts the method of splicing
followed by MLP.</p>
      <p>The segmentation sampling strategy uses the above fused features to predict the semantic category
of each point through a two-layer scale-invariant MLP. The class loss of points adopts cross entropy
loss. In order to get points closer to the center of the instance, we weight the loss function according
to the principle of centrality. The weight function is defined as follows:</p>
      <p>Weight = 3
min( f , b)
max( f , b)
×
min(l, r)
max(l, r)
×
min(u, d )
max(u, d )
where (f,b,l,r,u,d) represents the distance from the foreground point to the front, rear, left, right,
upper and lower faces, respectively. Points within the bounding box are assigned different weights
based on their spatial location by multiplying with the loss term for foreground points, thereby
assigning higher probability to points closer to the center. The weighted cross-entropy loss is defined as
follows:</p>
      <p>C
Lseg = − (Weight ⋅ s log(sˆ) + (1 − s) log(1 − sˆ))</p>
      <p>c=1
where C represents the number of classes, s is the one-hot semantic label, and sˆ represents the
probability of predicting the semantic class.</p>
    </sec>
    <sec id="sec-5">
      <title>2.3. Multi-View Fusion</title>
      <p>The voxel-based view is inferior to the original point-based view in detection accuracy and
rangebased view in detection efficiency, but it has the best comprehensive detection performance. The
performance of voxel-based algorithm is heavily dependent on the voxel resolution. In order to apply the
advantages of voxel-based view and suppress its disadvantages. As shown in part 3 of Fig. 1, we
apply a voxel feature extraction network to the segmented foreground points, and then compensate for
the sparse problem of voxel features by introducing range-based features.
(2)
(3)</p>
      <p>The purpose of multi-view feature fusion is to optimize the combination of features from different
views, aggregate key information, and eliminate redundant information. At present, the commonly
used feature aggregation methods in the field of 3D object detection are feature addition and feature
splicing. This average weighted fusion method cannot measure the importance of the respective
features of different views, and inevitably introduces many invalid features. Therefore, we design a
multi-view feature fusion module for different views of the same point cloud. The details of feature fusion
are shown in Figure 2. Add in the figure represents feature additionand Multiplication represents
element-by-element multiplication.</p>
    </sec>
    <sec id="sec-6">
      <title>2.3.1. Attentional Feature Fusion</title>
      <p>BEV features and range-based features are 2D pseudo-image forms. Point-based features and
multi-scale voxel features are 3D forms. Therefore, the modules are first fused pairwise according to the
feature form. The multi-scale voxel features are distributed in 3D space, so they are directly gathered
on the sampling points through the SA module. The BEV features extracted by the deep network pay
more attention to the global information, and the features are relatively sparse due to the number of
input points. Range-bsed features contain multi-level semantic information and are relatively dense.
In order to better fuse the above two semantically and scale-inconsistent features, we introduce an
attention mechanism feature fusion module AFF[7]. The network structure is shown in Figure 3.</p>
      <p>It can be expressed by the formula as:</p>
      <p>Z = M ( X ⊕ Y ) ⊗ X + (1 − M ( X ⊕ Y )) ⊗ Y
(4)
where M(X) represents the network inside the red box, and the dashed arrow represents the
1-M(X) operation. The network uses two branches with different scales to extract the channel
attention weights. One of the branches uses Global Avg Pooling to extract the attention of global features,
and the other branch directly uses point-wise convolution to extract the channel attention of local
features. To keep as lightweight as possible, Point-wise conv is chosen as the local channel context
aggregator.</p>
    </sec>
    <sec id="sec-7">
      <title>2.3.2. Adaptive Fusion</title>
      <p>Finally, the features converge into two branches with point views as carriers. In order to eliminate
the interference of redundant information and measure the importance of different features, the
module maps the features of the two branches respectively to obtain the initial weight vector of each
branch. Then, the weights of each branch are voted and added, and the superimposed results are
converted into probability weights through softmax. The features of different branches are weighted and
then superimposed to obtain the final fusion feature.</p>
      <p>It can be expressed as the following formula:</p>
      <p>n n
Ff =  split[softmax( sigmoid (wi ∗ Fi ))]i ⋅ Fi (5)
i i
where Fi represents the branch feature, wi represents the convolution kernel, is used to estimate the
initial weight vector of each branch mapping, ⋅ represents element-wise multiplication, n represents
the number of fusion branches, and corresponds to the number of weight vectors.</p>
    </sec>
    <sec id="sec-8">
      <title>3. experiment</title>
    </sec>
    <sec id="sec-9">
      <title>3.1. Experimental Setup</title>
      <p>KITTI Dataset is one of the most popular dataset of 3D detection for autonomous driving. We
divide the kitti training set into training set and validation set, which contain 3712 frames of data and
3769 frames of data respectively.The IoU threshold settings for different classes of objects are set to
0.7 for cars and 0.5 for pedestrians and cyclists. Out PVR-SSD is trained in an end-to-end manner,
using Adam with a single-loop learning strategy for optimization. For the KITTI dataset, we train the
entire network with the batch size 16, learning rate 0.01 for 80 epochs on 2 GTX 1080 Ti 10 GB
GPUs, which takes around 6 hours.</p>
    </sec>
    <sec id="sec-10">
      <title>3.2. Experimental Results and Analysis</title>
      <p>In the KITTI benchmark, the detection objects are divided into three categories: easy, medium, and
difficult according to the difficulty level. We compare the performance and efficiency of PVR-SSD
with existing algorithms according to the view representation category, which are recorded in Table 3.</p>
      <p>As shown in the table, it can be seen that: 1). In terms of detection accuracy, PVR-SSD
outperforms other existing methods in most categories; 2). In terms of detection efficiency, PVR-SSD has
faster detection speed than multi-view algorithms; 3).PVR-SSD has a greater improvement in small
target detection.</p>
      <p>Because with the introduction of the range-based view branch, the sampling network can
effectively preserve foreground points and increase the proportion of small object points. And multi-view
feature fusion provides multi-dimensional and deep feature representation, eliminating the interference of
redundant information, so as to achieve comprehensive and accurate perception of objects of different
scales and categories. Despite using the multi-view features, our PVR-SSD still shows a very
competitive running latency due to the anchor-free algorithm structure and efficient implementation.</p>
    </sec>
    <sec id="sec-11">
      <title>3.3. Ablation Studies</title>
    </sec>
    <sec id="sec-12">
      <title>3.3.1. Downsampling Ablation Experiment</title>
      <p>To further verify the effectiveness of the proposed point-range segmentation downsampling
method, we replace it with the D-FPS+F-FPS and the Centroid-aware sampling. As shown in Table 4, the
proposed sampling method achieves the best detection performance in all three categories. Especially
for small objects such as pedestrians and cyclists, the detection performance is significantly improved
after adding the range-based view feature. This shows information provided by range-based view
features can expand the receptive field, effectively preserve foreground information during
downsampling, and improve the proportion of small objects, thereby achieving better detection performance.</p>
    </sec>
    <sec id="sec-13">
      <title>3.3.2. Fusion Ablation Experiment</title>
      <p>To verify the effectiveness of the feature fusion approach in Figure 2, we investigate the effect of
changing the fusion style on multi-view features, as shown in Table 5. It can be seen that there are
considerable improvements compared to these two methods. And the addition of features does not
improve the detection accuracy, indicating that redundant features are harmful to the network, which
proves the necessity of the feature fusion strategy in this paper.</p>
    </sec>
    <sec id="sec-14">
      <title>Conclusion</title>
      <p>This paper proposes a point-voxel-range view fusion detection algorithm, namely PVR-SSD. The
algorithm proposes a segmentation sampling strategy and an adaptive multi-view fusion method, and
conducts extensive experiments on the open source dataset KITTI, and achieves good detection
results. But the algorithm still has room for further research. In addition to point, voxel, and range views,
point clouds have some other views, and real scene data can not only be captured with point clouds,
but other data sources such as images can also be regarded as samples of the real physical world. In
the future, more in-depth research can be done from the perspective of other views or multi-source
data.</p>
      <p>This work was supported by the National Natural Science Foundation of China (52072212),
Natural Science Foundation of Shandong Province (ZR2021MF103).
[4] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image
segmentation. InInternational Conference on Medical image computing and computer-assisted
intervention 2015 Oct 5 (pp. 234-241). Springer, Cham.
[5] Yan Y, Mao Y, Li B. Second: Sparsely embedded convolutional detection. Sensors. 2018 Oct
6;18(10):3337.
[6] Fan L, Xiong X, Wang F, Wang N, Zhang Z. Rangedet: In defense of range view for lidar-based
3d object detection. InProceedings of the IEEE/CVF International Conference on Computer
Vision 2021 (pp. 2918-2927).
[7] Dai Y, Gieseke F, Oehmcke S, Wu Y, Barnard K. Attentional feature fusion. InProceedings of
the IEEE/CVF Winter Conference on Applications of Computer Vision 2021 (pp. 3560-3569).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <year>2020</year>
          . 3dssd:
          <article-title>Point-based 3d single stage object detector</article-title>
          .
          <source>In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          (pp.
          <fpage>11040</fpage>
          -
          <lpage>11048</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Zhang</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hu</surname>
            <given-names>Q</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            <given-names>G</given-names>
          </string-name>
          , et al.
          <article-title>Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds</article-title>
          [J].
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Qi</surname>
            <given-names>CR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yi</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guibas</surname>
            <given-names>LJ</given-names>
          </string-name>
          .
          <article-title>Pointnet++: Deep hierarchical feature learning on point sets in a metric space</article-title>
          .
          <source>Advances in neural information processing systems</source>
          .
          <year>2017</year>
          ;
          <volume>30</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>