Visible Region Enhancement Network for Occluded Pedestrian
Detection
Fangwei Sun1, Caidong Yang1, Chengyang Li1,2, Heng Zhou1,3, Ziwei Du1, Yongqiang Xie1,*,
and Zhongbo Li1,*
1
  Institution of Systems Engineering, Academy of Military Sciences, Beijing, 100141, China
2
  School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China
3
  School of Electronic Engineering, Xidian University, Xi’an, Shanxi, 710071, China

                Abstract
                Occlusion is a big challenge in detecting pedestrians. In this paper, we propose a new network
                module named Visible Region Enhancement Network(VREN), which is consisted of a spatial
                attention network and a channel attention network. Given feature maps, our module infers at-
                tention maps from two dimensions, spatial and channel. In particular, compared with the pre-
                vious attention mechanism, the acquisition of the two kinds of attention in VREN is interrelated,
                rather than independent. Based on attention maps, VREN can enhance the effective feature
                from different dimensions, while reducing the interference noise. Because VREN works in the
                feature extraction stage, it can be integrated into any Convolutional Neural Network(CNN)
                architecture and is end-to-end trainable along with base CNNs. We validate our VREN through
                extensive experiments on CrowdHuman datasets. Our experiments show VREN can effectively
                increase detection performances compared to the Faster R-CNN baseline.

                Keywords
                Pedestrian Detection; Occlusion Detection; Spatial Attention; Channel Attention

1. Introduction 1

    Pedestrian detection, as a branch of object detection, is an important task in computer vision. It is
widely used in various fields, such as autonomous driving, object tracking, video surveillance, and many
other fields. In recent years, with the development of deep learning, especially CNN, the performance
of pedestrian detection has obtained rapid improvement. According to the different generation modes
of proposals, the CNN frames can be roughly divided into two types: one-stage detector[1][2][3][4]
without independent to generate proposals, and two-stage detector[5][6][7][8][9][10] with independent
network generating proposals. In contrast, the one-stage detector has a faster detection speed but a lower
detection accuracy, while the two-stage detector has a higher detection accuracy but a slower detection
speed. These advanced detectors have greatly promoted the research of pedestrian detection and made
great breakthroughs.
    However, in the real world, it is very common for the pedestrian to occlude each other or be occluded
by other objects, which cause the body is not fully visible. The difficulties of occluded object detection
are as follows: (a) Because of the influence of the datasets and the complexity of occlusion, Fawzi and
Frossard[11] proved occlusion detector which based on CNN is not robust. (b) Occlusion interference
feature extraction and occlusion of each other two objectives are likely to have very similar character-
istics, which cause the detector cannot predict accurately distinguish. (c) During occlusion, the predic-
tion boxes of different objects may be seriously overlapped, so the prediction boxes of different object
may be regarded as the prediction of one object by the non-maximum suppression(NMS) algorithm,
and the false suppression will lead to missed detection. From the above analysis, occlusion remains a
big challenge in detecting pedestrians.

ICBASE2022@3rd International Conference on Big Data & Artificial Intelligence & Software Engineering, October 21-23, 2022, Guang-
zhou, China
*Corresponding author’s e-mail: xyq_ams@outlook.com (Yongqiang Xie); lzb_ams@outlook.com (Zhongbo Li)
             © 2022 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                 59
    To handle occlusion, an effective solution is to use attention mechanism. Attention mechanisms not
only tell us where to focus, but it also improve the representation of target feature information. In this
paper, we propose a new network module, named “Visible Region Enhancement Network”. Since CNN
extract features by blending cross-spatial and channel information together, we adopt our module to
emphasize meaningful features along spatial and channel dimensions. In addition, the two kinds of
attention acquisition are closely related. As a result, our module efficiently helps the feature information
transfer within the network by learning which information to enhance or suppress. Fig. 1 (a) shows the
results predicted by Faster R-CNN[7] baseline: the detector fails to predict instances heavily overlapped
with others. Fig. 1 (b) shows the prediction results of our method. In particular, our method also im-
proves positioning accuracy.


                    (a) Baseline                                              (b) Ours
Figure 1. Human detection in crowds. (a) Results predicted by Faster R-CNN[7] baseline. The red box
indicates the missed detection. (b) Results of our method. All instances are correctly predicted.

   In the CrowdHuman datasets[12], we obtain accuracy improvement from the baseline network by
plugging our module, proving the efficacy of VREN. Since our module is designed to work in the fea-
ture extraction stage, in theory, both the one-stage model and the two-stage model can add VREN in
most cases.
   Contribution. Our main contribution is three-fold：
   1. We propose an effective attention module (VREN), which can be integrated with any CNN ar-
       chitecture.
   2. Compared with the existing attention mechanism, VREN combines spatial attention and channel
       attention and enhances the correlation between the two kinds of attention.
   3. We evaluate the effectiveness of VREN through a large number of ablation experiments.

2. Related Work

    As mentioned in the introduction, occlusion interference feature extraction will cause the feature
map not to be able to effectively guide the classifier to make a correct judgment on the predicted box.
Therefore, for the detection of the occlusion scene, the feature information should be distinguished. To
make this purpose, the attention mechanism can re-weight the feature by adjusting the spatial dimension
and the channel dimension.
    Occluded Pedestrian Detection. Several studies have been proposed to handle occlusion in pedes-
trian detection. A common strategy is a part-based approach where a set of part detectors are learned
with each part designed to handle a specific occlusion pattern. Some of these part-based methods, such
as [13][14], divide pedestrians into different parts and then train several detectors to detect each part.
    As a part of the whole, the component detector can effectively use the structural information of the
visible part when dealing with the occlusion problem. However, training each component detector sep-
arately linearly increases computing resources consumed with the number of defined component detec-
tors. In addition, some part-based methods, such as [15][16], integrate structural information of objects
into a network and exploit visible body information to learn specific occlusion modes. Different from
these methods, we propose a module that uses the attention mechanism to adjust the weight of the input

                                                    60
feature map and uses effective information to detect pedestrians.
    Attention Mechanism. The attention method consists of spatial attention and channel attention,
specifically, spatial attention helps us focus on where features are meaningful and channel attention
helps us focus on what features are meaningful. Since Squeeze-and-Excitation Networks(SENet)[17]
have demonstrated the effectiveness of the attention mechanisms, which are widely used in many com-
puter vision tasks such as image classification, object detection, instance segmentation, and semantic
segmentation. SENet[17] improves detection performance at a very low cost with MaxPool and Aver-
agePool operations, but it ignores the importance of spatial information. Therefore, the Bottleneck At-
tention Module(BAM)[18], Double Attention Networks(DANet)[19], and Convolutional Block Atten-
tion Module(CBAM)[20] are proposed to obtain the attention map by combining the spatial and channel
attention. Motivated by CBAM, to extract richer feature information, a new second-order pooling
method was proposed in [21] based on Global Second-order Pooling(GSoP). Subsequently, [22] intro-
duces a dynamic selection attention mechanism named Selective Kernel Networks(SKNet), which al-
lows each neuron to adaptively adjust its receptive field size based on multiple scales of input infor-
mation. The ResNeSt[23] proposes a similar Split-Attention block that applies channel-wise attention
to different network branches to leverage their success in capturing cross-feature interactions and learn-
ing diverse representations. To reduce model complexity and improve detection efficiency, GCNet[24]
introduces a simple spatial attention module and thus a long-range channel dependency is developed.
The ECANet[25] employs the one-dimensional convolution layer to reduce the redundancy of fully
connected layers. The FcaNet[26] proposes a novel multi-spectral channel attention that realizes the
pre-processing of channel attention mechanism in the frequency domain. On the basis of SENet[17],
EPSANet[27] groups the feature map to obtain a split attention block. To effectively combine two types
of attention mechanisms and reduce the computational overhead, SA-Net[28] first groups channel di-
mensions into multiple sub-features before processing them in parallel.
    For occlusion object detection, we propose a visible region enhancement network that combines
spatial and channel attention, specifically, the acquisition of them is interrelated compared to the above-
mentioned methods.

3. Method

   In this section, we introduce the VREN, which consists of spatial and channel attention. Given a
feature map 𝐹 ∈ ℝ × × as input, VREN sequentially infers a 2D spatial attention 𝐴 ∈ ℝ × × and
a 1D channel attention 𝐴 ∈ ℝ × × , especially, the acquisition of the 𝐴 is affected by 𝐴 , the overall
framework is shown in Fig.2. The overall process of VREN can be summarized as:
                                           𝐹 =𝐹×𝐴
                                             𝐴 = 𝐹⨂𝐴
                                             𝐹 ′ = 𝐹′ × 𝐴
    𝐹 ′ is the final refined feature map as output. The following describes the details of VREN and the
attention module.


Figure 2. The architecture of VREN.


                                                    61
    Visible Region Enhancement Network. As mentioned earlier, we design VREN to take into ac-
count the incompleteness of object information in the case of occlusion, and the missing information
will reduce the overall confidence of the object. Therefore, VREN first obtains spatial attention to de-
termine where are visible at the spatial level and then obtains channel attention by feature map convolve
with spatial attention to determine what features are visible. Finally, we obtain the refined feature map
after feature map sequentially through the processing of spatial attention and channel attention. Refined
feature map makes the information of the object’s visible region enhanced, and the irrelevant infor-
mation is suppressed. The overall framework of VREN is shown in figure 2.
    Spatial Attention. Spatial attention focuses on ‘where’ features of a given input image are visible,
our method produces a spatial attention mask through three consecutive convolution operations. For
aggregating attention feature information, Woo et al.[20] use both max-pooling and average-pooling
operations, this operation is very simple and shows to be effective in highlighting informative re-
gions[29]. In order to improve the learning ability of spatial attention and the nonlinear expression
ability of VREN, we use three convolution operations to obtain spatial attention mask. Specifically, the
first two convolutional layers continuously reduce the channel dimension to ℝ × × as preliminary
mask, and the last convolutional layer adjusts the mask with very few parameters as the final spatial
attention mask. To reduce the complexity of the model, we set the size of the first convolution kernel to
1 × 1, and the second and the third to 3 × 3. In short, spatial attention is computed as:

                                   𝐴 𝐹 =𝑓 ×         𝑓 × 𝑓 × 𝐹

    where 𝑓 × denotes a convolution operation, which has the filter with the size of 1 × 1. The 𝑓 ×
denotes a convolution operation, which has a filter with the size of 3 × 3. Figure 3 depicts the compu-
tation process of spatial attention.


Figure 3. Computation process of spatial attention.

   Channel Attention. Different channel represents different filter, channel attention focuses on ‘what’
features of a given input image are visible. Our method produces a channel attention map through fea-
ture map 𝐹 convolve with spatial attention mask. For aggregating spatial feature information, common
operations are to use max-pooling and average-pooling for dimensionality reduction. Hu et al. use it to
design a simple attention module to obtain effectively channel information. However, we consider that
only if the object characteristic information is visible, the channel filter should play a specific role. In
other words, ‘where’ should guide the generation of ‘what’. We first aggregate spatial information of a
feature map by using spatial attention mask 𝐴 𝐹 ∈ ℝ × × convolves with feature map 𝐹 ∈
ℝ × × , generating a spatial context descriptor 𝐴 ∈ ℝ × × . The descriptor is then forwarded to a
network to produce channel attention map 𝐴 ∈ ℝ × × , the network is composed of fully connected
layer(FC) with three hidden layers. In short, the channel attention is computed as:

                           𝐴 𝐹 = 𝜎 𝐹𝐶 𝐹𝐶 𝐹𝐶 𝐹𝐶 𝐹⨂𝐴 𝐹


                                   =𝜎 𝑊 𝑊 𝑊 𝑊 𝐴

                                                                       ×
   Where 𝜎 denotes the sigmoid function, 𝑊 ∈ ℝ × , 𝑊 ∈ ℝ                   , 𝑊 ∈ ℝ × , and 𝑊 ∈ ℝ × .
Figure 4 depicts the computation process of spatial attention.


                                                    62
Figure 4. Computation process of channel attention.

4. Experiments

   In this section, we evaluate VREN on the standard benchmarks: CrowdHuman datasets[12] for ob-
ject detection. In order to perform better comparisons, we first reproduce the Faster R-CNN in the
PyTorch framework and set it as our baseline. Then we perform extensive experiments to thoroughly
evaluate the effectiveness of our module.

4.1. Datasets and Evaluation Metrics

    Datasets. The quality of the datasets greatly affects the performance and generalization ability of
the detector, so we chose CrowdHuman datasets[12] as our test data to simulate occlusion situations.
CrowdHuman contains 15000 training images, 4370 validation images, and 5000 test images respec-
tively. Especially, each picture has an average of 22.64 pedestrians, and the occlusion rate of 2.4 pedes-
trians exceeds 0.5[30]. We use the full-body benchmark in [24] to evaluate our model, and the results
are evaluated on the validation dataset.
    Evaluation Metrics.       To better reflect the advantages of the proposed method, we use two met-
rics for comparison, including AP and MR-2[31].
     AP, which is short for average precision, is the most popular metric for object detection. AP
        reflects both the precision and recall of detection results. The larger the AP, the better the per-
        formance of the detector.
     MR-2[31], which is short for log-average Miss Rate on False Positive Per Image (FPPI) in [10-
        2
          ,100], is a common metric used in pedestrian detection. MR-2 reflects false positives of detection
        results. The smaller the MR-2[31], the better the performance of the detector.

4.2. Implementation Details

    We use the open-source implementation of Faster R-CNN[7] for experiments. The models are
trained on 2 NVIDIA Tesla V100 GPUs, and the batch size is 8 per GPU within 90 epochs. We use the
SGD optimizer with a momentum of 0.9, the weight decay of 10 . The learning rate is initially set to
0.01 and is decreased by the factor of 10 at the 72th and the 81th epochs, respectively.

4.3. Ablation Study.

   We perform the ablation experiments of the proposed module to evaluate the effectiveness of various
parts, including spatial attention and channel attention. The baseline is Faster R-CNN using Resnet50
for feature extraction. It is clear that the best performance is achieved only when both spatial attention
and channel attention act on the visible region enhancement network. Table 1 has shown the specific
performance of our experiments. It is clear that our method consistently improves the detection perfor-
mances by 3.5% in AP and 7.2% in MR-2[31] compared to the baseline network Faster R-CNN[7]. To
improve test efficiency, we only add each attention to the last feature map.


                                                    63
Table 1. Ablation experiments on CrowdHuman.
    Spatial attention       Channel attention           AP(%)                    MR-2(%)
                                                        84.61                    47.34
    √                                                   86.84                    43.10
                              √                         85.29                    46.36
    √                         √                         87.10                    42.61

4.4. Comparisons with Other Attention Mechanism

    To our knowledge, very few previous works of attention mechanisms on crowded detection report
their results. To compare, we reproduce several attention algorithms. All methods use Faster R-CNN[7]
as the base detector with the same implementation details. Table 2 lists the comparison results. In con-
trast, our method achieves the best results. The reason is that VREN guide the generation of channel
attention through spatial attention. Spatial attention first filters out the interference information from
the spatial level so that channel attention can focus more accurately on the selection of feature patterns
by learning of FC.

Table 2. Comparison experiments on CrowdHuman.
               Method                 AP(%)                         MR-2(%)
               baseline               84.61                         47.34
               +SENet                 86.99                         44.13
               +BAM                   87.39                         41.60
               +CBAM                  87.49                         41.28
               +SKNet                 87.05                         43.76
               +EPSANet               86.38                         44.58
               +SANet                 87.48                         42.55
               +VREN(ours)            88.12                         40.12

   In order to better show the effect of our method, we visually compare the results of three algorithms,
which are baseline, CBAM[20], and our method. The reason for choosing CBAM[20] is that it is the
best result except for our method. Figure 5 shows the results of the visual comparison.


                                                   64
Figure 5. Visual comparison. The first row is the results of the baseline. The second row is the results
of CBAM[20]. The last row is the results of our method. Red boxes are the missed detection ones.

5. Conclusion

   In this paper, we have proposed the visible region enhancement network(VREN), a novel method to
improve the representation power for occluded pedestrian detection. This method makes use of the
concept of attention, designing new spatial attention and channel attention. Our approach is not only
effective but also easy to combine with most existing state-of-the-art detection frameworks.

6. References

[1] LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot MultiBox detector[C]//LNCS 9905:
     Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 8-16, 2016.
     Cham: Springer, 2016: 21-37.
[2] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object
     detection[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recog-
     nition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 779-788.
[3] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of
     the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washing-
     ton: IEEE Computer Society, 2017: 2999-3007.
[4] Fu C Y, Liu W, Ranga A, et al. Dssd: Deconvolutional single shot detector[J]. arXiv preprint
     arXiv:1701.06659, 2017.
[5] HE K M, ZHANG X Y, REN S Q, et al. Spatial pyramid pooling in deep convolutional networks
     for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015,
     37(9): 1904-1916.
[6] GIRSHICK R. Fast R-CNN[C]//Proceedings of the 2015 IEEE International Conference on Com-
     puter Vision, Santiago, Dec 13-16, 2015. Washington: IEEE Computer Society, 2015: 1440-1448.
[7] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region
     proposal networks[C]//Advances in Neural Information Processing Systems 28, Dec 7-12, 2015.
     Red Hook: Curran Associates, 2015: 91-99.
[8] DAI J, LI Y, HE K, et al. R-FCN: object detection via region based fully convolutional net-
     works[C]//Advances in Neural Information Processing Systems 29, Barcelona, Dec 5-10, 2016.
     Red Hook: Curran Associates, 2016: 379-387.
[9] HE K M, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[C]//Proceedings of the 2017 IEEE In-
     ternational Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Com-
     puter Society, 2017: 2980-2988.
[10] Cai Z, Vasconcelos N. Cascade R-CNN: high quality object detection and instance segmentation[J].
     IEEE transactions on pattern analysis and machine intelligence, 2019, 43(5): 1483-1498.
[11] Fawzi A, Frossard P. Measuring the effect of nuisance variables on classifiers[C]//British Machine
     Vision Conference (BMVC). 2016 (CONF).
[12] Shao S, Zhao Z, Li B, et al. Crowdhuman: A benchmark for detecting human in a crowd[J]. arXiv
     preprint arXiv:1805.00123, 2018.
[13] Tian Y, Luo P, Wang X, et al. Deep learning strong parts for pedestrian detection[C]//Proceedings


                                                  65
     of the IEEE international conference on computer vision. 2015: 1904-1912.
[14] Zhou C, Yuan J. Multi-label learning of part detectors for occluded pedestrian detection[J]. Pattern
     Recognition, 2019, 86: 99-111.
[15] Zhang S, Wen L, Bian X, et al. Occlusion-aware R-CNN: detecting pedestrians in a crowd[C]//Pro-
     ceedings of the European Conference on Computer Vision (ECCV). 2018: 637-653.
[16] Xie J, Pang Y, Cholakkal H, et al. PSC-Net: learning part spatial co-occurrence for occluded pe-
     destrian detection[J]. Science China Information Sciences, 2021, 64(2): 1-13.
[17] Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of the 2018 IEEE Confer-
     ence on Computer Vision and Pattern Recognition. 2018: 7132-7141.
[18] Park J, Woo S, Lee J Y, et al. Bam: Bottleneck attention module[J]. arXiv preprint
     arXiv:1807.06514, 2018.
[19] Fu J, Liu J, Tian H, et al. Dual attention network for scene segmentation[C]//Proceedings of the
     2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 3146-3154.
[20] Woo S, Park J, Lee J Y, et al. Cbam: Convolutional block attention module[C]//Proceedings of the
     European Conference on Computer Vision. 2018: 3-19.
[21] Gao Z, Xie J, Wang Q, et al. Global second-order pooling convolutional networks[C]//Proceedings
     of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 3024-3033.
[22] Li X, Wang W, Hu X, et al. Selective kernel networks[C]//Proceedings of the 2019 IEEE/CVF
     Conference on Computer Vision and Pattern Recognition. 2019: 510-519.
[23] Zhang H, Wu C, Zhang Z, et al. Resnest: Split-attention networks[C]//Proceedings of the 2022
     IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 2736-2746.
[24] Cao Y, Xu J, Lin S, et al. Gcnet: Non-local networks meet squeeze-excitation networks and be-
     yond[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Work-
     shops. 2019: 0-0.
[25] Wang Q, Wu B, Zhu P, et al. Supplementary material for ‘ECA-Net: Efficient channel attention for
     deep convolutional neural networks[C]//Proceedings of the 2020 IEEE/CVF Conference on Com-
     puter Vision and Pattern Recognition, IEEE, Seattle, WA, USA. 2020: 13-19.
[26] Qin Z, Zhang P, Wu F, et al. Fcanet: Frequency channel attention networks[C]//Proceedings of the
     IEEE/CVF international conference on computer vision. 2021: 783-792.
[27] Zhang H, Zu K, Lu J, et al. Epsanet: An efficient pyramid split attention block on convolutional
     neural network[J]. arXiv preprint arXiv:2105.14447, 2021.
[28] Zhang Q L, Yang Y B. Sa-net: Shuffle attention for deep convolutional neural net-
     works[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal
     Processing (ICASSP). IEEE, 2021: 2235-2239.
[29] Zagoruyko S, Komodakis N. Paying more attention to attention: Improving the performance of
     convolutional neural networks via attention transfer[J]. arXiv preprint arXiv:1612.03928, 2016.
[30] Chu X, Zheng A, Zhang X, et al. Detection in crowded scenes: One proposal, multiple predic-
     tions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
     2020: 12214-12223.
[31] Dollar P, Wojek C, Schiele B, et al. Pedestrian detection: An evaluation of the state of the art[J].
     IEEE transactions on pattern analysis and machine intelligence, 2011, 34(4): 743-761.


                                                   66