<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Visible Region Enhancement Network for Occluded Pedestrian Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fangwei Sun</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Caidong Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chengyang Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Heng Zhou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ziwei Du</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yongqiang Xie</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhongbo Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institution of Systems Engineering, Academy of Military Sciences</institution>
          ,
          <addr-line>Beijing, 100141</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Electronic Engineering, Xidian University</institution>
          ,
          <addr-line>Xi'an, Shanxi, 710071</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Electronics Engineering and Computer Science, Peking University</institution>
          ,
          <addr-line>Beijing 100871</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <fpage>59</fpage>
      <lpage>66</lpage>
      <abstract>
        <p>Occlusion is a big challenge in detecting pedestrians. In this paper, we propose a new network module named Visible Region Enhancement Network(VREN), which is consisted of a spatial attention network and a channel attention network. Given feature maps, our module infers attention maps from two dimensions, spatial and channel. In particular, compared with the previous attention mechanism, the acquisition of the two kinds of attention in VREN is interrelated, rather than independent. Based on attention maps, VREN can enhance the effective feature from different dimensions, while reducing the interference noise. Because VREN works in the feature extraction stage, it can be integrated into any Convolutional Neural Network(CNN) architecture and is end-to-end trainable along with base CNNs. We validate our VREN through extensive experiments on CrowdHuman datasets. Our experiments show VREN can effectively increase detection performances compared to the Faster R-CNN baseline.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Pedestrian Detection</kwd>
        <kwd>Occlusion Detection</kwd>
        <kwd>Spatial Attention</kwd>
        <kwd>Channel Attention</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction 1</p>
      <p>
        Pedestrian detection, as a branch of object detection, is an important task in computer vision. It is
widely used in various fields, such as autonomous driving, object tracking, video surveillance, and many
other fields. In recent years, with the development of deep learning, especially CNN, the performance
of pedestrian detection has obtained rapid improvement. According to the different generation modes
of proposals, the CNN frames can be roughly divided into two types: one-stage detector[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref2">2</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
without independent to generate proposals, and two-stage detector[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ][
        <xref ref-type="bibr" rid="ref6">6</xref>
        ][
        <xref ref-type="bibr" rid="ref7">7</xref>
        ][
        <xref ref-type="bibr" rid="ref8">8</xref>
        ][
        <xref ref-type="bibr" rid="ref9">9</xref>
        ][
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] with independent
network generating proposals. In contrast, the one-stage detector has a faster detection speed but a lower
detection accuracy, while the two-stage detector has a higher detection accuracy but a slower detection
speed. These advanced detectors have greatly promoted the research of pedestrian detection and made
great breakthroughs.
      </p>
      <p>
        However, in the real world, it is very common for the pedestrian to occlude each other or be occluded
by other objects, which cause the body is not fully visible. The difficulties of occluded object detection
are as follows: (a) Because of the influence of the datasets and the complexity of occlusion, Fawzi and
Frossard[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proved occlusion detector which based on CNN is not robust. (b) Occlusion interference
feature extraction and occlusion of each other two objectives are likely to have very similar
characteristics, which cause the detector cannot predict accurately distinguish. (c) During occlusion, the
prediction boxes of different objects may be seriously overlapped, so the prediction boxes of different object
may be regarded as the prediction of one object by the non-maximum suppression(NMS) algorithm,
and the false suppression will lead to missed detection. From the above analysis, occlusion remains a
big challenge in detecting pedestrians.
      </p>
      <p>
        To handle occlusion, an effective solution is to use attention mechanism. Attention mechanisms not
only tell us where to focus, but it also improve the representation of target feature information. In this
paper, we propose a new network module, named “Visible Region Enhancement Network”. Since CNN
extract features by blending cross-spatial and channel information together, we adopt our module to
emphasize meaningful features along spatial and channel dimensions. In addition, the two kinds of
attention acquisition are closely related. As a result, our module efficiently helps the feature information
transfer within the network by learning which information to enhance or suppress. Fig. 1 (a) shows the
results predicted by Faster R-CNN[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] baseline: the detector fails to predict instances heavily overlapped
with others. Fig. 1 (b) shows the prediction results of our method. In particular, our method also
improves positioning accuracy.
      </p>
      <p>(a) Baseline
(b) Ours</p>
      <p>
        In the CrowdHuman datasets[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], we obtain accuracy improvement from the baseline network by
plugging our module, proving the efficacy of VREN. Since our module is designed to work in the
feature extraction stage, in theory, both the one-stage model and the two-stage model can add VREN in
most cases.
      </p>
      <p>Contribution. Our main contribution is three-fold：
1. We propose an effective attention module (VREN), which can be integrated with any CNN
architecture.
2. Compared with the existing attention mechanism, VREN combines spatial attention and channel
attention and enhances the correlation between the two kinds of attention.</p>
      <p>3. We evaluate the effectiveness of VREN through a large number of ablation experiments.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>As mentioned in the introduction, occlusion interference feature extraction will cause the feature
map not to be able to effectively guide the classifier to make a correct judgment on the predicted box.
Therefore, for the detection of the occlusion scene, the feature information should be distinguished. To
make this purpose, the attention mechanism can re-weight the feature by adjusting the spatial dimension
and the channel dimension.</p>
      <p>
        Occluded Pedestrian Detection. Several studies have been proposed to handle occlusion in
pedestrian detection. A common strategy is a part-based approach where a set of part detectors are learned
with each part designed to handle a specific occlusion pattern. Some of these part-based methods, such
as [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ][
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], divide pedestrians into different parts and then train several detectors to detect each part.
      </p>
      <p>
        As a part of the whole, the component detector can effectively use the structural information of the
visible part when dealing with the occlusion problem. However, training each component detector
separately linearly increases computing resources consumed with the number of defined component
detectors. In addition, some part-based methods, such as [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ][
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], integrate structural information of objects
into a network and exploit visible body information to learn specific occlusion modes. Different from
these methods, we propose a module that uses the attention mechanism to adjust the weight of the input
feature map and uses effective information to detect pedestrians.
      </p>
      <p>
        Attention Mechanism. The attention method consists of spatial attention and channel attention,
specifically, spatial attention helps us focus on where features are meaningful and channel attention
helps us focus on what features are meaningful. Since Squeeze-and-Excitation Networks(SENet)[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
have demonstrated the effectiveness of the attention mechanisms, which are widely used in many
computer vision tasks such as image classification, object detection, instance segmentation, and semantic
segmentation. SENet[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] improves detection performance at a very low cost with MaxPool and
AveragePool operations, but it ignores the importance of spatial information. Therefore, the Bottleneck
Attention Module(BAM)[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], Double Attention Networks(DANet)[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], and Convolutional Block
Attention Module(CBAM)[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] are proposed to obtain the attention map by combining the spatial and channel
attention. Motivated by CBAM, to extract richer feature information, a new second-order pooling
method was proposed in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] based on Global Second-order Pooling(GSoP). Subsequently, [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]
introduces a dynamic selection attention mechanism named Selective Kernel Networks(SKNet), which
allows each neuron to adaptively adjust its receptive field size based on multiple scales of input
information. The ResNeSt[
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] proposes a similar Split-Attention block that applies channel-wise attention
to different network branches to leverage their success in capturing cross-feature interactions and
learning diverse representations. To reduce model complexity and improve detection efficiency, GCNet[
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]
introduces a simple spatial attention module and thus a long-range channel dependency is developed.
The ECANet[
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] employs the one-dimensional convolution layer to reduce the redundancy of fully
connected layers. The FcaNet[
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] proposes a novel multi-spectral channel attention that realizes the
pre-processing of channel attention mechanism in the frequency domain. On the basis of SENet[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],
EPSANet[
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] groups the feature map to obtain a split attention block. To effectively combine two types
of attention mechanisms and reduce the computational overhead, SA-Net[
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] first groups channel
dimensions into multiple sub-features before processing them in parallel.
      </p>
      <p>For occlusion object detection, we propose a visible region enhancement network that combines
spatial and channel attention, specifically, the acquisition of them is interrelated compared to the
abovementioned methods.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>In this section, we introduce the VREN, which consists of spatial and channel attention. Given a
framework is shown in Fig.2. The overall process of VREN can be summarized as:
feature map  ∈ ℝ ××
a 1D channel attention 
as input, VREN sequentially infers a 2D spatial attention 
∈ ℝ××
∈ ℝ××
, especially, the acquisition of the 
is affected by  , the overall
and
attention module.</p>
      <p>′ is the final refined feature map as output. The following describes the details of VREN and the
 ′ = ′ × 
Visible Region Enhancement Network.</p>
      <p>As mentioned earlier, we design VREN to take into
account the incompleteness of object information in the case of occlusion, and the missing information
will reduce the overall confidence of the object. Therefore, VREN first obtains spatial attention to
determine where are visible at the spatial level and then obtains channel attention by feature map convolve
with spatial attention to determine what features are visible. Finally, we obtain the refined feature map
after feature map sequentially through the processing of spatial attention and channel attention. Refined
feature map makes the information of the object’s visible region enhanced, and the irrelevant
information is suppressed. The overall framework of VREN is shown in figure 2.</p>
      <p>
        Spatial Attention. Spatial attention focuses on ‘where’ features of a given input image are visible,
our method produces a spatial attention mask through three consecutive convolution operations. For
aggregating attention feature information, Woo et al.[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] use both max-pooling and average-pooling
operations, this operation is very simple and shows to be effective in highlighting informative
regions[
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. In order to improve the learning ability of spatial attention and the nonlinear expression
ability of VREN, we use three convolution operations to obtain spatial attention mask. Specifically, the
first two convolutional layers continuously reduce the channel dimension to ℝ××
mask, and the last convolutional layer adjusts the mask with very few parameters as the final spatial
attention mask. To reduce the complexity of the model, we set the size of the first convolution kernel to
as preliminary
1 × 1, and the second and the third to 3 × 3. In short, spatial attention is computed as:


=  ×
 × 
×

tation process of spatial attention.
      </p>
      <p>where  ×</p>
      <p>denotes a convolution operation, which has the filter with the size of 1 × 1. The  ×
denotes a convolution operation, which has a filter with the size of 3 × 3. Figure 3 depicts the
computure map  convolve with spatial attention mask. For aggregating spatial feature information, common
operations are to use max-pooling and average-pooling for dimensionality reduction. Hu et al. use it to
design a simple attention module to obtain effectively channel information. However, we consider that
only if the object characteristic information is visible, the channel filter should play a specific role. In
other words, ‘where’ should guide the generation of ‘what’. We first aggregate spatial information of a
layer(FC) with three hidden layers. In short, the channel attention is computed as:
feature map by using spatial attention mask 

ℝ××</p>
      <p>, generating a spatial context descriptor  ∈ ℝ
network to produce channel attention map</p>
      <p>convolves with feature map  ∈
. The descriptor is then forwarded to a
, the network is composed of fully connected


=   
=</p>
      <p>Where  denotes the sigmoid function, 
∈ ℝ × , 
∈ ℝ
× , 
∈ ℝ × , and 
∈ ℝ
×</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>
        In this section, we evaluate VREN on the standard benchmarks: CrowdHuman datasets[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] for
object detection. In order to perform better comparisons, we first reproduce the Faster R-CNN in the
PyTorch framework and set it as our baseline. Then we perform extensive experiments to thoroughly
evaluate the effectiveness of our module.
      </p>
    </sec>
    <sec id="sec-5">
      <title>4.1. Datasets and Evaluation Metrics</title>
      <p>
        Datasets. The quality of the datasets greatly affects the performance and generalization ability of
the detector, so we chose CrowdHuman datasets[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] as our test data to simulate occlusion situations.
CrowdHuman contains 15000 training images, 4370 validation images, and 5000 test images
respectively. Especially, each picture has an average of 22.64 pedestrians, and the occlusion rate of 2.4
pedestrians exceeds 0.5[
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. We use the full-body benchmark in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] to evaluate our model, and the results
are evaluated on the validation dataset.
      </p>
      <p>
        Evaluation Metrics. To better reflect the advantages of the proposed method, we use two
metrics for comparison, including AP and MR-2[
        <xref ref-type="bibr" rid="ref31">31</xref>
        ].
      </p>
      <p>
         AP, which is short for average precision, is the most popular metric for object detection. AP
reflects both the precision and recall of detection results. The larger the AP, the better the
performance of the detector.
 MR-2[
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], which is short for log-average Miss Rate on False Positive Per Image (FPPI) in
[102,100], is a common metric used in pedestrian detection. MR-2 reflects false positives of detection
results. The smaller the MR-2[
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], the better the performance of the detector.
      </p>
    </sec>
    <sec id="sec-6">
      <title>4.2. Implementation Details</title>
      <p>
        We use the open-source implementation of Faster R-CNN[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for experiments. The models are
trained on 2 NVIDIA Tesla V100 GPUs, and the batch size is 8 per GPU within 90 epochs. We use the
SGD optimizer with a momentum of 0.9, the weight decay of 10 . The learning rate is initially set to
0.01 and is decreased by the factor of 10 at the 72th and the 81th epochs, respectively.
      </p>
    </sec>
    <sec id="sec-7">
      <title>4.3. Ablation Study.</title>
      <p>
        We perform the ablation experiments of the proposed module to evaluate the effectiveness of various
parts, including spatial attention and channel attention. The baseline is Faster R-CNN using Resnet50
for feature extraction. It is clear that the best performance is achieved only when both spatial attention
and channel attention act on the visible region enhancement network. Table 1 has shown the specific
performance of our experiments. It is clear that our method consistently improves the detection
performances by 3.5% in AP and 7.2% in MR-2[
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] compared to the baseline network Faster R-CNN[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. To
improve test efficiency, we only add each attention to the last feature map.
      </p>
    </sec>
    <sec id="sec-8">
      <title>4.4. Comparisons with Other Attention Mechanism</title>
      <p>
        To our knowledge, very few previous works of attention mechanisms on crowded detection report
their results. To compare, we reproduce several attention algorithms. All methods use Faster R-CNN[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
as the base detector with the same implementation details. Table 2 lists the comparison results. In
contrast, our method achieves the best results. The reason is that VREN guide the generation of channel
attention through spatial attention. Spatial attention first filters out the interference information from
the spatial level so that channel attention can focus more accurately on the selection of feature patterns
by learning of FC.
      </p>
      <p>
        In order to better show the effect of our method, we visually compare the results of three algorithms,
which are baseline, CBAM[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], and our method. The reason for choosing CBAM[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] is that it is the
best result except for our method. Figure 5 shows the results of the visual comparison.
      </p>
    </sec>
    <sec id="sec-9">
      <title>5. Conclusion</title>
      <p>In this paper, we have proposed the visible region enhancement network(VREN), a novel method to
improve the representation power for occluded pedestrian detection. This method makes use of the
concept of attention, designing new spatial attention and channel attention. Our approach is not only
effective but also easy to combine with most existing state-of-the-art detection frameworks.</p>
    </sec>
    <sec id="sec-10">
      <title>6. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>LIU</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>ANGUELOV</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>ERHAN D</surname>
          </string-name>
          , et al.
          <article-title>SSD: single shot MultiBox detector</article-title>
          [C]//LNCS 9905:
          <source>Proceedings of the 14th European Conference on Computer Vision</source>
          , Amsterdam, Oct 8-
          <issue>16</issue>
          ,
          <year>2016</year>
          . Cham: Springer,
          <year>2016</year>
          :
          <fpage>21</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>REDMON</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>DIVVALA</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>GIRSHICK</surname>
            <given-names>R</given-names>
          </string-name>
          , et al.
          <article-title>You only look once: unified, real-time object detection[C]//</article-title>
          <source>Proceedings of the 2016 IEEE Conference on Computer Vision</source>
          and Pattern Recognition,
          <source>Las Vegas, Jun 27-30</source>
          ,
          <year>2016</year>
          . Washington: IEEE Computer Society,
          <year>2016</year>
          :
          <fpage>779</fpage>
          -
          <lpage>788</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>LIN</surname>
            <given-names>T Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>GOYAL</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>GIRSHICK</surname>
            <given-names>R</given-names>
          </string-name>
          , et al.
          <article-title>Focal loss for dense object detection[C]//</article-title>
          <source>Proceedings of the 2017 IEEE International Conference on Computer Vision</source>
          , Venice, Oct 22-
          <issue>29</issue>
          ,
          <year>2017</year>
          . Washington: IEEE Computer Society,
          <year>2017</year>
          :
          <fpage>2999</fpage>
          -
          <lpage>3007</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Fu</surname>
            <given-names>C Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ranga</surname>
            <given-names>A</given-names>
          </string-name>
          , et al.
          <article-title>Dssd: Deconvolutional single shot detector[J]</article-title>
          .
          <source>arXiv preprint arXiv:1701.06659</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>HE K M</surname>
            , ZHANG X Y,
            <given-names>REN S Q</given-names>
          </string-name>
          , et al.
          <article-title>Spatial pyramid pooling in deep convolutional networks for visual recognition[J]</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <year>2015</year>
          ,
          <volume>37</volume>
          (
          <issue>9</issue>
          ):
          <fpage>1904</fpage>
          -
          <lpage>1916</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>GIRSHICK R. Fast R-CNN</surname>
          </string-name>
          [C]//Proceedings of the 2015
          <source>IEEE International Conference on Computer Vision</source>
          , Santiago,
          <source>Dec 13-16</source>
          ,
          <year>2015</year>
          . Washington: IEEE Computer Society,
          <year>2015</year>
          :
          <fpage>1440</fpage>
          -
          <lpage>1448</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>REN</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>HE</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>GIRSHICK R</surname>
          </string-name>
          , et al.
          <article-title>Faster R-CNN: towards real-time object detection with region proposal networks[</article-title>
          <source>C]//Advances in Neural Information Processing Systems 28, Dec</source>
          <volume>7</volume>
          -
          <issue>12</issue>
          ,
          <year>2015</year>
          . Red Hook: Curran Associates,
          <year>2015</year>
          :
          <fpage>91</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>DAI</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>LI</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>HE K</surname>
          </string-name>
          , et al. R-FCN:
          <article-title>object detection via region based fully convolutional networks[</article-title>
          <source>C]//Advances in Neural Information Processing Systems</source>
          <volume>29</volume>
          ,
          <string-name>
            <surname>Barcelona</surname>
          </string-name>
          , Dec 5-
          <issue>10</issue>
          ,
          <year>2016</year>
          . Red Hook: Curran Associates,
          <year>2016</year>
          :
          <fpage>379</fpage>
          -
          <lpage>387</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>HE K M</surname>
            ,
            <given-names>G</given-names>
            KIOXARI G
          </string-name>
          ,
          <string-name>
            <surname>DOLLÁR</surname>
            <given-names>P</given-names>
          </string-name>
          , et al. Mask
          <string-name>
            <surname>R-CNN</surname>
          </string-name>
          [C]//Proceedings of the 2017
          <source>IEEE International Conference on Computer Vision</source>
          , Venice, Oct 22-
          <issue>29</issue>
          ,
          <year>2017</year>
          . Washington: IEEE Computer Society,
          <year>2017</year>
          :
          <fpage>2980</fpage>
          -
          <lpage>2988</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Cai</surname>
            <given-names>Z</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vasconcelos N. Cascade</surname>
          </string-name>
          R-CNN:
          <article-title>high quality object detection and instance segmentation[J]</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          ,
          <year>2019</year>
          ,
          <volume>43</volume>
          (
          <issue>5</issue>
          ):
          <fpage>1483</fpage>
          -
          <lpage>1498</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Fawzi</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frossard</surname>
            <given-names>P. Measuring</given-names>
          </string-name>
          <article-title>the effect of nuisance variables on classifiers</article-title>
          [C]//British Machine Vision Conference (BMVC).
          <source>2016 (CONF).</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Shao</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            <given-names>Z</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            <given-names>B</given-names>
          </string-name>
          , et al.
          <article-title>Crowdhuman: A benchmark for detecting human in a crowd[J]</article-title>
          .
          <source>arXiv preprint arXiv:1805.00123</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Tian</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>X</given-names>
          </string-name>
          , et al.
          <article-title>Deep learning strong parts for pedestrian detection[C]//</article-title>
          <source>Proceedings of the IEEE international conference on computer vision</source>
          .
          <year>2015</year>
          :
          <fpage>1904</fpage>
          -
          <lpage>1912</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Zhou</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <article-title>Yuan J. Multi-label learning of part detectors for occluded pedestrian detection</article-title>
          [J].
          <source>Pattern Recognition</source>
          ,
          <year>2019</year>
          ,
          <volume>86</volume>
          :
          <fpage>99</fpage>
          -
          <lpage>111</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Zhang</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wen</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bian</surname>
            <given-names>X</given-names>
          </string-name>
          , et al.
          <article-title>Occlusion-aware R-CNN: detecting pedestrians in a crowd[C]//</article-title>
          <source>Proceedings of the European Conference on Computer Vision (ECCV)</source>
          .
          <year>2018</year>
          :
          <fpage>637</fpage>
          -
          <lpage>653</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Xie</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pang</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cholakkal</surname>
            <given-names>H</given-names>
          </string-name>
          , et al.
          <article-title>PSC-Net: learning part spatial co-occurrence for occluded pedestrian detection</article-title>
          [J].
          <source>Science China Information Sciences</source>
          ,
          <year>2021</year>
          ,
          <volume>64</volume>
          (
          <issue>2</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Hu</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            <given-names>G</given-names>
          </string-name>
          .
          <article-title>Squeeze-and-excitation networks[C]//</article-title>
          <source>Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition</source>
          .
          <year>2018</year>
          :
          <fpage>7132</fpage>
          -
          <lpage>7141</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Park</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Woo</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            <given-names>J Y</given-names>
          </string-name>
          , et al.
          <source>Bam: Bottleneck attention module[J]. arXiv preprint arXiv:1807.06514</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Fu</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tian</surname>
            <given-names>H</given-names>
          </string-name>
          , et al.
          <article-title>Dual attention network for scene segmentation[C]//</article-title>
          <source>Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          .
          <year>2019</year>
          :
          <fpage>3146</fpage>
          -
          <lpage>3154</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Woo</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            <given-names>J Y</given-names>
          </string-name>
          , et al.
          <source>Cbam: Convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision</source>
          .
          <year>2018</year>
          :
          <fpage>3</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Gao</surname>
            <given-names>Z</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xie</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>Q</given-names>
          </string-name>
          , et al.
          <article-title>Global second-order pooling convolutional networks[C]//</article-title>
          <source>Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          .
          <year>2019</year>
          :
          <fpage>3024</fpage>
          -
          <lpage>3033</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Li</surname>
            <given-names>X</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hu</surname>
            <given-names>X</given-names>
          </string-name>
          , et al.
          <source>Selective kernel networks[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          .
          <year>2019</year>
          :
          <fpage>510</fpage>
          -
          <lpage>519</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Zhang</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            <given-names>Z</given-names>
          </string-name>
          , et al.
          <source>Resnest: Split-attention networks[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          .
          <year>2022</year>
          :
          <fpage>2736</fpage>
          -
          <lpage>2746</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Cao</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>S</given-names>
          </string-name>
          , et al.
          <article-title>Gcnet: Non-local networks meet squeeze-excitation networks</article-title>
          and beyond[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops.
          <year>2019</year>
          :
          <fpage>0</fpage>
          -
          <lpage>0</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Wang</surname>
            <given-names>Q</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            <given-names>B</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            <given-names>P</given-names>
          </string-name>
          , et al.
          <article-title>Supplementary material for 'ECA-Net: Efficient channel attention for deep convolutional neural networks[C]//</article-title>
          <source>Proceedings of the 2020 IEEE/CVF Conference on Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          , Seattle, WA, USA.
          <year>2020</year>
          :
          <fpage>13</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Qin</surname>
            <given-names>Z</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            <given-names>F</given-names>
          </string-name>
          , et al.
          <article-title>Fcanet: Frequency channel attention networks</article-title>
          [C]//Proceedings of the IEEE/CVF international conference on
          <source>computer vision</source>
          .
          <year>2021</year>
          :
          <fpage>783</fpage>
          -
          <lpage>792</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Zhang</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zu</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            <given-names>J</given-names>
          </string-name>
          , et al.
          <article-title>Epsanet: An efficient pyramid split attention block on convolutional neural network[J]</article-title>
          .
          <source>arXiv preprint arXiv:2105.14447</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Zhang</surname>
            <given-names>Q L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            <given-names>Y B</given-names>
          </string-name>
          .
          <article-title>Sa-net: Shuffle attention for deep convolutional neural networks</article-title>
          [C]//ICASSP 2021-2021 IEEE International Conference on Acoustics,
          <source>Speech and Signal Processing (ICASSP)</source>
          . IEEE,
          <year>2021</year>
          :
          <fpage>2235</fpage>
          -
          <lpage>2239</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Zagoruyko</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Komodakis</surname>
            <given-names>N.</given-names>
          </string-name>
          <article-title>Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer</article-title>
          [J].
          <source>arXiv preprint arXiv:1612.03928</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Chu</surname>
            <given-names>X</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            <given-names>X</given-names>
          </string-name>
          , et al.
          <article-title>Detection in crowded scenes: One proposal</article-title>
          , multiple predictions[C]//Proceedings of the IEEE/CVF Conference on
          <source>Computer Vision and Pattern Recognition</source>
          .
          <year>2020</year>
          :
          <fpage>12214</fpage>
          -
          <lpage>12223</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Dollar</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wojek</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schiele</surname>
            <given-names>B</given-names>
          </string-name>
          , et al.
          <article-title>Pedestrian detection: An evaluation of the state of the art[J]</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          ,
          <year>2011</year>
          ,
          <volume>34</volume>
          (
          <issue>4</issue>
          ):
          <fpage>743</fpage>
          -
          <lpage>761</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>