<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>YOLO-DSRF: An Improved Small-Scale Pedestrian Detection Al- gorithm Based on Yolov4</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Runjie Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shuguang Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yipeng Duan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lei Shi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Supercomputing Center in Zhengzhou, Zhengzhou University</institution>
          ,
          <addr-line>Zhengzhou 450001, Henan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>307</fpage>
      <lpage>313</lpage>
      <abstract>
        <p>Although existing object detection algorithms have achieved good results, it is still a challenge to effectively detect small-scale pedestrians in real time. Aiming at the problems of complex structure, large number of parameters and high missed detection rate of small targets in existing pedestrian detection algorithms, the YOLO-DSRF pedestrian detection algorithm is proposed. On the basis of YOLOv4, the depth separation convolution was first introduced to significantly reduce the amount of parameters and computation of the model, and the channel attention mechanism was introduced into the network to improve the influence of important channel features on the network, a feature fusion module was designed in the backbone network by merging deep and shallow features to effectively extract target semantic information and location information, and introduce a receptive field module in the detection head to simulate the human receptive field to enhance feature extraction capability for small targets. For training and verification on the Caltech dataset, compared with the original algorithm, the number of parameters is reduced by 65.2%, and the running speed on the GPU is increased by 20%. The AP is roughly the same. The algorithm proposed in this paper can effectively reduce the model complexity while ensuring accuracy. Thereby increasing the running speed.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Pedestrians Detection</kwd>
        <kwd>YOLOv4</kwd>
        <kwd>Depthwise Separable Convolution</kwd>
        <kwd>SE</kwd>
        <kwd>Receptive Field Block</kwd>
        <kwd>Feature Fusion Module</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Object detection has always been a hot research direction in the field of computer vision. Its task is
to accurately identify and locate all objects in an image or video. Pedestrian detection is an important
branch of object detection, which is widely used in re-identification, and intelligent monitoring[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The
current mainstream pedestrian detection algorithms are based on deep learning, which includes
onestage algorithm and two-stage algorithm. The two-stage algorithm is based on candidate region
extraction. The main algorithms include RCNN[5], Faster-RCNN[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Mask-RCNN[
        <xref ref-type="bibr" rid="ref5">7</xref>
        ], etc. Although
the accuracy is high, it’s running slow. The one-stage algorithm has a simple structure and can directly
detect the image output results, so the calculation efficiency is high and the running speed is fast. The
main algorithms include SSD series[8] and YOLO series[9]. However, the accuracy of the single-stage
target detection algorithm is low. Due to the low resolution of small-scale pedestrians and less feature
information, it is easy to cause false detection and missed detection due to the influence of the
surrounding environment and image noise. How to accurately and quickly detect small target
pedestrians has become a hot research problem in the field of pedestrian detection.
      </p>
      <p>The feature extraction network of the pedestrian detection algorithm has a low downsampling factor
and a small receptive field, and pays more attention to small-scale pedestrian targets. However, due to
the weak ability to represent semantic information, it is easy to detect false object. Deep features have
large downsampling multiples and large receptive fields, which may easily lead to misse a lot of
smallscale pedestrian targets. To solve the above problems, some scholars proposed FPN (Feature Pyramid
Networks)[10], which fuses the features of deep and shallow layers to improve the detection effect, but
the detection speed of this kind of algorithm is low and cannot be detected in real-time. The YOLOv3[11]
proposed by Redmon et al. uses the DarkNet-53[12] residual network for feature extraction and
combines the FPN network to significantly improve the small target detection performance.
Bochkovskiy A improved on YOLOv3 and proposed YOLOv4[13]. Using CSPDarkNet[14]as the
backbone network, using SPP+PAN to fuse feature maps of different sizes, greatly improving the
accuracy of model detection, but its complex structure and large number of parameters are difficult to
deploy on mobile terminals or embedded devices.</p>
      <p>
        In order to reduce the complexity of the YOLOv4 model and improve the network running speed.
Wang et al.[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]replaced the YOLOv4 feature extraction network with a lightweight MobileNet and
improved the general problem of small target detection by adding a shallow detection head.Li et al.
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]constructed a feature extraction network with reference to ShuffleNet and channel attention
mechanism, which improved speed and ensured accuracy.
      </p>
      <p>
        In order to make yolo more effective in detecting small targets, Gao et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] introduced a threshold
attention module (TAM) and embedding CAM into BiFPN as a feature pyramid network. Wei [
        <xref ref-type="bibr" rid="ref9">17</xref>
        ]
introduced ASFF on the basis of FPN. Different from the layer-by-layer fusion of PANet, ASFF
adaptively learns weight coefficients for all feature maps participating in the fusion, and multiplies the
feature maps of each layer by the corresponding weights and then fuses them, which improves the
detection accuracy of the network for small targets.
      </p>
      <p>This paper proposes the YOLO-DSRF (Depthwise separable convolution &amp; SE attention module &amp;
Receptive field module &amp; Feature fusion module) algorithm by improving YOLOv4. First, replacing
traditional convolution with depthwise separation convolution reduces the amount of parameters and
improves the running speed. Second, in order to fully extract the features of small objects, the SE
attention mechanism is introduced to suppress the influence of noise and enhance the learning of
important channel features. Third, the FFM module is designed to integrate deep and shallow features,
and to strengthen the influence of shallow features on geometric information. and semantic information
representation ability. Finally, adding a receptive field module to the detection head of small-scale
targets to enhance feature extraction for small-scale targets. Finally, the effectiveness of the proposed
algorithm is verified on the Caltech dataset.</p>
    </sec>
    <sec id="sec-2">
      <title>2.Improved Algorithm</title>
      <p>The proposed YOLO-DSRF algorithm is improved by YOLOv4, and the improved part is the dashed
box in Figure 1. Depth separable convolution is introduced to replace the 3*3 convolution kernel in the
original network, and SE attention mechanism, RFB module, and top-down feature fusion module are
added</p>
    </sec>
    <sec id="sec-3">
      <title>2.1.Depthwise Separable Convolution</title>
      <p>
        The depthwise separation convolution is a lightweight method for embedded devices proposed by
Sandler et al. in MobileNet [
        <xref ref-type="bibr" rid="ref7">15</xref>
        ] in 2017. As shown in Figure 2, the depthwise separation convolution
consists of Depthwise convolution and Pointwise convolution. The input size The feature map of
H*W*C, the Depthwise convolution uses C convolution kernels of size n*n*1 to convolve with the
1layer feature map respectively to output a feature map. The independent operations do not effectively
utilize the feature information of different channels in the same spatial position. Pointwise convolution
uses C' convolution kernels of size 1*1*C to convolve with the output feature map. The output feature
map of size H'*W'*C'. The parameter amount and calculation amount of depthwise separable
convolution are about 1/n2 of ordinary convolution. In order to improve the running speed of the network
to meet the real-time requirements.The convolution kernel with a size of 3*3 in YOLOv4 is replaced
with a depthwise separation convolution, and its parameter amount and calculation amount are about
1/9 of that of ordinary convolution. After the replacement, the model parameter amount and calculation
amount are greatly reduced, which is conducive to deployment in embedded devices.
      </p>
    </sec>
    <sec id="sec-4">
      <title>2.2.SE Attention Mechanism</title>
      <p>
        Paying attention to important parts and ignoring irrelevant parts is the attention mechanism. SENet
[
        <xref ref-type="bibr" rid="ref8">16</xref>
        ] adopts the squeeze-and-excitation module to collect full set information, capture the relationship
between channels, and improve the representation ability. As shown in Figure 3, SE attention
mechanism includes squeeze module and excitation module. The squeeze module uses global average pooling
to collect global spatial information. The excitation module uses a fully connected layer and a nonlinear
layer to capture the channel relationship. The output attention vector is multiplied by each channel of
the input feature for scaling. SE plays an important role in suppressing noise while strengthening.
function of the channel. Since small-scale pedestrians contain few pixels, the backbone network can only
extract fewer features, in order to promote the network to learn more important features and improve
the detection accuracy. In this paper, SE is added to the Res unit of YOLOv4 to form the Res unit-SE
module in Figure 1 to enhance the ability of the backbone network to extract important features. SE is
also introduced after the convolution in the dashed box in Figure 1, which enhances the learning of the
important channel features of P4, and then fuses with the P3 and P5 feature maps. The introduction here
can better exert the effect of the SE module.
      </p>
    </sec>
    <sec id="sec-5">
      <title>2.3.Receptive Field Module</title>
      <p>The receptive field module was proposed by Liu et al. [19] in RFBNet. Inspired by the receptive
field structure in the human visual system, it simulates the group receptive field in the human superficial
retinal image. As shown in Figure 4, its construction is similar to the Inception structure. A similar
multi-branch convolution module that captures multi-scale information. In addition, dilated convolution
is introduced to expand the sampling range to extract finer features of the target. Since there are few
features of small-scale pedestrians, in order to enhance the network's feature representation for small
objects, this paper introduces RFB in YOLOv4 to improve the network's detection effect on small
objects.</p>
    </sec>
    <sec id="sec-6">
      <title>2.4.Feature Fusion Module</title>
      <p>The shallow features extracted by the network have low downsampling times and high resolution
for detecting small-scale pedestrian targets. They have strong geometric representation capabilities, but
weak semantic representation capabilities. The deep features extracted by the network have small
downsampling multiples and low resolution to detect large-scale pedestrian targets. They have strong
semantic representation ability, but weak geometric representation ability. In order to enhance the
feature representation ability of the network to achieve accurate pedestrian detection, this paper designs a
top-down FFM as shown in Figure 5(a) and a bottom-up FFM as shown in Figure 5(b). Strengthen the
fusion of deep features and shallow features in the backbone network, and enhance the semantic
information representation ability of shallow features to improve the detection effect of small-scale
pedestrian targets. Check the effect. The module uses P3, P4, and P5 in Figure 1 as input. It can be seen from
Figure 5 that the top-down FFM first upsamples P5 and stacks P4, and then performs convolution to
achieve feature integration and output P4' to replace the original network. P4; the convolution output
P3' performed after stacking the resulting P4' upsampling with P3 replaces P3 in the original network.
Bottom-up FFM is to downsample the P3 feature map with a convolution with a stride of 2 and stack it
with P4, and then perform the convolution output to integrate the output P4', stack the resulting P4'
downsampling with P5, and convolve the output P5 ', both operations strengthen the fusion of shallow
and deep features in the backbone network, and enhance the geometric and semantic representation
capabilities of features, thereby improving the accuracy of network detection.</p>
      <p>Fig 5 Feature fusion module</p>
    </sec>
    <sec id="sec-7">
      <title>3.Experiment</title>
    </sec>
    <sec id="sec-8">
      <title>3.1.Dataset</title>
      <p>
        The experiment uses the Caltech[
        <xref ref-type="bibr" rid="ref10">18</xref>
        ] pedestrian dataset, which is collected from the street view road
and collected from the conventional street view road, including 350,000 pedestrian annotation boxes
and 2,300 different pedestrians. Small-scale pedestrian detection dataset. After the data set is filtered,
it is divided into training set, validation set and test set according to the proportion.
      </p>
    </sec>
    <sec id="sec-9">
      <title>3.2.Experimental Environment And Parameter Configuration</title>
      <p>The experimental environment in this paper is the GNU operating system, the graphics card is Tesla
V100s, the CUDA version is 11.0, the Pytorch 1.8 deep learning framework, and the compiled language
is Python3.6.8. The experiments use the Adam optimizer with a batch size of 8 and a cosine annealing
learning rate with an initial learning rate of 0.001.</p>
    </sec>
    <sec id="sec-10">
      <title>3.3.Measurement indicators</title>
      <p>This paper uses five indicators of P (Precision), R (Recall), AP(Average Precision), PARAM
(Parameters), FPS (Frame Per Second) five indicators to evaluate network performance, The calculation
formulas of P , R and AP are shown in formula (1-3).</p>
      <p>=
 =
 =
∗ 100%
∗ 100%
 （ ）
（1）
（2）
（3）</p>
      <p>TP (True Positive) is the number of targets correctly detected by the model, FP (False Positive) is
the number of falsely detected targets, and FN (False Negative) is the number of false detections and
missed detections. AP is the area under the P-R curve, which is used to measure the detection capability
of the network; the parameter quantity is used to measure the complexity of the network; FPS is the
number of pictures processed per second, which is used to measure the computing power of the network.</p>
      <p>R(%)
76.04
72.19
73.41
74.18
74.46</p>
    </sec>
    <sec id="sec-11">
      <title>3.4.Results and Analysis</title>
      <p>In order to verify the effectiveness of the algorithm proposed in this paper, an ablation experiment
is designed. First, YOLO-D replaces the 3*3 convolution in the network with a depthwise separable
convolution, and then introduces the SE attention mechanism. , RFB, and the two FFMs designed in
this paper. The results are shown in Table 1. The introduction of depthwise separable convolution
greatly reduces the amount of network parameters, which is only 27.9% of YOLOv4. 2.5%, and Recall
decreased by nearly 4%. The introduction of the SE attention mechanism increases the number of
parameters by 14.11M, the AP increases by 0.93%, and the Recall increases by 1.22%, which proves
that the SE attention mechanism can enhance the feature extraction ability. The introduction of RFB in
the network detection head only increases the amount of parameters by 0.35M, the Recall is increased
by about 2%, and the AP is increased by 1.1%, which proves that RFB can more effectively extract the
features of small targets. YOLO-D-FEM1 introduces the bottom-up feature fusion module designed in
this paper, the parameters are increased by 12.6M, the AP is increased by 1.52%, and the Recall is
increased by 2.27%. YOLO-D-FFM2 introduces the top-down feature fusion module designed in this
paper. The number of parameters is increased by 5.04M, the AP is increased by 1.47%, and the Recall
is increased by 2.28%. It is proved that the two modules designed in this paper can improve the
representation ability of semantic information and geometric information by fusing deep and shallow
features in the backbone network. In addi-tion, based on YOLO-D-SE, RFB, bottom-up FFM, and
topdown FFM are introduced respectively. Compared with YOLO-D-SE, AP increased by 0.89%, 0.95%,
1.3%; Recall increased by 0.89%, 0.95%, 1.3%. Finally, for the algorithm YOLO-DSRF proposed in
this paper, compared with YOLO-D, AP, Recall, and Precision have been improved by 2.35%, 3.52%,
and 1.23 respectively; compared with YOLOv4, the difference in AP is only 0.15%; the difference in
Recall is only 0.33 %; increased by 0.07%. Compared with YOLO, the number of parameters is
increased by 19.49M, which is only 35.8% of YOLOv4. Runs 20% faster on GPU</p>
    </sec>
    <sec id="sec-12">
      <title>4.Conclusion</title>
      <p>In view of the complex structure of the YOLOv4 algorithm, the poor real-time performance of
mobile devices and the insufficient extraction of small target pedestrian features, this paper adopts the
depth separable convolution to replace the ordinary convolution, and introduces the SE attention
mechanism, RFB and FFM designed in this paper. The designed YOLOv4-DSRF algorithm greatly
reduces the amount of parameters and computation, and enhances the feature extraction capability of
the network, especially the feature extraction capability for small targets. However, limited by the low
resolution of the feature map extracted by the network, it is not conducive to the feature analysis of
small objects. In the future research work, we will study the construction of a high-resolution
lightweight network. Also, it is very important to study the appropriate feature fusion method to improve
the detection of small objects. optimizing the feature fusion method to further improve the detection
effect of small target pedestrians.</p>
    </sec>
    <sec id="sec-13">
      <title>5.References</title>
      <p>region proposal networks[J].IEEE Transactions on Pattern Analysis &amp; Machine Intelligence，
[8]. LIU W，ANGUELOV D，ERHAN D，et al.SSD：single shot multibox detector[C]//European</p>
      <p>Conference on Computer Vision.Cham：Springer，2016：21-37.
[9]. REDMON J，DIVVALA S，GIRSHICK R，et al.You only look once：unified，real-time
object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition，2016：779-788.</p>
      <p>Conference on Computer Vision and Pattern Recognition，2017：2117-2125.
[11].REDMON J，FARHADI A.YOLOv3：an incremental improvement[J].arXiv：1804.02767，
2018.
[12].KIM K J，KIM P K，CHUNG Y S，et al.Performance enhancement of YOLOv3 by adding
prediction layers with spatial pyramid pooling for vehicle detection[C]// 2018 15th IEEE International
Conference on Advanced Video and Signal Based Surveillance（AVSS），2018：1-6.
[13].BOCHKOVSKIY A，WANG C Y，LIAO H Y M.YOLOv4： optimal speed and accuracy of
[19].Liu S，Huang D.Receptive field block net for accurate and fast object detetion[C]//European
Con</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]. Li
          <string-name>
            <given-names>J</given-names>
            ,
            <surname>Liang</surname>
          </string-name>
          <string-name>
            <given-names>X</given-names>
            ,
            <surname>Shen S M</surname>
          </string-name>
          , et al.
          <article-title>Scale-aware fast R-CNN for pedestrian detection[J]</article-title>
          .
          <source>IEEE transactions on Multimedia</source>
          ,
          <year>2017</year>
          ,
          <volume>20</volume>
          (
          <issue>4</issue>
          ):
          <fpage>985</fpage>
          -
          <lpage>996</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2].
          <string-name>
            <surname>Wang</surname>
            <given-names>H</given-names>
          </string-name>
          , Zang W.
          <source>Research On Object Detection Method In Driving Scenario Based On Improved YOLOv4[C]//2022 IEEE 6th Information Technology and Mechatronics Engineering Conference (ITOEC)</source>
          . IEEE,
          <year>2022</year>
          ,
          <volume>6</volume>
          :
          <fpage>1751</fpage>
          -
          <lpage>1754</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]. Li
          <string-name>
            <given-names>Y</given-names>
            ,
            <surname>Lv</surname>
          </string-name>
          <string-name>
            <surname>C</surname>
          </string-name>
          .
          <article-title>Ss-yolo: An object detection algorithm based on YOLOv3</article-title>
          and shufflenet[C]//
          <source>2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC)</source>
          . IEEE,
          <year>2020</year>
          ,
          <volume>1</volume>
          :
          <fpage>769</fpage>
          -
          <lpage>772</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]. Gao
          <string-name>
            <given-names>Y</given-names>
            ,
            <surname>Wu</surname>
          </string-name>
          <string-name>
            <given-names>Z</given-names>
            ,
            <surname>Ren</surname>
          </string-name>
          <string-name>
            <surname>M</surname>
          </string-name>
          , et al.
          <article-title>Improved YOLOv4 Based on Attention Mechanism for Ship Detection in SAR Images[J]</article-title>
          .
          <source>IEEE Access</source>
          ,
          <year>2022</year>
          ,
          <volume>10</volume>
          :
          <fpage>23785</fpage>
          -
          <lpage>23797</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [7]
          <string-name>
            <surname>. HE K，GKIOXARI G，PIOTR D，</surname>
          </string-name>
          et al.Mask
          <string-name>
            <surname>R-CNN</surname>
          </string-name>
          [C]// IEEE International Conference on [5]
          <string-name>
            <surname>. GIRSHICK R，DONAHUE J，DARRELL T，</surname>
          </string-name>
          et al.
          <article-title>Rich feature hierarchies for accurate object detection and semantic segmentation[C]//</article-title>
          <source>Proceedings of the IEEE Conference on Computer Vision</source>
          and Pattern Recognition，
          <year>2014</year>
          ：
          <fpage>580</fpage>
          -
          <lpage>587</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>. REN S，HE K，GIRSHICK R</surname>
          </string-name>
          ，et al.
          <article-title>Faster R-CNN：towards real-time object detection with Journal of Advanced Research in Engineering</article-title>
          and Technology（IJARET），
          <year>2020</year>
          ，
          <volume>11</volume>
          （5）：
          <fpage>409</fpage>
          -
          <lpage>419</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [15].HOWARD
          <string-name>
            <given-names>A G</given-names>
            ,
            <surname>ZHU M L</surname>
            , CHEN
          </string-name>
          <string-name>
            <surname>B</surname>
          </string-name>
          , et al.
          <article-title>MobileNets: efficient convolutional neural networks for mobile vision applications</article-title>
          [J].
          <source>Computer Vision</source>
          and Pattern Recognition arXiv,
          <source>Preprint arXiv: 1704. 04861</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [16].HU
          <string-name>
            <given-names>J</given-names>
            ,
            <surname>SHEN</surname>
          </string-name>
          <string-name>
            <given-names>L</given-names>
            ,
            <surname>SUN G. Squeeze-</surname>
          </string-name>
          and
          <string-name>
            <surname>- excitation networks</surname>
          </string-name>
          [C]//Proceedings of the
          <source>2018 IEEE Conference on Computer Vision</source>
          and Pattern Recognition,
          <source>Salt Lake City, Jun 18- 22</source>
          ,
          <year>2018</year>
          . Washington: IEEE Computer Society,
          <year>2018</year>
          :
          <fpage>7132</fpage>
          -
          <lpage>7141</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [17].Wei Hongyu.
          <article-title>Aircraft target detection method in remote sensing image based on YOLOv4 antiocclusion [D]</article-title>
          . China University of Mining and Technology,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [18].Ess
          <string-name>
            <given-names>A</given-names>
            ，
            <surname>Müller</surname>
          </string-name>
          <string-name>
            <given-names>T</given-names>
            ，
            <surname>Grabner</surname>
          </string-name>
          <string-name>
            <surname>H</surname>
          </string-name>
          ， et al.
          <article-title>Segmentation- based urban traffic scene understand-</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>