<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Traffic Sign Detection Based on the Fusion of YOLOR and CBAM</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Qiang Luo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenbin Zheng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Software Engineering, Chengdu University of Information Technology</institution>
          ,
          <addr-line>Chengdu 610225, Sichuan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>V.C. &amp; V.R. Key Lab of Sichuan Province, Sichuan Normal University</institution>
          ,
          <addr-line>Chengdu, China, 610068</addr-line>
        </aff>
      </contrib-group>
      <fpage>115</fpage>
      <lpage>119</lpage>
      <abstract>
        <p>In the field of traffic sign recognition, traffic signs usually occupy very small areas in the input image. Generally, the Convolutional Neural Networks (CNN) based multi-layer residual networks are used to extract the feature information from these small objects, which often leads to the feature misalignment in the process of feature aggregation. Moreover, most CNN-based algorithms made use of only explicit knowledge, not implicit knowledge. In this paper, a novel method (named YOLOR-A) that combines YOLOR with CBAM is proposed. The CBAM attention mechanism module is integrated to focus the important object. This method can add implicit knowledge into model, which realizes the translation mapping of the feature kernel space and solve the problem of feature misalignment in traffic sign detection. The experimental results show that the proposed method achieves 94.7 mAP, 57 FPS on TT100k dataset, satisfying the real-time detection and outperforming the state-of-the-art methods.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Traffic sign detection</kwd>
        <kwd>Implicit knowledge</kwd>
        <kwd>Attention mechanism</kwd>
        <kwd>Feature alignment</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Driver assistance systems and autonomous vehicles have been widely used[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. As a sub-module, the
traffic sign detection system plays an important role in improving driving safety. For the task of traffic
sign detection, traffic signs usually only occupy a small proportion of the input image, while extracting
high-dimensional features requires multi-level down-sampling, which leads to the loss of characteristic
information of small traffic signs[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Although the residual structure can alleviate the information loss
in the down-sampling process, the residual information fusion process[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is an indiscriminate
combination of context information, which often leads to misalignment in the feature aggregation
process[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, the use of implicit knowledge is a good solution to this problem. In deep learning,
implicit knowledge refers to the observation-independent knowledge implicit in the model, which can
help the model to utilize feature information more effectively. Wang et al.[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] integrated implicit and
explicit knowledge into a unified matrix factorization framework for customer volume prediction.
Belzen et al.[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] used the implicit knowledge in the neural network to assist in the analysis of protein
sensitivity features to achieve protein functional anatomy.
      </p>
      <p>
        This paper proposes a novel method (named YOLOR-A) that combines YOLOR[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] (You Only
Learn One Representation) and CBAM[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] (Convective Block Attention Module).The CBAM attention
mechanism is used to focus on the important traffic sign region, and the implicit knowledge is integrated
to solve the misalignment problem.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. YOLOR-A for Traffic Sign Detection</title>
      <p>
        The YOLOR-A model is composed of a backbone feature extraction network, neck network, and
recognition head. Backbone uses the network architecture based on CSPDarknet53[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], the core of Neck
3 x CSP
3 x CSP
7 x CSP
7 x CSP
3 x CSP
      </p>
      <p>Fcous</p>
      <sec id="sec-2-1">
        <title>Backbone</title>
        <p>
          is the structure of Feature Pyramid Networks and Path Aggregation Networks (PAN[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]), the head uses
the structure of YOLO[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] detector, the Align feature alignment module is added to Neck, and the
Preprediction refinement module is added to head. The YOLOR-A model framework is shown in Figure
1. Then, the CBAM attention module is added after the Neck network to refine the small object features
and improve the recognition accuracy.
        </p>
        <p>CSPSSP</p>
        <p>3 x CSPDw
3 x CSPUp</p>
        <p>3 x CSPDw
3 x CSPUp</p>
        <p>3 x CSPDw
3 x CSPUp</p>
        <p>Alian
Alian
Alian
Alian</p>
        <p>CBAM
CBAM
CBAM
CBAM</p>
        <p>Pre
Pre
Pre
Pre</p>
        <p>Detection P6
Detection P5
Detection P4</p>
        <p>Detection P3</p>
      </sec>
      <sec id="sec-2-2">
        <title>Neck</title>
      </sec>
      <sec id="sec-2-3">
        <title>Head</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2.1. Implicit knowledge learning module</title>
      <p>The implicit knowledge in a neural network generally comes from the deep layer of the network,
which is the knowledge implicit in the model and not affected by the input value. Therefore, the implicit
knowledge representation is independent of concrete input values, which can be regarded as a set of
constant tensors Z = (z1, z2, …, zk). Before the introduction of implicit knowledge, the mapping
relationship between objects and features can be abstracted as a point-to-point mapping relationship, as
shown in Figure 2. The CNN-based residual network extracts feature information. In the feature
aggregation stage, this simple correspondence is prone to misalignment.</p>
      <p>As shown in Figure 3, after the introduction of implicit knowledge, the implicit knowledge added to
the output features of the neck network structure of the model, and the features can be aligned to the
network output through translation transformation, which solves the problem of misalignment in the
feature aggregation process. By adding implicit knowledge to the prediction head module and
multiplying it with the input features, the point-to-point mapping relationship in the original network
can be transformed into a mapping of feature points to range intervals, so that different categories can
achieve finer feature mapping, which facilitates the model to distinguish different categories and thus
improve the classification accuracy.</p>
      <p>x f θ f θ (x) target
2.2.</p>
    </sec>
    <sec id="sec-4">
      <title>CBAM attention module</title>
      <p>
        CBAM[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is a simple but effective attention module. Most of the images are irrelevant foreground
information in the traffic sign dataset. Using CBAM can help the model extract effective feature
information and focus on the important area for traffic sign.
      </p>
    </sec>
    <sec id="sec-5">
      <title>3. Experiment</title>
    </sec>
    <sec id="sec-6">
      <title>3.1. Datasets and Evaluation metrics</title>
      <p>
        TT-100k[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]: TT-100k dataset contains 16,811 images of 2048-2048, which were collected from
Chinese street scenes, with a total of 234 types of traffic signs. However, the number of categories
varies greatly, so this paper selects 45 categories with the highest frequency for research.
      </p>
      <p>
        The model detection accuracy evaluation metric uses the Mean Average Precision (mAP[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]). The
model detection speed evaluation metric uses Frames Per Second (FPS).
3.2.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Results and Analysis</title>
      <p>The experimental platform is Ubuntu 20.4.1 operating system, Pytorch-1.7.1 deep learning
framework, and the hardware configuration is: graphics GPU NVIDIA GeForce GTX3090, 24GB video
memory. The code is written in Python3.7, run on PyCharm platform.</p>
      <p>
        This paper selects the classic two-stage object detection algorithm Faster RCNN, Cascade RCNN[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
algorithm; the single-stage algorithms SSD512[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], yolov5s, and the recently advanced algorithms
tphyolov5[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and Scaled-YOLOv4[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] in the field of object detection have been compared. The results
on test dataset are shown in Table 1.
      </p>
      <p>S
87.6
89.7
90.2
91.6
91.8</p>
      <p>M
91.7
92.0
92.2
94.3
95.5</p>
      <p>TT100k</p>
      <p>L
88.3
89.5
88.1
96.7
97.2</p>
      <p>ALL
88.1
91.2
91.8
93.8
94.7
(a)
(b)
(c)
(d)</p>
      <p>Some detection examples are shown in Figure 4. The algorithm YOLOR-A has the best detection
effectiveness compared with tph-yolov5, ScaledYOLOv4, and yolov5, and its corresponding detected
traffic signs have the highest confidence, especially for small objects.</p>
      <p>Based on the heat map visualization experiments, are shown in Figure 5. we can conclude that the
problem of algorithmic feature misalignment is solved with the inclusion of implicit knowledge.</p>
    </sec>
    <sec id="sec-8">
      <title>4. Conclusion</title>
      <p>In this paper, a traffic sign object detection algorithm based on the fusion of YOLOR and CBAM is
proposed. This method can make use of the implicit knowledge in a neural network to overcome the
feature misalignment problem, and incorporates the CBAM attention mechanism so that the object
detector can focus on the important feature area for the traffic sign. The experimental results show that
the proposed algorithm obtain better performance compared with other competitive algorithms.</p>
    </sec>
    <sec id="sec-9">
      <title>5. Acknowledgements</title>
      <p>This work is supported by the Natural Science Foundation of Sichuan, China (No. 2022NSFSC0571)
and the Sichuan Science and Technology Program (No. 2018JY0273, No. 2019YJ0532). This work is
supported by funding of V.C. &amp; V.R. Key Lab of Sichuan Province (No. SCVCVR2020.05VS). This
work is also supported by the China Scholarship Council (No. 201908510026).</p>
    </sec>
    <sec id="sec-10">
      <title>6. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . Gao,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Real-time small traffic sign detection with revised faster-RCNN, Multimedia Tools</article-title>
          and Applications,
          <volume>78</volume>
          (
          <year>2019</year>
          )
          <fpage>13263</fpage>
          -
          <lpage>13278</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.L.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>You</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Group multi-scale attention pyramid network for traffic sign detection</article-title>
          ,
          <source>Neurocomputing</source>
          ,
          <volume>452</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <source>Pedestrian Reidentification Algorithm Based on Local Feature Fusion Mechanism, Journal of Electrical Computer Engineering</source>
          ,
          <year>2022</year>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.C.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.G.</given-names>
            <surname>Wang</surname>
          </string-name>
          , W.Y. Liu,
          <string-name>
            <given-names>T.S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          , AlignSeg:
          <string-name>
            <surname>Feature-Aligned Segmentation</surname>
            <given-names>Networks</given-names>
          </string-name>
          ,
          <source>Ieee Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>44</volume>
          (
          <year>2022</year>
          )
          <fpage>550</fpage>
          -
          <lpage>557</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <article-title>Coupling Implicit and Explicit Knowledge for Customer Volume Prediction</article-title>
          ,
          <source>The Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17</source>
          )
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>J.U.</surname>
          </string-name>
          zu Belzen, T. Burgel,
          <string-name>
            <given-names>S.</given-names>
            <surname>Holderbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bubeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gandor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mathony</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pfuderer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Platz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Przybilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schwendemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Heid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.D.</given-names>
            <surname>Hoffmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jendrusch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmelas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Waldhauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Niopek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Eils</surname>
          </string-name>
          ,
          <article-title>Leveraging implicit knowledge in neural networks for functional dissection and engineering of proteins</article-title>
          ,
          <source>Nature Machine Intelligence</source>
          ,
          <volume>1</volume>
          (
          <year>2019</year>
          )
          <fpage>225</fpage>
          -
          <lpage>235</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.-H.</given-names>
            <surname>Yeh</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-Y.M. Liao</surname>
          </string-name>
          ,
          <article-title>You only learn one representation: Unified network for multiple tasks</article-title>
          ,
          <source>arXiv preprint arXiv:2105.04206</source>
          , (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Woo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Park</surname>
          </string-name>
          , J.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.S.</given-names>
            <surname>Kweon</surname>
          </string-name>
          , Cbam:
          <article-title>Convolutional block attention module</article-title>
          ,
          <source>Proceedings of the European conference on computer vision</source>
          (ECCV)
          <year>2018</year>
          ), pp.
          <fpage>3</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bochkovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <article-title>YOLOv4: Optimal Speed and Accuracy of Object Detection</article-title>
          , arXiv:
          <year>2004</year>
          .
          <volume>10934</volume>
          , (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <article-title>Path aggregation network for instance segmentation</article-title>
          ,
          <source>Proceedings of the IEEE conference on computer vision and pattern recognition2018)</source>
          , pp.
          <fpage>8759</fpage>
          -
          <lpage>8768</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Farhadi,</surname>
          </string-name>
          <article-title>Yolov3: An incremental improvement</article-title>
          , arXiv:
          <year>1804</year>
          .
          <volume>02767</volume>
          , (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>Traffic-Sign Detection and Classification in the Wild, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</article-title>
          <year>2016</year>
          ), pp.
          <fpage>2110</fpage>
          -
          <lpage>2118</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hays</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Ramanan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dollár</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          <string-name>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <article-title>Microsoft coco: Common objects in context</article-title>
          ,
          <source>European conference on computer vision</source>
          ,
          <source>(Springer2014)</source>
          , pp.
          <fpage>740</fpage>
          -
          <lpage>755</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vasconcelos</surname>
          </string-name>
          ,
          <string-name>
            <surname>Cascade</surname>
          </string-name>
          r-cnn:
          <article-title>Delving into high quality object detection</article-title>
          ,
          <source>Proceedings of the IEEE conference on computer vision and pattern recognition2018)</source>
          , pp.
          <fpage>6154</fpage>
          -
          <lpage>6162</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.C.</given-names>
            <surname>Berg</surname>
          </string-name>
          , Ssd:
          <article-title>Single shot multibox detector</article-title>
          ,
          <source>European conference on computer vision</source>
          ,
          <source>(Springer2016)</source>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhao</surname>
          </string-name>
          , TPH-YOLOv5:
          <article-title>Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios</article-title>
          ,
          <source>Proceedings of the IEEE/CVF International Conference on Computer Vision2021)</source>
          , pp.
          <fpage>2778</fpage>
          -
          <lpage>2788</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>C.-Y. Wang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bochkovskiy</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-Y.M. Liao</surname>
          </string-name>
          , Scaled-yolov4:
          <article-title>Scaling cross stage partial network</article-title>
          ,
          <source>Proceedings of the IEEE/cvf conference on computer vision and pattern recognition2021)</source>
          , pp.
          <fpage>13029</fpage>
          -
          <lpage>13038</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>