<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Attention Enhancement of YOLO for Vehicle Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Caixiao Ouyang</string-name>
          <email>ouyangcaixiao@163.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hu Jiwei</string-name>
          <email>hujiwei@fiberhome.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Youyuan She</string-name>
          <email>yongyuanshe@163.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chunzhi Wang</string-name>
          <email>chunzhiwang@hbut.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hubei University of Technology</institution>
          ,
          <addr-line>Wuhan 430068</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Wuhan Fiberhome Technical Services Co., Ltd.</institution>
          ,
          <addr-line>Wuhan 430205</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Wuhan Vocational College of Sortware and Engineering</institution>
          ,
          <addr-line>Wuhan 430205</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Vehicle detection and recognition is an important research. An attention and feature fusion target detection algorithm based on the improved YOLOv4 algorithm is proposed to achieve a more effective screening of vehicle targets in traffic scenes. Considering the cost deployment problem of traffic recognition algorithms, this paper uses YOLOv4 as the base architecture, firstly, the lightweight DenseNet is used as the backbone feature extraction network; secondly, effective channel attention (ECA) and Adaptive Spatial Feature Fusion (ASFF) are used to enhance the PANet structure with attention-guided fusion; in addition, the weight ratio of the loss function is optimized and the mosaic method is used for training enhancement.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Target detection</kwd>
        <kwd>vehicle detection</kwd>
        <kwd>YOLOv4</kwd>
        <kwd>feature fusion</kwd>
        <kwd>attention mechanism</kwd>
        <kwd>lightweighting 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        YOLOv1 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] achieves real-time performance of 155 fps. The algorithm divides the network
into multiple grids, and each grid is responsible for predicting only the location and class of
targets whose centers fall on that grid. This was followed by the SSD [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and YOLOv2 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
both of which improved detection accuracy and speed. However, the accuracy of these
algorithms is still relatively limited, especially for small targets. YOLOv3 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] uses an
Anchorbased approach that allows targets at different scales to be preassigned a close detection frame
form, although YOLOv3 uses MSE as the border regression loss function, which makes
YOLOv3's localization of targets not precise. RetinaNet [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] analyzes the category imbalance
problem existing in the first stage of network training and proposes Focal loss that can
automatically adjust the weights according to the Loss size, making the training more focused
on difficult samples. Yolov4 introduces the SPP module [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Mish [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] activation function, etc.,
to improve the performance of the network.
      </p>
      <p>
        With the development of deep learning algorithms, multi-target and multi-scale detection
in complex environments, severe partial occlusion of vehicles, and high requirements for
computing hardware are in the focus of research [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        FPN [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is a network for solving multi-scale detection problems. It uses a pyramid structure
to make features flow between vertical and horizontal and propagates semantic information
between multiple layers to build multi-scale features. However, FPN does not handle the
difference of information at different levels reasonably, and the operation of fused features is
obtained by summing the higher-level features with the next level directly after sampling,
which limits the self-learning of features. Therefore, recently appeared works to optimize and
improve FPN. For example, PANet [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] adds an extra top-down path to the original structure
and adopts a channel superposition when fusing features, which both uses new feature
information and ensures the preservation of original features. In addition, the attention
mechanism (AM) is gradually becoming a popular method to improve detection performance.
Various attention modules, used as a plug-and-play component, bring good performance
improvements at an acceptable model complication. They select from the channels or spatial
dimensions of the model and filter out the feature information that is more interesting and
better matches the detection target.
      </p>
      <p>This paper proposes a vehicle detection algorithm based on feature fusion and attention
enhancement, which can purposefully alleviate the problems of missed detection, false
detection, and accuracy degradation caused by detection scale or occlusion while reducing the
complexity of the model. The main work of this paper is as follows:</p>
      <p>
        1. DenseNet [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] with lower complexity is used as the backbone network of the
detection model.
      </p>
      <p>
        2. Introducing effective channel attention (ECA) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] attention network, filling in the
structure between the backbone and neck layer to achieve a smooth transition of features and
selection of channel information.
      </p>
      <p>
        3. Improving the network structure of the feature pyramid, adding Adaptive Spatial
Feature Fusion (ASFF) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] fusion module based on PANet.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Materials and Methods</title>
      <sec id="sec-2-1">
        <title>2.1. Libertinus fonts for Linux Related Materials</title>
      </sec>
      <sec id="sec-2-2">
        <title>2.1.1. One-stage Target Detection</title>
        <p>
          The YOLO series algorithm innovates on the detection principle of the Faster Region-based
CNN (R-CNN) series by abandoning the RPN approach and using regression to obtain the
coordinate information of the bbox. YOLOv1 is an one-stage target detection algorithm. This
algorithm was quickly deployed in many real-world projects due to the dramatic increase in
detection speed. Many one-stage target detection algorithms have emerged since then [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>YOLOv4 consists of the CSPDarknet53 backbone network, SPPNet, PANet feature fusion
network, and the YOLO-Head detection head module, that is used in YOLOv3. It is shown in
Figure 1.</p>
        <p>CSPDarknet53 is an improvement on Darknet53, which uses the CSPNet structure and
applies a more extensive residual structure to reduce the information loss during training and
further enhance the learning ability of the network. The activation function Leaky ReLU is
replaced by the Mish function, whose upward unbounded property avoids model saturation
due to numerical capping. In addition, its micro design for negative values brings better
gradient flow. The Mish smoothed activation function ensures better accuracy and
generalization.</p>
        <p>Between the backbone network and the detection head is the Neck layer, which is
composed of the SPP (Spatial Pyramid Pooling) module and the PANet module. The output of
the backbone network is adjusted by the convolutional layer and used as the input of the SPP
module. The SPP outputs the input data after doing maximum pooling and data stacking at
different scales, and is adjusted by the convolutional layer and used as the input of the PANet
network together with the two intermediate layers of the backbone network. PANet does
further fusion of three sets of feature maps at different scales by some convolution,
upsampling, downsampling and data stacking to enhance the perceptual field of feature maps
at different scales and output three layers of data information.</p>
        <p>The YOLO-Head in the detection layer receives the input from PANet and performs the
final prediction process. The YOLO-Head with three a priori frames each will predict three
feature maps with scales of 13X13, 26X26, and 52X52, respectively, and based on the a priori
frame analysis information, the preliminary prediction frame will be output after
nonmaximum suppression.</p>
        <p>In this paper, we improve the training and inference speed of the one-stage detection
algorithm by modifying the backbone network of the model, based on the YOLOv4 algorithm,
and improve the model structure using the AM and feature fusion module to enhance the
detection performance of the algorithm.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.1.2. Feature Parymid Network (FPN)</title>
        <p>Feature Pyramid Representation (FPN) addresses the challenge of scale variation in target
detection. Its structural layer design allows the model to better utilize the feature information
extracted from the backbone network.</p>
        <p>Initial target detection, either one-stage or two-stage, is usually performed with an external
detection head after the feature map is output at the last layer of the last stage of Backbone.
This approach is called the single-stage object detection algorithm. However, in this
algorithm, the scale of the last output feature map of the backbone is too different from the
input image, which is easy to cause information loss, especially the detection capability of
small targets is insufficient. Subsequent studies found that the single-stage target detection
algorithm cannot effectively transfer the information of various scales in the original image.
Therefore, later target detection algorithms gradually developed into a feature pyramid
network (FPN) using multi-scale, multi-stage feature maps to enhance the characterization
ability of the model.</p>
        <p>The FPN evolved through continuous iterations and can be divided into four models, as
shown in Figure 2.</p>
        <p>1. A typical representative of fusion-free and at the same time utilizing multi-scale
features is the SSD algorithm, which directly predicts objects of different sizes from the
feature maps outputted by different stages.</p>
        <p>
          2. There are many classical models that use algorithms with top-down fusion approach,
such as Faster RCNN, Mask RCNN [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], Yolov3, RetinaNet, etc. They use the same kind of
FPN models, and the difference is that feature maps of different scales are involved in feature
fusion.
        </p>
        <p>3. PANet proposes a top-down model followed by an additional bottom-up secondary
fusion, which can be called a bidirectional fusion structure. YOLOv4 uses a fine-tuned version
of PANet, which makes feature fusion not additive, but feature stacking.</p>
        <p>
          4. The proposed PANet proved the effectiveness of bidirectional fusion, introduced more
complex bidirectional fusion structures, such as NAS-FPN [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and BiFPN [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>Various FPNs are designed to maximize the utilization of the multi-scale feature maps from
backbone, and its optimization leads to significant improvement of object detection.
Therefore, the algorithms in this paper in concert with the fusion of PANet and ASFF to
enhance the reuse and extraction of feature maps and avoid the loss of effective information
[18, 19].</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.1.3. Attention Mechanisms</title>
        <p>The AM focuses on local information while suppressing distracting information [20]. From a
mathematical point of view, AMs provide a weight-based model to perform operations. The
process of extracting image features from feature maps in a NN is seen to vary in the degree
to which different feature maps provide overall information [21]. The AM uses the network
layer to calculate the weight values corresponding to the relevant feature maps, and then
applies these weights to the feature maps, so that the feature maps with a large role in
extracting information become somewhat more influential on the overall [22]. The AMs can
currently be classified into following types: channel AMs, spatial AMs, and mixed spatial and
channel AMs.
2.1.3.1. Spatial AM
Not all regions in an image are equally important, only the task-relevant regions are
important. The spatial attention model is to find the most important parts of the network for
processing.</p>
        <p>The Spatial Transformer Network (STN) [23] is a spatial-based Attention by learning the
shape change of the input so as to accomplish preprocessing operations suitable for a specific
task. The ST module consists of the Localisation net, the Grid generator and Sample. The
Localisation net determines the parameter θ of the input required transformation. The Grid
generator finds the mapping T(θ) of the output to the input features by θ and the defined
transformation. The Sample combines the location mapping and transformation parameters to
select the input features and combine them with bilinear interpolation for the output.
2.1.3.2. Channel AM
For a set of images processed by the CNN, its effective information can be extracted from two
dimensions. One dimension is the scale of the image in space, that is, the length and width.
The other dimension is the channel information. Therefore, Attention based on channel
orientation is also common.</p>
        <p>SENet (Sequeeze and Excitation Net) [24] is a channel type Attention model, which
automatically enhances or suppresses channels after model learning by modeling the
importance of each feature channel. It divides a bypass branch after the normal convolution
operation, and this branch is compressed and fully connected to obtain a set of weight values.
By applying this set of weights to each of the original feature channels, the importance of the
different channels can be learned.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.1.3.3. Fusion of spatial and channel AMs</title>
        <p>CBAM (Convolutional Block Attention Module) [25] is a representative network that
combines spatial and channel AMs. It uses a channel-then-space approach for collocation, so
that the model models the important information of channel and spatial locations separately.</p>
        <p>Besides these, there are many other AMs related to research [26, 27]</p>
      </sec>
      <sec id="sec-2-6">
        <title>2.2. The Proposed Method</title>
        <p>Figure 3 shows the architecture of the proposed algorithm. It takes one-stage target detection
algorithm YOLOv4 as the reference architecture and divides the algorithm framework into
four parts: data pre-processing and input, backbone network, FPN structure and prediction
network. The pre-processed images are sent to the backbone network, which adopts a
lightweight DenseNet structure consisting of different numbers of Dense Blocks and
Transition Layers. Depending on the number of sub-module overlays, the backbone network
extracts the feature information at different scales and passes it into the FPN network. Before
this information is passed into SPPNet and PANet, the feature information will be further
filtered and refined by three ECA attention modules. Then the information output from the
bidirectional fusion-type network PANet is fed into the complex fusion network ASFF, which
makes the feature map information at different scales form the interaction. Finally, the
information extracted from the ASFF network is fed into the YOLO detection head, and the
prediction results of the image are obtained after the information decoding and other
operations. Next, the backbone network, FPN structure and loss function of the algorithm in
this paper are described in more detail, respectively.</p>
      </sec>
      <sec id="sec-2-7">
        <title>2.2.1. Lightweighting Of The Backbone</title>
        <p>The lightweight network DenseNet is integrated and bridged with the original YOLOv4 to
achieve faster, more accurate, and less computationally intensive target detection results.
Specifically, the backbone network is replaced with DenseNet-121, and the rest of the
architecture is optimized on the basis of YOLOv4.</p>
        <p>As another type of CNN with deeper layers, it has the following advantages:
1. Fewer number of parameters compared to ResNet.
2. More emphasis and encouragement on feature reuse.
3. The network is easier to train and has some regularization effect.</p>
        <p>4. The problems of gradient vanishing and model degradation are alleviated.</p>
        <p>DenseNet is mainly composed of Dense Blocks and Transition Layers.</p>
        <p>The dense block is composed of several bottle necks. Each block uses the same number of
output channels, and then uses a loop to connect the input and output of each block in the
channel dimension. The structure of the bolt neck is shown in the upper part of Figure 4.</p>
        <p>BN-ReLU is placed before the convolution module for processing. Each Bottle Neck
contains two convolutions, the first one is a 1*1 convolution, which has 4k output channels.
Here, k is a feature map growth factor, which is the number of feature maps contributed by
each Bottle Neck. The second 3*3 convolution has k output channels. Finally, the input of the
module and the output of the 3*3 convolution are concat stacked to obtain the overall number
of output channels of the module as C`+k.</p>
        <p>The Dense Block structure is shown in the middle part of Fig. 4. It consists of several Bottle
Necks. The number of input channels of the whole Dense Block is C0. Since the output of
Bottle Neck stacks, the output and input of the final convolutional structure in its interior, the
number of feature channels will be increased by k for each Bottle Neck that passes through it.
Therefore, the number of final output feature maps of a Dense Block composed of n Bottle
Neck is C0+nk. The input of each Bottle Neck is a stack of all the outputs of its preceding
layers.</p>
        <p>The Transition Layer controls the model complexity. Its structure is shown in the bottom
of Fig. 4. Since the number of channels increases with each Dense Block connec-tion, its
overuse will result in an overly complex model. Therefore, the Transition Layer first reduces
the number of channels by a 1×1 convolution layer, and then to compress the height and
width of the feature map, an average pooling layer with stride=2 is used for downsampling,
which further reduces the model complexity.</p>
      </sec>
      <sec id="sec-2-8">
        <title>2.2.2. Citation of Attentional Mechanisms</title>
        <p>Among the types of attention modules, channel AMs have great potential in im-proving the
performance of deep CNNs. However, there are a large number of AMs developing more
complex attention modules to achieve better performance, which will inevitably increase the
complexity of the model. To strike a balance between model complexity and performance, this
paper refers to an effective channel attention module (ECA) that contains only a small number
of parameters while delivering significant performance gains.</p>
        <p>SE-Net is the basis of ECA-Net optimization and its structure is shown in Figure 5(a).
Global average pooling is first performed separately for each input channel, followed by two
fully connected layers using different activation functions. This computational process causes
the channel features to be mapped from high to low and then to high dimensions. This
dimensionality reduction operation reduces the complexity of the model, but it also cause the
loss of critical information.</p>
        <p>ECA-Net empirically shows, by observing SE-Net and improving it, that avoiding
dimensionality reduction is important for learning channel attention and that proper
crosschannel interaction can increase model complexity only slightly while maintaining
performance. Its structural design is shown in Figure 5(b).</p>
        <p>On the left is the feature of the original input image, which is first subjected to global
average pooling (GAP) [28] to obtain a 1×1×C feature map, on which ECA obtains the local
cross-channel interaction by fast one-dimensional convolution of size K, where the parameter
K can be generated by an adaptive function based on the size of the input channel C, which
represents the local coverage of the cross-channel interaction. After that, a Sigmoid function is
used to generate the weight share of each channel, and then the original input features are
combined with the channel weights to obtain the features with channel attention. The
network constructed with this module makes it easier to extract discriminative features of
images based on channel dimensionality.</p>
        <p>To avoid the consumption of large computational resources due to manual adjustment, the
size of the parameter k can be generated adaptively by a function with the convolution kernel
k calculated as:
(1)
where |t|odd denotes the odd number of t-nearest neighbors, γ is set to 2, and b is 1. From (1),
it is clear that the communication range of the high-dimensional channel is longer, while the
communication range of the low-dimensional channel is relatively contracted.</p>
        <p>In this paper, three ECA layers are inserted at the connection between Backbone and Neck
of the model to avoid dimensionality reduction while better bridging the two components,
making the feature transfer of the model more efficient and preventing the disappearance of
feature information. At the same time, the ECA layer allows the model to focus on more
critical features and suppress unnecessary features, which improves the detection accuracy.</p>
      </sec>
      <sec id="sec-2-9">
        <title>2.2.3. Spatially Adaptive Fusion Of Feature Layers</title>
        <p>In general, the lower level features of the network contain more location information and the
higher level features contain more semantic information. The PANet structure is used in
YOLOV4 to further fuse and output the higher and lower level features. After downsampling,
the network does bidirectional propagation and then upsampling, and fuses the information
from the same level downsampling by lateral connection, and then sends the feature
information of different scales to different detectors.</p>
        <p>However, the PANet connection simply stacks the top-down and bottom-up layers of
information together, and there is a lack of communication between the layers to transfer the
information. To more fully utilize the semantic information of the high-level features and the
fine-grained features of the underlying features, this paper introduces a new feature fusion
method, Adaptive Spatial Feature Fusion (ASFF), in the proposed algorithm.</p>
        <p>ASFF can enhance the extraction capability of PANet and can fuse the information of
multiple feature layers simultaneously. Its idea is to adaptively adjust the spatial weights of
each scale features in fusion by learning. Its underlying structure is shown in Figure 6.</p>
        <p>Figure 7 shows the operation of layers in ASFF. First, X1, X2 and X3 are derived from the
feature information at different scales of level1, level2 and level3 output in PANet,
respectively.</p>
        <p>The ASFF-3 is an example of a convolution with the kernel of 3*3, the step size of 2, and a
padding of 1. The X2 is scaled down to the same value as X3 with equal number of channels,
and is denoted as level_1_resized. The number of channels and dimensionality of
level_1_resized, level_2_resized and X3 are the same. Finally, level_1_resized, level_2_resized,
and X3 are multiplied by α, β, and γ, respectively, and the values are summed, and the number
of channels is adjusted by a final convolutional layer to obtain a new feature layer with
multilayer perceptual field fusion. The formula is expressed as follows:
(2)
where yijl represents the new feature map of a layer obtained by ASFF, αijl, βijl, and γijl
represent the weight parameters learned through the three feature layers, and αijl+βijl+γijl=1 is
guaranteed by the Softmax function.</p>
        <p>where yijl represents the new feature map of a layer obtained by ASFF, αijl, βijl, and γijl
represent the weight parameters learned through the three feature layers, and αijl+βijl+γijl=1 is
guaranteed by the Softmax function.</p>
      </sec>
      <sec id="sec-2-10">
        <title>2.2.4. Design of the loss function</title>
        <p>The loss function contains three components: confidence error Lconf classification error Lcls,
and regression frame prediction error Lloc [29]. CIoU loss was used in the regression frame
prediction error. CIoU is based on IoU, GIoU, and DIoU, and the CIOU takes into account
three geometric factors, which are overlap area, centroid distance, and aspect ratio [30].
(4)
(5)
where S² is the number of grids, B is the number of prediction frames in each grid,
,
are the indicated values of the prediction frames containing and not containing the target,
is the confidence true value, C is the prediction confidence,
is the penalty weight factor,
is the actual probability that the target in the cell belongs to category c, P(c) is the
probability that the prediction is of category c, wgt, hgt are the width and height of the true
frame, respectively, IoU(X , Y) is the intersection ratio of the predicted frame X to the real
frame Y, ρ2(Xctr, Yctr) is the Euclidean distance between the center point of the predicted
frame and the real frame, m is the diagonal distance of the minimum closed region containing
both the predicted and real frames, u is the balance adjustment parameter, and v is the
parameter measuring the consistency of the aspect ratio.</p>
        <p>To balance the loss sensitivity of different detection scales, in this paper, the three
prediction heads in the network structure are multiplied with different weights when
calculating the total loss. The weights assigned to Yolo Head1, Yolo Head2, and Yolo Head3
are 0.4, 1.0, and 4.0, respectively [31].</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Conclusion</title>
      <p>This paper focuses on the One-Stage target detection method which has higher requirements
for detection speed and deployment cost. It helps cameras in traffic scenes to recognize vehicle
information and perform vehicle model discrimination. A lightweight target detection
algorithm based on attention and feature augmentation is proposed to address the problem of
the demand for vehicle detection in smart city construction. The complexity of the algorithm
is strictly controlled. The proposed algorithm uses YOLOv4 as the base architecture: (i)
significantly reduces the number of model parameters by replacing the DenseNet, which has
excellent performance, as the backbone feature extraction network;v(ii) reconstructs the
existing FPN network module, uses the ECA attention structure for the transition and transfer
of feature information between Backbone and Neck, as well as adds the information
crossfusion function before the final detection layer of the network of the ASFF structure; (iii)
while optimizing in terms of the loss function and image preprocessing.
[18] Y. Gong, X. Yu, Y. Ding, X. Peng, J. Zhao and Z. Han, "Effective Fusion Factor in FPN for
Tiny Object Detection," 2021 IEEE Winter Conference on Applications of Computer
Vision (WACV), 2021, pp. 1159-1167, doi: 10.1109/WACV48630.2021.00120.
[19] Xiong, G., Przystupa, K., Teng, Y., et al. (2021). Online measurement error detection for
the electronic transformer in a smart grid. Energies, 14(12), 3551.
[20] Jiang, K., Zhang, C., Wei, B., Li, Z., &amp; Kochan, O. (2024). Fault diagnosis of RV reducer
based on denoising time–frequency attention neural network. Expert Systems with
Applications, 238, 121762.
[21] Xu, X., Przystupa, K., &amp; Kochan, O. (2023). Social Recommendation Algorithm Based on</p>
      <p>Self-Supervised Hypergraph Attention. Electronics, 12(4), 906.
[22] Y. Wang and A. Zell, "Yolo+FPN: 2D and 3D Fused Object Detection With an RGB-D
Camera," 2020 25th International Conference on Pattern Recognition (ICPR), 2021, pp.
4657-4664, doi: 10.1109/ICPR48806.2021.9413066.
[23] Jaderberg M, Simonyan K, Zisserman A. Spatial transformer networks[J]. Advances in
neural information processing systems, 2015, 28.
[24] J. Hu, L. Shen and G. Sun, "Squeeze-and-Excitation Networks," 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132-7141, doi:
10.1109/CVPR.2018.00745.
[25] Woo S, Park J, Lee J Y, et al. Cbam: Convolutional block attention
module[C]//Proceedings of the European conference on computer vision (ECCV). 2018:
319.
[26] F. Wang et al., "Residual Attention Network for Image Classification," 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6450-6458,
doi: 10.1109/CVPR.2017.683.
[27] Deng J, Cheng L, Wang Z. Attention-based BiLSTM fused CNN with gating mechanism
model for Chinese long text classification[J]. Computer Speech &amp; Language, 2021, 68:
101182.
[28] Lin M , Chen Q , Yan S . Network In Network[J]. Computer Science, 2013.
[29] N. K. Kim and H. K. Kim, "Polyphonic Sound Event Detection Based on Residual
Convolutional Recurrent Neural Network With Semi-Supervised Loss Function," in IEEE
Access, vol. 9, pp. 7564-7575, 2021, doi: 10.1109/ACCESS.2020.3048675.
[30] Zheng Z, Wang P, Liu W, et al. Distance-IoU loss: Faster and better learning for bounding
box regression[C]//Proceedings of the AAAI conference on artificial intelligence. 2020,
34(07): 12993-13000.
[31] Jiang Z, Fan Y. Singularity intensity function analysis of autoregressive spectrum and its
application in weak target detection under sea clutter background[J]. Radio Science, 2020,
55(10): 1-8.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W</given-names>
            <surname>J. Redmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Divvala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <article-title>"You Only Look Once: Unified, RealTime Object Detection,"</article-title>
          <source>2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>779</fpage>
          -
          <lpage>788</lpage>
          , doi: 10.1109/CVPR.
          <year>2016</year>
          .
          <volume>91</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Liu</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anguelov</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erhan</surname>
            <given-names>D</given-names>
          </string-name>
          , et al,
          <article-title>"SSD: Single Shot MultiBox Detector," 2016 Computer Vision and Pattern Recognition (ECCV</article-title>
          ),
          <year>2016</year>
          , doi: 10.1007/978-3-
          <fpage>319</fpage>
          -46448-
          <issue>0</issue>
          _
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <article-title>"YOLO9000: Better, Faster, Stronger,"</article-title>
          <source>2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>6517</fpage>
          -
          <lpage>6525</lpage>
          , doi: 10.1109/CVPR.
          <year>2017</year>
          .
          <volume>690</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Fang</surname>
            ,
            <given-names>M. T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Przystupa</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Majka</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kochan</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <source>Examination of Abnormal Behavior Detection Based on Improved YOLOv3. Electronics</source>
          <year>2021</year>
          ,
          <volume>10</volume>
          ,
          <fpage>197</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T. -Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <article-title>"Focal Loss for Dense Object Detection,"</article-title>
          <source>in IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          , vol.
          <volume>42</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>318</fpage>
          -
          <issue>327</issue>
          , 1 Feb.
          <year>2020</year>
          , doi: 10.1109/TPAMI.
          <year>2018</year>
          .
          <volume>2858826</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>"Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition,"</article-title>
          <source>in IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          , vol.
          <volume>37</volume>
          , no.
          <issue>9</issue>
          , pp.
          <fpage>1904</fpage>
          -
          <lpage>1916</lpage>
          , 1 Sept.
          <year>2015</year>
          , doi: 10.1109/TPAMI.
          <year>2015</year>
          .
          <volume>2389824</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Misra</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Mish</surname>
          </string-name>
          :
          <article-title>A self regularized non-monotonic neural activation function[J]</article-title>
          .
          <source>arXiv preprint arXiv:1908.08681</source>
          ,
          <year>2019</year>
          ,
          <volume>4</volume>
          (
          <issue>2</issue>
          ):
          <fpage>10</fpage>
          .
          <fpage>48550</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ailing</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <article-title>"Fine-grained vehicle recognition method based on improved ResNet,"</article-title>
          <source>2020 2nd International Conference on Information Technology and Computer Application (ITCA)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>588</fpage>
          -
          <lpage>592</lpage>
          , doi: 10.1109/ITCA52113.
          <year>2020</year>
          .
          <volume>00129</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T. -Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hariharan</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Belongie</surname>
          </string-name>
          ,
          <article-title>"Feature Pyramid Networks for Object Detection,"</article-title>
          <source>2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>936</fpage>
          -
          <lpage>944</lpage>
          , doi: 10.1109/CVPR.
          <year>2017</year>
          .
          <volume>106</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <article-title>"Path Aggregation Network for Instance Segmentation,"</article-title>
          <source>2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>8759</fpage>
          -
          <lpage>8768</lpage>
          , doi: 10.1109/CVPR.
          <year>2018</year>
          .
          <volume>00913</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Van Der</given-names>
            <surname>Maaten</surname>
          </string-name>
          and
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <article-title>"Densely Connected Convolutional Networks,"</article-title>
          <source>2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2261</fpage>
          -
          <lpage>2269</lpage>
          , doi: 10.1109/CVPR.
          <year>2017</year>
          .
          <volume>243</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zuo</surname>
          </string-name>
          and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>"ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks,"</article-title>
          <source>2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>11531</fpage>
          -
          <lpage>11539</lpage>
          , doi: 10.1109/CVPR42600.
          <year>2020</year>
          .
          <volume>01155</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Liu</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>Y</given-names>
          </string-name>
          .
          <article-title>Learning spatial fusion for single-shot object detection[J]</article-title>
          .
          <source>arXiv preprint arXiv:1911.09516</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>X.</given-names>
            <surname>Dai</surname>
          </string-name>
          et al.,
          <article-title>"Dynamic Head: Unifying Object Detection Heads with Attentions,"</article-title>
          <source>2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>7369</fpage>
          -
          <lpage>7378</lpage>
          , doi: 10.1109/CVPR46437.
          <year>2021</year>
          .
          <volume>00729</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>He</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gkioxari</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dollár</surname>
            <given-names>P</given-names>
          </string-name>
          , et al.
          <string-name>
            <surname>Mask</surname>
          </string-name>
          r-cnn[C]//Proceedings of the IEEE international conference on
          <source>computer vision</source>
          .
          <year>2017</year>
          :
          <fpage>2961</fpage>
          -
          <lpage>2969</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ghiasi</surname>
          </string-name>
          , T. -
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          and
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>"NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection,"</article-title>
          <source>2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>7029</fpage>
          -
          <lpage>7038</lpage>
          , doi: 10.1109/CVPR.
          <year>2019</year>
          .
          <volume>00720</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>"EfficientDet: Scalable and Efficient Object Detection,"</article-title>
          <source>2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>10778</fpage>
          -
          <lpage>10787</lpage>
          , doi: 10.1109/CVPR42600.
          <year>2020</year>
          .
          <volume>01079</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>