<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Attention Enhancement of YOLO for Vehicle Detection</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Caixiao</forename><surname>Ouyang</surname></persName>
							<email>ouyangcaixiao@163.com</email>
							<affiliation key="aff0">
								<orgName type="department">Wuhan Vocational College of Sortware and Engineering</orgName>
								<address>
									<postCode>430205</postCode>
									<settlement>Wuhan</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Hu</forename><surname>Jiwei</surname></persName>
							<email>hujiwei@fiberhome.com</email>
							<affiliation key="aff1">
								<orgName type="institution">Wuhan Fiberhome Technical Services Co., Ltd</orgName>
								<address>
									<postCode>430205</postCode>
									<settlement>Wuhan</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Youyuan</forename><surname>She</surname></persName>
							<email>yongyuanshe@163.com</email>
							<affiliation key="aff0">
								<orgName type="department">Wuhan Vocational College of Sortware and Engineering</orgName>
								<address>
									<postCode>430205</postCode>
									<settlement>Wuhan</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Chunzhi</forename><surname>Wang</surname></persName>
							<email>chunzhiwang@hbut.edu.cn</email>
							<affiliation key="aff2">
								<orgName type="institution">Hubei University of Technology</orgName>
								<address>
									<postCode>430068</postCode>
									<settlement>Wuhan</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Attention Enhancement of YOLO for Vehicle Detection</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">89E37A82B2AF0F7DEDB85543E7ACBEC7</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T20:12+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Target detection</term>
					<term>vehicle detection</term>
					<term>YOLOv4</term>
					<term>feature fusion</term>
					<term>attention mechanism</term>
					<term>lightweighting 1 1</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Vehicle detection and recognition is an important research. An attention and feature fusion target detection algorithm based on the improved YOLOv4 algorithm is proposed to achieve a more effective screening of vehicle targets in traffic scenes. Considering the cost deployment problem of traffic recognition algorithms, this paper uses YOLOv4 as the base architecture, firstly, the lightweight DenseNet is used as the backbone feature extraction network; secondly, effective channel attention (ECA) and Adaptive Spatial Feature Fusion (ASFF) are used to enhance the PANet structure with attention-guided fusion; in addition, the weight ratio of the loss function is optimized and the mosaic method is used for training enhancement.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>YOLOv1 <ref type="bibr" target="#b0">[1]</ref> achieves real-time performance of 155 fps. The algorithm divides the network into multiple grids, and each grid is responsible for predicting only the location and class of targets whose centers fall on that grid. This was followed by the SSD <ref type="bibr" target="#b1">[2]</ref> and YOLOv2 <ref type="bibr" target="#b2">[3]</ref>, both of which improved detection accuracy and speed. However, the accuracy of these algorithms is still relatively limited, especially for small targets. YOLOv3 <ref type="bibr" target="#b3">[4]</ref> uses an Anchorbased approach that allows targets at different scales to be preassigned a close detection frame form, although YOLOv3 uses MSE as the border regression loss function, which makes YOLOv3's localization of targets not precise. RetinaNet <ref type="bibr" target="#b4">[5]</ref> analyzes the category imbalance problem existing in the first stage of network training and proposes Focal loss that can automatically adjust the weights according to the Loss size, making the training more focused on difficult samples. Yolov4 introduces the SPP module <ref type="bibr" target="#b5">[6]</ref>, Mish <ref type="bibr" target="#b6">[7]</ref> activation function, etc., to improve the performance of the network.</p><p>With the development of deep learning algorithms, multi-target and multi-scale detection in complex environments, severe partial occlusion of vehicles, and high requirements for computing hardware are in the focus of research <ref type="bibr" target="#b7">[8]</ref>.</p><p>FPN <ref type="bibr" target="#b8">[9]</ref> is a network for solving multi-scale detection problems. It uses a pyramid structure to make features flow between vertical and horizontal and propagates semantic information between multiple layers to build multi-scale features. However, FPN does not handle the difference of information at different levels reasonably, and the operation of fused features is obtained by summing the higher-level features with the next level directly after sampling, which limits the self-learning of features. Therefore, recently appeared works to optimize and improve FPN. For example, PANet <ref type="bibr" target="#b9">[10]</ref> adds an extra top-down path to the original structure and adopts a channel superposition when fusing features, which both uses new feature information and ensures the preservation of original features. In addition, the attention mechanism (AM) is gradually becoming a popular method to improve detection performance. Various attention modules, used as a plug-and-play component, bring good performance improvements at an acceptable model complication. They select from the channels or spatial dimensions of the model and filter out the feature information that is more interesting and better matches the detection target.</p><p>This paper proposes a vehicle detection algorithm based on feature fusion and attention enhancement, which can purposefully alleviate the problems of missed detection, false detection, and accuracy degradation caused by detection scale or occlusion while reducing the complexity of the model. The main work of this paper is as follows:</p><p>1. DenseNet <ref type="bibr" target="#b10">[11]</ref> with lower complexity is used as the backbone network of the detection model.</p><p>2. Introducing effective channel attention (ECA) <ref type="bibr" target="#b11">[12]</ref> attention network, filling in the structure between the backbone and neck layer to achieve a smooth transition of features and selection of channel information.</p><p>3. Improving the network structure of the feature pyramid, adding Adaptive Spatial Feature Fusion (ASFF) <ref type="bibr" target="#b12">[13]</ref> fusion module based on PANet.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Materials and Methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Libertinus fonts for Linux Related Materials</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.1.">One-stage Target Detection</head><p>The YOLO series algorithm innovates on the detection principle of the Faster Region-based CNN (R-CNN) series by abandoning the RPN approach and using regression to obtain the coordinate information of the bbox. YOLOv1 is an one-stage target detection algorithm. This algorithm was quickly deployed in many real-world projects due to the dramatic increase in detection speed. Many one-stage target detection algorithms have emerged since then <ref type="bibr" target="#b13">[14]</ref>.</p><p>YOLOv4 consists of the CSPDarknet53 backbone network, SPPNet, PANet feature fusion network, and the YOLO-Head detection head module, that is used in YOLOv3. It is shown in Figure <ref type="figure" target="#fig_0">1</ref>. CSPDarknet53 is an improvement on Darknet53, which uses the CSPNet structure and applies a more extensive residual structure to reduce the information loss during training and further enhance the learning ability of the network. The activation function Leaky ReLU is replaced by the Mish function, whose upward unbounded property avoids model saturation due to numerical capping. In addition, its micro design for negative values brings better gradient flow. The Mish smoothed activation function ensures better accuracy and generalization.</p><p>Between the backbone network and the detection head is the Neck layer, which is composed of the SPP (Spatial Pyramid Pooling) module and the PANet module. The output of the backbone network is adjusted by the convolutional layer and used as the input of the SPP module. The SPP outputs the input data after doing maximum pooling and data stacking at different scales, and is adjusted by the convolutional layer and used as the input of the PANet network together with the two intermediate layers of the backbone network. PANet does further fusion of three sets of feature maps at different scales by some convolution, upsampling, downsampling and data stacking to enhance the perceptual field of feature maps at different scales and output three layers of data information.</p><p>The YOLO-Head in the detection layer receives the input from PANet and performs the final prediction process. The YOLO-Head with three a priori frames each will predict three feature maps with scales of 13X13, 26X26, and 52X52, respectively, and based on the a priori frame analysis information, the preliminary prediction frame will be output after nonmaximum suppression.</p><p>In this paper, we improve the training and inference speed of the one-stage detection algorithm by modifying the backbone network of the model, based on the YOLOv4 algorithm, and improve the model structure using the AM and feature fusion module to enhance the detection performance of the algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.2.">Feature Parymid Network (FPN)</head><p>Feature Pyramid Representation (FPN) addresses the challenge of scale variation in target detection. Its structural layer design allows the model to better utilize the feature information extracted from the backbone network. Initial target detection, either one-stage or two-stage, is usually performed with an external detection head after the feature map is output at the last layer of the last stage of Backbone. This approach is called the single-stage object detection algorithm. However, in this algorithm, the scale of the last output feature map of the backbone is too different from the input image, which is easy to cause information loss, especially the detection capability of small targets is insufficient. Subsequent studies found that the single-stage target detection algorithm cannot effectively transfer the information of various scales in the original image. Therefore, later target detection algorithms gradually developed into a feature pyramid network (FPN) using multi-scale, multi-stage feature maps to enhance the characterization ability of the model.</p><p>The FPN evolved through continuous iterations and can be divided into four models, as shown in Figure <ref type="figure" target="#fig_1">2</ref>. 1. A typical representative of fusion-free and at the same time utilizing multi-scale features is the SSD algorithm, which directly predicts objects of different sizes from the feature maps outputted by different stages.</p><p>2. There are many classical models that use algorithms with top-down fusion approach, such as Faster RCNN, Mask RCNN <ref type="bibr" target="#b14">[15]</ref>, Yolov3, RetinaNet, etc. They use the same kind of FPN models, and the difference is that feature maps of different scales are involved in feature fusion.</p><p>3. PANet proposes a top-down model followed by an additional bottom-up secondary fusion, which can be called a bidirectional fusion structure. YOLOv4 uses a fine-tuned version of PANet, which makes feature fusion not additive, but feature stacking.</p><p>4. The proposed PANet proved the effectiveness of bidirectional fusion, introduced more complex bidirectional fusion structures, such as NAS-FPN <ref type="bibr" target="#b15">[16]</ref> and BiFPN <ref type="bibr" target="#b16">[17]</ref>.</p><p>Various FPNs are designed to maximize the utilization of the multi-scale feature maps from backbone, and its optimization leads to significant improvement of object detection. Therefore, the algorithms in this paper in concert with the fusion of PANet and ASFF to enhance the reuse and extraction of feature maps and avoid the loss of effective information <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b18">19]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.3.">Attention Mechanisms</head><p>The AM focuses on local information while suppressing distracting information <ref type="bibr" target="#b19">[20]</ref>. From a mathematical point of view, AMs provide a weight-based model to perform operations. The process of extracting image features from feature maps in a NN is seen to vary in the degree to which different feature maps provide overall information <ref type="bibr" target="#b20">[21]</ref>. The AM uses the network layer to calculate the weight values corresponding to the relevant feature maps, and then applies these weights to the feature maps, so that the feature maps with a large role in extracting information become somewhat more influential on the overall <ref type="bibr" target="#b21">[22]</ref>. The AMs can currently be classified into following types: channel AMs, spatial AMs, and mixed spatial and channel AMs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.3.1.">Spatial AM</head><p>Not all regions in an image are equally important, only the task-relevant regions are important. The spatial attention model is to find the most important parts of the network for processing.</p><p>The Spatial Transformer Network (STN) <ref type="bibr" target="#b22">[23]</ref> is a spatial-based Attention by learning the shape change of the input so as to accomplish preprocessing operations suitable for a specific task. The ST module consists of the Localisation net, the Grid generator and Sample. The Localisation net determines the parameter θ of the input required transformation. The Grid generator finds the mapping T(θ) of the output to the input features by θ and the defined transformation. The Sample combines the location mapping and transformation parameters to select the input features and combine them with bilinear interpolation for the output.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.3.2.">Channel AM</head><p>For a set of images processed by the CNN, its effective information can be extracted from two dimensions. One dimension is the scale of the image in space, that is, the length and width. The other dimension is the channel information. Therefore, Attention based on channel orientation is also common.</p><p>SENet (Sequeeze and Excitation Net) <ref type="bibr" target="#b23">[24]</ref> is a channel type Attention model, which automatically enhances or suppresses channels after model learning by modeling the importance of each feature channel. It divides a bypass branch after the normal convolution operation, and this branch is compressed and fully connected to obtain a set of weight values. By applying this set of weights to each of the original feature channels, the importance of the different channels can be learned.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.3.3.">Fusion of spatial and channel AMs</head><p>CBAM (Convolutional Block Attention Module) <ref type="bibr" target="#b24">[25]</ref> is a representative network that combines spatial and channel AMs. It uses a channel-then-space approach for collocation, so that the model models the important information of channel and spatial locations separately.</p><p>Besides these, there are many other AMs related to research <ref type="bibr" target="#b25">[26,</ref><ref type="bibr" target="#b26">27]</ref>   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">The Proposed Method</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.1.">Lightweighting Of The Backbone</head><p>The lightweight network DenseNet is integrated and bridged with the original YOLOv4 to achieve faster, more accurate, and less computationally intensive target detection results. Specifically, the backbone network is replaced with DenseNet-121, and the rest of the architecture is optimized on the basis of YOLOv4.</p><p>As another type of CNN with deeper layers, it has the following advantages:</p><p>1. Fewer number of parameters compared to ResNet.  DenseNet is mainly composed of Dense Blocks and Transition Layers. The dense block is composed of several bottle necks. Each block uses the same number of output channels, and then uses a loop to connect the input and output of each block in the channel dimension. The structure of the bolt neck is shown in the upper part of Figure <ref type="figure" target="#fig_5">4</ref>.</p><p>BN-ReLU is placed before the convolution module for processing. Each Bottle Neck contains two convolutions, the first one is a 1*1 convolution, which has 4k output channels. Here, k is a feature map growth factor, which is the number of feature maps contributed by each Bottle Neck. The second 3*3 convolution has k output channels. Finally, the input of the module and the output of the 3*3 convolution are concat stacked to obtain the overall number of output channels of the module as C`+k.</p><p>The Dense Block structure is shown in the middle part of Fig. <ref type="figure" target="#fig_5">4</ref>. It consists of several Bottle Necks. The number of input channels of the whole Dense Block is C0. Since the output of Bottle Neck stacks, the output and input of the final convolutional structure in its interior, the number of feature channels will be increased by k for each Bottle Neck that passes through it. Therefore, the number of final output feature maps of a Dense Block composed of n Bottle Neck is C0+nk. The input of each Bottle Neck is a stack of all the outputs of its preceding layers.</p><p>The Transition Layer controls the model complexity. Its structure is shown in the bottom of Fig. <ref type="figure" target="#fig_5">4</ref>. Since the number of channels increases with each Dense Block connec-tion, its overuse will result in an overly complex model. Therefore, the Transition Layer first reduces the number of channels by a 1×1 convolution layer, and then to compress the height and width of the feature map, an average pooling layer with stride=2 is used for downsampling, which further reduces the model complexity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.2.">Citation of Attentional Mechanisms</head><p>Among the types of attention modules, channel AMs have great potential in im-proving the performance of deep CNNs. However, there are a large number of AMs developing more complex attention modules to achieve better performance, which will inevitably increase the complexity of the model. To strike a balance between model complexity and performance, this paper refers to an effective channel attention module (ECA) that contains only a small number of parameters while delivering significant performance gains. SE-Net is the basis of ECA-Net optimization and its structure is shown in Figure <ref type="figure" target="#fig_6">5(a)</ref>. Global average pooling is first performed separately for each input channel, followed by two fully connected layers using different activation functions. This computational process causes the channel features to be mapped from high to low and then to high dimensions. This dimensionality reduction operation reduces the complexity of the model, but it also cause the loss of critical information.</p><p>ECA-Net empirically shows, by observing SE-Net and improving it, that avoiding dimensionality reduction is important for learning channel attention and that proper crosschannel interaction can increase model complexity only slightly while maintaining performance. Its structural design is shown in Figure <ref type="figure" target="#fig_6">5(b)</ref>.</p><p>On the left is the feature of the original input image, which is first subjected to global average pooling (GAP) <ref type="bibr" target="#b28">[28]</ref> to obtain a 1×1×C feature map, on which ECA obtains the local cross-channel interaction by fast one-dimensional convolution of size K, where the parameter K can be generated by an adaptive function based on the size of the input channel C, which represents the local coverage of the cross-channel interaction. After that, a Sigmoid function is used to generate the weight share of each channel, and then the original input features are combined with the channel weights to obtain the features with channel attention. The network constructed with this module makes it easier to extract discriminative features of images based on channel dimensionality.</p><p>To avoid the consumption of large computational resources due to manual adjustment, the size of the parameter k can be generated adaptively by a function with the convolution kernel k calculated as: <ref type="bibr" target="#b0">(1)</ref> where |t|odd denotes the odd number of t-nearest neighbors, γ is set to 2, and b is 1. From (1), it is clear that the communication range of the high-dimensional channel is longer, while the communication range of the low-dimensional channel is relatively contracted.</p><p>In this paper, three ECA layers are inserted at the connection between Backbone and Neck of the model to avoid dimensionality reduction while better bridging the two components, making the feature transfer of the model more efficient and preventing the disappearance of feature information. At the same time, the ECA layer allows the model to focus on more critical features and suppress unnecessary features, which improves the detection accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.3.">Spatially Adaptive Fusion Of Feature Layers</head><p>In general, the lower level features of the network contain more location information and the higher level features contain more semantic information. The PANet structure is used in YOLOV4 to further fuse and output the higher and lower level features. After downsampling, the network does bidirectional propagation and then upsampling, and fuses the information from the same level downsampling by lateral connection, and then sends the feature information of different scales to different detectors.</p><p>However, the PANet connection simply stacks the top-down and bottom-up layers of information together, and there is a lack of communication between the layers to transfer the information. To more fully utilize the semantic information of the high-level features and the fine-grained features of the underlying features, this paper introduces a new feature fusion method, Adaptive Spatial Feature Fusion (ASFF), in the proposed algorithm.</p><p>ASFF can enhance the extraction capability of PANet and can fuse the information of multiple feature layers simultaneously. Its idea is to adaptively adjust the spatial weights of each scale features in fusion by learning. Its underlying structure is shown in Figure <ref type="figure" target="#fig_7">6</ref>.</p><p>Figure <ref type="figure" target="#fig_8">7</ref> shows the operation of layers in ASFF. First, X1, X2 and X3 are derived from the feature information at different scales of level1, level2 and level3 output in PANet, respectively. The ASFF-3 is an example of a convolution with the kernel of 3*3, the step size of 2, and a padding of 1. The X2 is scaled down to the same value as X3 with equal number of channels, and is denoted as level_1_resized. The number of channels and dimensionality of level_1_resized, level_2_resized and X3 are the same. Finally, level_1_resized, level_2_resized, and X3 are multiplied by α, β, and γ, respectively, and the values are summed, and the number of channels is adjusted by a final convolutional layer to obtain a new feature layer with multilayer perceptual field fusion. The formula is expressed as follows:</p><p>(2) where y ij l represents the new feature map of a layer obtained by ASFF, α ij l , β ij l , and γ ij l represent the weight parameters learned through the three feature layers, and α ij l +β ij l +γ ij l =1 is guaranteed by the Softmax function.</p><p>where y ij l represents the new feature map of a layer obtained by ASFF, α ij l , β ij l , and γ ij l represent the weight parameters learned through the three feature layers, and α ij l +β ij l +γ ij l =1 is guaranteed by the Softmax function. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.4.">Design of the loss function</head><p>The loss function contains three components: confidence error Lconf classification error Lcls, and regression frame prediction error Lloc <ref type="bibr" target="#b29">[29]</ref>. CIoU loss was used in the regression frame prediction error. CIoU is based on IoU, GIoU, and DIoU, and the CIOU takes into account three geometric factors, which are overlap area, centroid distance, and aspect ratio <ref type="bibr" target="#b30">[30]</ref>.</p><p>(3) (4) <ref type="bibr" target="#b4">(5)</ref> where S² is the number of grids, B is the number of prediction frames in each grid, , are the indicated values of the prediction frames containing and not containing the target, is the confidence true value, C is the prediction confidence, is the penalty weight factor, is the actual probability that the target in the cell belongs to category c, P(c) is the probability that the prediction is of category c, wgt, hgt are the width and height of the true frame, respectively, IoU(X , Y) is the intersection ratio of the predicted frame X to the real frame Y, ρ2(Xctr, Yctr) is the Euclidean distance between the center point of the predicted frame and the real frame, m is the diagonal distance of the minimum closed region containing both the predicted and real frames, u is the balance adjustment parameter, and v is the parameter measuring the consistency of the aspect ratio.</p><p>To balance the loss sensitivity of different detection scales, in this paper, the three prediction heads in the network structure are multiplied with different weights when calculating the total loss. The weights assigned to Yolo Head1, Yolo Head2, and Yolo Head3 are 0.4, 1.0, and 4.0, respectively <ref type="bibr" target="#b31">[31]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Conclusion</head><p>This paper focuses on the One-Stage target detection method which has higher requirements for detection speed and deployment cost. It helps cameras in traffic scenes to recognize vehicle information and perform vehicle model discrimination. A lightweight target detection algorithm based on attention and feature augmentation is proposed to address the problem of the demand for vehicle detection in smart city construction. The complexity of the algorithm is strictly controlled. The proposed algorithm uses YOLOv4 as the base architecture: (i) significantly reduces the number of model parameters by replacing the DenseNet, which has excellent performance, as the backbone feature extraction network;v(ii) reconstructs the existing FPN network module, uses the ECA attention structure for the transition and transfer of feature information between Backbone and Neck, as well as adds the information crossfusion function before the final detection layer of the network of the ASFF structure; (iii) while optimizing in terms of the loss function and image preprocessing.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: YOLOv4 network structure.</figDesc><graphic coords="3,140.80,85.05,313.80,169.20" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Various FPN modes.</figDesc><graphic coords="4,120.10,264.45,355.20,134.40" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3</head><label>3</label><figDesc>Figure3shows the architecture of the proposed algorithm. It takes one-stage target detection algorithm YOLOv4 as the reference architecture and divides the algorithm framework into four parts: data pre-processing and input, backbone network, FPN structure and prediction network. The pre-processed images are sent to the backbone network, which adopts a lightweight DenseNet structure consisting of different numbers of Dense Blocks and Transition Layers. Depending on the number of sub-module overlays, the backbone network extracts the feature information at different scales and passes it into the FPN network. Before this information is passed into SPPNet and PANet, the feature information will be further filtered and refined by three ECA attention modules. Then the information output from the bidirectional fusion-type network PANet is fed into the complex fusion network ASFF, which makes the feature map information at different scales form the interaction. Finally, the information extracted from the ASFF network is fed into the YOLO detection head, and the prediction results of the image are obtained after the information decoding and other operations. Next, the backbone network, FPN structure and loss function of the algorithm in this paper are described in more detail, respectively.</figDesc><graphic coords="6,100.30,326.85,394.80,161.40" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: The structure of the proposed algorithm.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>2 .</head><label>2</label><figDesc>More emphasis and encouragement on feature reuse. 3. The network is easier to train and has some regularization effect. 4. The problems of gradient vanishing and model degradation are alleviated.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: DenseNet main structure.</figDesc><graphic coords="7,134.70,85.05,340.20,198.60" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: SENet and ECANet structures.</figDesc><graphic coords="8,165.60,126.45,278.40,212.40" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: ASFF schematic.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: ASFF specific operations.</figDesc><graphic coords="10,165.00,328.35,279.60,136.80" type="bitmap" /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">You Only Look Once: Unified, Real-Time Object Detection</title>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">J</forename><surname>Redmon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Divvala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Girshick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Farhadi</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2016.91</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2016">2016. 2016</date>
			<biblScope unit="page" from="779" to="788" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">SSD: Single Shot MultiBox Detector</title>
		<author>
			<persName><forename type="first">W</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Anguelov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Erhan</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-319-46448-0_2</idno>
	</analytic>
	<monogr>
		<title level="m">Computer Vision and Pattern Recognition (ECCV)</title>
				<imprint>
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">YOLO9000: Better, Faster, Stronger</title>
		<author>
			<persName><forename type="first">J</forename><surname>Redmon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Farhadi</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2017.690</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="page" from="6517" to="6525" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Examination of Abnormal Behavior Detection Based on Improved YOLOv3</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">T</forename><surname>Fang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><forename type="middle">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Przystupa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Majka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Kochan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Electronics</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page">197</biblScope>
			<date type="published" when="2021">2021. 2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Focal Loss for Dense Object Detection</title>
		<author>
			<persName><forename type="first">T. -Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Girshick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dollár</surname></persName>
		</author>
		<idno type="DOI">10.1109/TPAMI.2018.2858826</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Pattern Analysis and Machine Intelligence</title>
		<imprint>
			<biblScope unit="volume">42</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="318" to="327" />
			<date type="published" when="2020-02-01">1 Feb. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
		<idno type="DOI">10.1109/TPAMI.2015.2389824</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Pattern Analysis and Machine Intelligence</title>
		<imprint>
			<biblScope unit="volume">37</biblScope>
			<biblScope unit="issue">9</biblScope>
			<biblScope unit="page" from="1904" to="1916" />
			<date type="published" when="2015-09-01">1 Sept. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A self regularized non-monotonic neural activation function</title>
		<author>
			<persName><forename type="first">D</forename><surname>Misra</surname></persName>
		</author>
		<author>
			<persName><surname>Mish</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1908.08681</idno>
		<idno>10.48550</idno>
	</analytic>
	<monogr>
		<title level="j">J</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">2</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Fine-grained vehicle recognition method based on improved ResNet</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Ailing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ning</surname></persName>
		</author>
		<idno type="DOI">10.1109/ITCA52113.2020.00129</idno>
	</analytic>
	<monogr>
		<title level="m">2nd International Conference on Information Technology and Computer Application (ITCA)</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
			<biblScope unit="page" from="588" to="592" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Feature Pyramid Networks for Object Detection</title>
		<author>
			<persName><forename type="first">T. -Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dollár</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Girshick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Hariharan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Belongie</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2017.106</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="page" from="936" to="944" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Path Aggregation Network for Instance Segmentation</title>
		<author>
			<persName><forename type="first">S</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Qi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Jia</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2018.00913</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<imprint>
			<date type="published" when="2018">2018. 2018</date>
			<biblScope unit="page" from="8759" to="8768" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Densely Connected Convolutional Networks</title>
		<author>
			<persName><forename type="first">G</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Van Der Maaten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">Q</forename><surname>Weinberger</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2017.243</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="page" from="2261" to="2269" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zuo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Hu</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR42600.2020.01155</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
			<biblScope unit="page" from="11531" to="11539" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Learning spatial fusion for single-shot object detection</title>
		<author>
			<persName><forename type="first">S</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1911.09516</idno>
	</analytic>
	<monogr>
		<title level="j">J</title>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Dynamic Head: Unifying Object Detection Heads with Attentions</title>
		<author>
			<persName><forename type="first">X</forename><surname>Dai</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR46437.2021.00729</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2021">2021. 2021</date>
			<biblScope unit="page" from="7369" to="7378" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Mask r-cnn</title>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Gkioxari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dollár</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE international conference on computer vision</title>
				<meeting>the IEEE international conference on computer vision</meeting>
		<imprint>
			<biblScope unit="volume">2017</biblScope>
			<biblScope unit="page" from="2961" to="2969" />
		</imprint>
	</monogr>
	<note>C</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection</title>
		<author>
			<persName><forename type="first">G</forename><surname>Ghiasi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T. -Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2019.00720</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2019">2019. 2019</date>
			<biblScope unit="page" from="7029" to="7038" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">EfficientDet: Scalable and Efficient Object Detection</title>
		<author>
			<persName><forename type="first">M</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Pang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR42600.2020.01079</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
			<biblScope unit="page" from="10778" to="10787" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Effective Fusion Factor in FPN for Tiny Object Detection</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Gong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Han</surname></persName>
		</author>
		<idno type="DOI">10.1109/WACV48630.2021.00120</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE Winter Conference on Applications of Computer Vision (WACV)</title>
				<imprint>
			<date type="published" when="2021">2021. 2021</date>
			<biblScope unit="page" from="1159" to="1167" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Online measurement error detection for the electronic transformer in a smart grid</title>
		<author>
			<persName><forename type="first">G</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Przystupa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Teng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Energies</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="issue">12</biblScope>
			<biblScope unit="page">3551</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Fault diagnosis of RV reducer based on denoising time-frequency attention neural network</title>
		<author>
			<persName><forename type="first">K</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Kochan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Expert Systems with Applications</title>
		<imprint>
			<biblScope unit="volume">238</biblScope>
			<biblScope unit="page">121762</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Social Recommendation Algorithm Based on Self-Supervised Hypergraph Attention</title>
		<author>
			<persName><forename type="first">X</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Przystupa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Kochan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Electronics</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page">906</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Yolo+FPN: 2D and 3D Fused Object Detection With an RGB-D Camera</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zell</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICPR48806.2021.9413066</idno>
	</analytic>
	<monogr>
		<title level="m">25th International Conference on Pattern Recognition (ICPR)</title>
				<imprint>
			<date type="published" when="2020">2020. 2021</date>
			<biblScope unit="page" from="4657" to="4664" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Spatial transformer networks</title>
		<author>
			<persName><forename type="first">M</forename><surname>Jaderberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zisserman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in neural information processing systems</title>
				<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page">28</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Squeeze-and-Excitation Networks</title>
		<author>
			<persName><forename type="first">J</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sun</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2018.00745</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<imprint>
			<date type="published" when="2018">2018. 2018</date>
			<biblScope unit="page" from="7132" to="7141" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Cbam: Convolutional block attention module</title>
		<author>
			<persName><forename type="first">S</forename><surname>Woo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J Y</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">C]//Proceedings of the European conference on computer vision (ECCV)</title>
				<imprint>
			<biblScope unit="volume">2018</biblScope>
			<biblScope unit="page" from="3" to="19" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Residual Attention Network for Image Classification</title>
		<author>
			<persName><forename type="first">F</forename><surname>Wang</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2017.683</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="page" from="6450" to="6458" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Attention-based BiLSTM fused CNN with gating mechanism model for Chinese long text classification</title>
		<author>
			<persName><forename type="first">J</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J</title>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title/>
	</analytic>
	<monogr>
		<title level="j">Computer Speech &amp; Language</title>
		<imprint>
			<biblScope unit="volume">68</biblScope>
			<biblScope unit="page">101182</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Network In Network</title>
		<author>
			<persName><forename type="first">M</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J</title>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Polyphonic Sound Event Detection Based on Residual Convolutional Recurrent Neural Network With Semi-Supervised Loss Function</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">K</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">K</forename><surname>Kim</surname></persName>
		</author>
		<idno type="DOI">10.1109/ACCESS.2020.3048675</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Access</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="7564" to="7575" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Distance-IoU loss: Faster and better learning for bounding box regression</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI conference on artificial intelligence</title>
				<meeting>the AAAI conference on artificial intelligence</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="12993" to="13000" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Singularity intensity function analysis of autoregressive spectrum and its application in weak target detection under sea clutter background</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Fan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J]. Radio Science</title>
		<imprint>
			<biblScope unit="volume">55</biblScope>
			<biblScope unit="issue">10</biblScope>
			<biblScope unit="page" from="1" to="8" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
