<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Ming Xue CEUR Workshop Proceedings</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Ming</forename><surname>Xue</surname></persName>
							<email>mingxue202@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Wuhan Fiberhome Technical Services Co., Ltd</orgName>
								<address>
									<addrLine>88 Youkeyuan Rd., Hongshan District</addrLine>
									<postCode>430068</postCode>
									<settlement>Wuhan</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Ming Xue CEUR Workshop Proceedings</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">5B1AE30DC5D5556E752E4899F815D25A</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:09+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>target detection</term>
					<term>vehicle detection</term>
					<term>YOLOv4</term>
					<term>feature fusion</term>
					<term>attention mechanism</term>
					<term>lightweighting</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Considering the cost deployment problem of traffic recognition algorithms, this paper considers YOLOv4 as the base architecture. The lightweight DenseNet is used as the backbone feature extraction network, and effective channel attention (ECA) and Adaptive Spatial Feature Fusion (ASFF) are used to enhance the PANet structure with attention-guided fusion. The weight ratio of the loss function is optimized and the mosaic method is used for training enhancement. The results show that the proposed algorithm improves both the detection accuracy and detection speed as well as reduces the number of parameters by 64. The research results provide some reference value for the traffic construction of smart cities.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>As the pioneer of one-stage target detection, the YOLO series algorithm innovates on the detection principle of the Faster R-CNN series by abandoning the RPN approach and using regression to obtain the coordinate information of the bbox. YOLOv1, an algorithm that uses an end-to-end identification approach, is known as the one-stage target detection algorithm. This algorithm was quickly deployed in many real-world projects due to the dramatic increase in detection speed, and was even used in military devices. A large number of one-stage target detection algorithms have also emerged since then, and these algorithms have evolved through iterations in pursuit of faster and more accurate recognition results <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4]</ref>.</p><p>Bochkovskiy et al. <ref type="bibr" target="#b4">[5]</ref> proposed YOLOv4, the algorithm developed for one-stage target detection. It is based on the architecture of the classical YOLO target detection family, published in 2020 and endorsed by the authors of YOLOv3. Such algorithms concentrate both target classification and localization in the same network architecture, enabling end-to-end detection.</p><p>The YOLOv4 algorithm consists of the CSPDarknet53 backbone network, SPPNet, PANet feature fusion network and the YOLO-Head detection head module that is used in YOLOv3. Its network structure is shown in figure <ref type="figure" target="#fig_0">1</ref>.</p><p>In this paper, we further improve the training and inference speed of the one-stage detection algorithm by modifying the backbone network of the model, based on the YOLOv4 algorithm, and improve the model structure using the attention mechanism and feature fusion module to enhance the detection performance of the algorithm. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Improved YOLOv4 method 2.1. Feature Parymid Network (FPN)</head><p>Feature Pyramid Representation (FPN) is a common approach to address the challenge of scale variation in target detection. Its structural layer design allows the model to more fully utilize the feature information extracted from the backbone network.</p><p>Various FPNs are designed to maximize the utilization of the multi-scale feature maps from backbone, and its optimization leads to significant performance improvement of object detection. Therefore, the algorithms in this paper work in concert with the fusion of PANet and ASFF to enhance the reuse and extraction of feature maps and avoid the loss of effective information <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b8">9]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Attentional mechanisms</head><p>The attention mechanism focuses on local information while suppressing distracting information. Attention mechanisms have made important breakthroughs in recent years in areas such as image and natural language processing, and have widely demonstrated their effectiveness in improving model performance.</p><p>From a mathematical point of view, the attention mechanisms provides a weight-based model to perform operations. The attention mechanism uses the network layer to calculate the weight values corresponding to the relevant feature maps, and then applies these weights to the feature maps, so that the feature maps with a large role in extracting information become somewhat more influential on the overall. With respect to the content of interest, attention mechanisms can be split into three types: channel attention mechanism, spatial attention mechanism, and mixed spatial and channel attention mechanism.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.1.">The spatial attention mechanism</head><p>The Spatial Transformer Network (STN) <ref type="bibr" target="#b9">[10]</ref> proposed by Google DeepMind is a spatial-based attention by learning the shape change of the input so as to accomplish preprocessing operations suitable for a specific task. The ST module consists of localisation net, grid generator and sample. The localisation net determines the parameter 𝜃 of the input required transformation. The grid generator finds the mapping 𝑇 (𝜃) of the output to the input features by 𝜃 and the defined transformation. The sample combines the location mapping and transformation parameters to select the input features and combine them with bilinear interpolation for the output.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.2.">Channel attention mechanism</head><p>Sequeeze and Excitation Net (SENet) <ref type="bibr" target="#b10">[11]</ref> is a channel type Attention model, which automatically enhances or suppresses channels after model learning by modeling the importance of each feature channel. It divides a bypass branch after the normal convolution operation, and this branch is compressed and fully connected to obtain a set of weight values.he importance of the different channels can be learned by applying this set of weights to each of the original feature channels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.3.">Fusion of spatial and channel attention mechanisms</head><p>Convolutional Block Attention Module (CBAM) <ref type="bibr" target="#b11">[12]</ref> is a representative network that combines spatial and channel attention mechanisms. It uses a channel-then-space approach for collocation, so that the model models the important information of channel and spatial locations separately. Besides these, there are many other attention mechanisms related to research, such as residual attention mechanism, multi-scale attention mechanism, recursive attention mechanism, etc. <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b15">16,</ref><ref type="bibr" target="#b16">17]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Related methods</head><p>Figure <ref type="figure" target="#fig_1">2</ref> shows the overall architecture of the proposed algorithm in this paper. The algorithm in this paper takes the one-stage target detection algorithm YOLOv4 as the reference architecture and divides the algorithm framework into four parts: data pre-processing and input, backbone network, FPN structure and prediction network. In this paper, the pre-processed images are first passed into the backbone network, which adopts a lightweight DenseNet structure consisting of different numbers of dense blocks and transition layers. Depending on the number of sub-module overlays, the backbone network extracts the feature information at different scales and passes it into the FPN network. Before this information is passed into SPPNet and PANet, the feature information will be further filtered and refined by three ECA attention modules. Then the information output from the bidirectional fusion-type network PANet is fed into the complex fusion network ASFF, which makes the feature map information at different scales form the interaction. Finally, the information extracted from the ASFF network is fed into the YOLO detection head for detection, and the prediction results of the image are obtained after the information decoding and other operations. Next, the backbone network, FPN structure and loss function of the algorithm in this paper are described in more detail, respectively. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.1.">Lightweighting of the backbone</head><p>The backbone network is replaced with the lightweight network DenseNet-121, and the rest of the architecture is optimized on the basis of YOLOv4.</p><p>The main components of the DenseNet network are both the dense blocks and transition layers. The dense block is composed of several bolt necks. Each block uses the same number of output channels, and then uses a loop to connect the input and output of each block in the channel dimension. The structure of bolt neck is shown in the upper part of figure <ref type="figure" target="#fig_2">3</ref>. Each bottle neck contains two convolutions, the first one is a 1*1 convolution, which has 4𝑘 output channels. Here, 𝑘 is a feature map growth factor, which is the number of feature maps contributed by each bottle neck. The second 3*3 convolution has 𝑘 output channels. Finally, the input of the module and the output of the 3*3 convolution are concat stacked to obtain the overall number of output channels of the module as 𝐶 ′ + 𝑘. The dense block structure is shown in the middle part of figure <ref type="figure" target="#fig_2">3</ref>, which consists of several bottle neck. If the number of input channels of the whole dense block is 𝐶 0 . Since the output of Bottle Neck stacks the output and input of the final convolutional structure in its interior, the number of feature channels will be increased by 𝑘 for each bottle neck that passes through it. Therefore, the number of final output feature maps of a dense block composed of 𝑛 bottle neck is 𝐶 0 + 𝑛𝑘.</p><p>By looking at the dense block structure, it can be seen that the input of each bottle neck is a stack of all the outputs of its preceding layers. This essentially densely connected network structure is the reason why DenseNet can achieve good results.</p><p>The transition layer is used to control the model complexity, and its structure is shown in the lower part of figure <ref type="figure" target="#fig_3">4</ref>. Since the number of channels increases with each dense block connection, using too many would result in an overly complex model. Therefore, the transition layer first reduces the number of channels by a 1*1 convolution layer, and then to compress the height and width of the feature map, an average pooling layer with stride=2 is used for downsampling, which further reduces the model complexity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Citation of attentional mechanisms</head><p>To ensure the detection accuracy of the model while performing lightweight optimization of the model, this paper will intersperse the attention mechanism module in the network structure.</p><p>To keep the balance between model complexity and performance, this paper refers to an effective channel attention module (ECA) that contains only a small number of parameters while delivering significant performance gains. SE-Net is the basis of ECA-Net optimization and its structure is shown in figure <ref type="figure" target="#fig_3">4</ref>(a). Global average pooling is first performed separately for each input channel, followed by two fully connected layers using different activation functions. This computational process causes the channel features to be mapped from high to low and then to high dimensions. This dimensionality reduction operation reduces the complexity of the model, but it also hinders the correspondence generated between weights and weights, which may result in the loss of critical information.</p><p>Empirical data show ECA-Net, by observing SE-Net and improving it, that it is important to avoid dimensionality reduction when learning channel attention and a proper cross-channel interaction can increase the complexity of the model only slightly while maintaining performance. Its structural design is given in figure <ref type="figure" target="#fig_3">4(b)</ref>. On the left is the feature of the original input image, which is first subjected to global average pooling (GAP) <ref type="bibr" target="#b17">[18]</ref> to obtain a 1*1*𝐶 feature map, on which ECA obtains the local cross-channel interaction by fast one-dimensional convolution of size 𝐾. The parameter 𝐾 can be developed by an adaptive function based on the dimension of the input channel 𝐶. This channel represents the local coverage of the cross-channel interaction. Then, the sigmoid activation function generates the weight share for every single channel. After that, the features with channel attention are obtained by merging the original input features with the channel weights. The network based on this module extracts discriminative features of images on the basis of channel dimensionality more easily.</p><p>In order to avoid the consumption of large arithmetic resources due to manual adjustment, the size of the parameter 𝑘 can be generated adaptively by a function with the convolution kernel 𝑘:</p><formula xml:id="formula_0">𝑘 = 𝜓(𝐶) = ⌊︂ log 2 (𝐶) 𝛾 + 𝑏 𝛾 ⌋︂ odd<label>(1)</label></formula><p>where |𝑡| 𝑜𝑑𝑑 denotes the odd number of 𝑡-nearest neighbors, 𝛾 is set to 2, and 𝑏 is 1. From the equation, it is clear that the communication range of the high-dimensional channel is longer, while the communication range of the low-dimensional channel is relatively contracted. In this paper, three ECA layers are inserted at the connection between backbone and neck of the model to avoid dimensionality reduction while better bridging the two components, making the feature transfer of the model more efficient and preventing the disappearance of feature information. At the same time, the ECA layer allows the model to focus on more critical features and suppress unnecessary features, thus ignoring the interference brought by the image background, which enhances the accuracy of detection for the model even further.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Spatially adaptive fusion of feature layers</head><p>ASFF can further enhance the extraction capability of PANet and can fuse the information of multiple feature layers simultaneously. Its core idea is to adaptively adjust the spatial weights of each scale features in fusion by learning. Its underlying structure is shown in figure <ref type="figure" target="#fig_4">5</ref>. The ASFF-3 is an example of a convolution with a convolution kernel of 3*3, a step size of 2, and a padding of 1. The 𝑋2 is scaled down to the same value as 𝑋3 with equal number of channels, and is denoted as level_1_resized. The number of channels and dimensionality of level_1_resized, level_2_resized, and 𝑋3 are the same. Finally, level_1_resized, level_2_resized, and 𝑋3 are multiplied by 𝛼, 𝛽 and 𝛾, respectively, and the values are summed, and the number of channels is adjusted by a final convolutional layer to obtain a new feature layer with multi-layer perceptual field fusion. The expression is as follows.</p><formula xml:id="formula_1">𝑦 𝑙 𝑖𝑗 = 𝛼 𝑙 𝑖𝑗 • 𝑋 1→𝑙 𝑖𝑗 + 𝛽 𝑙 𝑖𝑗 • 𝑋 2→𝑙 𝑖𝑗 + 𝛾 𝑙 𝑖𝑗 • 𝑋 3→𝑙 𝑖𝑗 (2)</formula><p>where 𝑦 𝑙 𝑖𝑗 represents the new feature map of a layer obtained by ASFF, 𝛼 𝑙 𝑖𝑗 , 𝛽 𝑙 𝑖𝑗 , and 𝛾 𝑙 𝑖𝑗 represent the weight parameters learned through the three feature layers, and 𝛼 𝑙 𝑖𝑗 + 𝛾 𝑙 𝑖𝑗 + 𝛽 𝑙 𝑖𝑗 = 1 is guaranteed by the Softmax function.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Designing the loss function</head><p>The loss function based on the YOLOv4 algorithm contains three components: confidence error 𝐿 conf , classification error 𝐿 cls , and regression frame prediction error 𝐿 loc . Among them, the confidence error and classification error continue the design idea of YOLOv3 <ref type="bibr" target="#b18">[19]</ref>. However, CIoU loss was used in the design of the regression frame prediction error. The CIoU is based on IoU, GIoU, and DIoU, and the CIoU takes into account three geometric factors, which are overlap area, centroid distance, and aspect ratio. They are calculated by the following equations <ref type="bibr" target="#b19">[20,</ref><ref type="bibr" target="#b20">21,</ref><ref type="bibr" target="#b21">22]</ref>.</p><formula xml:id="formula_2">162-177 𝐿 conf = − 𝑆 2 ∑︁ 𝑖=0 𝐵 ∑︁ 𝑗=0 𝐼 obj 𝑖𝑗 [︁ 𝐶 𝑗 𝑖 log(𝐶 𝑗 𝑖 ) + (︁ 1 − 𝐶 𝑗 𝑖 )︁ log(1 − 𝐶 𝑗 𝑖 ) ]︁ − 𝜆 noobj 𝑆 2 ∑︁ 𝑖=0 𝐵 ∑︁ 𝑗=0 𝐼 noobj 𝑖𝑗 [︁ 𝐶 𝑗 𝑖 log(𝐶 𝑗 𝑖 ) + (︁ 1 − 𝐶 𝑗 𝑖 )︁ log(1 − 𝐶 𝑗 𝑖 )</formula><p>]︁</p><p>(3)</p><formula xml:id="formula_3">𝐿 cls = − 𝑆 2 ∑︁ 𝑖=0 𝐼 obj 𝑖𝑗 ∑︁ 𝑐∈classes {︁ 𝑃 𝑗 𝑖 (𝑐) log [︁ 𝑃 𝑗 𝑖 (𝑐) ]︁ + [︁ 1 − 𝑃 𝑗 𝑖 (𝑐) ]︁ log [︁ 1 − 𝑃 𝑗 𝑖 (𝑐) ]︁}︁ (4) CIoU(𝑋, 𝑌 ) = IoU(𝑋, 𝑌 ) − 𝜌 2 (𝑋 ctr , 𝑌 ctr ) 𝑚 2 − 𝛼𝑣<label>(5)</label></formula><p>where: 𝑆 2 is the number of grids, 𝐵 is the number of prediction frames in each grid, 𝐼 obj 𝑖𝑗 and 𝐼 noobj 𝑖𝑗</p><p>are the indicator values of the prediction frames containing and not containing the target, 𝐶 𝑗 𝑖 is the confidence true value, 𝐶 𝑗 𝑖 is the prediction confidence, 𝜆 noobj is the penalty weight factor, 𝑃 𝑗 𝑖 (𝑐) is the actual probability that the target in the cell belongs to category 𝑐, 𝑃 𝑗 𝑖 (𝑐) is the probability that the prediction is of category 𝑐, IoU(𝑋, 𝑌 ) is the intersection ratio of the predicted frame 𝑋 to the real frame 𝑌 , 𝜌 2 (𝑋 ctr , 𝑌 ctr ) is the Euclidean distance between the center point of the predicted frame and the real frame, 𝑚 is the diagonal distance of the minimum closed region containing both the predicted and real frames, 𝛼 is the balance adjustment parameter, 𝑣 is the parameter measuring the consistency of the aspect ratio.</p><p>The regression frame prediction error 𝐿 loc is typically defined using the CIoU loss:</p><formula xml:id="formula_4">𝐿 loc = 𝑆 2 ∑︁ 𝑖=0 𝐵 ∑︁ 𝑗=0 𝐼 obj 𝑖𝑗 (1 − CIoU(𝑏 𝑗 𝑖 , 𝑏 ^𝑗 𝑖 ))<label>(6)</label></formula><p>where 𝑏 𝑗 𝑖 represents the predicted bounding box and 𝑏 ^𝑗 𝑖 represents the ground truth bounding box. In order to balance the loss sensitivity of different detection scales, in this paper, the three prediction heads in the network structure are multiplied with different weights when calculating the total loss. The weights assigned to Yolo Head1, Yolo Head2 and Yolo Head3 are 0.4, 1.0 and 4.0, respectively <ref type="bibr" target="#b22">[23]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Experiment and analysis</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">Experimental setup</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.1.">Dataset</head><p>In this paper, a variety of datasets will be used for performance evaluation.</p><p>1. The RSOV dataset is published by Brno University of Technology and consists of three subdatasets with different viewpoints, namely the rear view shot dataset, the eye level view shot dataset, and the unconstrained shot dataset, each of which contains 5000 images of vehicles with annotations. Thus, the dataset has a total of 15,000 images containing information of 41 different brands and categories of vehicles. In this paper, we divide the dataset according to the ratio of  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.2.">Dataset pre-processing</head><p>Since there are large differences in the number of samples and uneven distribution of some images in some datasets, mosaic data enhancement is performed on the dataset before model training. The operation process is shown in figure <ref type="figure" target="#fig_6">6</ref>. The mosaic data enhancement method first takes out a batch of data to be processed from the dataset, then randomly uses four images to be scaled or shifted in different proportions, placed in the direction of the four corners of the rectangle, crops the excess part of images that exceed the specified input size, and finally gets a new image as training data. The mosaic method for data preprocessing not only enhances the diversity of data and enriches the image dataset, but also improves the batchsize in disguise and enhances the efficiency of the model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.3.">Evaluation index</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Evaluation metrics for model inference</head><p>For the target detection task, the precision 𝑃 , recall 𝑅, and mean average precision 𝑚𝐴𝑃 (mean Average Precision) are commonly used as evaluation metrics for model identification. The calculation of the relevant metrics is given below. a) Precision 𝑃 and recall 𝑅.</p><p>The precision is the ratio of correct predictions to the total number of predictions. It is one of the simplest metrics. Recall calculates the ratio of the number of predicted positive cases to the total number of positive case labels. They are calculated as follows:</p><formula xml:id="formula_5">𝑃 = 𝑇 𝑃 𝑇 𝑃 + 𝐹 𝑃 ,<label>(7)</label></formula><formula xml:id="formula_6">𝑅 = 𝑇 𝑃 𝑇 𝑃 + 𝐹 𝑁 ,<label>(8)</label></formula><p>where 𝑇 𝑃 denotes the number of positive samples correctly identified as positive, 𝐹 𝑃 denotes the number of negative samples incorrectly identified as positive, and 𝐹 𝑁 denotes the number of positive samples incorrectly identified as negative.</p><p>b) 𝐴𝑃 and 𝑚𝐴𝑃 . Average Precision (𝐴𝑃 ) is the area limited by the precision-recall curve. The better a classifier is, the higher the 𝐴𝑃 value. It is used to evaluate the detection accuracy of a class, and is calculated as follows:</p><formula xml:id="formula_7">𝐴𝑃 = ∫︁ 1 0 𝑃 (𝑅) 𝑑𝑅<label>(9)</label></formula><p>The mean Average Precision (𝑚𝐴𝑃 ), which is the average value of multiple categories of 𝐴𝑃 . The 𝑚𝐴𝑃 has a size in the interval [0, 1] and the larger its value, the better, and this metric is the most important one in the target detection algorithm and is calculated as follows.</p><formula xml:id="formula_8">𝑚𝐴𝑃 = 1 𝑛 𝑛 ∑︁ 𝑖=1 𝐴𝑃 (𝑖)<label>(10)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Evaluation Index of Model Parameters</head><p>In a deep learning model, the size of the number of parameters determines to some extent the depth of the model, the speed of inference and even the detection accuracy. A large deep learning architecture is often accompanied by high accuracy because it is closer to the neuronal composition of the human brain. However, a large number of parameters also means a sacrifice of inference time and response speed. Therefore, a small number of parameters is important for deploying deep learning models to embedded devices or platforms, and for handling large numbers of concurrent requests. The metrics of floating point operations (FLOPs) is referred to as the number of float-point operations. It is understood as the number of computations, and the base unit is 𝐵. It can be used to measure the complexity of an algorithm/model. Params is referred to as a total number of parameters to be trained, and the base unit is 𝑀 .</p><formula xml:id="formula_9">Params = 𝐿 ∑︁ 𝑙=1 𝑃 𝑙 (<label>11</label></formula><formula xml:id="formula_10">)</formula><p>where 𝐿 is the total number of layers in the model and 𝑃 𝑙 is the number of parameters in layer 𝑙.</p><p>For a typical convolutional layer, the number of parameters can be calculated as:</p><formula xml:id="formula_11">𝑃 conv = (𝑘 ℎ × 𝑘 𝑤 × 𝑐 in + 1) × 𝑐 out (<label>12</label></formula><formula xml:id="formula_12">)</formula><p>where 𝑘 ℎ and 𝑘 𝑤 are the kernel height and width, 𝑐 in is the number of input channels, 𝑐 out is the number of output channels, and the +1 term accounts for the bias parameter for each output channel.</p><p>For 𝐹 𝐿𝑂𝑃 𝑠 in a convolutional layer:</p><formula xml:id="formula_13">FLOPs conv = ℎ out × 𝑤 out × (2 × 𝑘 ℎ × 𝑘 𝑤 × 𝑐 in − 1) × 𝑐 out (<label>13</label></formula><formula xml:id="formula_14">)</formula><p>where ℎ out and 𝑤 out are the height and width of the output feature map. The term (2 × 𝑘 ℎ × 𝑘 𝑤 × 𝑐 in − 1) accounts for multiplication-addition operations for each output element. For a fully connected layer:</p><formula xml:id="formula_15">𝑃 fc = (𝑛 in + 1) × 𝑛 out (<label>14</label></formula><formula xml:id="formula_16">)</formula><formula xml:id="formula_17">FLOPs fc = (2 × 𝑛 in − 1) × 𝑛 out (<label>15</label></formula><formula xml:id="formula_18">)</formula><p>where 𝑛 in is the number of input neurons and 𝑛 out is the number of output neurons. The total computational complexity of a neural network model can then be expressed as:</p><formula xml:id="formula_19">Total FLOPs = 𝐿 ∑︁ 𝑙=1</formula><p>FLOPs 𝑙 <ref type="bibr" target="#b15">(16)</ref> where FLOPs 𝑙 is the number of floating point operations in layer 𝑙.</p><p>Model efficiency can be characterized by the ratio: Efficiency = Accuracy FLOPs × Params <ref type="bibr" target="#b16">(17)</ref> This metric helps quantify the trade-off between model performance and computational resources required.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.4.">Training strategies</head><p>In the paper, we use migration learning to speed up the training of models. Migration learning is used to help train a new model by migrating the weight parameters of an already trained model to a new one. The basis of this approach is that most of the data or tasks are correlated, and by sharing some of the parameters of the pre-trained model to the new model to be trained, the training process of the new model can be significantly accelerated and optimized for the purpose of saving computational resources <ref type="bibr" target="#b23">[24]</ref>.</p><p>In deep neural networks, the previous convolutional layers generally learn shallow features with generality, while the later convolutional layers learn more targeted, higher-level abstract features for the current training target. Freezing some of the network layers first can speed up the training and also prevent the weights from being corrupted in the early stage of training. Migration learning can also effectively avoid the problem of poor generalization of the model due to the existence of local minima in the objective function. The migration learning strategy in this paper is as follows.</p><p>1. Selecting publicly available DenseNet weight files that have been trained through Imagenet or other large datasets as the parameter source for pre-trained weights for migration learning. 2. Load the weight files into the backbone network of this paper's model, and then freeze the backbone network without participating in back propagation. The other unfrozen network layers are trained with a certain number of epochs to perform gradient updates. 3. After a certain number of epoch iterations, the frozen layers are unfrozen and all network layers are involved in the backpropagation update to finally obtain the appropriate parameter matrix and bias vector.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.5.">Experimental conditions and parameter settings</head><p>The experiments in this paper are conducted under Linux with Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz processor, 100GB RAM, NVIDIA GTX1070Ti graphics card, and Pytorch 1.8.0 framework for model training and testing. The training parameters were set as shown in table <ref type="table" target="#tab_1">1</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">Comparative experiments and discussion</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.1.">Comparison with YOLOv4</head><p>Since the framework of this paper is inspired by the structure of YOLOv4, this paper mainly uses YOLOv4 as the comparison object to test the improvement results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Training process comparison.</head><p>To show the convergence of the model, the YOLOv4 algorithm is compared with the proposed algorithm in terms of the loss values at training time. Since the initial training loss values are large, the curve generation will disturb the overall display, so the values of the first 5 epochs are removed from the loss curve graph. The final loss curves of the two models can be seen in figure <ref type="figure" target="#fig_7">7</ref>. On the figure, we can see that the trend of the loss value tends to be smooth in the last 10 epochs, and there is almost no change in the magnitude, and the final convergence for the proposed algorithm is smaller than that of the original YOLOv4 model under the same loss function calculation for both models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Model complexity comparison</head><p>The FLOPs and Params introduced in the previous section are used here to measure the complexity of the models.</p><p>According to the different model components and usage methods, the experimental groups are divided into four groups for comparison in this paper. The group marked as A1 is the original YOLOv4 algorithm; the group (A2) is based on YOLOv4 with the addition of an ECA layer in addition to replacing the backbone network with DenseNet; while the third group (A3) differs from A2 in that the ECA layer is turned into an ASFF structure; and finally, the fourth group (A4) has the proposed algorithm, with the replacement of the DenseNet backbone network and adding both ECA and ASFF layers. The results of each experimental group are given in table 2. As we can see, the original YOLOv4 algorithm has the highest complexity in the table, thus it takes longer time in training and inference, and the final generated weight file takes up more space. The comparison between A3 and A4 shows that the ECA attention structure not only works well, but also increases the computational pressure minimally; by comparing A2 and A4, it can be seen that the ASFF structure has a certain fraction on both FLOPs and Params when the computation of the ECA layer is known to be tiny. By comparing A1 and A4, we can see that the computation of the proposed algorithm in this paper is only 51% of that for YOLOv4 and the number of parameters is only 36% of that of the original one, so the proposed algorithm can significantly reduce the usage cost of the model and is beneficial to the deployment of vehicle detection algorithms in practice.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Comparison of detection effect</head><p>In terms of detection effectiveness, YOLOv4 is used as baseline to compare with the proposed algorithm on a variety of datasets. The results are shown in table <ref type="table" target="#tab_3">3</ref>. As it can be seen from the table, the proposed algorithm has an improvement effect on different datasets compared to Baseline. In the tasks and datasets related to vehicle detection, the RSOV dataset improves by 3.53% and the BIT-Vehicle dataset improves by 2.46%. This confirms that the proposed algorithm has good applicability in the vehicle detection task. Meanwhile, the mAP of the PASCAL VOC dataset improves by 2.86%. This confirms that the proposed algorithm has good generalization ability and still performs well. In figure <ref type="figure" target="#fig_8">8</ref>, the recognition effects and detection heat maps of the two algorithms under the generic task are shown. The comparison between D1 and D2 shows that both algorithms have excellent performance in detecting people, however, D1 has the situation of missing detection for the obscured transportation, and it can also be seen from the heat map E1 that the YOLOv4 algorithm does not pay much attention to the target location of the transportation in the missing detection area. The recognition ability of D2 has been greatly improved. On E2 we see that the attention of the proposed algorithm to the missed detection region of D1 has been well improved.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.2.">Performance comparison with other algorithms</head><p>To further validate the downstream detection task and the proposed algorithm, the proposed algorithm is compared with other mainstream algorithms in with respect to detection effect, complexity and various other indicators. The experimental conditions and parameter settings are the same as in section 6.1.5 of in this paper, and the data sets are selected as RSOV for vehicle detection tasks and VOC data sets representing generic detection tasks, and the comparison results of various algorithms are illustrated in table <ref type="table" target="#tab_4">4</ref>. As can be seen, the proposed algorithm outperforms other algorithms on a variety of data sets. The YOLOv4-MobilenetV3 network, which combines the MobilenetV3 backbone network with YOLOv4, has the simplest model with a low number of parameters, but its mAP has a significant gap compared with other algorithms, making it difficult to meet the requirements for accurate vehicle recognition in traffic scenarios. EfficientDet-d3 and YOLOv4-MobilenetV3 have the same level of counts and excellent detection, but the algorithm takes longer time to train, on average 2-3 times longer than other one-stage algorithms, so it is not suitable for scenarios with strict speed requirements. YOLOv3 and YOLOv4 are both classic algorithms of the YOLO family, and although their detection capabilities perform well, the model parameters of both algorithms are relatively large, which can bring hardware expenses at the practical deployment level for traffic vehicle detection tasks. The more effective retinanet is reduced in the number of parameters compared to the former, but its computational effort is significantly aggravated, and the detection effect is also a certain distance from the proposed algorithm. Therefore, the above comparison results show that the proposed algorithm has a balance of computation, number of parameters and detection effect, not only has excellent detection effect, but also has advantages in training difficulty and model deployment cost compared with other algorithms. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.">Ablation experiments and analysis</head><p>Ablation experiments are mainly used to analyze the degree of influence of different components on the whole model. To highlight the effectiveness and synergy of the improvement points of the proposed algorithm, ablation experiments are conducted by using some of the improvement points in this paper. The dataset is selected as RSOV for vehicle detection task and VOC dataset representing generic detection task, where the experimental configuration and parameter settings for the ablation experiments are the same as in section 6.1.5 of this paper.</p><p>In this ablation experiment, the component modules are classified as follows: DenseNet backbone network module, ECA attention module, ASFF feature fusion module, and mosaic data enhancement module. The mAP performance of each of their groups is shown in table 5. In table 5, "✓" indicates that the method is used and "-" indicates that the method is not used.</p><p>Firstly, we can see there is a considerable improvement in the detection efficiency of the target when all the improved modules are working in concert. From the experimental data of G1 and G2, it can be seen that the detection network using DenseNet lightweight backbone has good feature extraction ability for a variety of datasets based on migration learning. Comparing G4, G5 and G6, it can be seen that the obvious enhancements to mAP are the ASFF module and the mosaic data enhancement method. These two methods enhance the utilization of feature information and the generalization ability of the model, especially the synergistic effect of the two makes the enhancement more obvious. As can be seen in G3, the enhancement effect of the ECA attention module is small in percentage, however, its ability to refine the effective information in the channels allows it to enhance the mAP to a new superior value when used in combination with the ASFF module and the mosaic method. Thus, this ablation experiment is a good demonstration of the effectiveness of each component module of the proposed algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion</head><p>This paper focuses on the one-stage target detection method which has higher requirements for detection speed and deployment cost. Therefore, the proposed method uses YOLOv4 as the base architecture, and significantly reduces the number of parameters in the model by replacing the DenseNet, which has excellent performance, as the backbone feature extraction network; reconstructs the existing FPN network module, uses the ECA attention structure for the transition and transfer of feature information between backbone and neck, and adds the information cross-fusion function before the final detection layer of the network of the ASFF structure; while optimizing in terms of loss function and image preprocessing. The studies of the efficiency of the proposed method are carried out on RSOV, BIT-Vehicle and VOC datasets.</p><p>The training process converges faster with a lower value of the loss function for the proposed method. Comparison of complexity between the proposed method and the basic YOLOv4 shows the almost twofold decrease in complexity and the number of parameters is reduced by 64%. The detection accuracy in various datasets with different degrees, for example, the mAP reaches 98.70% on the test set of RSOV dataset. The research results provide some reference value for the traffic construction of smart cities.</p><p>For the data augmentation algorithm and traffic information recognition algorithm proposed in this article, the expansion and optimization of the dataset can be considered. By expanding and optimizing</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: YOLOv4 network structure.</figDesc><graphic coords="2,98.89,65.61,397.50,214.88" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Overall algorithm structure.</figDesc><graphic coords="3,100.20,537.56,394.88,161.25" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: DenseNet main structure.</figDesc><graphic coords="4,102.64,233.61,390.00,226.88" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: SENet and ECANet structures.</figDesc><graphic coords="5,110.51,252.93,374.25,286.13" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: ASFF schematic.</figDesc><graphic coords="6,112.39,271.66,370.50,124.50" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>3 .</head><label>3</label><figDesc>To verify the detection generality of the proposed algorithm, the classical multi-category dataset PASCAL VOC is used. The PASCAL VOC Challenge is a world-class competition in computer vision covering several sub-tasks such as classification of images, detection and segmentation of targets, etc. VOC2007 and VOC2012 are two classical benchmark datasets publicly provided by the competition, including a total of 20 categories including people, airplanes, and cars, and each version of the dataset is produced in a uniform manner. In the paper, we use in total 16,551 images of trainval data from VOC2007 and VOC2012 as the overall dataset, which is randomly partitioned according to the ratio of 0.81:0.09:0.1, resulting in 13,405 training sets, 1,490 validation sets and 1,656 test sets.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Principle of mosaic operation.</figDesc><graphic coords="8,88.95,275.92,417.38,93.38" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: Loss trend comparison.</figDesc><graphic coords="11,162.80,131.74,269.67,203.24" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Figure 8 :</head><label>8</label><figDesc>Figure 8: Visualization of general task detection effect.</figDesc><graphic coords="13,72.00,65.61,451.28,483.14" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>8:1:1, and finally get 12,000 training sets, 1,500 validation sets and 1,500 test sets. 2. BIT-Vehicle dataset is captured by two cameras at different times and locations, and these images</figDesc><table /><note>vary in terms of lighting conditions, vehicle color, and camera viewpoint. All vehicles in the dataset are classified into six categories: Bus, Microbus, Minivan, Sedan, SUV, and Truck. The dataset has a total of 9850 images, and the dataset is divided according to the ratio of 8:1:1, resulting in 7880 training sets, 985 validation sets and 985 test sets.</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1</head><label>1</label><figDesc>Hyperparameter settings for model training.</figDesc><table><row><cell cols="7">Training strategies Epoch BatchSize Optimizer Weight decay num_workers Learning rate</cell></row><row><cell>Backbone-freeze</cell><cell>1-50</cell><cell>32</cell><cell>Adam</cell><cell>0</cell><cell>4</cell><cell>5𝑒 −4</cell></row><row><cell cols="2">Backbone-unfreeze 51-100</cell><cell>8</cell><cell>Adam</cell><cell>0</cell><cell>4</cell><cell>9.7𝑒 −6</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2</head><label>2</label><figDesc>Comparison of parameters of different models.</figDesc><table><row><cell>Experimental group</cell><cell>Method</cell><cell cols="2">FLOPs/G Params/M</cell></row><row><cell>A1</cell><cell>YOLOv4</cell><cell>59.896</cell><cell>64.040</cell></row><row><cell>A2</cell><cell>DenseNet+ECA</cell><cell>25.880</cell><cell>16.116</cell></row><row><cell>A3</cell><cell>DenseNet+ASFF</cell><cell>30.796</cell><cell>23.293</cell></row><row><cell>A4</cell><cell>DenseNet+ECA+ASFF</cell><cell>30.799</cell><cell>23.293</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3</head><label>3</label><figDesc>mAP for the proposed algorithm and YOLOv4 with multiple datasets (%).</figDesc><table><row><cell cols="4">Algorithm RSOV BIT-Vehicle PASCAL VOC</cell></row><row><cell>YOLOv4</cell><cell>95.17</cell><cell>93.55</cell><cell>84.83</cell></row><row><cell>Our</cell><cell>98.70</cell><cell>96.01</cell><cell>87.69</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 4</head><label>4</label><figDesc>Comparison of the effects of different algorithms on multiple datasets.</figDesc><table><row><cell>Algorithm</cell><cell></cell><cell>mAP/%</cell><cell cols="2">FLOPs/G Params/M</cell></row><row><cell></cell><cell cols="2">RSOV VOC07+12</cell><cell></cell><cell></cell></row><row><cell>YOLOv3</cell><cell>93.99</cell><cell>83.15</cell><cell>65.658</cell><cell>61.626</cell></row><row><cell>YOLOv4</cell><cell>95.17</cell><cell>84.83</cell><cell>59.896</cell><cell>64.040</cell></row><row><cell cols="2">YOLOv4-MobilenetV3 86.63</cell><cell>73.67</cell><cell>7.162</cell><cell>11.406</cell></row><row><cell>Retinanet</cell><cell>96.14</cell><cell>85.16</cell><cell>151.013</cell><cell>36.724</cell></row><row><cell>EfficientDet-d3</cell><cell>96.76</cell><cell>83.11</cell><cell>46.871</cell><cell>11.931</cell></row><row><cell>Ours</cell><cell>98.70</cell><cell>87.69</cell><cell>30.799</cell><cell>23.293</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 5</head><label>5</label><figDesc>Results of ablation experiments.</figDesc><table><row><cell>Experience</cell><cell></cell><cell cols="2">Component</cell><cell></cell><cell></cell><cell>mAP/%</cell></row><row><cell>group</cell><cell cols="6">DenseNet ECA ASFF mosaic RSOV VOC07+12</cell></row><row><cell>G1</cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>95.17</cell><cell>84.83</cell></row><row><cell>G2</cell><cell>✓</cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>95.55</cell><cell>85.84</cell></row><row><cell>G3</cell><cell>✓</cell><cell>-</cell><cell>✓</cell><cell>✓</cell><cell>98.31</cell><cell>86.80</cell></row><row><cell>G4</cell><cell>✓</cell><cell>✓</cell><cell>✓</cell><cell>-</cell><cell>98.17</cell><cell>86.43</cell></row><row><cell>G5</cell><cell>✓</cell><cell>✓</cell><cell>-</cell><cell>✓</cell><cell>97.24</cell><cell>86.67</cell></row><row><cell>G6</cell><cell>✓</cell><cell>✓</cell><cell>✓</cell><cell>✓</cell><cell>98.70</cell><cell>87.69</cell></row></table></figure>
		</body>
		<back>
			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>the dataset, the accuracy and generalization of traffic information recognition can be improved. For example, images of different vehicle types can be added to enrich the dataset of traffic scenes, improve data diversity, and more data augmentation techniques such as scaling and random cropping can be used to increase data volume. We also considered the deployment possibility of models in embedded devices in vehicles, and therefore strictly controlled the complexity of algorithms.</p><p>Recently new versions of YOLO appeared. In our future studies, we will focus on implementing the developed algorithm in modern versions of YOLO.</p><p>Declaration on Generative AI: The author have not employed any generative AI tools.</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Dynamic Head: Unifying Object Detection Heads with Attentions</title>
		<author>
			<persName><forename type="first">X</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhang</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR46437.2021.00729</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2021">2021. 2021</date>
			<biblScope unit="page" from="7369" to="7378" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks</title>
		<author>
			<persName><forename type="first">M</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<ptr target="http://proceedings.mlr.press/v97/tan19a.html" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 36th International Conference on Machine Learning, ICML 2019</title>
				<editor>
			<persName><forename type="first">K</forename><surname>Chaudhuri</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Salakhutdinov</surname></persName>
		</editor>
		<meeting>the 36th International Conference on Machine Learning, ICML 2019<address><addrLine>Long Beach, California, USA; PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019-06-15">9-15 June 2019. 2019</date>
			<biblScope unit="volume">97</biblScope>
			<biblScope unit="page" from="6105" to="6114" />
		</imprint>
	</monogr>
	<note>Proceedings of Machine Learning Research</note>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices</title>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2018.00716</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<imprint>
			<date type="published" when="2018">2018. 2018</date>
			<biblScope unit="page" from="6848" to="6856" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G</forename><surname>Howard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kalenichenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Weyand</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Andreetto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Adam</surname></persName>
		</author>
		<idno>CoRR abs/1704.04861</idno>
		<ptr target="http://arxiv.org/abs/1704.04861.arXiv:1704.04861" />
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">YOLOv4: Optimal Speed and Accuracy of Object Detection</title>
		<author>
			<persName><forename type="first">A</forename><surname>Bochkovskiy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">M</forename><surname>Liao</surname></persName>
		</author>
		<idno>CoRR abs/2004.10934</idno>
		<ptr target="https://arxiv.org/abs/2004.10934.arXiv:2004.10934" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Effective Fusion Factor in FPN for Tiny Object Detection</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Gong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Han</surname></persName>
		</author>
		<idno type="DOI">10.1109/WACV48630.2021.00120</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE Winter Conference on Applications of Computer Vision (WACV)</title>
				<imprint>
			<date type="published" when="2021">2021. 2021</date>
			<biblScope unit="page" from="1159" to="1167" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Multi-scale Pulmonary Nodule Detection by Fusion of Cascade R-CNN and FPN</title>
		<author>
			<persName><forename type="first">N</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Bai</surname></persName>
		</author>
		<idno type="DOI">10.1109/CCAI50917.2021.9447531</idno>
	</analytic>
	<monogr>
		<title level="m">2021 International Conference on Computer Communication and Artificial Intelligence (CCAI)</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="15" to="19" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">FPN-GAN: Multi-class Small Object Detection in Remote Sensing Images</title>
		<author>
			<persName><forename type="first">T</forename><surname>Ahmad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Saqlain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ma</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICCCBDA51879.2021.9442506</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE 6th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA)</title>
				<imprint>
			<date type="published" when="2021">2021. 2021</date>
			<biblScope unit="page" from="478" to="482" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Yolo+FPN: 2D and 3D Fused Object Detection With an RGB-D Camera</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zell</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICPR48806.2021.9413066</idno>
	</analytic>
	<monogr>
		<title level="m">25th International Conference on Pattern Recognition (ICPR)</title>
				<imprint>
			<date type="published" when="2020">2020. 2021</date>
			<biblScope unit="page" from="4657" to="4664" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Spatial Transformer Networks</title>
		<author>
			<persName><forename type="first">M</forename><surname>Jaderberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zisserman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Kavukcuoglu</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper/2015/hash/33ceb07bf4eeb3da587e268d663aba1a-Abstract.html" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015</title>
				<editor>
			<persName><forename type="first">C</forename><surname>Cortes</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><forename type="middle">D</forename><surname>Lawrence</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><forename type="middle">D</forename><surname>Lee</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Sugiyama</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Garnett</surname></persName>
		</editor>
		<meeting><address><addrLine>Montreal, Quebec, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">December 7-12, 2015. 2015</date>
			<biblScope unit="page" from="2017" to="2025" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Squeeze-and-Excitation Networks</title>
		<author>
			<persName><forename type="first">J</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sun</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2018.00745</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<imprint>
			<date type="published" when="2018">2018. 2018</date>
			<biblScope unit="page" from="7132" to="7141" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">CBAM: Convolutional Block Attention Module</title>
		<author>
			<persName><forename type="first">S</forename><surname>Woo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">S</forename><surname>Kweon</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-030-01234-2_1</idno>
	</analytic>
	<monogr>
		<title level="m">Computer Vision -ECCV 2018 -15th European Conference</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<editor>
			<persName><forename type="first">V</forename><surname>Ferrari</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Hebert</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Sminchisescu</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Weiss</surname></persName>
		</editor>
		<meeting><address><addrLine>Munich, Germany</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2018">September 8-14, 2018. 2018</date>
			<biblScope unit="volume">11211</biblScope>
			<biblScope unit="page" from="3" to="19" />
		</imprint>
	</monogr>
	<note>Proceedings, Part VII</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Residual Attention Network for Image Classification</title>
		<author>
			<persName><forename type="first">F</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Qian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Tang</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2017.683</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="page" from="6450" to="6458" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Single Image Super-Resolution using Residual Channel Attention Network</title>
		<author>
			<persName><forename type="first">H</forename><surname>Basak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kundu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Giri</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICIIS51140.2020.9342688</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE 15th International Conference on Industrial and Information Systems (ICIIS)</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
			<biblScope unit="page" from="219" to="224" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Recurrent Models of Visual Attention</title>
		<author>
			<persName><forename type="first">V</forename><surname>Mnih</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Heess</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Graves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Kavukcuoglu</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper/2014/hash/09c6c3783b4a70054da74f2538ed47c6-Abstract.html" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014</title>
				<editor>
			<persName><forename type="first">Z</forename><surname>Ghahramani</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Welling</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Cortes</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><forename type="middle">D</forename><surname>Lawrence</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><forename type="middle">Q</forename><surname>Weinberger</surname></persName>
		</editor>
		<meeting><address><addrLine>Montreal, Quebec, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">December 8-13 2014. 2014</date>
			<biblScope unit="page" from="2204" to="2212" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Spatiotemporal Fusion of Remote Sensing Images using a Convolutional Neural Network with Attention and Multiscale Mechanisms</title>
		<author>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dong</surname></persName>
		</author>
		<idno type="DOI">10.1080/01431161.2020.1809742</idno>
	</analytic>
	<monogr>
		<title level="j">International Journal of Remote Sensing</title>
		<imprint>
			<biblScope unit="volume">42</biblScope>
			<biblScope unit="page" from="1973" to="1993" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Attention-based BiLSTM fused CNN with gating mechanism model for Chinese long text classification</title>
		<author>
			<persName><forename type="first">J</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wang</surname></persName>
		</author>
		<idno type="DOI">10.1016/J.CSL.2020.101182</idno>
	</analytic>
	<monogr>
		<title level="j">Comput. Speech Lang</title>
		<imprint>
			<biblScope unit="volume">68</biblScope>
			<biblScope unit="page">101182</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Network In Network</title>
		<author>
			<persName><forename type="first">M</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yan</surname></persName>
		</author>
		<ptr target="http://arxiv.org/abs/1312.4400" />
	</analytic>
	<monogr>
		<title level="m">2nd International Conference on Learning Representations, ICLR 2014</title>
				<editor>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Lecun</surname></persName>
		</editor>
		<meeting><address><addrLine>Banff, AB, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">April 14-16, 2014. 2014</date>
		</imprint>
	</monogr>
	<note>Conference Track Proceedings</note>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Polyphonic Sound Event Detection Based on Residual Convolutional Recurrent Neural Network With Semi-Supervised Loss Function</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">K</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">K</forename><surname>Kim</surname></persName>
		</author>
		<idno type="DOI">10.1109/ACCESS.2020.3048675</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Access</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="7564" to="7575" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ren</surname></persName>
		</author>
		<idno type="DOI">10.1609/AAAI.V34I07.6999</idno>
	</analytic>
	<monogr>
		<title level="m">The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>AAAI Press</publisher>
			<date type="published" when="2020">February 7-12, 2020. 2020</date>
			<biblScope unit="page" from="12993" to="13000" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression</title>
		<author>
			<persName><forename type="first">H</forename><surname>Rezatofighi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Tsoi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gwak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sadeghian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Reid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Savarese</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2019.00075</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2019">2019. 2019</date>
			<biblScope unit="page" from="658" to="666" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zuo</surname></persName>
		</author>
		<idno type="DOI">10.1109/TCYB.2021.3095305</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Cybernetics</title>
		<imprint>
			<biblScope unit="volume">52</biblScope>
			<biblScope unit="page" from="8574" to="8586" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Singularity Intensity Function Analysis of Autoregressive Spectrum and Its Application in Weak Target Detection Under Sea Clutter Background</title>
		<author>
			<persName><forename type="first">Z.-J</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>-F. Fan</surname></persName>
		</author>
		<idno type="DOI">.org/10.1029/2020RS007108</idno>
	</analytic>
	<monogr>
		<title level="j">Radio Science</title>
		<imprint>
			<biblScope unit="volume">55</biblScope>
			<biblScope unit="page" from="e2020R" to="S7108" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Incremental learning based multi-domain adaptation for object detection</title>
		<author>
			<persName><forename type="first">X</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Duan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lu</surname></persName>
		</author>
		<idno type="DOI">10.1016/J.KNOSYS.2020.106420</idno>
	</analytic>
	<monogr>
		<title level="j">Knowl. Based Syst</title>
		<imprint>
			<biblScope unit="volume">210</biblScope>
			<biblScope unit="page">106420</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
