1. Introduction 1

PG-Prnet: A Lightweight Parallel Gated Feature Extractor Based on An Adaptive Progressive Regularization Algorithm

Zhe Zhang

Ming Ye

Yongsheng Xie

Yan Liu

0 0 Chongqing Market Supervision Administration Archives Information Center , Chongqing, 400700 , China 1 College of Artificial Intelligence, Southwest University , Chongqing, 400700 , China

263 271

The residual block in deeper DNNs has a positive effect on feature extraction, but it is limited by practical computational resources. Deeper structures have limited performance gains in later stages, while residuals in lightweight DNNs reduce the abstract feature representation capability. We propose a lightweight parallel gating framework (PG-PRNet) based on the adaptive progressive regularization algorithm (APR), which changes the constant mapping of residual, increases the representation of structural information, and compresses the structure by Hard-Sigmoid, layer pruning, etc. The APR algorithm avoids the irrationality of using the same regularization rules in different cases. This better preserves the shallow spatial location information and deep abstract semantic information, improving the performance of the lightweight model for different specification. PG-PRNet is embedded in two vision tasks. It outperforms the listed models on the GTSRB and BDD100K datasets while maintaining low storage and computational overhead.

eol>parallel gating progressive regularization feature extraction residual block

1. Introduction 1

specifications. The shallow spatial location information and deep abstract semantic information are better preserved.

We embed the proposed feature extraction framework into image recognition and object detection, and validate the performance of PG-PRNet on GTSRB and BDD100K datasets. 2. Related Work In its early years focused on improving accuracy by building more complex neural networks.

AlexNet designed a DNNs with 60 million parameters and 60,000 neurons, which earned first place in the ImageNetLSVRC-2012 competition [ 3 ]. Parallel computing on dozens of devices by Google confirms that distributing the model across multiple devices is another solution [ 4 ]. However, in recent years, scholars have found that simply increasing the depth of the model can lead to performance degradation. ResNet shows that as the depth of the network increases, the accuracy gain obtained later decreases due to overfitting, gradient disappearance, etc [ 5 ]. The residual structure adopted by ResNet preserves the shallow spatial location information as much as possible. This avoids the above problems to a large extent.

SOTA models usually use neural network structure search (NAS) to find the best structural

parameters for building the network [ 6 ]. This places higher demands on the hardware. Some researchers are working on network compression. DepthwiseConv most assigning only one set of convolutional kernels to each channel can achieve great speedups with little loss of accuracy [ 7 ].

GhostModule presents a plug-and-play module that reduces intermediate feature maps and allows models to be easily deployed on mobile devices. In this paper, GhostModule and DepthwiseConv are used to build PG-PRNet lightweight networks, which combines layer pruning and Hard Sigmoid. 3. Methodology The overall network structure is shown in figure 1. In the feature extraction part, in order to avoid

the computational overload caused by the large feature embedding in the later stage, the input image is first passed through a CBR block, which increases the channel dimension and reduces the width and height scales. Then there are multiple Fused-PG and PG units proposed in this paper. The detailed description of the improvement points is as follows:

3.1. Model compression.

In this paper, layer pruning is used to reduce the overall scale. In order to minimize the sacrifice of accuracy, parallel gating is used, and the representation of residual branches is added. Multiple parallel gating units form a cascade feature representation. According to GhostNet, each trained DNN contains many similar intermediate feature maps. We start by generating only half of the intermediate feature maps, generating the same number of features by linear mapping, called Ghost. Finally, connect the

two parts in series. Extensive use of GhostModule and DepthwiseConv in the PG unit reduces the amount of computation.

The squeeze and excitation module (SE) is inserted into the PG unit to impose an attention

mechanism with low computational cost [ 8 ]. The Squeeze part of the main branch compresses the features in the channel dimension, and the Excitation part learns the feature weights of the channel.

The core idea is that the model learns the attention weight of the channel by loss, so that the weight of the effective feature map is relatively large, and the weight of the invalid feature map is relatively small. We use two 1x1 convolutions instead of fully connected layers, and use Hard Sigmoid activation instead of ReLU, which reduces the amount of computation.

activates and weights the output of the last convolutional layer and visualizes the result in different colors, which shows which parts of the image the model focuses more on [ 9 ]. Due to the effectiveness of the proposed method, most of the categories can be highly focused.

3.2. PG and Fused-PG Units

MBConv differs from the traditional process of dimensionality reduction of residuals [ 10 ]. The features input to the inverse residual block are first expanded to higher dimensions and then deeply mapped to the lower dimensional space. The PG unit inherits this process. As shown in figure 1, the parameters of the network structure are reduced in the PG unit by using a modified GhostModule instead of CNN. Constant mapping is redundant in the lightweight model (which means that the input features go to the next layer without modification) because it deprives the branching part of the ability to obtain abstract features. We add Pooling and GhostModule to the branch part to selectively output the generated branch-specific folded embeddings by thresholding, which can freely choose the relationship to the backbone part, which is called parallel gating. 3.2.1. PG unit.

B0(%) 97.3 96.8 98.3

B1(%) 98.0 96.9 98.4

B2(%) 97.9 95.4 99.0

In the backbone part, a DepthwiseConv of size 3x3 extends the previous feature. The attention score is computed in the SE module, which makes the model focus on features that are more important to the channel. Then reduce the dimension with 1x1GhostModule. The average pooling layer in the branch section selectively compresses features, acting as gating and local area feature aggregation. Downsampling and fusion are performed using the Ghost module. Finally connect the trunk and branch parts. Stochastic depth is used to prevent network model degradation. The simplified mathematical expression of the whole process is equation (1) [ 11 ].

expressed as.

Where ,

represent the generated feature by the -th module backbone and branc. P represents the survival probability of , which fits the Bernoulli distribution, ∈ [ 0,1 ]. , 3.2.2. Fused-PG.

= {[ℎ( =

)]} , = 2 ,

PG uses DepthwiseConv to reduce computation, but it is limited in the early stages. As can be seen from table 1, if all modules use Depthwise, the performance will drop. Therefore, we only use it in the first few stages of the model. In the Fused-PG module, the 1x1 CNN and Depthwise are replaced by 3x3 for convolution to reduce computation, and DepthwiseConv is removed. The simplified mathematical expression is equation (4).

= {[ ]} Algorithm 1. Adaptive progressive regularization (APR)

Output: Trained model.

regularization dropout rate , adjustment factor , , , Input: Network blocks length , initial image size , final image size , initial are (2) (3) (4) then if ≥ else end if for = 1 to do

Last blockc survival probability:

←

Last blockc survival probability: Image size or feature map size: ←

− ( − ) 10: 11: end for

Dropout rate: Survival probability: Train model with and

← (

) ← 1 − (1 − )   1: 2: 3: 4: 5: 6: 7: 8: 9:

3.3. Adaptive Progressive Regularization

Similar to EfficientNetv2, we consider the regularization problem for the training of a multigranularity variable model. First, we add the regularization to the network depth. Second, the survival probability problem at stochastic depth is considered. Third, the adaptive probability calculation expression of dropout is improved. In PG-PRNet, the head has more redundant information, and a larger regularization factor is required to improve the generalization ability. In the tail, the features are mapped to a high-dimensional abstract space with smaller features, so a smaller regularization factor is used. For lightweight models, residuals are very important. When the model is very shallow, try to keep the residuals. When the model is complex, the residuals are discarded appropriately. Therefore it is not reasonable to use the same regularization rules all the time. Therefore, the survival probability and dropout rate need to be flexibly adjusted to fit the feature size and network depth. There are identifiers defined as.

The length of the network module is , and if is larger, a higher regularization rate is required, The whole model has

and the ratio of the two is controlled by λ. map size.

stages. And the features of the middle hidden layer gradually decrease from the first stage to the last stage, and the dropout rate is positively related to the feature

The scale coefficients of feature map size and survival probability are β, μ, respectively, and the

overall steps can be described as algorithm 1. The ablation experiments in Section 4.3 further elaborate and demonstrate the effectiveness of APR.

4. Experiments All our experiments were done on a Nvidia RTX 2080Ti server using Pytorch. In the parameters of

adaptive regularization, we set the threshold ϱ = 11, = 0.25, β = 1, λ = 7 . We validate the PG

PRNet feature extraction performance on two tasks on two datasets. 4.1. PG-PRNet for Image Recongnition

The recognition of traffic signs is a challenging real-world problem related to intelligent transportation

systems. The German Traffic Sign Recognition Benchmark (GTSRB) contains more than 50,000 images of daytime and nighttime scenes from 43 categories [ 12 ]. Images that are too similar are removed using the Structural Similarity Index (SSMI) algorithm. The mean and variance of the local and global luminance of each image were calculated for adaptive luminance and contrast enhancement, and the distribution of each category was approximated after processing. Using the cross-entropy loss function, Adam optimizer and Cosine Annealing scheduler, we set the weight decay factor = 0.0005, initial learning rate = 0.001, batch size = 64, epoch = 100. the resolution of the design varies from 48 to 224. The training set was preprocessed using random cropping, Gaussian noise. To validate the performance of the feature extractor, a PG-PRNet feature extractor with classification head was added to evaluate its image recognition performance. It mainly includes a global average pooling layer, aggregated features and features compressed by a fully connected layer, and softmax output of category probabilities.

4.2. Result analyse We use the inference time of a single image with 224 resolution and the amount of parameters as

an indicator of network complexity, perform five calculations, and finally take the average. The results are shown in table 2. The PG-PRNet model uses the SE module and GhostModule, so the amount of parameters has been improved, but due to the calculation amount of the two, as well as the use of Hard Sigmoid, layer pruning and DepthwiseConv, therefore, the picture The inference speed is the best (56 ms < 70 ms < 74 ms), where the number of parameters of B0 is second only to

EfficientNetV1, but the accuracy of the latter is much lower than our method. Thanks to the parallel gating unit, our model can obtain good shallow spatial position information while keeping light weight. Because of the parallel gating, it also has the function of selecting input features in the branch part, and mapping the features to high dimensions.

224 98.3 98.4 99.0 87.0 96.5 98.3 97.7 Params(M) 3.2 5.4 7.2 10.2 0.7 22.4 4.0

4.3. Ablation experiments Two adaptive regularization methods are considered: Dropout and Stochasitc depth. Larger is

used for larger features and smaller is used for smaller features. The lower dimension contains more spatial location information, but the higher dimension contains more abstract semantic information.

Both kinds of information are very important for inference. It is not reasonable to use the same

and for the whole structure. In algorithm 1, the survival probability and dropout rate are adaptively adjusted according to the size of the feature map. This problem is mitigated to some extent.

4.4. PG-PRNet for Object Detection

BDD100K is a traffic driving video dataset that can be used for a variety of autonomous driving task scenarios, containing up to 100,000 images for 10 task scenarios. In this paper, for the use of lightweight models, we use a subset of 10,000 of these images of autonomous driving scenarios to test performance in terms of target detection. As shown in figure 3, the output features of the last three layers of PG-PRNet are extracted using the Neck of YOLOv4 [ 13 ]. Using Mosaic data enhancement, we introduced copy-and-paste data enhancement to improve the detection accuracy of small targets [14]. Finally, the three scales of features are output and the corresponding target detection results are obtained after post-processing (NMS). We list three images as a reference for the results. The parameters are set to batch size = 16, loop scheduler and Adam optimizer.

4.5. Results Analyze.

It can be seen that in the case of training only 300 epoches, our model is advanced. The optimum is achieved with 11.5 millions number of parameters and 311 ms inference time, and, the deepened model has a significant performance improvement. This demonstrates that the parallel gating unit effectively improves the feature representation of the branch, making the feature extraction capability of PG-PRNet still highly applicable even after many model compression methods. In figure 4, six typical difficulties of target detection in real-time traffic scenarios are listed. Our approach maintains high detection accuracy and robustness. A parallel gating unit is used in combination with an adaptive progressive regularization algorithm. The Copy-Paste and Mosaic based approach reduces overfitting, improves model generalization, and enhances performance in scenes with occlusion, too many small targets, rain, multiple categories, and video motion blur.

5. Conclusion

In this work, we propose a lightweight parallel gated feature extraction framework to represent the residual branching information of a given feature in a new cascade, which changes the constant mapping of the residual structure in lightweight networks. In addition, an adaptive progressive regularization algorithm is used to adapt the regularization rules for different size features and different scale networks, called PG-PRNet. The framework is embedded into image recognition and object detection to verify its feature extraction capability, and our model achieves optimality in model volume and accuracy. Its efficiency at variable resolution is demonstrated.

6. References

[14] Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T. Y., Cubuk, E. D. and Zoph, B., 2021. Simple copy-paste is a strong data augmentation method for instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Online Meeting. (pp. 2918-2928).

[1] Tan , M. , Le , Q. 2019 . Efficientnet: Rethinking model scaling for convolutional neural networks . In: International conference on machine learning . Long Beach , California. 6105 - 6114 .

[2] Han , K. , Wang , Y. , Tian , Q. , Guo , J. , Xu , C. and Xu , C. 2020 . Ghostnet: More features from cheap operations . In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Seattle, WA, USA. (pp. 1580 - 1589 ).

[3] Krizhevsky , A. , Sutskever , I. and Hinton , G. E. ( 2012 ). Imagenet classification with deep convolutional neural networks . Advances in neural information processing systems , 25 .

[4] Huang , Y. , Cheng, Y., Bapna , A. , Firat , O. , Chen , D. , Chen , M. and Wu , Y. , ( 2019 ). Gpipe: Efficient training of giant neural networks using pipeline parallelism . Advances in neural information processing systems , 32 .

[5] He , K. , Zhang , X. , Ren , S. and Sun , J. , 2016 . Deep residual learning for image recognition . In: Proceedings of the IEEE conference on computer vision and pattern recognition. Las Vegas , NV , USA. (pp. 770 - 778 ).

[6] Zoph , B. and Le , Q. V. , 2016 . Neural architecture search with reinforcement learning . arXiv preprint arXiv:1611 . 01578 .

[7] Chollet , F. , 2017 . Xception: Deep learning with depthwise separable convolutions . In: Proceedings of the IEEE conference on computer vision and pattern recognition. Honolulu , HI , USA. (pp. 1251 - 1258 ).

[8] Hu , J. , Shen , L. and Sun , G. , 2018 . Squeeze-and-excitation networks . In: Proceedings of the IEEE conference on computer vision and pattern recognition. Salt Lake City , UT , USA. (pp. 7132 - 7141 ).

[9] Selvaraju , R. R. , Cogswell , M. , Das , A. , Vedantam , R. , Parikh , D. and Batra , D. , 2017 . Grad-cam: Visual explanations from deep networks via gradient-based localization . In: Proceedings of the IEEE international conference on computer vision . Venice, Italy. (pp. 618 - 626 ).

[10] Howard , A. G. , Zhu , M. , Chen , B. , Kalenichenko , D. , Wang , W. , Weyand , T. , Andreetto , M. and Adam , H., 2017 . MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications . ArXiv, abs/1704.04861.

[11] Huang , G. , Sun , Y. , Liu , Z. , Sedra , D. and Weinberger , K. Q. , 2016 . Deep networks with stochastic depth . In: European conference on computer vision . Amsterdam, Netherlands. (pp. 646 - 661 ).

[12] Houben , S. , Stallkamp , J. , Salmen , J. , Schlipsing , M. and Igel , C. , 2013 . Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark . In: The 2013 international joint conference on neural networks (IJCNN) . Dallas, TX, USA. (pp. 1 - 8 ).

[13] Bochkovskiy , A. , Wang , C. Y. and Liao , H. Y. M. , 2020 . Yolov4: Optimal speed and accuracy of object detection . arXiv preprint arXiv: 2004 .10934.