PG-Prnet: A Lightweight Parallel Gated Feature Extractor Based on An Adaptive Progressive Regularization Algorithm Zhe Zhang1, Ming Ye1*, Yongsheng Xie1 and Yan Liu2 1 College of Artificial Intelligence, Southwest University, Chongqing, 400700, China 2 Chongqing Market Supervision Administration Archives Information Center, Chongqing, 400700, China Abstract The residual block in deeper DNNs has a positive effect on feature extraction, but it is limited by practical computational resources. Deeper structures have limited performance gains in later stages, while residuals in lightweight DNNs reduce the abstract feature representation capability. We propose a lightweight parallel gating framework (PG-PRNet) based on the adaptive progressive regularization algorithm (APR), which changes the constant mapping of residual, increases the representation of structural information, and compresses the structure by Hard-Sigmoid, layer pruning, etc. The APR algorithm avoids the irrationality of using the same regularization rules in different cases. This better preserves the shallow spatial location information and deep abstract semantic information, improving the performance of the lightweight model for different specification. PG-PRNet is embedded in two vision tasks. It outperforms the listed models on the GTSRB and BDD100K datasets while maintaining low storage and computational overhead. Keywords parallel gating; progressive regularization; feature extraction; residual block 1. Introduction 1 DNNs can learn the intrinsic properties and underlying semantic features of data from a large number of samples. To a certain extent, the more complex the network is, the more high-dimensional abstract semantic features are obtained. Researchers have proposed many methods to design deeper models. EfficientNetv2 finds a balance between depth, width and resolution to build complex structures [1]. Performs well after pre-training on large datasets. However, practical hardware conditions limit this. In this paper, we propose a GhostModule-based parallel gated feature extractor (PG-PRNet) to selectively control feature embedding into branches, change the traditional constant mapping of residual branches to lighten the network, introduce stochastic depth to prevent network overfitting [2]. We also use Hard Sigmoid and Layer Pruning to further reduce the model parameters. Due to the variety in the dimensionality of the inputs and the depth of the network, it is not reasonable to train the model using the same regularization rules. Therefore, an Adaptive Progressive Regularization (APR) algorithm is also proposed to solve this problem. The effectiveness of PG- PRNet in two vision tasks was experimentally demonstrated. The main contributions of this paper can be summarized as follows.  We propose a parallel gating unit (PG and Fused-PG) consisting of GhostModule, SE and DepthwiseConv as an intermediate module of the network, improve the constant mapping of residual branches, and configure three specifications of PG-PRNetB0 to PG-PRNetB2.  We use an adaptive progressive regularization algorithm to solve the unreasonable problem of using the same regularization rules for features of different sizes, resolutions, and network ICBASE2022@3rd International Conference on Big Data & Artificial Intelligence & Software Engineering, October 21- 23, 2022, Guangzhou, China zhangandzhe@foxmail.com (Zhe Zhang), 2323247608@qq.com (Yongsheng Xie), 12167292@qq.com (Yan Liu) * Corresponding author: zmxym@swu.edu.cn (Ming Ye) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 263 specifications. The shallow spatial location information and deep abstract semantic information are better preserved.  We embed the proposed feature extraction framework into image recognition and object detection, and validate the performance of PG-PRNet on GTSRB and BDD100K datasets. 2. Related Work In its early years focused on improving accuracy by building more complex neural networks. AlexNet designed a DNNs with 60 million parameters and 60,000 neurons, which earned first place in the ImageNetLSVRC-2012 competition [3]. Parallel computing on dozens of devices by Google confirms that distributing the model across multiple devices is another solution [4]. However, in recent years, scholars have found that simply increasing the depth of the model can lead to performance degradation. ResNet shows that as the depth of the network increases, the accuracy gain obtained later decreases due to overfitting, gradient disappearance, etc [5]. The residual structure adopted by ResNet preserves the shallow spatial location information as much as possible. This avoids the above problems to a large extent. SOTA models usually use neural network structure search (NAS) to find the best structural parameters for building the network [6]. This places higher demands on the hardware. Some researchers are working on network compression. DepthwiseConv most assigning only one set of convolutional kernels to each channel can achieve great speedups with little loss of accuracy [7]. GhostModule presents a plug-and-play module that reduces intermediate feature maps and allows models to be easily deployed on mobile devices. In this paper, GhostModule and DepthwiseConv are used to build PG-PRNet lightweight networks, which combines layer pruning and Hard Sigmoid. 3. Methodology The overall network structure is shown in figure 1. In the feature extraction part, in order to avoid the computational overload caused by the large feature embedding in the later stage, the input image is first passed through a CBR block, which increases the channel dimension and reduces the width and height scales. Then there are multiple Fused-PG and PG units proposed in this paper. The detailed description of the improvement points is as follows: 3.1. Model compression. In this paper, layer pruning is used to reduce the overall scale. In order to minimize the sacrifice of accuracy, parallel gating is used, and the representation of residual branches is added. Multiple parallel gating units form a cascade feature representation. According to GhostNet, each trained DNN contains many similar intermediate feature maps. We start by generating only half of the intermediate feature maps, generating the same number of features by linear mapping, called Ghost. Finally, connect the two parts in series. Extensive use of GhostModule and DepthwiseConv in the PG unit reduces the amount of computation. The squeeze and excitation module (SE) is inserted into the PG unit to impose an attention mechanism with low computational cost [8]. The Squeeze part of the main branch compresses the features in the channel dimension, and the Excitation part learns the feature weights of the channel. The core idea is that the model learns the attention weight of the channel by loss, so that the weight of the effective feature map is relatively large, and the weight of the invalid feature map is relatively small. We use two 1x1 convolutions instead of fully connected layers, and use Hard Sigmoid activation instead of ReLU, which reduces the amount of computation. 264 Figure 1. The pipeline of PG-PRNet. Four images from the GTSRB dataset are listed here. Grad-CAM activates and weights the output of the last convolutional layer and visualizes the result in different colors, which shows which parts of the image the model focuses more on [9]. Due to the effectiveness of the proposed method, most of the categories can be highly focused. 3.2. PG and Fused-PG Units MBConv differs from the traditional process of dimensionality reduction of residuals [10]. The features input to the inverse residual block are first expanded to higher dimensions and then deeply mapped to the lower dimensional space. The PG unit inherits this process. As shown in figure 1, the parameters of the network structure are reduced in the PG unit by using a modified GhostModule instead of CNN. Constant mapping is redundant in the lightweight model (which means that the input features go to the next layer without modification) because it deprives the branching part of the ability to obtain abstract features. We add Pooling and GhostModule to the branch part to selectively output the generated branch-specific folded embeddings by thresholding, which can freely choose the relationship to the backbone part, which is called parallel gating. Table 1. Compare Top-1 classification accuracy with Fused or not (224 pixels on GTSRB). Resolution:224 B0(%) B1(%) B2(%) All-Fused 97.3 98.0 97.9 No-Fused 96.8 96.9 95.4 Partial-Fused 98.3 98.4 99.0 3.2.1. PG unit. In the backbone part, a DepthwiseConv of size 3x3 extends the previous feature. The attention score is computed in the SE module, which makes the model focus on features that are more important to the channel. Then reduce the dimension with 1x1GhostModule. The average pooling layer in the branch section selectively compresses features, acting as gating and local area feature aggregation. Downsampling and fusion are performed using the Ghost module. Finally connect the trunk and branch parts. Stochastic depth is used to prevent network model degradation. The simplified mathematical expression of the whole process is equation (1) [11]. 𝐹 =𝐹 𝑃 𝐹 (1) 265 Where 𝐹 , 𝐹 represent the generated feature by the 𝐿 -th module backbone and branc. P represents the survival probability of 𝐹 , which fits the Bernoulli distribution, 𝑃 ∈ [0,1]. 𝐹 , 𝐹 are expressed as. 𝐹 = 𝐺𝑀{𝑆𝐸[π·π‘’π‘π‘‘β„Ž(𝐹 )]} (2) 𝐺𝑀 π‘ƒπ‘œπ‘œπ‘™ 𝐹 ,𝑠 = 2 𝐹 = (3) 𝐺𝑀 𝐹 , 𝑒𝑙𝑠𝑒 3.2.2. Fused-PG. PG uses DepthwiseConv to reduce computation, but it is limited in the early stages. As can be seen from table 1, if all modules use Depthwise, the performance will drop. Therefore, we only use it in the first few stages of the model. In the Fused-PG module, the 1x1 CNN and Depthwise are replaced by 3x3 for convolution to reduce computation, and DepthwiseConv is removed. The simplified mathematical expression is equation (4). 𝐹 = 𝐺𝑀{𝑆𝐸[𝐺𝑀 𝐹 ]} (4) Algorithm 1. Adaptive progressive regularization (APR) Input: Network blocks length 𝐿 , initial image size 𝑆 , final image size 𝑆 , initial regularization dropout rate 𝑑 , adjustment factor πœ†, 𝛽, πœ‡, 𝜚 Output: Trained model. 1: if 𝐿 β‰₯ 𝜚 then 2: Last blockc survival probability: 𝑃 ← 3: else 4: Last blockc survival probability: 𝑃 ← 1.0 5: end if 6: for 𝑖 = 1 to 𝐿 do 7: Image size or feature map size: 𝑆 ← 𝑆 βˆ’ (𝑆 βˆ’ 𝑆 ) 8: Dropout rate: 𝑑 ← ( ) 9: Survival probability: 𝑃 ← 1 βˆ’ (1 βˆ’ 𝑃 ) 10: Train model with 𝑑 and 𝑃 11: end for 3.3. Adaptive Progressive Regularization Similar to EfficientNetv2, we consider the regularization problem for the training of a multi- granularity variable model. First, we add the regularization to the network depth. Second, the survival probability problem at stochastic depth is considered. Third, the adaptive probability calculation expression of dropout is improved. In PG-PRNet, the head has more redundant information, and a larger regularization factor is required to improve the generalization ability. In the tail, the features are mapped to a high-dimensional abstract space with smaller features, so a smaller regularization factor is used. For lightweight models, residuals are very important. When the model is very shallow, try to keep the residuals. When the model is complex, the residuals are discarded appropriately. Therefore it is not reasonable to use the same regularization rules all the time. Therefore, the survival probability and dropout rate need to be flexibly adjusted to fit the feature size and network depth. There are identifiers defined as.  The length of the network module is 𝑙, and if 𝑙 is larger, a higher regularization rate is required, and the ratio of the two is controlled by Ξ».  The whole model has 𝑀 stages. And the features of the middle hidden layer gradually decrease from the first stage to the last stage, and the dropout rate is positively related to the feature map size. 266 The scale coefficients of feature map size and survival probability are Ξ², ΞΌ, respectively, and the overall steps can be described as algorithm 1. The ablation experiments in Section 4.3 further elaborate and demonstrate the effectiveness of APR. 4. Experiments All our experiments were done on a Nvidia RTX 2080Ti server using Pytorch. In the parameters of adaptive regularization, we set the threshold Ο± = 11, 𝑒 = 0.25, Ξ² = 1, Ξ» = 7. We validate the PG- PRNet feature extraction performance on two tasks on two datasets. 4.1. PG-PRNet for Image Recongnition The recognition of traffic signs is a challenging real-world problem related to intelligent transportation systems. The German Traffic Sign Recognition Benchmark (GTSRB) contains more than 50,000 images of daytime and nighttime scenes from 43 categories [12]. Images that are too similar are removed using the Structural Similarity Index (SSMI) algorithm. The mean and variance of the local and global luminance of each image were calculated for adaptive luminance and contrast enhancement, and the distribution of each category was approximated after processing. Using the cross-entropy loss function, Adam optimizer and Cosine Annealing scheduler, we set the weight decay factor = 0.0005, initial learning rate = 0.001, batch size = 64, epoch = 100. the resolution of the design varies from 48 to 224. The training set was preprocessed using random cropping, Gaussian noise. To validate the performance of the feature extractor, a PG-PRNet feature extractor with classification head was added to evaluate its image recognition performance. It mainly includes a global average pooling layer, aggregated features and features compressed by a fully connected layer, and softmax output of category probabilities. 4.2. Result analyse We use the inference time of a single image with 224 resolution and the amount of parameters as an indicator of network complexity, perform five calculations, and finally take the average. The results are shown in table 2. The PG-PRNet model uses the SE module and GhostModule, so the amount of parameters has been improved, but due to the calculation amount of the two, as well as the use of Hard Sigmoid, layer pruning and DepthwiseConv, therefore, the picture The inference speed is the best (56 ms < 70 ms < 74 ms), where the number of parameters of B0 is second only to EfficientNetV1, but the accuracy of the latter is much lower than our method. Thanks to the parallel gating unit, our model can obtain good shallow spatial position information while keeping light weight. Because of the parallel gating, it also has the function of selecting input features in the branch part, and mapping the features to high dimensions. Table 2. Performance comparison of image recognition tasks on the GTSRB dataset (TOP-1 accuracy (%)). Methods 48 96 160 224 Params(M) Infer-time(ms) PG-PRNet_B0(Ours) 92.7 93.4 96.8 98.3 3.2 56 PG-PRNet_B1(Ours) 92.5 93.1 96.9 98.4 5.4 81 PG-PRNet_B2(Ours) 91.8 93.3 98.4 99.0 7.2 100 Vision Transformer(P=16) 86.3 89.7 87.5 87.0 10.2 110 EfficientNet V1 89.5 92.3 94.6 96.5 0.7 74 EfficientNet V2 92.0 93.4 97.8 98.3 22.4 245 GhostNet 80.6 89.7 96.5 97.7 4.0 70 267 4.3. Ablation experiments Two adaptive regularization methods are considered: Dropout and Stochasitc depth. Larger 𝑝 is used for larger features and smaller 𝑝 is used for smaller features. The lower dimension contains more spatial location information, but the higher dimension contains more abstract semantic information. Both kinds of information are very important for inference. It is not reasonable to use the same 𝑝 and 𝑑 for the whole structure. In algorithm 1, the survival probability 𝑝 and dropout rate 𝑑 are adaptively adjusted according to the size of the feature map. This problem is mitigated to some extent. Figure 2. Comparison of Top-1 accuracy with and without adaptive progressive regularization algorithm. Solid dots: With APR. Hollow rectangle: Not APR. Figure 3. Pipeline to embed PG-PRNet into YOLOv4 model. Using PG-PRNet to generate the feature vectors of the last three layers, through the Neck of YOLOv4, the three output feature matrices are obtained, and the final detection results are generated after some complex post processing. 268 Figure 4. List the detection results under 6 typical target detection difficulties (Video Motion Blur, Continuously Stacked Small Objects, Multi-Category At Night, Rain Disturbance, Complex Road Conditions, Highway During At Day). 4.4. PG-PRNet for Object Detection BDD100K is a traffic driving video dataset that can be used for a variety of autonomous driving task scenarios, containing up to 100,000 images for 10 task scenarios. In this paper, for the use of lightweight models, we use a subset of 10,000 of these images of autonomous driving scenarios to test performance in terms of target detection. As shown in figure 3, the output features of the last three layers of PG-PRNet are extracted using the Neck of YOLOv4 [13]. Using Mosaic data enhancement, we introduced copy-and-paste data enhancement to improve the detection accuracy of small targets [14]. Finally, the three scales of features are output and the corresponding target detection results are obtained after post-processing (NMS). We list three images as a reference for the results. The parameters are set to batch size = 16, loop scheduler and Adam optimizer. Table 3. We use the listed model as the Backbone, connected to the YOLOv4 Neck. Compare our proposed method (Bold) with other methods. We complete mAP50, mAP75, single image parameter and inference time calculation on a 608-pixel image. Method mAP50(%) mAP75(%) Params(M) Infer-time(ms) PG-PRNet_B0(Ours) 55.6 27.7 11.5 311 PG-PRNet_B1(Ours) 56.2 28.9 13.0 332 PG-PRNet_B2(Ours) 56.8 30.2 15.8 356 GhostNet 43.8 19.7 11.9 344 MobileNetv1 49.3 22.5 12.5 320 MobileNetv2 41.9 16.2 10.2 372 MobileNetv3 42.2 18.3 11.4 363 DenseNet121 48.8 20.4 16.5 645 DenseNet169 49.1 20.8 22.6 873 DenseNet201 54.7 22.1 27.8 946 4.5. Results Analyze. It can be seen that in the case of training only 300 epoches, our model is advanced. The optimum is achieved with 11.5 millions number of parameters and 311 ms inference time, and, the deepened model has a significant performance improvement. This demonstrates that the parallel gating unit effectively improves the feature representation of the branch, making the feature extraction capability of PG-PRNet still highly applicable even after many model compression methods. In figure 4, six 269 typical difficulties of target detection in real-time traffic scenarios are listed. Our approach maintains high detection accuracy and robustness. A parallel gating unit is used in combination with an adaptive progressive regularization algorithm. The Copy-Paste and Mosaic based approach reduces overfitting, improves model generalization, and enhances performance in scenes with occlusion, too many small targets, rain, multiple categories, and video motion blur. 5. Conclusion In this work, we propose a lightweight parallel gated feature extraction framework to represent the residual branching information of a given feature in a new cascade, which changes the constant mapping of the residual structure in lightweight networks. In addition, an adaptive progressive regularization algorithm is used to adapt the regularization rules for different size features and different scale networks, called PG-PRNet. The framework is embedded into image recognition and object detection to verify its feature extraction capability, and our model achieves optimality in model volume and accuracy. Its efficiency at variable resolution is demonstrated. 6. References [1] Tan, M., Le, Q. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. Long Beach, California. 6105-6114. [2] Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C. and Xu, C. 2020. Ghostnet: More features from cheap operations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Seattle, WA, USA. (pp. 1580-1589). [3] Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25. [4] Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M. and Wu, Y., (2019). Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32. [5] He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Las Vegas, NV, USA. (pp. 770-778). [6] Zoph, B. and Le, Q. V., 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. [7] Chollet, F., 2017. Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Honolulu, HI, USA. (pp. 1251-1258). [8] Hu, J., Shen, L. and Sun, G., 2018. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Salt Lake City, UT, USA. (pp. 7132-7141). [9] Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D. and Batra, D., 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. Venice, Italy. (pp. 618-626). [10] Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M. and Adam, H., 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. ArXiv, abs/1704.04861. [11] Huang, G., Sun, Y., Liu, Z., Sedra, D. and Weinberger, K. Q., 2016. Deep networks with stochastic depth. In: European conference on computer vision. Amsterdam, Netherlands. (pp. 646-661). [12] Houben, S., Stallkamp, J., Salmen, J., Schlipsing, M. and Igel, C., 2013. Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark. In: The 2013 international joint conference on neural networks (IJCNN). Dallas, TX, USA. (pp. 1-8). [13] Bochkovskiy, A., Wang, C. Y. and Liao, H. Y. M., 2020. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934. 270 [14] Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T. Y., Cubuk, E. D. and Zoph, B., 2021. Simple copy-paste is a strong data augmentation method for instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Online Meeting. (pp. 2918-2928). 271