=Paper=
{{Paper
|id=Vol-3131/regular7
|storemode=property
|title=Pruning Network Based Knowledge Distillation for Offline Handwritten Chinese Character Recognition
|pdfUrl=https://ceur-ws.org/Vol-3131/paper7.pdf
|volume=Vol-3131
|authors=Zhuo Li,Yongping Dan,Zongnan Zhu,Dinggen Zhang
|dblpUrl=https://dblp.org/rec/conf/atait/LiDZZ21
}}
==Pruning Network Based Knowledge Distillation for Offline Handwritten Chinese Character Recognition==
Pruning Network Based Knowledge Distillation for Offline Handwritten Chinese Character Recognition Zhuo Li1 , Yongping Dan2 , Zongnan Zhu, and Dinggen Zhang 1 Zhongyuan University of Technology, Zhengzhou, Henan, China 1 lizhuo970604@gmail.com 2 6100@zut.edu.cn Abstract Recently, deep convolutional neural networks have brought great breakthrough in im- age classification, which provide effective solution for the handwritten Chinese character recognition problem. Researchers have experimented with various networks to increase recognition accuracy. Although good accuracy is achieved on different networks, these networks tend to be computation-intensive and memory-intensive that make them difficult to be deployed on resource-constrained devices. To solve the problem, the paper proposes an optimization to reduce the number of model parameters by using pruning network and knowledge distillation. Besides, to improve the model’s ability to extract the input fea- tures, an attention mechanism is adopted in the proposal. The experimental results show that the number of parameters decreased by nearly 26%. At the same time, the recog- nition accuracy improves by 1.17% with the value of 96.99% compared with the original model. The optimization method presented in this paper not only improves the accu- racy of handwritten Chinese characters recognition but also reduces the number of model parameters. 1 Introduction With the continuous development of Chinese culture, handwritten Chinese character recogni- tion (HCCR) has attracted more and more attention and has been an important research topic. HCCR has been widely used in many fields, such as automatic bill recognition, handwritten Chinese character entry, cultural heritage preservation [1], automated teaching and office work. In recent years, convolutional neural networks (CNNs) have made great progress and break- throughs in the field of computer vision. This is mainly due to the design of different network structures. For example, AlexNet [2], VGG [3], GoogLeNet [4] and ResNet [5], which have shown excellent performance in HCCR tasks. Although these neural networks have made great success in the field of HCCR, they have large requirements for computing resources, power con- sumption and storage space, which make them are difficult to be deployed on embedded devices such as ARM boards and FPGAs with limited hardware resources. For the reason that CNNs has a large number of redundant computations [6]. Therefore, it has been a major research topic to reduce the number of parameters while still ensuring accuracy for HCCR. The remainder of this paper is organized as follows: Section 2 briefly reviews the related works. Section 3 introduces the mothod of attention mechanism, model pruning, and knowledge distillation. Section 4 gives the experimental results and experimental procedure in detal. Section 5 summarizes this paper and describes the future work. 2 Related work Attention mechanisms are widely used in deep learning to enhance the performance of CNNs. [7] terms the ”Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise 53 The easychair Class File Li, Dan, Zhu and Zhang feature responses by explicitly modelling interdependencies between channel. SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. [8] proposes convolutional block attention module (CBAM), CBAM se- quentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement. [9] proposes an efficient channel attention (ECA) module, which only involves a handful of param- eters while bringing clear performance gain. Avoiding dimensionality reduction is important for learning channel attention, and appropriate cross-channel interaction can preserve perfor- mance while significantly decreasing model complexity. [10, 11] uses weighting and Huffman coding to minimize storage space furthermore. [12] eliminates the unimportant channles by applying L1 regularization to the scale factor of the batch normalization (BN) layer. [13] uses least absolute shrinkage and selection operator (LASSO) regression to sparse the weights and cut out unimportant channels. Then, least squares method is used to ensure that the cropping operation has little impaction on the features by using LASSO. [14] takes adopts a new Taylor expansion-based criterion for approximating the loss function change caused by pruning net- work parameters. This is a modern formula for achieving effective reasoning in neural networks through pruning the convolution kernel. [15] introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. [16] presents matching guided distillation (MGD) as an efficient and parameter-free manner to solve the problem of adding the adaptation module in classic methods. 3 Methods 3.1 Attention mechanism Channel attention mechanism has demonstrated to offer great potential in improving the perfor- mance of CNNs, which can be used for classification. The attention mechanism in deep learning draws on human attentional thinking to focus on the key information in an image rather than the whole image [17]. As shown in Fig. 1, the lightweight attention mechanism ECA module is used for HCCR. Adaptive Selection of Kernel Size: k = Ψ (C) C GAP s C H H W W 1×1×C 1×1×C element-wise product Figure 1: ECA module Given an aggregated feature y ∈ RC without dimensionality reduction, channel attention 54 The easychair Class File Li, Dan, Zhu and Zhang can be learned by Eq.(1). ω = σ(Wy) (1) Where W is a C × C parameter matrix. In order to capture local cross-channel interac- tion, employing Eq.(2) to learn channel attention, aiming at guaranteeing both efficiency and effectiveness. 1,1 · · · ω 1,k ω 0 0 ··· ··· 0 0 ω 2,2 · · · ω 2,k+1 0 ··· ··· 0 Wk = . (2) . . . .. . . .. .. .. .. .. . .. .. . 0 ··· 0 0 ··· ω C,C−k+1 ··· ω C,C Wk involves k × C parameters. When all channels share the same learning parameters. Xk ωi = σ wj yi , yij ∈ Ωki j (3) j=1 Where Ωki indicates the set of k adjacent channels of yi . Eq.(3) can be readily implemented by a fast 1D convolution with kernel size of k. ω = σ(C1Dk (y)) (4) Where C1D indicates 1D convolution. Here, the method in Eq.(4) is called by ECA module, which only consists of k parameters. The channel dimension C is proportional to the convolution kernel size k, as shown in Eq.(5), where k is taken as shown in Eq.(6). C = φ(k) = 2(γ∗k−b) (5) log2 (C) b k = ψ(C) = + (6) γ γ odd Where |t|odd indicates the nearest odd number of t. Setting γ and b to 2 and 1, respectively. Clearly, through the mapping ψ, highdimensional channels have longer range interaction while low-dimensional ones undergo shorter range interaction by using a non-linear mapping. 3.2 Model pruning Shown in Fig. 2, channel pruning is a coarser-grained pruning, which is accomplished by deleting the redundant channels of feature map. A scale factor γ is added for each channel, which is then multiplied by the channel output. The network weights and these scale factors are jointly trained, and the latter is sparsely regularized. The redundant channels which are determined according to the scale factors, is pruned after training. The training objective feature is given as Eq.(7). X X L= l(f (x, w), y) + λ g(γ) (7) (x,y) γ∈Γ Where (x, y) denotes the training input and output, w denotes the trainable weight, the first term of Eq.(7) denotes the loss corresponding to regular convolutional network training, g(·) is a sparsity-induced penalty on the scaling factors, and λ is the balance factor of the first and second terms. 55 The easychair Class File Li, Dan, Zhu and Zhang In the course of the experiment, choosing g(s) = |s|, which is known as L1-norm and widely used to achieve sparsity. BN has been adopted by most modern CNNs, as a standard method to achieve fast convergence and better generalization performance. Let zin and zout be the input and output of a BN layer, B denotes the current batchsize, BN layer performs the following transformation: zin − µB ẑ = p 2 , zout = γ ẑ + β (8) σB + ε Where µB and σB are the mean and standard deviation values of input activations over B, γ and β are trainsble affine transformation parameters which provides the possibility of linearly transforming normalized activations back to any scales. Figure 2: Channel pruning 3.3 Knowledge distillation The goal of knowledge distillation is to use the large model’s knowledge to direct the small model’s training so that the small model can match the large model’s output. The teacher model and the student model are described as the large and small models respectively. Fig. 3 depicts the structure. To obtain a better soft target, the temperature parameter T is quoted, as shown in Eq.(9). exp(Zi /T ) qi = P (9) j exp(Zj /T ) Where Zi is the probability of the i-th category in the output vector, j ∈ (1, 2, ..., k), and k is the total number of categories. The exp is an exponential operation, and qi is the soft target output obtained by the function. For the same input, when T is set to 1, the student network creat a hard target. Using a higher value for T produces a softer probability distribution over classes, and the teacher network and student network generate a soft target respectively. The hard target and the two soft targets are used as the input of the cross-entropy loss function to learn the weights. As a result, the objective function of the knowledge distillation can be summed up as Eq.(10). L = αLsoft + βLhard (10) 56 The easychair Class File Li, Dan, Zhu and Zhang Teacher network Layer Layer Layer 1 Ă Softmax(T=t) soft labels Loss Fn 2 n distillation loss input x soft Softmax(T=t) predictions Student network Layer Layer Layer 1 Ă Softmax(T=1) hard Loss Fn 2 n prediction student loss hard label y ground truth Figure 3: Knowledge distillation 4 Experiment 4.1 Experiment dataset Shown in Fig. 4, the data on the left is from the CASIA-HWDB1.1 dataset, which is a publicly available HCCR dataset provided by the Institute of Automation of the Chinese Academy of Sciences. 16 classes are selected from the CASIA-HWDB1.1 dataset as part of the dataset in this paper. The right side is the same type of Chinese character written by different volunteers. The two parts are combined to form a new dataset, which is named MiniHWDB dataset. Shown in Table 1, MiniHWDB contains of 12,000 images. The dataset is split into two parts: the training set and test with the ratio of 8:2. Figure 4: Offline handwritten Chinese character dataset Dataset Total images Training ratio Image size classification MiniHWDB 12000 0.8 256*256 16 Table 1: MiniHWDB dataset 4.2 Experimental process In the process of training and inference, the input is resized to 224×224. The batch size is set to be 64. The adaptive optimizer SGD is taken to optimize the loss function. All of 57 The easychair Class File Li, Dan, Zhu and Zhang the experiments are conducted on a computer with the 3.00 GHz Intel(R) Core(TM) i7-9700 processor, 2×8GB of RAM, and a GeForce RTX 2060 graphics card with 6GB of video memory. At the first, ResNet18 is adopted as the original network. As shown in Fig. 4, different people have their own writing styles, and there is a lot of useless information (white area) in the input. The attention mechanism ECA is used to improve the feature extraction from the input. After that, the teacher network ECA-ResNet56 and the student network ECA-ResNet18 are obtained respectively. Next, the student network is pruned at the channel level according to the pruning ratios, which can be set 0.4 and 0.6. When the pruning rate is defined as 0.4, it means that 40% of the channels are pruned. The new student networks CS-ECA-ResNet18(0.4) and CS-ECA-ResNet18(0.6) are obtained after the pruning is completed. Finally, the teacher network ECA-ResNet56 is used to guide the pruned student network. The distilled student network is named KD-SC-ECA-ResNet18. In the course of the experiment, the parameters are set as shown in Table 2. Description Value Adaptive selection of kernel size k 5 Temperature 5 Batch size 64 Minimun number of epochs 30 Maximun number of epochs 100 Table 2: Parameter Setting 4.3 Results and analysis The accuracy of the original network ResNet18 reached 94.40%. By introducing the attention mechanism ECA, the accuracy is improved by 1.42%, while the number of parameters only increases by 4.5%. Since ECA is a lightweight module, it can be seen that the parameters increase by introducing the attention mechanism is negligible. Channel pruning reduce the number of parameters by removing unimportant channels, but the result is loss of accuracy. To improve the loss of accuracy due to pruning, the method of teacher network is taken to guide the pruned network. Teacher networks usually to be deep networks. Although the increasing in depth of the network improves the accuracy, it also brings significant increase of parameters. For example, the teacher network is much deeper than the student network, but only 3.39% accuracy improvement. However, the number of parameters is 2.19 times than the student network. Although parameters and accuracy are difficult to balance in the task of HCCR, the parameters are given priority. Because these networks are mostly deployed on devices like mobile phones that do not have large storage. With the increase in the number of parameters, the model is hard to be deployed on embed- ded devices. So channel pruning is adopted to reduce the number of parameters, this results in a loss of accuracy. Therefore, knowledge distillation is used to improve the accuracy of the pruned network. When the pruning rate is 0.4. The accuracy of the KD-SC-ECA-ResNet18(0.4) is improved 1.71%, and the number of parameters is reduced 16.7%, compared to the ECA- ResNet18. When the pruning rate is 0.6. The accuracy of the student network is improved 1.17%, and the number of parameters is reduced 25.6%, compared to before the pruning and knowledge distillation. The pruning rate is over 0.6, it is tough to obtain a good result even after knowledge distillation. The results of different models are shown in Table 3. 58 The easychair Class File Li, Dan, Zhu and Zhang Model Accuracy Params ResNet18 94.40 11.19M ECA-ResNet18 95.82 11.69M ECA-ResNet56 99.21 25.56M KD-SC-ECA-ResNet18(0.4) 97.53 9.74M KD-SC-ECA-ResNet18(0.6) 96.99 8.70M Table 3: Experiment accuracy and model parameter number 5 Conclusion and future work This paper focuses on images classification for offline handwritten Chinese character recognition. The method by using attention mechanism, channel pruning and knowledge distillation, not only obtains higher recognition accuracy, but also has a lower number of parameters than original network. In this paper, the attention mechanism is used to improve the network’s ability to extract features, channel pruning effectively reduces the number of parameters, and the knowledge distillation improves the accuracy. It is beneficial for the model to be deployed on the resource-canstrained devices. In future work, the model can be compressed with other methods to further reduce model size. It is very useful for the development of artificial intelligence, especially for the field of computer vision. References [1] Lin Meng, Bing Lyu, Zhiyu Zhang, C. V. Aravinda, Naoto Kamitoku, and Katsuhiro Yamazaki. Oracle bone inscription detector based on ssd. ICIAP2019, pages 126–136, 2019. [2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep con- volutional neural networks. Advances in neural information processing systems, pages 1097–1105, 2012. [3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1904–1916, 2015. [4] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. CVPR2015, page 1–9, 2015. [5] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ICLR2015, 2015. [6] Hengyi Li, Zhichen Wang, Xuebin Yue, Wenwen Wang, Tomiyama Hiroyuki, and Lin Meng. A comprehensive analysis of low-impact computations in deep learning workloads. Proceedings of the 2021 on Great Lakes Symposium on VLSI, 2021. [7] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 2011–2023, 2019. [8] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. Springer, Cham, 2018. [9] Qilong Wang, Banggu Wu, Pengfei Zhu, Peihua Li, Wangmeng Zuo, and Qinghua Hu. Eca-net: Efficient channel attention for deep convolutional neural networks. CVPR2020, 2020. [10] Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural networks. MIT Press, 2015. 59 The easychair Class File Li, Dan, Zhu and Zhang [11] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. arXiv:1608.04493, 2016. [12] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. ICCV2017, 2017. [13] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. ICCV2017, 2017. [14] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. arXiv:1611.06440, 2017. [15] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. Computer Science, pages 38–39, 2015. [16] Kaiyu Yue, Jiangfan Deng, and Feng Zhou. Matching guided distillation. arXiv:2008.09958, 2020. [17] Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent models of visual attention. Advances in Neural Information Processing Systems, 2014. 60