1. Introduction

Adjoined Networks: A Training Paradigm With Applications to Network Compression

Utkarsh Nath

unath@asu.edu 0 1

Shrinu Kushagra

0 2

Yingzhen Yang

yingzhen.yang@asu.edu 0 1 0 In A. Martin, K. Hinkelmann , H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.) , Proceedings of the AAAI 2022 Spring Symposium on Machine Learning and Knowledge Engineering for Hybrid Intelligence (AAAI-MAKE 2022), Stanford University , Palo Alto, California , USA 1 School of Computing and Augmented Intelligence, Arizona State University , Tempe, AZ 85281 , USA 2 University of Waterloo , Waterloo, ON N2L 3G1 , Canada

Compressing deep neural networks while maintaining accuracy is important when we want to deploy large, powerful models in production and/or edge devices. One common technique used to achieve this goal is knowledge distillation. Typically, the output of a static pre-defined teacher (a large base network) is used as soft labels to train and transfer information to a student (or smaller) network. In this paper, we introduce Adjoined Networks, or AN, a learning paradigm that trains both the original base network and the smaller compressed network together. In our training approach, the parameters of the smaller network are shared across both the base and the compressed networks. Using our training paradigm, we can simultaneously compress (the student network) and regularize (the teacher network) any architecture. In this paper, we focus on popular CNN-based architectures used for computer vision tasks. We conduct an extensive experimental evaluation of our training paradigm on various large-scale datasets. Using ResNet-50 as the base network, AN achieves 71.8% top-1 accuracy with only 1.8M parameters and 1.6 GFLOPs on the ImageNet data-set. We further propose Diferentiable Adjoined Networks (DANs), a training paradigm that augments AN by using neural architecture search to jointly learn both the width and the weights for each layer of the smaller network. DAN achieves ResNet-50 level accuracy on ImageNet with 3.8× fewer parameters and 2.2× fewer FLOPs.

eol>Knowledge Distillation Diferentiable Adjoined Networks Neural Architecture Search

1. Introduction

Deep Neural Networks (DNNs) have achieved state-of-the-art performance on many tasks such as classification, object detection and image segmentation. However, the large number of parameters often required to achieve the performance makes it dificult to deploy them at the edge (like on mobile phones, IoT and embedded devices, etc). Unlike cloud servers, these edge devices are constrained in terms of memory, compute, and energy resources. A large network performs a lot of computations, consumes more energy, and is dificult to transport and update. A large network also has a high prediction time per image. This is a constraint when real-time inference is needed. Thus, compressing neural networks while maintaining accuracy and improving inference time has received significant attention in the last few years. Popular techniques for network compression include pruning and knowledge distillation.

Pruning methods remove parameters (or weights) of overparameterized DNNs based on some pre-defined criteria. For example, [ 1 ] removes weights whose absolute value is smaller than a threshold. While weight pruning methods are successful at reducing the number of parameters of the network, they often work by creating spares tensors that may require special hardware [ 2 ] or special software [ 3 ] to provide inference time speed-ups. These methods are also known as unstructured pruning and has been extensively studied in [ 1, 4, 5, 6, 7 ]. To overcome this limitation, channel pruning [ 8 ] and filter pruning [ 9 ] techniques are used. These structured pruning methods work by removing entire convolution channels or sometimes even filters based on some pre-defined criteria and can often provide significant improvement in inference times. In this paper, we show that our algorithm, Adjoined Networks or AN, achieves accuracy similar to the current state-of-the-art structured pruning methods but uses a significantly lower number of parameters and FLOPs (Fig 1).

The AN training paradigm works as follows. A given input image is processed by two networks, the larger network (or the base network) and the smaller network (or the compressed network). The base network outputs a probability vector and the compressed network outputs a probability vector . This setup is similar to the student-teacher training used in Knowledge Distillation [ 10 ] where the base network (or the teacher) is used to train the compressed network (or the student). However, there are two very important distinctions. (1) In knowledge distillation, the parameters of the base (or larger or teacher) network are fixed and the output of the base network is used as a "soft label" to train the compressed (or smaller or student) network. In the paradigm of the adjoined network, both the base and the compressed network are trained together. The output of the base network influences the compressed network and vice-versa. (2) The parameters of the compressed network are shared across both the smaller and larger networks (Fig. 2). We train the two networks using a novel time-dependent loss function called adjoined loss. An additional benefit of training the two networks together is that the smaller network can have a regularizing efect on the larger network. In our experiments (Section 6), we see that on many datasets and for many architectures, the base network trained in the adjoined fashion has greater prediction accuracy than the standard situation when the base network was trained alone. We also provide theoretical justification for this observation in Input q

Filters Figure 2: Training paradigm based on adjoined networks. The original and the compressed version of the network are trained together with the parameters of the smaller network shared across both. The network outputs two probability vectors (original network) and (smaller network). the appendix materials. The details of our design, the loss function and how it supports fast inference are discussed in Section 3.

As discussed in the previous paragraph, in the AN training paradigm, all the parameters of the smaller network are shared across both the smaller and larger network. Our compression architecture design involves selecting and tuning a hyper-parameter , the size (or the number of parameters in each convolution layer) of the smaller network as compared against the larger base network. In our experiments (Section 6) with the AN paradigm, we found that choosing a value of = 2 or 4 as a global (same across all the layers of the network) constant typically worked well. To get more boost in compression performance, we propose the framework of Diferentiable Adjoined Network (DAN). DAN uses techniques from Neural Architecture Search (NAS) to further optimize and choose the right value of at each layer of our compressed model. The details of DAN are discussed in Section 5.

Below are the main contributions of this work. 1. We propose a novel training paradigm based on Adjoined Networks or AN, that can compress any CNN based neural architecture. This involves adjoined training where the original network and the smaller network are trained together. This has twin benefits of compression and regularization whereby the larger network (or teacher) transfers information and helps compress the smaller network while the smaller network helps regularize the larger teacher network. 2. We further propose Diferentiable Adjoined Networks , or DAN, that adjointly learns some of the hyper-parameters of the smaller network including the number of filters in each layer of the smaller network. 3. We conducted an exhaustive experimental evaluation of our method and compared it against several state-of-the-art methods on datasets such as ImageNet [ 11 ], CIFAR-10 and CIFAR100 [ 12 ]. We consider diferent architectures such as ResNet-18,-50,-20,-32,-44,-56,-110 and DenseNet-121. On ImageNet, using adjoined training paradigm, we can compress ResNet50 by 4× with 2× FLOPs reduction while achieving 75.1% accuracy. Moreover, the base network gains 0.7% in accuracy when compared against the same network trained in the standard (non-adjoined) fashion. We further increase the accuracy of the compressed model to 75.7% by augmenting our approach with architecture search (DAN), clearly showing that it is better to train the networks together. Furthermore, we compare our approach against several state-of-the-art knowledge distillation methods on CIFAR-10 on various architectures like Resnet-20,-32,-44,-56, and -110. On each of these architectures the student trained using the adjoined method outperforms those trained using other methods (Table 2).

The paper is organized as follows. In Section 2, we discuss some of the other methods that are related to the discussions in the paper. In Section 3, we provide details of the architecture for adjoined networks and the loss function. In Section 4, we show how training both the base and compressed network together provides compression (for the smaller network) as well as regularization (for the larger network). In Section 5, we combine AN with neural architecture search and introduce Diferentiable Adjoined Networks (or DANs). In Section 6, we provide the details of our experimental results. In Section A of the appendix, we provide strong theoretical guarantees on the regularization behaviour of adjoined training.

2. Related Work

In this section, we discuss various techniques used to design eficient neural networks in terms of size and FLOPs. We also compare our approach to other similar approaches and ideas in the literature.

Knowledge Distillation is the transfer of knowledge from a cumbersome model to a small model. [ 10 ] proposed teacher student model, where soft targets from the teacher are used to train the student model. This forces the student to generalize in the same manner as the teacher. Various knowledge transfer methods have been proposed recently. [ 13 ] used intermediate layer’s information from teacher model to train thinner and deeper student model. [ 14 ] proposes to use instance level correlation congruence instead of just using instance congruence between the teacher and student. [ 15 ] tried to maximize the mutual information between teacher and student models using variational information maximization. [ 16 ] aims at transferring structural knowledge from teacher to student. [ 17 ] argues that directly transferring a teacher’s knowledge to a student is dificult due to inherent diferences in structure, layers, channels, etc., therefore, they paraphrase the output of the teacher in an unsupervised manner making it easier for the student to understand. Most of these methods use a trained teacher model to train a student model. In contrast in this work, we train both the teacher and the student together from scratch. In recent work, [ 18 ], rather than using a teacher to train a student, they let a cohort of students train together using a distillation loss function. In this paper, we consider a teacher and a student together rather than using a pre-trained teacher. We also use a novel time-dependent loss function. Moreover, we also provide theoretical guarantees on the eficacy of our approach. We have compared our AN with various knowledge distillation methods in the experiments section.

Pruning techniques aim to achieve network compression by removing parameters or weights from a network while still maintaining accuracy. These techniques can be broadly classified into two categories; unstructured and structured. Unstructured pruning methods are generic and do not take network architecture (channel, filters) into account. These methods induce sparsity based on some pre-defined criteria and often achieve a state-of-the-art reduction in the number of parameters. However, one drawback of these methods is that they are often unable to provide inference time speed-ups on commodity hardware due to their unstructured nature. Unstructured sparsity has been extensively studied in [ 1, 4, 5, 6, 7 ]. Structured pruning aims to address the issue of inference time speed-up by taking network architecture into account. As an example, for CNN architectures, these methods try to remove entire channels or filters, or blocks. This ensures that the reduction in the number of parameters also translates to a reduction in inference time on commodity hardware. For example, ABCPruner [ 19 ] decides the convolution filters to be removed in each layer using an artificial bee colony algorithm. [ 20 ] prunes filters with low-rank feature maps. [ 21 ] uses Taylor expansion to estimate the change in the loss function by removing a particular filter, and finally removes the filters with max change. The AN compression technique proposed in this paper can also be thought of as a structured pruning method where the architecture choice at the start of training fixes the convolution filters to be pruned and the amount of pruning at each layer. Another related work is of Slimmable Networks [ 22 ]. Here diferent networks (or architectures) are switched on one at a time and trained using the standard cross-entropy loss function. By contrast, in this work, both the networks are trained together at the same time using a novel loss function (adjoined-loss). We have compared our work with Slimmable Networks in Table 1. Neural Architecture Search (NAS) is a technique that automatically designs neural architecture without human intervention. The best architecture could be found by training all architectures in the given search space from scratch to convergence but this is computationally impractical. Earlier studies in NAS were based on RL [ 23, 24 ] and EA [ 25 ], however, they required lots of computation resources. Most recent studies [ 26, 27, 28 ] encode architectures as a weight sharing a super-net and optimize the weights using gradient descent. A recent study Meta Pruning [ 29 ] searches over the number of channels in each layer. It generates weights for all candidates and then selects the architecture with the highest validation accuracy. A lot of these techniques focus on designing compact architecture from scratch. In this paper, we use architecture search to help guide the choice of architecture for compression, that is, the fraction of filters which should be removed from each layer.

Small architectures - Another research direction that is orthogonal to ours is to design smaller architectures that can be deployed on edge devices, such as SqueezeNet [ 30 ], MobileNet [ 31 ] and EficientNet [ 32 ]. In this paper, our focus is to compress existing architectures while ensuring inference time speedups as well as maintaining prediction accuracy.

3. Adjoined networks

In our training paradigm, the original (larger) and the smaller network are trained together. The motivation for this kind of training comes from the principle that good teachers are lifelong learners. Hence, the larger network which serves as a teacher for the smaller network should not be frozen (as in standard teacher-student architecture designs [ 10 ]). Rather both should learn together in a "combined learning environment", that is, adjoined networks. By learning together both the networks can be better together.

We are now ready to describe our approach and discuss the design of adjoined networks. Before that, let’s take a re-look at the standard convolution operator. Let x ∈ Rℎ× × be the input to a convolution layer with weights W ∈ R× × × where , denotes the number of input and output channels, the kernel size and ℎ, the height and width of the Input

Output

Input Filters

Filters Output Small

Input

Filters × g1 + × g2 + × g3 + × g4

Output image. Then, the output of the convolution z is given by

z = (x, W) In the adjoined paradigm, a convolution layer with weight matrix W and a binary mask matrix ∈ {0, 1}× × × receives two inputs x1 and x2 of size ℎ × × and outputs two vectors z1 and z2 as defined below.

z1 = (x1, W) z2 = (x2, W * ) Here is of the same shape as W and * represents an element-wise multiplication. Note that the parameters of the matrix are fixed before training and not learned. The vector x1 represents an input to the original (bigger) network while the vector x2 is the input to the smaller, compressed network. For the first convolution layer of the network x1 = x2 but the two vectors are not necessarily equal for the deeper convolution layers (Fig. 2). The mask matrix serves to zero-out some of the parameters of the convolution layer thereby enabling network compression. In this paper, we consider matrices of the following form.

:= = matrix such that the first filters are all 1 and the rest 0

In Section 6, we run experiments with := for ∈ {2, 4, 8, 16}. Putting this all together, we see that any CNN-based architecture can be converted and trained in an adjoined fashion by replacing the standard convolution operation by the adjoined convolution operation (Eqn. 1). Since the first layer receives a single input (Fig. 2), two copies are created which are passed to the adjoined network. The network finally gives two outputs p corresponding to the original (bigger or unmasked) network and q corresponding to the smaller (compressed) network, where each convolution operation is done using a subset of the parameters described by the mask matrix (or ). We train the network using a novel time-dependent loss function which forces p and q to be close to one another (Defn. 1). (1) (2)

4. Regularization and Compression

In the previous section, we looked at the design on adjoined networks. For one input (X, y) ∈ Rℎ× × × [ 0, 1 ] , the network outputs two vectors p and q ∈ [ 0, 1 ] where denotes the number of classes and denotes the number of input channels (equals 3 for RGB images). Definition 1 (Adjoined loss). Let be the ground-truth one-hot encoded vector and and be output probabilities by the adjoined network. Then ℒ(, , ) = − log + () (, ) (3) where (, ) = ∑︀ log is the measure of diference between two probability measures [ 33 ]. The regularization term : [ 0, 1 ] → R is a function which changes with the number of current epoch epochs during training. Here = Total number of epochs equals zero at the start of training and equals one at the end.

In our definition of the loss function, the first term is the standard cross-entropy loss function which trains the bigger network. To train the smaller network, we use the predictions from the bigger network as a soft ground-truth signal. We use KL-divergence to measure how far the output of the smaller network is from the bigger network. This also has a regularizing efect as it forces the network to learn from a smaller set of parameters. Note that, in our implementations, we use (, ) = ∑︀ log ++ to avoid rounding and division by zero errors where = 10− 6.

At the start of training, is not a reliable indicator of the ground-truth labels. To compensate for this, the regularization term changes with time. In our experiments, we used () = min{42, 1}. Thus, the contribution of the second term in the loss is zero at the beginning and steadily grows to one at 50% training.

5. DAN: Diferentiable Adjoined Networks

In Sections 3 and 4, we described the framework of adjoined networks and the corresponding loss function. An important parameter in the design of these networks is the choice of parameter . Currently, the choice of is global, that is, we choose the same value of for all the layers of our network. However, choosing independently for each layer would add more flexibility and possibly improve performance of the current framework. To solve this problem, we propose the framework of Diferentiable Adjoined Networks (or DANs).

Consider the following example of a convolution network with 1 layer with the following choices for ∈ = {1, 2, 4} that outputs a vector . Finding the optimal network structure is equivalent to solving arg max ∈ ( ) where is some loss function. For a one layer network, we can solve this problem by computing ( ) for all the diferent values and then computing the max. However, this becomes intractable as the number of layers increase; for a 50-layer network, the search space has size 350.

Definition 2 (Gumbel-softmax ([ 34 ])). Given vector = [1, . . . , ] and a constant . The gumbel-softmax function is defined as () = [1, . . . , ] where

exp[( + )/ ] = ∑︀ exp[( + )/ ] (4) and ∼ (0, 1) is uniform random noise (also referred to as gumbel noise). Note that as → 0, gumbel-softmax tends to the arg max function.

Gumbel-softmax is a “re-parametrization trick" that can be viewed as a diferentiable approximation to the arg max function. Returning back to the one-layer example, the optimization objective now becomes ∑︀ ∈ ( ) where represents the gumbel weights corresponding to the particular . This objective is now diferentiable and can be solved using standard techniques like back-propagation.

With this insight, we propose the DAN architecture (Fig 3) where the standard convolution operation is replaced by a DAN convolution operation. As before, let x ∈ ℎ× × be the input to the DAN convolution layer with weights ∈ × × × where , denotes the number of input and output channels, the kernel size and ℎ, the height and width of the image. Let = { 1, . . . , } be the range of values of for the layer. Then, the output z of the DAN convolution layer is given by (5) (6) ( ) = ∑︁ ( )

=1 where = [ 1, . . . , ] denotes the mixing weights corresponding to the diferent ’s, is the gumbel-softmax function and = (x, * ) where is the mask matrix corresponding to (as in Eqn. 2). Thus, each layer of the DAN convolution layer combines its outputs according to the gumbel weights. Choosing the hyper-parameter now corresponds to learning the values of the parameter for each layer of our DAN conv network. Note that as before, our network outputs two probability vectors p and q. But these vectors now also depend upon the weights vector at each layer. We are now ready to define our main loss function.

Definition 3 (Diferentiable Adjoined loss) . Let the search space be = { 1, . . . , }Let be the ground-truth one-hot encoded vector and and be output probabilities of the adjoined network. Then

ℒ(, , ) = − log + ()( (, ) + ()) where (, ), () are the same as used in Eqn. 1. = [ 1, . . . , ] where is the mixing weight vector for the ℎ convolution layer. represents the gumbel weighted FLOPs or floating point operations for the given network. That is,

() = ∑︁ ∑︁ ( ) (, )

∈ =1 where (, ) measures the number of floating point operations at the ℎ convolution layer corresponding to the hyper-parameter . Also, note that in Eqn. 6 is a normalization constant.

Diferentiable Adjoined Loss is similar to Adjoined Loss defined in Eqn. 3. However, the key diference is the term. First note that, larger architectures tend to have higher accuracies. Hence, DAN learning tends to prefer a network with low alpha (large network) against that with high alpha (small network). Thus, the term acts as a regularization penalty against DAN preferring large architectures. Another point to note is that for a large network say Resnet-50, the number of flops corresponding to any setting of the mixing weights can be very large. Gamma normalizes it so that all the terms in the loss function are in the same scale.

6. Experiments

We are now ready to describe our experiments in detail. We run experiments on three diferent datasets. (1) ImageNet - an image classification dataset [ 11 ] with 1000 classes and about 1.2 images . (2) CIFAR-10 - a collection of 60 images in 10 classes. (3) CIFAR-100 - same as CIFAR10 but with 100 classes [ 12 ]. For each of these datasets, we use standard data augmentation techniques such as random-resize cropping, random flipping.

We train diferent architectures such as ResNet-100, ResNet-50, ResNet-18, ResNet-110, ResNet-56, DenseNet-121 on all of the above datasets. On each dataset, we first train these architectures in the standard non-adjoined fashion using the cross-entropy loss function. We will refer to it by the name Standard. Next, we train the adjoined network, obtained by replacing the standard convolution operation with the adjoined convolution operation, using the adjoined loss function. In the second step, we obtain two diferent networks. In this section, we refer to them by AN-X-Full and the AN-X-Small networks where X represents the number of layers and denotes the mask matrix as defined in 2. For example, AN-50-Full 2, AN-50-Small 2 represents larger and smaller networks obtained on adjoinedly training ResNet-50 with

= 2. AN-121-Full 4, AN-121-Small 2 represents models obtained on adjoinedly training DenseNet-121 with

= 4. We compare the performance of the AN-X-Full and AN-XSmall networks against the standard network. One point to note is that we do not replace the convolutions in the stem layers but only those in the residual blocks. Since most of the weights are in the later layers, this leads to significant space and time savings while retaining competitive accuracy. DAN describes the performace of adjoined network on architectures found by Diferentiable Adjoined Network. DAN-50 has the same number of blocks as ResNet-50 whereas DAN-100 has twice the number of blocks of ResNet-50.

We ran our experiments on GPU enabled machine using Pytorch. We have also open-sourced our implementation 1. Hyperparameters for the experiments are mentioned on our github page.

In Section 6.1, we compare our compression results against other structured pruning methods. In Section 6.2, we compare AN with various types of knowledge distillation methods. In Section 6.3, we describe our results for compression and performance of architectures found by DAN. In Section 6.4, we show the strong regularizing efect of AN training.

6.1. Comparison against other Structured Pruning works

the ImageNet dataset for the ResNet-50 architecture. Note, these methods provide inference speed-up without special hardware or software. We see that the adjoined training regime can achieve compression that is significantly better than other methods considered in the literature. In Figure 1, models trained using our paradigm are explicitly on the left side of the graph while other methods are clustered on to the right side. Other methods obtain compression ratios in the range 2 − 3× , compared to which our method achieves up to 12× compression in size. Similarly, GFLOPS for our method is amongst the highest as compared to the other state-of-the-art works, while sufering a small accuracy drop as compared against the base ResNet-50 model. Figure 4 compares the performance of AN against various pruning methods on CIFAR-10 dataset for 1The code can be found at https://github.com/utkarshnath/Adjoint-Network.git

Method ABCPruner-0.8 [ 19 ] ABCPruner-0.7 [ 19 ] GBN-50 [ 21 ] GBN-60 [ 21 ] DCP [ 35 ] HRank [ 20 ] HRank [ 20 ] HRank [ 20 ] MetaPruning [ 29 ] MetaPruning [ 29 ] Slimmable Net [ 22 ] Slimmable Net [ 22 ] AN-50-Small 4 (our) DAN-50 (our) DAN-100 (our) AN-50-Small 2 (our) ResNet-56 architecture. Models trained using AN paradigm achieve highest accuracy with fewest number of parameters on CIFAR-10. AN exceeds the next best model (Hinge [ 43 ]) by 0.8% while being smaller than 9 of the 11 models. The smallest AN model achieves accuracy similar to hinge but with 35% fewer parameter. We see similar results for FLOPs.

6.2. Comparison against other Knowledge Distillation Works

In this section, we discuss the efectiveness of weight sharing and training two networks together. We compare AN against the various state-of-the-art variants of knowledge distillation. In Table 2, we compare accuracy (Top-1%) of AN-X-Small against the same architecture trained using various KD variants on CIFAR-10 dataset. The corresponding pre-trained ResNet architecture was used as the teacher model for KD variants. Teacher models were trained on CIFAR10 using standard training paradigm. We see that all models trained using AN paradigm significantly outperforms the models trained using various teacher-student paradigm showing the efectiveness of training a subset of weights together.

6.3. Ablation study: Compression

Network ResNet-20 AN-20-Small 2 ResNet-32 AN-32-Small 2 ResNet-44 AN-44-Small 2 ResNet-56 AN-56-Small 2 ResNet-110 AN-110-Small 2 ResNet-50 AN-50-Small 2 AN-50-Small 4 DAN-50 ResNet-100 AN-100-Small 4 DAN-100 denote the masking matrix (defined in Eqn. 2).

In this section, we evaluate the performance of models compressed by Adjoined training paradigm. Table 3 compares the performance (top 1% accuracy) of the models compressed using AN against the performance of standard network. For AN, we use the as the masking matrix (defined in Eqn. 2). The mask is such that the last (1 − 1 ) filters are zero. Hence, these can be parameters and 2× reduction in FLOPs. pruned away to support fast inference. For CIFAR-10, 4 out of 5 models compressed using AN paradigm exceed it’s base architecture by 0.5%-0.8%. These models achieves 3.5-4× reduction in

We also observe that ResNet-50 is a bigger network and can be compressed more. Also, diferent datasets can be compressed by diferent amounts. For example, on CIFAR-100 dataset, the network can be compressed by factors ∼ 35× while for other datasets it ranges from 2× to 12× . DAN is able to search compressed architecture with minimum loss in accuracy as compared to base architecture. For ImageNet, DAN architectures were searched on Imagewoof (a proxy dataset with 10 diferent dog breeds from ImageNet [ 46, 47 ]). as defined in Defn. 3 is − 13, − 19 for DAN-50 and DAN-100 respectively. During architecture search, temperature in gumbel softmax was initialized to 15 and exponentially annealed by − 0.045 every epoch.

6.4. Ablation study: Regularization

Network ResNet-20 ResNet-32 ResNet-44 ResNet-56 ResNet-110 ResNet-50 ResNet-18 DenseNet-121 ResNet-50 ResNet-50

AN-Full vs Standard

CIFAR-10

AN-Full

Standard 2 2 2 2 2 8 2 4 2 4 CIFAR-100 ImageNet denote the masking matrix (defined in Eqn. 2).

In this section, we study the regularization efect of Adjoined training paradigm on AN-Full network. Table 4 compares the performance of the base network trained in adjoined fashion (AN-Full) to the same network trained in Standard fashion. We see a consistent trend that the network trained adjoinedly outperforms the same network trained in the standard way. We see maximum gains on CIFAR-100, exceeding accuracy by as much as 1.8%. Even on ImageNet, we see a gain of about 0.77%.

7. Conclusion

In this work, we introduced the paradigm of Adjoined Network training where both the larger teacher (or base) network and the smaller student network are trained together. We showed how this approach to training neural networks can allow us to reduce the number of parameters nificant loss in classification accuracy with of large networks like ResNet-50 by 12× , (even going up to 35× on some datasets) without sig2-3× reduction in the number of FLOPs. We showed (both theoretically and experimentally) that adjoining a large and a small network together has a regularizing efect on the larger network. We also introduced DAN, a search strategy that automatically selects the best architecture for the smaller student network. Augmenting adjoined training with DAN, the smaller network achieves accuracy that is close to that of the base teacher network.

A. Regularization theory

Theorem A.1. Given a deep neural network which consists of only convolution and linear layers. Let the network use one of () = min{, 0} (relu) or () = (linear) as the activation function. Let the network be trained using the adjoined loss function as defined in Eqn. 3. Let X be the set of parameters of the network which is shared across both the smaller and bigger networks. Let Y be the set of parameters of the bigger network not shared with the smaller network. Let p be the output of the larger network and let q be the output of the smaller network where p, q represents their iℎ component. Then, the adjoined loss function induces a data-dependent regularizer with the following properties.

• For all ∈ , the induced 2 penalty is given by ∑︀ p(︀ log′ p − log′ q)︀ 2 • For all ∈ , the induced 2 penalty is given by ∑︀ p(︀ log′ p)︀ 2 Proof. We are interested in analyzing the regularizing behavior of the following loss function. − log + (, ) is the ground truth label, is the output probability vector of the bigger network and is the output probability vector of the smaller network. Recall that the parameters of smaller network are shared across both. We will look at the second order taylor expansion for the kl-divergence term. This will give us insights into regularization behavior of the loss function.

Let be a parameter which is common across both the networks and be a parameter in the bigger network but not the smaller one. () = ∑︁ ()(︀ log () − log ())︀ and () = ∑︁ ()(︀ log () − log )︀ For the parameter , is a constant. Now, computing the first order derivative, we get that ′() = ∑︁ ′()(︀ log () − log ())︀ + ′() − ′()()

() ′() = ∑︁ ′()(︀ log () − log )︀ + ′()

Now, computing the second derivative for both the types of parameters, we get that ︃( ′() ′′() = ∑︁ ′′()(︀ log () − log ())︀ + ′()

()′()′() + ()′′()() − ′()′()() − 2() ′′() = ∑︁ ′′()(︀ log () − log )︀ + ′()′() () + ′′()

− Similarly, for the parameters only in the bigger network, we get that ′′() = ∑︁ ′()′() = ∑︁ (log′ )2 () (7) (8)

[1]

Han ,

Mao ,

W. J.

Dally , Deep compression: Compressing deep neural networks with pruning, trained quantization and hufman coding , arXiv preprint arXiv:1510.00149 ( 2015 ).

[2]

Han , X . Liu,

Mao ,

Pu ,

Pedram ,

M. A.

Horowitz ,

W. J.

Dally , Eie: Eficient inference engine on compressed deep neural network , 2016 . arXiv: 1602 . 01528 .

[3]

Park ,

Li ,

Wen ,

P. T. P.

Tang ,

Li ,

Chen ,

Dubey , Faster cnns with direct sparse convolutions and guided pruning , 2017 . arXiv: 1608 . 01409 .

[4]

Zhu ,

Gupta , To prune, or not to prune: exploring the eficacy of pruning for model compression , 2017 . arXiv: 1710 . 01878 .

[5]

Gale ,

Elsen , S. Hooker, The state of sparsity in deep neural networks , 2019 . arXiv: 1902 .09574.

[6]

Kusupati ,

Ramanujan ,

Somani ,

Wortsman ,

Jain ,

Kakade ,

Farhadi , Soft threshold weight reparameterization for learnable sparsity , 2020 . arXiv: 2002 .03231.

[7]

Evci ,

Gale ,

Menick ,

P. S.

Castro , E. Elsen, Rigging the lottery: Making all tickets winners , 2021 . arXiv: 1911 .11134.

[8]

Liu ,

Li ,

Shen ,

Huang ,

Yan ,

Zhang , Learning eficient convolutional networks through network slimming , in: Proceedings of the IEEE International Conference on Computer Vision , 2017 , pp. 2736 - 2744 .

[9]

Li ,

Kadav , I. Durdanovic,

Samet ,

H. P.

Graf , Pruning filters for eficient convnets , arXiv preprint arXiv:1608.08710 ( 2016 ).

[10]

Hinton ,

Vinyals ,

Dean , Distilling the knowledge in a neural network , arXiv preprint arXiv:1503.02531 ( 2015 ).

[11]

Russakovsky ,

Deng ,

Su ,

Krause ,

Satheesh ,

Ma ,

Huang ,

Karpathy ,

Khosla ,

Bernstein ,

A. C.

Berg , L. Fei-Fei, ImageNet Large Scale Visual Recognition Challenge , International Journal of Computer Vision (IJCV) 115 ( 2015 ) 211 - 252 . doi: 10 . 1007/s11263-015-0816-y.

[12]

Krizhevsky ,

Nair , G. Hinton, Cifar-10 and cifar-100 datasets, URl: https://www. cs. toronto. edu/kriz/cifar. html 6 ( 2009 ).

[13]

Romero ,

Ballas ,

S. E.

Kahou ,

Chassang ,

Gatta ,

Bengio , Fitnets: Hints for thin deep nets , 2015 . arXiv: 1412 . 6550 .

[14]

Peng ,

Jin , J. Liu,

Zhou ,

Wu ,

Liu ,

Li , Z. Zhang, Correlation congruence for knowledge distillation , 2019 . arXiv: 1904 . 01802 .

[15]

Ahn ,

S. X.

Hu ,

Damianou ,

N. D.

Lawrence ,

Dai , Variational information distillation for knowledge transfer , 2019 . arXiv: 1904 .05835.

[16]

Park ,

Kim ,

Lu ,

Cho , Relational knowledge distillation, 2019 . arXiv: 1904 .05068.

[17]

Kim ,

Park ,

Kwak , Paraphrasing complex network: Network compression via factor transfer , in: S. Bengio,

Wallach ,

Larochelle ,

Grauman , N. CesaBianchi, R. Garnett (Eds.), Advances in Neural Information Processing Systems , volume 31 , Curran

Associates

, Inc., 2018 . URL: https://proceedings.neurips.cc/paper/2018/file/ 6d9cb7de5e8ac30bd5e8734bc96a35c1-Paper.pdf.

[18]

Zhang , T. Xiang,

T. M.

Hospedales ,

Lu , Deep mutual learning , in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2018 , pp. 4320 - 4328 .

[19]

Lin ,

Ji ,

Zhang ,

Zhang , Y. Wu,

Tian , Channel pruning via automatic structure search , arXiv preprint arXiv: 2001 . 08565 ( 2020 ).

[20]

Lin ,

Ji ,

Wang ,

Zhang ,

Tian , L. Shao, Hrank: Filter pruning using high-rank feature map , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2020 , pp. 1529 - 1538 .

[21]

You ,

Yan ,

Ye , M. Ma, P. Wang, Gate decorator: Global filter pruning method for accelerating deep convolutional neural networks , arXiv preprint arXiv: 1909 . 08174 ( 2019 ).

[22]

Yu ,

Yang ,

Xu ,

Yang ,

Huang , Slimmable neural networks , arXiv preprint arXiv: 1812 . 08928 ( 2018 ).

[23]

Zoph ,

Q. V.

Le , Neural architecture search with reinforcement learning , 2017 . arXiv: 1611 . 01578 .

[24]

Tan ,

Chen ,

Pang ,

Vasudevan ,

Sandler ,

Howard ,

Q. V.

Le , Mnasnet: Platform-aware neural architecture search for mobile , 2019 . arXiv: 1807 .11626.

[25]

Real ,

Moore ,

Selle ,

Saxena ,

Y. L.

Suematsu ,

Tan ,

Le ,

Kurakin , Large-scale evolution of image classifiers , 2017 . arXiv: 1703 . 01041 .

[26]

Liu ,

Simonyan ,

Yang , Darts: Diferentiable architecture search, 2019 . arXiv: 1806 .09055.

[27]

Cai ,

Zhu , S. Han, Proxylessnas: Direct neural architecture search on target task and hardware , 2019 . arXiv: 1812 .00332.

[28]

Wu ,

Dai ,

Zhang ,

Wang ,

Sun ,

Wu ,

Tian ,

Vajda ,

Jia ,

Keutzer , Fbnet: Hardware-aware eficient convnet design via diferentiable neural architecture search , 2019 . arXiv: 1812 .03443.

[29]

Liu ,

Mu ,

Zhang ,

Guo ,

Yang , T. K.-T. Cheng, J. Sun, Metapruning: Meta learning for automatic neural network channel pruning , 2019 . arXiv: 1903 .10258.

[30]

F. N.

Iandola , S. Han,

M. W.

Moskewicz ,

Ashraf ,

W. J.

Dally ,

Keutzer , Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size , arXiv preprint arXiv:1602.07360 ( 2016 ).

[31]

Sandler ,

Howard ,

Zhu ,

Zhmoginov , L.-C. Chen, Mobilenetv2: Inverted residuals and linear bottlenecks , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2018 , pp. 4510 - 4520 .

[32]

Tan ,

Q. V.

Le , Eficientnet: Rethinking model scaling for convolutional neural networks , arXiv preprint arXiv: 1905 . 11946 ( 2019 ).

[33]

Kullback ,

R. A.

Leibler , On information and suficiency , The annals of mathematical statistics 22 ( 1951 ) 79 - 86 .

[34]

Wan ,

Dai ,

Zhang ,

He ,

Tian ,

Xie ,

Wu ,

Yu ,

Xu ,

Chen ,

Vajda , J. E. Gonzalez, Fbnetv2: Diferentiable neural architecture search for spatial and channel dimensions , in: CVPR, IEEE, 2020 , pp. 12962 - 12971 .

[35]

Zhuang ,

Tan ,

Zhuang , J. Liu,

Guo ,

Wu ,

Huang ,

Zhu , Discrimination-aware channel pruning for deep neural networks , arXiv preprint arXiv: 1810 . 11809 ( 2018 ).

[36]

Lin ,

Ji ,

Yan ,

Zhang ,

Cao ,

Ye ,

Huang ,

Doermann , Towards optimal structured cnn pruning via generative adversarial learning , 2019 . arXiv: 1903 .09291.

[37]

Kim , M. U. K. Khan , C.-M. Kyung , Eficient neural network compression , 2019 . arXiv: 1811 .12781.

[38]

Yu ,

Li ,

C.-F.

Chen ,

J.-H.

Lai ,

V. I.

Morariu , X. Han,

Gao ,

C.-Y.

Lin ,

L. S.

Davis , Nisp: Pruning networks using neuron importance score propagation , 2018 . arXiv: 1711 . 05908 .

[39]

Li ,

Kadav , I. Durdanovic,

Samet ,

H. P.

Graf , Pruning filters for eficient convnets , 2017 . arXiv: 1608 . 08710 .

[40]

Minnehan ,

Savakis , Cascaded projection: End-to-end network compression and acceleration , 2019 . arXiv: 1903 .04988.

[41]

Li ,

Lin ,

Zhang , J. Liu,

Doermann ,

Wu ,

Huang ,

Ji , Exploiting kernel sparsity and entropy for interpretable cnn compression , 2019 . arXiv: 1812 .04368.

[42]

He ,

Liu ,

Wang ,

Hu ,

Yang , Filter pruning via geometric median for deep convolutional neural networks acceleration , 2019 . arXiv: 1811 .00250.

[43]

Li ,

Gu ,

Mayer ,

L. V.

Gool ,

Timofte , Group sparsity: The hinge between filter pruning and decomposition for network compression , 2020 . arXiv: 2003 .08935.

[44]

H. Z.

Pengguang Chen , Shu Liu,

Jia , Distilling knowledge via knowledge review , in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2021 .

[45]

Guo ,

Wang ,

Wu ,

Yu ,

Liang ,

Hu ,

Luo , Online knowledge distillation via collaborative learning , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2020 .

[46]

Howard , Imagenette, URL : Github repository with links todataset . https://github.com/fastai/imagenette ( 2019 ).

[47]

Shleifer , E. Prokop, Using small proxy datasets to accelerate hyperparameter search , arXiv preprint arXiv: 1906 . 04887 ( 2019 ).