<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Adjoined Networks: A Training Paradigm With Applications to Network Compression</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Utkarsh Nath</string-name>
          <email>unath@asu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shrinu Kushagra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yingzhen Yang</string-name>
          <email>yingzhen.yang@asu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>In A. Martin, K. Hinkelmann</institution>
          ,
          <addr-line>H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.)</addr-line>
          ,
          <institution>Proceedings of the AAAI 2022 Spring Symposium on Machine Learning and Knowledge Engineering for Hybrid Intelligence (AAAI-MAKE 2022), Stanford University</institution>
          ,
          <addr-line>Palo Alto, California</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computing and Augmented Intelligence, Arizona State University</institution>
          ,
          <addr-line>Tempe, AZ 85281</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Waterloo</institution>
          ,
          <addr-line>Waterloo, ON N2L 3G1</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Compressing deep neural networks while maintaining accuracy is important when we want to deploy large, powerful models in production and/or edge devices. One common technique used to achieve this goal is knowledge distillation. Typically, the output of a static pre-defined teacher (a large base network) is used as soft labels to train and transfer information to a student (or smaller) network. In this paper, we introduce Adjoined Networks, or AN, a learning paradigm that trains both the original base network and the smaller compressed network together. In our training approach, the parameters of the smaller network are shared across both the base and the compressed networks. Using our training paradigm, we can simultaneously compress (the student network) and regularize (the teacher network) any architecture. In this paper, we focus on popular CNN-based architectures used for computer vision tasks. We conduct an extensive experimental evaluation of our training paradigm on various large-scale datasets. Using ResNet-50 as the base network, AN achieves 71.8% top-1 accuracy with only 1.8M parameters and 1.6 GFLOPs on the ImageNet data-set. We further propose Diferentiable Adjoined Networks (DANs), a training paradigm that augments AN by using neural architecture search to jointly learn both the width and the weights for each layer of the smaller network. DAN achieves ResNet-50 level accuracy on ImageNet with 3.8× fewer parameters and 2.2× fewer FLOPs.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Knowledge Distillation</kwd>
        <kwd>Diferentiable Adjoined Networks</kwd>
        <kwd>Neural Architecture Search</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Deep Neural Networks (DNNs) have achieved state-of-the-art performance on many tasks
such as classification, object detection and image segmentation. However, the large number
of parameters often required to achieve the performance makes it dificult to deploy them at
the edge (like on mobile phones, IoT and embedded devices, etc). Unlike cloud servers, these
edge devices are constrained in terms of memory, compute, and energy resources. A large
network performs a lot of computations, consumes more energy, and is dificult to transport
and update. A large network also has a high prediction time per image. This is a constraint
when real-time inference is needed. Thus, compressing neural networks while maintaining
accuracy and improving inference time has received significant attention in the last few years.
Popular techniques for network compression include pruning and knowledge distillation.</p>
      <p>
        Pruning methods remove parameters (or weights) of overparameterized DNNs based on some
pre-defined criteria. For example, [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] removes weights whose absolute value is smaller than a
threshold. While weight pruning methods are successful at reducing the number of parameters
of the network, they often work by creating spares tensors that may require special hardware
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or special software [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to provide inference time speed-ups. These methods are also known
as unstructured pruning and has been extensively studied in [
        <xref ref-type="bibr" rid="ref1 ref4 ref5 ref6 ref7">1, 4, 5, 6, 7</xref>
        ]. To overcome this
limitation, channel pruning [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and filter pruning [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] techniques are used. These structured
pruning methods work by removing entire convolution channels or sometimes even filters
based on some pre-defined criteria and can often provide significant improvement in inference
times. In this paper, we show that our algorithm, Adjoined Networks or AN, achieves accuracy
similar to the current state-of-the-art structured pruning methods but uses a significantly lower
number of parameters and FLOPs (Fig 1).
      </p>
      <p>
        The AN training paradigm works as follows. A given input image  is processed by two
networks, the larger network (or the base network) and the smaller network (or the compressed
network). The base network outputs a probability vector  and the compressed network outputs
a probability vector . This setup is similar to the student-teacher training used in Knowledge
Distillation [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] where the base network (or the teacher) is used to train the compressed
network (or the student). However, there are two very important distinctions. (1) In knowledge
distillation, the parameters of the base (or larger or teacher) network are fixed and the output
of the base network is used as a "soft label" to train the compressed (or smaller or student)
network. In the paradigm of the adjoined network, both the base and the compressed network
are trained together. The output of the base network influences the compressed network and
vice-versa. (2) The parameters of the compressed network are shared across both the smaller
and larger networks (Fig. 2). We train the two networks using a novel time-dependent loss
function called adjoined loss. An additional benefit of training the two networks together is that
the smaller network can have a regularizing efect on the larger network. In our experiments
(Section 6), we see that on many datasets and for many architectures, the base network trained
in the adjoined fashion has greater prediction accuracy than the standard situation when the
base network was trained alone. We also provide theoretical justification for this observation in
Input
q
      </p>
      <p>Filters
Figure 2: Training paradigm based on adjoined networks. The original and the compressed version of
the network are trained together with the parameters of the smaller network shared across both. The
network outputs two probability vectors  (original network) and  (smaller network).
the appendix materials. The details of our design, the loss function and how it supports fast
inference are discussed in Section 3.</p>
      <p>As discussed in the previous paragraph, in the AN training paradigm, all the parameters of
the smaller network are shared across both the smaller and larger network. Our compression
architecture design involves selecting and tuning a hyper-parameter  , the size (or the number
of parameters in each convolution layer) of the smaller network as compared against the larger
base network. In our experiments (Section 6) with the AN paradigm, we found that choosing a
value of  = 2 or 4 as a global (same across all the layers of the network) constant typically
worked well. To get more boost in compression performance, we propose the framework of
Diferentiable Adjoined Network (DAN). DAN uses techniques from Neural Architecture Search
(NAS) to further optimize and choose the right value of  at each layer of our compressed model.
The details of DAN are discussed in Section 5.</p>
      <p>
        Below are the main contributions of this work.
1. We propose a novel training paradigm based on Adjoined Networks or AN, that can compress
any CNN based neural architecture. This involves adjoined training where the original network
and the smaller network are trained together. This has twin benefits of compression and
regularization whereby the larger network (or teacher) transfers information and helps compress
the smaller network while the smaller network helps regularize the larger teacher network.
2. We further propose Diferentiable Adjoined Networks , or DAN, that adjointly learns some of
the hyper-parameters of the smaller network including the number of filters in each layer of
the smaller network.
3. We conducted an exhaustive experimental evaluation of our method and compared it against
several state-of-the-art methods on datasets such as ImageNet [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], CIFAR-10 and
CIFAR100 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. We consider diferent architectures such as ResNet-18,-50,-20,-32,-44,-56,-110 and
DenseNet-121. On ImageNet, using adjoined training paradigm, we can compress
ResNet50 by 4× with 2× FLOPs reduction while achieving 75.1% accuracy. Moreover, the base
network gains 0.7% in accuracy when compared against the same network trained in the
standard (non-adjoined) fashion. We further increase the accuracy of the compressed model
to 75.7% by augmenting our approach with architecture search (DAN), clearly showing that
it is better to train the networks together. Furthermore, we compare our approach against
several state-of-the-art knowledge distillation methods on CIFAR-10 on various architectures
like Resnet-20,-32,-44,-56, and -110. On each of these architectures the student trained using
the adjoined method outperforms those trained using other methods (Table 2).
      </p>
      <p>The paper is organized as follows. In Section 2, we discuss some of the other methods that
are related to the discussions in the paper. In Section 3, we provide details of the architecture
for adjoined networks and the loss function. In Section 4, we show how training both the base
and compressed network together provides compression (for the smaller network) as well as
regularization (for the larger network). In Section 5, we combine AN with neural architecture
search and introduce Diferentiable Adjoined Networks (or DANs). In Section 6, we provide the
details of our experimental results. In Section A of the appendix, we provide strong theoretical
guarantees on the regularization behaviour of adjoined training.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>In this section, we discuss various techniques used to design eficient neural networks in terms
of size and FLOPs. We also compare our approach to other similar approaches and ideas in the
literature.</p>
      <p>
        Knowledge Distillation is the transfer of knowledge from a cumbersome model to a small
model. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proposed teacher student model, where soft targets from the teacher are used to
train the student model. This forces the student to generalize in the same manner as the teacher.
Various knowledge transfer methods have been proposed recently. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] used intermediate layer’s
information from teacher model to train thinner and deeper student model. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] proposes to
use instance level correlation congruence instead of just using instance congruence between
the teacher and student. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] tried to maximize the mutual information between teacher and
student models using variational information maximization. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] aims at transferring structural
knowledge from teacher to student. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] argues that directly transferring a teacher’s knowledge
to a student is dificult due to inherent diferences in structure, layers, channels, etc., therefore,
they paraphrase the output of the teacher in an unsupervised manner making it easier for the
student to understand. Most of these methods use a trained teacher model to train a student
model. In contrast in this work, we train both the teacher and the student together from scratch.
In recent work, [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], rather than using a teacher to train a student, they let a cohort of students
train together using a distillation loss function. In this paper, we consider a teacher and a
student together rather than using a pre-trained teacher. We also use a novel time-dependent
loss function. Moreover, we also provide theoretical guarantees on the eficacy of our approach.
We have compared our AN with various knowledge distillation methods in the experiments
section.
      </p>
      <p>
        Pruning techniques aim to achieve network compression by removing parameters or weights
from a network while still maintaining accuracy. These techniques can be broadly classified
into two categories; unstructured and structured. Unstructured pruning methods are generic
and do not take network architecture (channel, filters) into account. These methods induce
sparsity based on some pre-defined criteria and often achieve a state-of-the-art reduction in the
number of parameters. However, one drawback of these methods is that they are often unable
to provide inference time speed-ups on commodity hardware due to their unstructured nature.
Unstructured sparsity has been extensively studied in [
        <xref ref-type="bibr" rid="ref1 ref4 ref5 ref6 ref7">1, 4, 5, 6, 7</xref>
        ]. Structured pruning aims
to address the issue of inference time speed-up by taking network architecture into account.
As an example, for CNN architectures, these methods try to remove entire channels or filters,
or blocks. This ensures that the reduction in the number of parameters also translates to a
reduction in inference time on commodity hardware. For example, ABCPruner [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] decides
the convolution filters to be removed in each layer using an artificial bee colony algorithm.
[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] prunes filters with low-rank feature maps. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] uses Taylor expansion to estimate the
change in the loss function by removing a particular filter, and finally removes the filters with
max change. The AN compression technique proposed in this paper can also be thought of as
a structured pruning method where the architecture choice at the start of training fixes the
convolution filters to be pruned and the amount of pruning at each layer. Another related
work is of Slimmable Networks [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Here diferent networks (or architectures) are switched
on one at a time and trained using the standard cross-entropy loss function. By contrast, in
this work, both the networks are trained together at the same time using a novel loss function
(adjoined-loss). We have compared our work with Slimmable Networks in Table 1.
Neural Architecture Search (NAS) is a technique that automatically designs neural
architecture without human intervention. The best architecture could be found by training all
architectures in the given search space from scratch to convergence but this is computationally
impractical. Earlier studies in NAS were based on RL [
        <xref ref-type="bibr" rid="ref23 ref24">23, 24</xref>
        ] and EA [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], however, they
required lots of computation resources. Most recent studies [
        <xref ref-type="bibr" rid="ref26 ref27 ref28">26, 27, 28</xref>
        ] encode architectures as
a weight sharing a super-net and optimize the weights using gradient descent. A recent study
Meta Pruning [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] searches over the number of channels in each layer. It generates weights for
all candidates and then selects the architecture with the highest validation accuracy. A lot of
these techniques focus on designing compact architecture from scratch. In this paper, we use
architecture search to help guide the choice of architecture for compression, that is, the fraction
of filters which should be removed from each layer.
      </p>
      <p>
        Small architectures - Another research direction that is orthogonal to ours is to design smaller
architectures that can be deployed on edge devices, such as SqueezeNet [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ], MobileNet [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] and
EficientNet [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ]. In this paper, our focus is to compress existing architectures while ensuring
inference time speedups as well as maintaining prediction accuracy.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Adjoined networks</title>
      <p>
        In our training paradigm, the original (larger) and the smaller network are trained together.
The motivation for this kind of training comes from the principle that good teachers are lifelong
learners. Hence, the larger network which serves as a teacher for the smaller network should
not be frozen (as in standard teacher-student architecture designs [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]). Rather both should
learn together in a "combined learning environment", that is, adjoined networks. By learning
together both the networks can be better together.
      </p>
      <p>We are now ready to describe our approach and discuss the design of adjoined networks.
Before that, let’s take a re-look at the standard convolution operator. Let x ∈ Rℎ× ×  be
the input to a convolution layer with weights W ∈ R× × ×  where ,  denotes the
number of input and output channels,  the kernel size and ℎ,  the height and width of the
Input</p>
      <p>Output</p>
      <p>Input
Filters</p>
      <p>Filters
Output
Small</p>
      <p>Input</p>
      <p>Filters
× g1
+
× g2
+
× g3
+
× g4</p>
      <p>Output
image. Then, the output of the convolution z is given by</p>
      <p>z = (x, W)
In the adjoined paradigm, a convolution layer with weight matrix W and a binary mask matrix
 ∈ {0, 1}× × ×  receives two inputs x1 and x2 of size ℎ ×  ×  and outputs two
vectors z1 and z2 as defined below.</p>
      <p>z1 = (x1, W)
z2 = (x2, W *  )
Here  is of the same shape as W and * represents an element-wise multiplication. Note
that the parameters of the matrix  are fixed before training and not learned. The vector x1
represents an input to the original (bigger) network while the vector x2 is the input to the
smaller, compressed network. For the first convolution layer of the network x1 = x2 but the
two vectors are not necessarily equal for the deeper convolution layers (Fig. 2). The mask
matrix  serves to zero-out some of the parameters of the convolution layer thereby enabling
network compression. In this paper, we consider matrices  of the following form.</p>
      <p>:=  = matrix such that the first  filters are all 1 and the rest 0</p>
      <p>In Section 6, we run experiments with  :=  for  ∈ {2, 4, 8, 16}. Putting this all together,
we see that any CNN-based architecture can be converted and trained in an adjoined fashion by
replacing the standard convolution operation by the adjoined convolution operation (Eqn. 1).
Since the first layer receives a single input (Fig. 2), two copies are created which are passed to
the adjoined network. The network finally gives two outputs p corresponding to the original
(bigger or unmasked) network and q corresponding to the smaller (compressed) network, where
each convolution operation is done using a subset of the parameters described by the mask
matrix  (or  ). We train the network using a novel time-dependent loss function which
forces p and q to be close to one another (Defn. 1).
(1)
(2)</p>
    </sec>
    <sec id="sec-4">
      <title>4. Regularization and Compression</title>
      <p>
        In the previous section, we looked at the design on adjoined networks. For one input (X, y) ∈
Rℎ× ×  × [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] , the network outputs two vectors p and q ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] where  denotes the
number of classes and  denotes the number of input channels (equals 3 for RGB images).
Definition 1 (Adjoined loss). Let  be the ground-truth one-hot encoded vector and  and  be
output probabilities by the adjoined network. Then
ℒ(, , ) = −  log  +  () (, )
(3)
where (, ) = ∑︀  log  is the measure of diference between two probability measures

[
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. The regularization term  : [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] → R is a function which changes with the number of
current epoch
epochs during training. Here  = Total number of epochs equals zero at the start of training and equals
one at the end.
      </p>
      <p>In our definition of the loss function, the first term is the standard cross-entropy loss function
which trains the bigger network. To train the smaller network, we use the predictions from
the bigger network as a soft ground-truth signal. We use KL-divergence to measure how far
the output of the smaller network is from the bigger network. This also has a regularizing
efect as it forces the network to learn from a smaller set of parameters. Note that, in our
implementations, we use (, ) = ∑︀  log ++ to avoid rounding and division by zero
errors where  = 10− 6.</p>
      <p>At the start of training,  is not a reliable indicator of the ground-truth labels. To compensate
for this, the regularization term  changes with time. In our experiments, we used  () =
min{42, 1}. Thus, the contribution of the second term in the loss is zero at the beginning and
steadily grows to one at 50% training.</p>
    </sec>
    <sec id="sec-5">
      <title>5. DAN: Diferentiable Adjoined Networks</title>
      <p>In Sections 3 and 4, we described the framework of adjoined networks and the corresponding
loss function. An important parameter in the design of these networks is the choice of parameter
 . Currently, the choice of  is global, that is, we choose the same value of  for all the layers of
our network. However, choosing  independently for each layer would add more flexibility and
possibly improve performance of the current framework. To solve this problem, we propose the
framework of Diferentiable Adjoined Networks (or DANs).</p>
      <p>Consider the following example of a convolution network with 1 layer with the following
choices for  ∈  = {1, 2, 4} that outputs a vector  . Finding the optimal network structure
is equivalent to solving arg max ∈ ( ) where  is some loss function. For a one layer
network, we can solve this problem by computing ( ) for all the diferent values and then
computing the max. However, this becomes intractable as the number of layers increase; for a
50-layer network, the search space has size 350.</p>
      <p>
        Definition 2 (Gumbel-softmax ([
        <xref ref-type="bibr" rid="ref34">34</xref>
        ])). Given vector  = [1, . . . , ] and a constant  . The
gumbel-softmax function is defined as () = [1, . . . , ] where
      </p>
      <p>exp[( +  )/ ]
 = ∑︀ exp[( +  )/ ]
(4)
and   ∼  (0, 1) is uniform random noise (also referred to as gumbel noise). Note that as  → 0,
gumbel-softmax tends to the arg max function.</p>
      <p>Gumbel-softmax is a “re-parametrization trick" that can be viewed as a diferentiable
approximation to the arg max function. Returning back to the one-layer example, the optimization
objective now becomes ∑︀ ∈  ( ) where  represents the gumbel weights corresponding
to the particular  . This objective is now diferentiable and can be solved using standard
techniques like back-propagation.</p>
      <p>With this insight, we propose the DAN architecture (Fig 3) where the standard convolution
operation is replaced by a DAN convolution operation. As before, let x ∈ ℎ× ×  be the
input to the DAN convolution layer with weights  ∈ × × ×  where ,  denotes
the number of input and output channels,  the kernel size and ℎ,  the height and width of
the image. Let  = { 1, . . . ,  } be the range of values of  for the layer. Then, the output z
of the DAN convolution layer is given by
(5)
(6)

( ) = ∑︁ ( )</p>
      <p>=1
where  = [ 1, . . . ,  ] denotes the mixing weights corresponding to the diferent  ’s, 
is the gumbel-softmax function and  = (x,  * ) where  is the mask matrix
corresponding to   (as in Eqn. 2). Thus, each layer of the DAN convolution layer combines its
outputs according to the gumbel weights. Choosing the hyper-parameter  now corresponds
to learning the values of the parameter  for each layer of our DAN conv network. Note that
as before, our network outputs two probability vectors p and q. But these vectors now also
depend upon the weights vector  at each layer. We are now ready to define our main loss
function.</p>
      <p>Definition 3 (Diferentiable Adjoined loss) . Let the search space be  = { 1, . . . ,  }Let 
be the ground-truth one-hot encoded vector and  and  be output probabilities of the adjoined
network. Then</p>
      <p>ℒ(, , ) = −  log  +  ()( (, ) +   ())
where (, ),  () are the same as used in Eqn. 1.  = [ 1, . . . ,  ] where   is the mixing
weight vector for the ℎ convolution layer.  represents the gumbel weighted FLOPs or floating
point operations for the given network. That is,</p>
      <p>() = ∑︁ ∑︁ ( )   (,   )</p>
      <p>∈ =1
where  (,   ) measures the number of floating point operations at the ℎ convolution layer
corresponding to the hyper-parameter   . Also, note that  in Eqn. 6 is a normalization constant.</p>
      <p>Diferentiable Adjoined Loss is similar to Adjoined Loss defined in Eqn. 3. However, the key
diference is the  term. First note that, larger architectures tend to have higher accuracies.
Hence, DAN learning tends to prefer a network with low alpha (large network) against that
with high alpha (small network). Thus, the  term acts as a regularization penalty against DAN
preferring large architectures. Another point to note is that for a large network say Resnet-50,
the number of flops corresponding to any setting of the mixing weights can be very large.
Gamma normalizes it so that all the terms in the loss function are in the same scale.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Experiments</title>
      <p>
        We are now ready to describe our experiments in detail. We run experiments on three diferent
datasets. (1) ImageNet - an image classification dataset [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] with 1000 classes and about 1.2
images . (2) CIFAR-10 - a collection of 60 images in 10 classes. (3) CIFAR-100 - same as
CIFAR10 but with 100 classes [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. For each of these datasets, we use standard data augmentation
techniques such as random-resize cropping, random flipping.
      </p>
      <p>We train diferent architectures such as ResNet-100, ResNet-50, ResNet-18, ResNet-110,
ResNet-56, DenseNet-121 on all of the above datasets. On each dataset, we first train these
architectures in the standard non-adjoined fashion using the cross-entropy loss function. We
will refer to it by the name Standard. Next, we train the adjoined network, obtained by replacing
the standard convolution operation with the adjoined convolution operation, using the adjoined
loss function. In the second step, we obtain two diferent networks. In this section, we refer
to them by AN-X-Full  and the AN-X-Small  networks where X represents the number of
layers and  denotes the mask matrix as defined in 2. For example, AN-50-Full 2, AN-50-Small
2 represents larger and smaller networks obtained on adjoinedly training ResNet-50 with</p>
      <p>= 2. AN-121-Full 4, AN-121-Small 2 represents models obtained on adjoinedly training
DenseNet-121 with</p>
      <p>= 4. We compare the performance of the AN-X-Full  and
AN-XSmall  networks against the standard network. One point to note is that we do not replace
the convolutions in the stem layers but only those in the residual blocks. Since most of the
weights are in the later layers, this leads to significant space and time savings while retaining
competitive accuracy. DAN describes the performace of adjoined network on architectures
found by Diferentiable Adjoined Network. DAN-50 has the same number of blocks as ResNet-50
whereas DAN-100 has twice the number of blocks of ResNet-50.</p>
      <p>We ran our experiments on GPU enabled machine using Pytorch. We have also open-sourced
our implementation 1. Hyperparameters for the experiments are mentioned on our github page.</p>
      <p>In Section 6.1, we compare our compression results against other structured pruning methods.
In Section 6.2, we compare AN with various types of knowledge distillation methods. In Section
6.3, we describe our results for compression and performance of architectures found by DAN.
In Section 6.4, we show the strong regularizing efect of AN training.</p>
      <sec id="sec-6-1">
        <title>6.1. Comparison against other Structured Pruning works</title>
        <p>the ImageNet dataset for the ResNet-50 architecture. Note, these methods provide inference
speed-up without special hardware or software. We see that the adjoined training regime can
achieve compression that is significantly better than other methods considered in the literature.
In Figure 1, models trained using our paradigm are explicitly on the left side of the graph while
other methods are clustered on to the right side. Other methods obtain compression ratios in the
range 2 − 3× , compared to which our method achieves up to 12× compression in size. Similarly,
GFLOPS for our method is amongst the highest as compared to the other state-of-the-art works,
while sufering a small accuracy drop as compared against the base ResNet-50 model. Figure 4
compares the performance of AN against various pruning methods on CIFAR-10 dataset for
1The code can be found at https://github.com/utkarshnath/Adjoint-Network.git</p>
        <p>
          Method
ABCPruner-0.8 [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]
ABCPruner-0.7 [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]
GBN-50 [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]
GBN-60 [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]
DCP [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ]
HRank [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]
HRank [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]
HRank [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]
MetaPruning [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]
MetaPruning [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]
Slimmable Net [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]
Slimmable Net [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]
AN-50-Small 4 (our)
DAN-50 (our)
DAN-100 (our)
AN-50-Small 2 (our)
ResNet-56 architecture. Models trained using AN paradigm achieve highest accuracy with
fewest number of parameters on CIFAR-10. AN exceeds the next best model (Hinge [
          <xref ref-type="bibr" rid="ref43">43</xref>
          ]) by
0.8% while being smaller than 9 of the 11 models. The smallest AN model achieves accuracy
similar to hinge but with 35% fewer parameter. We see similar results for FLOPs.
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Comparison against other Knowledge Distillation Works</title>
        <p>In this section, we discuss the efectiveness of weight sharing and training two networks together.
We compare AN against the various state-of-the-art variants of knowledge distillation. In Table
2, we compare accuracy (Top-1%) of AN-X-Small against the same architecture trained using
various KD variants on CIFAR-10 dataset. The corresponding pre-trained ResNet architecture
was used as the teacher model for KD variants. Teacher models were trained on
CIFAR10 using standard training paradigm. We see that all models trained using AN paradigm
significantly outperforms the models trained using various teacher-student paradigm showing
the efectiveness of training a subset of weights together.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Ablation study: Compression</title>
        <p>Network
ResNet-20
AN-20-Small 2
ResNet-32
AN-32-Small 2
ResNet-44
AN-44-Small 2
ResNet-56
AN-56-Small 2
ResNet-110
AN-110-Small 2
ResNet-50
AN-50-Small 2
AN-50-Small 4
DAN-50
ResNet-100
AN-100-Small 4
DAN-100
 denote the masking matrix (defined in Eqn. 2).</p>
        <p>In this section, we evaluate the performance of models compressed by Adjoined training
paradigm. Table 3 compares the performance (top 1% accuracy) of the models compressed using
AN against the performance of standard network. For AN, we use the  as the masking matrix
(defined in Eqn. 2). The mask is such that the last (1 − 1 ) filters are zero. Hence, these can be
parameters and 2× reduction in FLOPs.
pruned away to support fast inference. For CIFAR-10, 4 out of 5 models compressed using AN
paradigm exceed it’s base architecture by 0.5%-0.8%. These models achieves 3.5-4× reduction in</p>
        <p>
          We also observe that ResNet-50 is a bigger network and can be compressed more. Also,
diferent datasets can be compressed by diferent amounts. For example, on CIFAR-100 dataset,
the network can be compressed by factors ∼
35×
while for other datasets it ranges from 2×
to 12× . DAN is able to search compressed architecture with minimum loss in accuracy as
compared to base architecture. For ImageNet, DAN architectures were searched on Imagewoof
(a proxy dataset with 10 diferent dog breeds from ImageNet [
          <xref ref-type="bibr" rid="ref46 ref47">46, 47</xref>
          ]).  as defined in Defn. 3 is
− 13, − 19 for DAN-50 and DAN-100 respectively. During architecture search, temperature 
in gumbel softmax was initialized to 15 and exponentially annealed by − 0.045 every epoch.
        </p>
      </sec>
      <sec id="sec-6-4">
        <title>6.4. Ablation study: Regularization</title>
        <p>Network
ResNet-20
ResNet-32
ResNet-44
ResNet-56
ResNet-110
ResNet-50
ResNet-18
DenseNet-121
ResNet-50
ResNet-50</p>
        <p>AN-Full vs Standard</p>
        <p>CIFAR-10</p>
        <p>AN-Full</p>
        <p>Standard

2
2
2
2
2
8
2
4
2
4
CIFAR-100
ImageNet
 denote the masking matrix (defined in Eqn. 2).</p>
        <p>In this section, we study the regularization efect of Adjoined training paradigm on AN-Full
network. Table 4 compares the performance of the base network trained in adjoined fashion
(AN-Full) to the same network trained in Standard fashion. We see a consistent trend that the
network trained adjoinedly outperforms the same network trained in the standard way. We see
maximum gains on CIFAR-100, exceeding accuracy by as much as 1.8%. Even on ImageNet, we
see a gain of about 0.77%.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In this work, we introduced the paradigm of Adjoined Network training where both the larger
teacher (or base) network and the smaller student network are trained together. We showed
how this approach to training neural networks can allow us to reduce the number of parameters
nificant loss in classification accuracy with
of large networks like ResNet-50 by 12× , (even going up to 35×
on some datasets) without
sig2-3× reduction in the number of FLOPs. We showed
(both theoretically and experimentally) that adjoining a large and a small network together
has a regularizing efect on the larger network. We also introduced DAN, a search strategy
that automatically selects the best architecture for the smaller student network. Augmenting
adjoined training with DAN, the smaller network achieves accuracy that is close to that of the
base teacher network.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Regularization theory</title>
      <p>Theorem A.1. Given a deep neural network  which consists of only convolution and linear
layers. Let the network use one of  () = min{, 0} (relu) or  () =  (linear) as the activation
function. Let the network be trained using the adjoined loss function as defined in Eqn. 3. Let
X be the set of parameters of the network  which is shared across both the smaller and bigger
networks. Let Y be the set of parameters of the bigger network not shared with the smaller network.
Let p be the output of the larger network and let q be the output of the smaller network where
p, q represents their iℎ component. Then, the adjoined loss function induces a data-dependent
regularizer with the following properties.</p>
      <p>• For all  ∈ , the induced 2 penalty is given by ∑︀ p(︀ log′ p − log′ q)︀ 2
• For all  ∈  , the induced 2 penalty is given by ∑︀ p(︀ log′ p)︀ 2
Proof. We are interested in analyzing the regularizing behavior of the following loss function.
−  log  + (, )  is the ground truth label,  is the output probability vector of the bigger
network and  is the output probability vector of the smaller network. Recall that the parameters
of smaller network are shared across both. We will look at the second order taylor expansion
for the kl-divergence term. This will give us insights into regularization behavior of the loss
function.</p>
      <p>Let  be a parameter which is common across both the networks and  be a parameter in the
bigger network but not the smaller one.
() = ∑︁ ()(︀ log () − log ())︀ and () = ∑︁ ()(︀ log () − log )︀
 
For the parameter ,  is a constant. Now, computing the first order derivative, we get that
′() = ∑︁ ′()(︀ log () − log ())︀ + ′() − ′()()</p>
      <p>()
′() = ∑︁ ′()(︀ log () − log )︀ + ′()</p>
      <p>Now, computing the second derivative for both the types of parameters, we get that
︃( ′()
′′() = ∑︁ ′′()(︀ log () − log ())︀ + ′()</p>
      <p>()′()′() + ()′′()() − ′()′()()
− 2()
′′() = ∑︁ ′′()(︀ log () − log )︀ + ′()′()
 ()
+ ′′()</p>
      <p>−

Similarly, for the parameters only in the bigger network, we get that
′′() = ∑︁ ′()′() = ∑︁ (log′ )2
 () 
(7)
(8)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Dally</surname>
          </string-name>
          ,
          <article-title>Deep compression: Compressing deep neural networks with pruning, trained quantization and hufman coding</article-title>
          ,
          <source>arXiv preprint arXiv:1510.00149</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>X</surname>
          </string-name>
          . Liu,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pedram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Horowitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Dally</surname>
          </string-name>
          , Eie:
          <article-title>Eficient inference engine on compressed deep neural network</article-title>
          ,
          <year>2016</year>
          . arXiv:
          <volume>1602</volume>
          .
          <fpage>01528</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. T. P.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <article-title>Faster cnns with direct sparse convolutions and guided pruning</article-title>
          ,
          <year>2017</year>
          . arXiv:
          <volume>1608</volume>
          .
          <fpage>01409</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <article-title>To prune, or not to prune: exploring the eficacy of pruning for model compression</article-title>
          ,
          <year>2017</year>
          . arXiv:
          <fpage>1710</fpage>
          .
          <year>01878</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Elsen</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Hooker,</surname>
          </string-name>
          <article-title>The state of sparsity in deep neural networks</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1902</year>
          .09574.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kusupati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ramanujan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Somani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wortsman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kakade</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <article-title>Soft threshold weight reparameterization for learnable sparsity</article-title>
          ,
          <year>2020</year>
          . arXiv:
          <year>2002</year>
          .03231.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>U.</given-names>
            <surname>Evci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Menick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Castro</surname>
          </string-name>
          , E. Elsen, Rigging the lottery:
          <source>Making all tickets winners</source>
          ,
          <year>2021</year>
          . arXiv:
          <year>1911</year>
          .11134.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Learning eficient convolutional networks through network slimming</article-title>
          ,
          <source>in: Proceedings of the IEEE International Conference on Computer Vision</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2736</fpage>
          -
          <lpage>2744</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadav</surname>
          </string-name>
          , I. Durdanovic,
          <string-name>
            <given-names>H.</given-names>
            <surname>Samet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. P.</given-names>
            <surname>Graf</surname>
          </string-name>
          ,
          <article-title>Pruning filters for eficient convnets</article-title>
          ,
          <source>arXiv preprint arXiv:1608.08710</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Distilling the knowledge in a neural network</article-title>
          ,
          <source>arXiv preprint arXiv:1503.02531</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>O.</given-names>
            <surname>Russakovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Krause</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satheesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karpathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Berg</surname>
          </string-name>
          , L. Fei-Fei,
          <article-title>ImageNet Large Scale Visual Recognition Challenge</article-title>
          ,
          <source>International Journal of Computer Vision</source>
          (IJCV)
          <volume>115</volume>
          (
          <year>2015</year>
          )
          <fpage>211</fpage>
          -
          <lpage>252</lpage>
          . doi:
          <volume>10</volume>
          . 1007/s11263-015-0816-y.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Nair</surname>
          </string-name>
          , G. Hinton, Cifar-10 and cifar-100 datasets, URl: https://www. cs. toronto. edu/kriz/cifar. html
          <volume>6</volume>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ballas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Kahou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chassang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gatta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Fitnets: Hints for thin deep nets</article-title>
          ,
          <year>2015</year>
          . arXiv:
          <volume>1412</volume>
          .
          <fpage>6550</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jin</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z. Zhang,</surname>
          </string-name>
          <article-title>Correlation congruence for knowledge distillation</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1904</year>
          .
          <year>01802</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Damianou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. D.</given-names>
            <surname>Lawrence</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <article-title>Variational information distillation for knowledge transfer</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1904</year>
          .05835.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>W.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cho</surname>
          </string-name>
          , Relational knowledge distillation,
          <year>2019</year>
          . arXiv:
          <year>1904</year>
          .05068.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kwak</surname>
          </string-name>
          ,
          <article-title>Paraphrasing complex network: Network compression via factor transfer</article-title>
          , in: S. Bengio,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Grauman</surname>
          </string-name>
          , N. CesaBianchi, R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>31</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2018</year>
          . URL: https://proceedings.neurips.cc/paper/2018/file/ 6d9cb7de5e8ac30bd5e8734bc96a35c1-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , T. Xiang,
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Hospedales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Deep mutual learning</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>4320</fpage>
          -
          <lpage>4328</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y. Wu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <article-title>Channel pruning via automatic structure search</article-title>
          , arXiv preprint arXiv:
          <year>2001</year>
          .
          <volume>08565</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          , L. Shao, Hrank:
          <article-title>Filter pruning using high-rank feature map</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1529</fpage>
          -
          <lpage>1538</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Z.</given-names>
            <surname>You</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ye</surname>
          </string-name>
          , M. Ma, P. Wang, Gate decorator:
          <article-title>Global filter pruning method for accelerating deep convolutional neural networks</article-title>
          , arXiv preprint arXiv:
          <year>1909</year>
          .
          <volume>08174</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Slimmable neural networks</article-title>
          , arXiv preprint arXiv:
          <year>1812</year>
          .
          <volume>08928</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Neural architecture search with reinforcement learning</article-title>
          ,
          <year>2017</year>
          . arXiv:
          <volume>1611</volume>
          .
          <fpage>01578</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vasudevan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sandler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          , Mnasnet:
          <article-title>Platform-aware neural architecture search for mobile</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1807</year>
          .11626.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>E.</given-names>
            <surname>Real</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Moore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Selle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saxena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. L.</given-names>
            <surname>Suematsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kurakin</surname>
          </string-name>
          ,
          <article-title>Large-scale evolution of image classifiers</article-title>
          ,
          <year>2017</year>
          . arXiv:
          <volume>1703</volume>
          .
          <fpage>01041</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , Darts: Diferentiable architecture search,
          <year>2019</year>
          . arXiv:
          <year>1806</year>
          .09055.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>H.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , S. Han,
          <article-title>Proxylessnas: Direct neural architecture search on target task and hardware</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1812</year>
          .00332.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vajda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Keutzer</surname>
          </string-name>
          , Fbnet:
          <article-title>Hardware-aware eficient convnet design via diferentiable neural architecture search</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1812</year>
          .03443.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          , T. K.-T. Cheng, J. Sun, Metapruning:
          <article-title>Meta learning for automatic neural network channel pruning</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1903</year>
          .10258.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>F. N.</given-names>
            <surname>Iandola</surname>
          </string-name>
          , S. Han,
          <string-name>
            <given-names>M. W.</given-names>
            <surname>Moskewicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ashraf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Dally</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Keutzer</surname>
          </string-name>
          , Squeezenet:
          <article-title>Alexnet-level accuracy with 50x fewer parameters and&lt; 0.5 mb model size</article-title>
          ,
          <source>arXiv preprint arXiv:1602.07360</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sandler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhmoginov</surname>
          </string-name>
          , L.-C.
          <article-title>Chen, Mobilenetv2: Inverted residuals and linear bottlenecks</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>4510</fpage>
          -
          <lpage>4520</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          , Eficientnet:
          <article-title>Rethinking model scaling for convolutional neural networks</article-title>
          , arXiv preprint arXiv:
          <year>1905</year>
          .
          <volume>11946</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kullback</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Leibler</surname>
          </string-name>
          ,
          <article-title>On information and suficiency</article-title>
          ,
          <source>The annals of mathematical statistics 22</source>
          (
          <year>1951</year>
          )
          <fpage>79</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vajda</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. E. Gonzalez,</surname>
          </string-name>
          <article-title>Fbnetv2: Diferentiable neural architecture search for spatial and channel dimensions</article-title>
          , in: CVPR, IEEE,
          <year>2020</year>
          , pp.
          <fpage>12962</fpage>
          -
          <lpage>12971</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Discrimination-aware channel pruning for deep neural networks</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>11809</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Doermann</surname>
          </string-name>
          ,
          <article-title>Towards optimal structured cnn pruning via generative adversarial learning</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1903</year>
          .09291.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. U. K. Khan</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-M. Kyung</surname>
          </string-name>
          ,
          <source>Eficient neural network compression</source>
          ,
          <year>2019</year>
          . arXiv:
          <year>1811</year>
          .12781.
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>R.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-F.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-H.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. I.</given-names>
            <surname>Morariu</surname>
          </string-name>
          , X. Han,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <article-title>Nisp: Pruning networks using neuron importance score propagation</article-title>
          ,
          <year>2018</year>
          . arXiv:
          <volume>1711</volume>
          .
          <fpage>05908</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadav</surname>
          </string-name>
          , I. Durdanovic,
          <string-name>
            <given-names>H.</given-names>
            <surname>Samet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. P.</given-names>
            <surname>Graf</surname>
          </string-name>
          ,
          <article-title>Pruning filters for eficient convnets</article-title>
          ,
          <year>2017</year>
          . arXiv:
          <volume>1608</volume>
          .
          <fpage>08710</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>B.</given-names>
            <surname>Minnehan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Savakis</surname>
          </string-name>
          ,
          <article-title>Cascaded projection: End-to-end network compression and acceleration</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1903</year>
          .04988.
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>D.</given-names>
            <surname>Doermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <article-title>Exploiting kernel sparsity and entropy for interpretable cnn compression</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1812</year>
          .04368.
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Filter pruning via geometric median for deep convolutional neural networks acceleration</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1811</year>
          .00250.
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mayer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. V.</given-names>
            <surname>Gool</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Timofte</surname>
          </string-name>
          , Group sparsity:
          <article-title>The hinge between filter pruning and decomposition for network compression</article-title>
          ,
          <year>2020</year>
          . arXiv:
          <year>2003</year>
          .08935.
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>H. Z.</given-names>
            <surname>Pengguang</surname>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , Shu Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <article-title>Distilling knowledge via knowledge review</article-title>
          ,
          <source>in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <article-title>Online knowledge distillation via collaborative learning</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>J.</given-names>
            <surname>Howard</surname>
          </string-name>
          , Imagenette,
          <string-name>
            <surname>URL</surname>
          </string-name>
          :
          <article-title>Github repository with links todataset</article-title>
          . https://github.com/fastai/imagenette (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shleifer</surname>
          </string-name>
          , E. Prokop,
          <article-title>Using small proxy datasets to accelerate hyperparameter search</article-title>
          , arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>04887</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>