<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>SEBD</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Speeding up Vision Transformers Through Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>(Discussion Paper)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Cauteruccio</string-name>
          <email>fcauteruccio@unisa.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michele Marchetti</string-name>
          <email>m.marchetti@pm.univpm.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Traini</string-name>
          <email>davide.traini@unimore.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domenico Ursino</string-name>
          <email>d.ursino@univpm.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Virgili</string-name>
          <email>luca.virgili@univpm.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CHIMOMO, University of Modena and Reggio Emilia</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>DIEM, University of Salerno</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>DII, Polytechnic University of Marche</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>32</volume>
      <fpage>23</fpage>
      <lpage>26</lpage>
      <abstract>
        <p>In recent years, Transformers have led a revolution in Natural Language Processing, and Vision Transformers (ViTs) promise to do the same in Computer Vision. The main obstacle to the widespread use of ViTs is their computational cost. Indeed, given an image divided into a list of patches, ViTs compute, for each layer, the attention of each patch with respect to all others. In the literature, many solutions try to reduce the computational cost of attention layers using quantization, knowledge distillation, and input perturbation. In this paper, we aim to make a contribution in this setting. In particular, we propose AgentViT, a framework that uses Reinforcement Learning to train an agent whose task is to identify the least important patches during the training of a ViT. Once such patches are identified, AgentViT removes them, thus reducing the number of patches processed by the ViT. Our goal is to reduce the training time of the ViT while maintaining competitive performance.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Vision Transformers</kwd>
        <kwd>Training Time Reduction</kwd>
        <kwd>Reinforcement Learning</kwd>
        <kwd>Computer Vision</kwd>
        <kwd>CIFAR10</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In recent years, thanks also to the massive development of deep learning systems, Artificial
Intelligence is experiencing a golden age in many sectors, including Natural Language Processing
(NLP) and Computer Vision (CV) [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. Transformers are one of the key players in this
development [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Initially designed in the context of NLP, they were adapted for Computer
Vision tasks through the introduction of Vision Transformers (ViTs) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The working principle
of ViTs is similar to those of Transformers, but instead of dividing a sentence into words, they
split an image into non-overlapping rectangular patches and look for semantic correlations
between them. ViTs have proven to be very competitive, and in some contexts, their performance
has been superior to that of Convolutional Neural Networks (CNNs) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The main problem with
ViTs is their computational cost, since for each layer it is necessary to compute the attention of
each token with respect to all others. To overcome this problem, several variants of ViTs have
been proposed to reduce the cost of the attention layers [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8 ref9">5, 6, 7, 8, 9</xref>
        ].
      </p>
      <p>
        Another area of Artificial Intelligence that has shown great promise in recent years is
Reinforcement Learning (RL). In fact, it is being applied in a wide range of contexts, from robotics
to intelligent transportation systems [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref15">10, 11, 12, 13, 14, 15</xref>
        ].
      </p>
      <p>
        In this paper, we propose AgentViT, a framework for ViT optimization. To achieve its goal,
AgentViT uses RL to reduce the computational complexity of the attention layer and thus the
training time of ViTs. In AgentViT, an RL agent selects a subset of the image patches so that the
ViT has to process only them for its training, thus reducing the training time while maintaining
competitive performances. The RL agent is a Deep Q-Learning Network [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] that returns a list
of selected patches. The agent is composed of three dense layers. For each training batch, it
observes the attention values produced by the first ViT layer and returns a subset of the original
patches to use for the training of the ViT. After a certain number of training epochs, the agent
receives a reward that takes into account training loss and training time. The user can decide
how much weight to give to each of these two parameters, thus favoring a set of patches that
guarantees a low training time or one that guarantees a low training loss.
      </p>
      <p>
        Several approaches have been proposed in the literature to reduce the computational load
of attention layers. They are based on diferent techniques such as quantization [
        <xref ref-type="bibr" rid="ref17 ref18 ref19 ref20 ref21 ref22 ref23">17, 18, 19, 20,
21, 22, 23</xref>
        ], pruning [
        <xref ref-type="bibr" rid="ref24 ref25 ref26 ref27">24, 25, 26, 27</xref>
        ], low-rank factorization [
        <xref ref-type="bibr" rid="ref28 ref29">28, 29</xref>
        ] and knowledge distillation
[
        <xref ref-type="bibr" rid="ref30 ref31 ref32 ref33 ref34 ref35">30, 31, 32, 33, 34, 35</xref>
        ]. Other approaches perturbate the input of a ViT to optimize the resources
it uses [
        <xref ref-type="bibr" rid="ref36 ref37 ref8">36, 8, 37</xref>
        ]. Others compute the importance of each token and remove less important
tokens as inference proceeds [
        <xref ref-type="bibr" rid="ref38 ref6 ref7 ref9">9, 6, 7, 38, 39, 40</xref>
        ]. AgentViT shares with some of the above
approaches the policy of setting a variable number of tokens based on the input images. This
allows it to fit the images in the best possible way. However, it has a completely diferent
iftting mechanism than the other approaches. In fact, the latter requires the user to specify the
maximum number of tokens to be used. If, after resampling, the number of tokens is greater
than the number specified by the user, the excess tokens are removed. AgentViT also allows the
user to specify the maximum number of tokens desired, and it trains its RL agent to select a
number of tokens as close as possible to the number specified by the user. However, if it obtains
a particularly low training loss during training, its reward mechanism will prompt it to select a
smaller number of tokens for that particular batch of images. Conversely, if it obtains a high
training loss for a particular batch, its reward mechanism will prompt it to increase the number
of tokens to be used, giving less weight to the number specified by the user.
      </p>
      <p>This paper is organized as follows: in Section 2, we describe AgentViT. In Section 3, we
illustrate our experimental campaign aimed at determining the values of its hyperparameters,
comparing it with related approaches, and deriving interesting insights. Finally, in Section 4,
we draw our conclusions and look at some possible future developments.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Description of AgentVIT</title>
      <sec id="sec-2-1">
        <title>2.1. Schematic workflow of AgentViT</title>
        <p>AgentViT uses a Markov Decision Process (MDP) ℳ = ⟨, , ℛ, ,  ⟩ [41]. Here,  is a state
space,  is a discrete set of actions, ℛ :  ×  →
R is a reward function,  :  ×  → 
transition kernel, and  ∈ [0, 1) is a discount factor. In an MDP, a stationary policy  :  → 
is a mapping from states to actions; it specifies the action an agent takes when it is in a given
is a
state. It is used to describe how an agent interacts with the environment.</p>
        <p>AgentViT uses an Action Value Function (, ), introduced in Q-Learning [42], to estimate
the expected cumulative reward an RL agent can obtain from a given state-action (, ).
QLearning uses a table with a row for each observable state and a column for each possible
action. As the algorithm runs, the values in the table are updated using the formula expressed
in Equation 2.1. This allows us to recursively obtain the cumulative reward (, ) associated
with each action-state pair (, ).</p>
        <p>︂[
(, ) +  ·
ℛ, +  · m′∈ax((* , ′)) − (, )
︂]
(2.1)
Here:</p>
        <p>
          the agent learns;
•  is the learning rate; it is a number in the real range [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] and specifies the rate at which
•  is the discount factor; it represents the importance of the immediacy of the reward. If 
is closer to 0, actions with immediate rewards are favored; if  is closer to 1, all rewards
are given equal weight, regardless of their immediacy, which favors a long-term view;
• * is the next state, i.e., the state in which the agent arrives when it starts from  and
executes the action .
        </p>
        <p>The Q-Learning algorithm struggles in the presence of a large number of states or when the
states involved are continuous. In these cases, the table is replaced by a neural network called
Deep Q-Network. It receives the vector representing the state  and computes the Q-values
corresponding to each pair (, * ), where * represents any action the agent can take in . The
agent chooses the action with the highest Q-value.
Deep Q-Network uses this reward to tune its behavior for the next step.</p>
        <p>Attention
Scores</p>
        <p>Reward</p>
        <p>Selected</p>
        <p>Patches
Deep Q-Network</p>
        <p>Vision</p>
        <p>Transformer
2.2. State
In the context of the MDP, a state  ∈  observed by the agent is represented by a vector of real
numbers and models the current conditions of the environment. In AgentViT, the state of an
environment is represented by the attention score obtained from a batch of images processed
by the first attention layer of the ViT.</p>
        <p>More specifically, given an image composed of  patches, the output of the attention layer
is represented by an  ×  matrix, where  is the embedding size of the image. This matrix is
given in input to a ViT module downstream of the attention layer, which transforms the matrix
itself into a vector of  elements obtained by averaging the  values along the  dimensions.
This vector represents the average values of attention (and thus importance) that the attention
layer assigns to the diferent patches. It is also the state given as input to AgentViT’s Deep
Q-Network.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.3. Action</title>
        <p>Given a vector of  elements representing a state  ∈ , the agent returns a list of  elements.
Each element is associated with a patch and is a real value that, according to the Deep Q-Learning
algorithm, represents the cumulative reward associated with the corresponding patches as
estimated by the agent [42]. The list is sorted in descending order so that the first elements
represent the most important patches. AgentViT selects all patches in the list whose associated
value is greater than the average value of the elements in the list. Therefore, the action  ∈ 
associated with the state  ∈  corresponds to the selection of the most promising patches.</p>
        <p>The transition kernel function  (see Section 2.1) is the one provided by Deep Q-Learning.
Based on it, after the agent chooses an action  ∈  and receives a reward ℛ,, the update of
the Q-value associated with the pair (, ) is done through the following formula [42]:
︂[</p>
        <p>(, ) +  · ℒ (ℛ, +  · m′∈ax((* , ′)), (, ))</p>
        <p>Similarly to Equation 2.1, this formula describes the update of (, ) by taking into account
the previous value and the distance between the maximum cumulative reward associated with
the next state * and the Q-value associated with the current state. In AgentViT, we adopted a
Huber function ℒ [43] to compute this distance (unlike [42] that used an algebraic diference).
The reasoning behind this choice is that ℒ is not sensitive to outliers and, in some cases,
prevents the gradient explosion problem. The cumulative reward for the next state is predicted
by the Target Network, which consists of an exact copy of the agent network except that its
weights are not updated by backpropagation, but are periodically copied from the agent network
by a soft-copy mechanism. As shown in [44], this way of proceeding allows us to stabilize the
learning process.</p>
        <p>AgentViT also has a mechanism to avoid falling into a local minimum. Indeed, the agent
chooses a random action with a probability equal to  instead of the action that maximizes
the value of . The value of  decays exponentially as training progresses to avoid instability.
In this way, AgentViT is able to ensure good exploratory analysis in the early stages of ViT
training and good stability of results as training progresses.</p>
        <p>Finally, AgentViT uses a replay memory [45, 46] to improve the stability and generalizability
of the agent. It can store observed data for later use during training in a way that breaks
unwanted temporal correlations.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.4. Reward</title>
        <p>In AgentViT, the reward ℛ obtained at iteration  by starting from a state  and executing
an action  plays a key role since it serves to define the quality of training. As mentioned
above, this quality must take into account the time required for training and the accuracy.
Consequently, ℛ must consider both the training loss and the training time. For this purpose,
it is defined as a weighted mean of the training loss and the number of patches selected by the
agent, which is proportional to the training time.</p>
        <p>Based on this reasoning, ℛ can be formulated as:
ℛ =  · ℛ 
 + (1 −  ) · ℛ 
ℎ
(2.3)
Here:
• ℛ
• ℛ
same function at iteration .</p>
        <p>
          is the reward related to the training loss; it is equal to the ratio between the value
ℒ(0) of the loss function of the ViT at the starting iteration and the value ℒ() of the
ℎ is the reward related to the number of patches; it is defined as the ratio of the
diference between the actual number of patches selected by the agent and the user’s
desired number of selected patches, to the user’s desired number of selected patches.
•  is a value belonging to the real interval [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ], that determines the weight to assign to
 with respect to ℛ
        </p>
        <p>ℎ.</p>
        <p>In this way, the agent is incentivized to select a number of patches close to the number
desired by the user (or a very small number if the user does not specify a value). But, it is also
incentivized to select a subset of patches that can minimize training loss. These two goals are
represented by ℛ ℎ in Equation 2.3. The weight  allows the user to specify how
 and ℛ
much importance to place on each of these goals. If  tends to 1, the agent has an incentive
to choose a large number of patches. On the other hand, if  tends to 0, it is incentivized to
minimize the number of patches selected, subject to the accuracy constraints to be achieved.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <sec id="sec-3-1">
        <title>3.1. Testbed</title>
        <p>AgentViT can be applied to any ViT, since it is based on observing the attention scores it returns.
Consequently, in our experiments, we could have applied AgentViT to any ViT proposed in the
literature. To conduct the experiments in a reasonable time, we decided to employ SimpleViT
[47] that splits images into 64 patches (SimpleViT64), since it can be trained faster than a
classical ViT. We performed our experiments on the CIFAR10 dataset, which is a collection of
60,000 color images 32x32 divided into 10 diferent classes, designed for training and testing
machine learning models in computer vision tasks. CIFAR10 is widely used for benchmarking
classification algorithms in the deep learning field. For the training and testing phases of our
experiments, we used Google Colab, which provides an Intel Xeon CPU with 2 vCPUs, 13 GB
of RAM, and an NVIDIA Tesla K80 GPU with 12 GB of VRAM. We refer the reader to the link
https://github.com/DavideTraini/RL-for-ViT for the code used to implement AgentViT.</p>
        <p>As a first step in our experiments, we had to define the values of the hyperparameters of
AgentViT. Due to space limitations, we cannot report in detail the tasks we performed to
determine these values. At the end of these tasks, we obtained the values reported in Table 1.</p>
        <p>
          As a next step, we decided to compare the performance of AgentViT with that of related
approaches already proposed in the literature. Specifically, the approaches we considered
for comparison are the original ViT, SimpleViT64, and ATSViT [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]; the latter, to the best of
our knowledge, is the approach most similar to AgentViT. For each of these approaches, we
computed their Cumulative Training Time (measured in seconds), Accuracy, Precision, Recall,
and F1-Score. Table 2 shows the corresponding values.
        </p>
        <p>This table shows that there are approaches able to guarantee low values of Cumulative
Training Time, but at the expense of Accuracy, Precision, Recall, and F1-Score. Conversely,
other approaches can obtain high values of these measures, but at the expense of Cumulative
Training Time. AgentViT is able to achieve a suboptimal value for all five metrics. In other
words, it does not achieve the maximum value for any metric but is able to ensure the best
compromise among all metrics.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Discussion</title>
        <p>As seen above, the goal of AgentViT is to use RL to select patches for optimal filtering.
Comparison with other related approaches has shown us that it has a satisfactory performance. In
fact, it is able to train a ViT in less time than the ViT itself, if it is trained without removing
Hyperparameter</p>
        <p>Value
patches. This saving in training time does not come at the expense of accuracy, which remains
comparable to that of the SimpleViT trained without patch removal. Moreover, AgentViT allows
the user to specify the desired trade-of between accuracy and training time. Finally, AgentViT
is the approach capable of providing the best trade-of between Cumulative Training Time on
the one hand, and Accuracy, Precision, Recall, and F1-Score on the other hand.</p>
        <p>In addition, AgentViT has other interesting implications. One of them is the possibility of
using larger Vision Transformers. In fact, AgentViT’s ability to reduce training time makes
it possible to adopt architectures that would not normally have been adopted due to their
excessive computational load. A second implication concerns the use of AgentViT to build
smaller synthetic datasets from the original ones, which can be used to train deep neural
networks. A further implication concerns the possibility of extending the use of AgentViT
to contexts other than Vision Transformers. In fact, the idea behind AgentViT is general and
independent of the type of transformers to which it is applied; therefore, it could work with
any transformers, such as those used in the context of NLP. The only condition is that the input
of the RL agent within AgentViT can receive an attention matrix as input.</p>
        <p>Finally, we highlight some limitations of AgentViT. The first concerns the fact that the decision
to use Deep Q-Learning within AgentViT involves the need to set various hyperparameters,
which makes the setup phase rather complex. A second limitation is related to the number of
patches required for AgentViT to work properly. In fact, if the Vision Transformer underlying
AgentViT works with only a few patches, the agent has dificulty selecting the most important
ones.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In this paper, we proposed AgentViT, a framework that uses RL to reduce the training time of a
ViT without significantly reducing its performance. The RL agent present in AgentViT uses a
classical MDP-based mechanism to represent an environment for the image classification task.
As for this process, we redefined the state, action, and reward needed to train our RL agent. We
tested AgentViT using SimpleViT64 as the internal ViT and Deep Q-Network as the internal
RL agent. Our experiments showed that AgentViT can achieve the best trade-of between
Cumulative Training Time on the one hand and Accuracy, Precision, Recall, and F1-Score on
the other hand. The experiments conducted allowed us to draw several implications regarding
the strengths and limitations of our framework.</p>
      <p>We can think of several possible future developments of our approach. For example, we
could improve the reward function to consider validation loss instead of training loss. We
could also define a new metric, similar to the Akaike information criterion [ 48], which takes
into account both model performance and the number of tokens. Moreover, we could test
other Reinforcement Learning algorithms, such as Multi-Agent RL and Contextual Multi-Armed
Bandit, instead of the Deep Q-Network, and see if they can further improve the performance of
ViTs. These algorithms could assist in selecting the best actions and the corresponding patches
to speed up ViT training. Finally, we could evaluate the impact of our approach on diferent ViT
architectures, possibly including multiple attention layers, which would make our framework
more robust and versatile.
Recognition (CVPR’22), New Orleans, LA, USA, 2022, pp. 12165–12174.
[39] Y. Rao, Z. Liu, W. Zhao, J. Zhou, J. Lu, Dynamic spatial sparsification for eficient vision
transformers and convolutional neural networks, IEEE Transactions on Pattern Analysis
and Machine Intelligence (2023). IEEE.
[40] L. Meng, H. Li, B. Chen, S. Lan, Z. Wu, Y. J. e S.N. Lim, Adavit: Adaptive vision transformers
for eficient image recognition, in: Proc. of the International Conference on Computer
Vision and Pattern Recognition (CVPR’22), New Orleans, LA, USA, 2022, pp. 12309–12318.
[41] C. L. Lan, S. Tu, A. Oberman, R. Agarwal, M. Bellemare, On the generalization of
representations in reinforcement learning, arXiv preprint arXiv:2203.00543 (2022).
[42] K. Arulkumaran, M. Deisenroth, M. Brundage, A. Bharath, Deep Reinforcement Learning:</p>
      <p>A brief survey, IEEE Signal Processing Magazine 34 (2017) 26–38. IEEE.
[43] P. J. Huber, Robust estimation of a location parameter, in: Breakthroughs in statistics:</p>
      <p>Methodology and distribution, Springer, 1992, pp. 492–518.
[44] J. Fan, Z. Wang, Y. Xie, Z. Yang, A theoretical analysis of deep Q-learning, in: Proc. of the
International Conference on Learning for Dynamics and Control (L4DC’20), Berkeley, CA,
USA, 2020, pp. 486–489. PMLR.
[45] R. Liu, J. Zou, The efects of memory replay in reinforcement learning, in: Proc. of the
Annual Allerton Conference on Communication, Control, and Computing (Allerton’18),
Monticello, IL, USA, 2018, pp. 478–485. IEEE.
[46] L. Lin, Self-improving reactive agents based on reinforcement learning, planning and
teaching, Machine Learning 8 (1992) 293–321. Springer.
[47] L. Beyer, X. Zhai, A. Kolesnikov, Better plain ViT baselines for ImageNet-1k, arXiv preprint
arXiv:2205.01580 (2022).
[48] H. Akaike, A new look at the statistical model identification, IEEE Transactions on
Automatic Control 19 (1974) 716–723. IEEE.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sulem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Veyseh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Sainz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Agirre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Heintz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <article-title>Recent advances in natural language processing via large pre-trained language models: A survey</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>56</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>40</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ngai</surname>
          </string-name>
          ,
          <article-title>Deep learning in computer vision: A critical review of emerging techniques and application scenarios</article-title>
          ,
          <source>Machine Learning with Applications</source>
          <volume>6</volume>
          (
          <year>2021</year>
          )
          <fpage>100134</fpage>
          . Elsevier.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , Attention is All you Need,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ). Curran Associates, Inc.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          , et al.,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Sutskever, Generating long sequences with sparse transformers</article-title>
          , arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>10509</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fayyaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Koohpayegani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. R.</given-names>
            <surname>Jafari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sengupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R. V.</given-names>
            <surname>Joze</surname>
          </string-name>
          , E. Sommerlade,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pirsiavash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gall</surname>
          </string-name>
          ,
          <article-title>Adaptive token sampling for eficient vision transformers</article-title>
          ,
          <source>in: Proc. of the European Conference on Computer Vision (ECCV'22)</source>
          , Tel Aviv, Israel,
          <year>2022</year>
          , pp.
          <fpage>396</fpage>
          -
          <lpage>414</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vahdat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Alvarez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mallya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kautz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Molchanov</surname>
          </string-name>
          ,
          <article-title>A-vit: Adaptive tokens for eficient vision transformer</article-title>
          ,
          <source>in: Proc. of the International Conference on Computer Vision and Pattern Recognition (CVPR'22)</source>
          , New Orleans, LA, USA,
          <year>2022</year>
          , pp.
          <fpage>10809</fpage>
          -
          <lpage>10818</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Renggli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Pinto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mustafa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Puigcerver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Riquelme</surname>
          </string-name>
          ,
          <article-title>Learning to merge tokens in vision transformers</article-title>
          ,
          <source>arXiv preprint arXiv:2202.12015</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <article-title>Not all patches are what you need: Expediting vision transformers via token renotes</article-title>
          ,
          <source>arXiv preprint arXiv:2202.07800</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Polydoros</surname>
          </string-name>
          , L. Nalpantidis,
          <article-title>Survey of model-based reinforcement learning: Applications on robotics</article-title>
          ,
          <source>Journal of Intelligent &amp; Robotic Systems</source>
          <volume>86</volume>
          (
          <year>2017</year>
          )
          <fpage>153</fpage>
          -
          <lpage>173</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Coronato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Naeem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Pietro</surname>
          </string-name>
          , G. Paragliola,
          <article-title>Reinforcement learning for intelligent healthcare applications: A survey</article-title>
          ,
          <source>Artificial Intelligence in Medicine</source>
          <volume>109</volume>
          (
          <year>2020</year>
          )
          <fpage>101964</fpage>
          . Elsevier.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>A review on reinforcement learning: Introduction and applications in industrial process control</article-title>
          ,
          <source>Computers &amp; Chemical Engineering</source>
          <volume>139</volume>
          (
          <year>2020</year>
          )
          <fpage>106886</fpage>
          . Elsevier.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N.</given-names>
            <surname>Luong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hoang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Niyato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. I. Kim</surname>
          </string-name>
          ,
          <article-title>Applications of deep reinforcement learning in communications and networking: A survey</article-title>
          ,
          <source>IEEE Communications Surveys &amp; Tutorials</source>
          <volume>21</volume>
          (
          <year>2019</year>
          )
          <fpage>3133</fpage>
          -
          <lpage>3174</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Haydari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yılmaz</surname>
          </string-name>
          ,
          <article-title>Deep reinforcement learning for intelligent transportation systems: A survey</article-title>
          ,
          <source>IEEE Transactions on Intelligent Transportation Systems</source>
          <volume>23</volume>
          (
          <year>2020</year>
          )
          <fpage>11</fpage>
          -
          <lpage>32</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>W.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pang</surname>
          </string-name>
          , W. Yi,
          <article-title>Survey on the application of deep reinforcement learning in image processing</article-title>
          ,
          <source>Journal on Artificial Intelligence</source>
          <volume>2</volume>
          (
          <year>2020</year>
          )
          <fpage>39</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Antonoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <article-title>Playing atari with deep reinforcement learning</article-title>
          ,
          <source>arXiv preprint arXiv:1312.5602</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          , S. Han,
          <article-title>Haq: Hardware-aware automated quantization with mixed precision</article-title>
          ,
          <source>in: Proc. of the International Conference on Computer Vision and Pattern Recognition (CVPR'19)</source>
          , Long Beach, CA, USA,
          <year>2019</year>
          , pp.
          <fpage>8612</fpage>
          -
          <lpage>8620</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bourdev</surname>
          </string-name>
          ,
          <article-title>Compressing deep convolutional networks using vector quantization</article-title>
          ,
          <source>arXiv preprint arXiv:1412.6115</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , C. X. e
          <string-name>
            <surname>Y. Chen e Q</surname>
          </string-name>
          . Wu e G.
          <article-title>Sun, Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization</article-title>
          ,
          <source>in: Proc. of the European Conference on Computer Vision (ECCV'22)</source>
          , Tel Aviv, Israel,
          <year>2022</year>
          , pp.
          <fpage>191</fpage>
          -
          <lpage>207</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Fq-vit: Post-training quantization for fully quantized vision transformer</article-title>
          ,
          <source>arXiv preprint arXiv:2111.13824</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qin</surname>
          </string-name>
          , Q. Y. e
          <string-name>
            <surname>Z. Chai e J</surname>
          </string-name>
          . Liu e
          <string-name>
            <given-names>X. Wei e X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Towards Accurate PostTraining Quantization for Vision Transformer</article-title>
          , in
          <source>: Proc. of the International Conference on Multimedia (MM'22)</source>
          , Lisbon, Portugal,
          <year>2022</year>
          , pp.
          <fpage>5380</fpage>
          -
          <lpage>5388</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          , J. Cheng, Q-vit:
          <article-title>Fully diferentiable quantization for vision transformer</article-title>
          ,
          <source>arXiv preprint arXiv:2201.07703</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , K. Han,
          <string-name>
            <surname>W</surname>
          </string-name>
          . Zhang, S. Ma, W. Gao,
          <article-title>Post-training quantization for vision transformer</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>34</volume>
          (
          <year>2021</year>
          )
          <fpage>28092</fpage>
          -
          <lpage>28103</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Sun,
          <article-title>Channel pruning for accelerating very deep neural networks</article-title>
          ,
          <source>in: Proc. of the International Conference on Computer Vision</source>
          (ICCV'17), Venice, Italy,
          <year>2017</year>
          , pp.
          <fpage>1389</fpage>
          -
          <lpage>1397</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Runtime network routing for eficient image classification</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>41</volume>
          (
          <year>2018</year>
          )
          <fpage>2291</fpage>
          -
          <lpage>2304</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          , K. Han, Vision transformer pruning,
          <source>arXiv preprint arXiv:2104.08500</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>F.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Huang</surname>
          </string-name>
          , M. W. e Y. Cheng e W. Chu e
          <string-name>
            <given-names>L.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <article-title>Width &amp; depth pruning for vision transformers</article-title>
          ,
          <source>in: Proc. of the International Conference on Artificial Intelligence (AAAI'22)</source>
          , volume
          <volume>36</volume>
          ,
          <string-name>
            <surname>Virtual</surname>
            <given-names>Only</given-names>
          </string-name>
          ,
          <year>2022</year>
          , pp.
          <fpage>3143</fpage>
          -
          <lpage>3151</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <article-title>On compressing deep models by low rank and sparse decomposition</article-title>
          ,
          <source>in: Proc. of the International Conference on Computer Vision and Pattern Recognition (CVPR'17)</source>
          , Honolulu,
          <string-name>
            <surname>HI</surname>
          </string-name>
          , USA,
          <year>2017</year>
          , pp.
          <fpage>7370</fpage>
          -
          <lpage>7379</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jaderberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Speeding up convolutional neural networks with low rank expansions</article-title>
          ,
          <source>arXiv preprint arXiv:1405.3866</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Distilling the knowledge in a neural network</article-title>
          ,
          <source>arXiv preprint arXiv:1503.02531</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hsieh</surname>
          </string-name>
          ,
          <article-title>Metadistiller: Network self-boosting via metalearned top-down distillation</article-title>
          ,
          <source>in: Proc. of the European Conference on Computer Vision (ECCV'20)</source>
          , Glasgow, Scotland, UK,
          <year>2020</year>
          , pp.
          <fpage>694</fpage>
          -
          <lpage>709</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Minilm:
          <article-title>Deep self-attention distillation for task-agnostic compression of pre-trained transformers</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>5776</fpage>
          -
          <lpage>5788</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>R.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roitberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Peng</surname>
          </string-name>
          , H. Liu, R. Stiefelhagen,
          <article-title>TransKD: Transformer knowledge distillation for eficient semantic segmentation</article-title>
          ,
          <source>arXiv preprint arXiv:2202.13393</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. G. e
          <string-name>
            <given-names>D.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <article-title>Dearkd: data-eficient early knowledge distillation for vision transformers</article-title>
          ,
          <source>in: Proc. of the International Conference on Computer Vision and Pattern Recognition (CVPR'22)</source>
          , New Orleans, LA, USA,
          <year>2022</year>
          , pp.
          <fpage>12052</fpage>
          -
          <lpage>12062</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Douze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégo</surname>
          </string-name>
          ,
          <article-title>Training data-eficient image transformers &amp; distillation through attention</article-title>
          ,
          <source>in: Proc. of the International Conference on Machine Learning (ICML'21)</source>
          , Virtual Only,
          <year>2021</year>
          , pp.
          <fpage>10347</fpage>
          -
          <lpage>10357</lpage>
          . PMLR.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. E.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Parameter-eficient model adaptation for vision transformers</article-title>
          ,
          <source>in: Proc. of the International Conference on Artificial Intelligence (AAAI'23)</source>
          , volume
          <volume>37</volume>
          , Washington, DC, USA,
          <year>2023</year>
          , pp.
          <fpage>817</fpage>
          -
          <lpage>825</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hsieh</surname>
          </string-name>
          , Dynamicvit:
          <article-title>Eficient vision transformers with dynamic token sparsification</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>34</volume>
          (
          <year>2021</year>
          )
          <fpage>13937</fpage>
          -
          <lpage>13949</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          , K. Han,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          , C. X. e
          <string-name>
            <given-names>D.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <article-title>Patch slimming for eficient vision transformers</article-title>
          ,
          <source>in: Proc. of the International Conference on Computer Vision</source>
          and Pattern
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>