<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Speeding up Vision Transformers Through Reinforcement Learning</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Francesco</forename><surname>Cauteruccio</surname></persName>
							<email>fcauteruccio@unisa.it</email>
							<affiliation key="aff0">
								<orgName type="department">DIEM</orgName>
								<orgName type="institution">University of Salerno</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Michele</forename><surname>Marchetti</surname></persName>
							<email>m.marchetti@pm.univpm.it</email>
							<affiliation key="aff1">
								<orgName type="department">DII</orgName>
								<orgName type="institution">Polytechnic University of Marche</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Davide</forename><surname>Traini</surname></persName>
							<email>davide.traini@unimore.it</email>
							<affiliation key="aff1">
								<orgName type="department">DII</orgName>
								<orgName type="institution">Polytechnic University of Marche</orgName>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="department">CHIMOMO</orgName>
								<orgName type="institution">University of Modena and Reggio Emilia</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Domenico</forename><surname>Ursino</surname></persName>
							<email>d.ursino@univpm.it</email>
							<affiliation key="aff1">
								<orgName type="department">DII</orgName>
								<orgName type="institution">Polytechnic University of Marche</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Luca</forename><surname>Virgili</surname></persName>
							<email>luca.virgili@univpm.it</email>
							<affiliation key="aff1">
								<orgName type="department">DII</orgName>
								<orgName type="institution">Polytechnic University of Marche</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Speeding up Vision Transformers Through Reinforcement Learning</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">535BA9A31139F1B47BA48BA10E028304</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:07+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Vision Transformers, Training Time Reduction, Reinforcement Learning, Computer Vision, CIFAR10 (L. Virgili) 0000-0001-8400-1083 (F. Cauteruccio)</term>
					<term>0000-0003-3692-3600 (M. Marchetti)</term>
					<term>0009-0007-3098-9349 (D. Traini)</term>
					<term>0000-0003-1360-8499 (D. Ursino)</term>
					<term>0000-0003-1509-783X (L. Virgili)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In recent years, Transformers have led a revolution in Natural Language Processing, and Vision Transformers (ViTs) promise to do the same in Computer Vision. The main obstacle to the widespread use of ViTs is their computational cost. Indeed, given an image divided into a list of patches, ViTs compute, for each layer, the attention of each patch with respect to all others. In the literature, many solutions try to reduce the computational cost of attention layers using quantization, knowledge distillation, and input perturbation. In this paper, we aim to make a contribution in this setting. In particular, we propose AgentViT, a framework that uses Reinforcement Learning to train an agent whose task is to identify the least important patches during the training of a ViT. Once such patches are identified, AgentViT removes them, thus reducing the number of patches processed by the ViT. Our goal is to reduce the training time of the ViT while maintaining competitive performance.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In recent years, thanks also to the massive development of deep learning systems, Artificial Intelligence is experiencing a golden age in many sectors, including Natural Language Processing (NLP) and Computer Vision (CV) <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b2">3]</ref>. Transformers are one of the key players in this development <ref type="bibr" target="#b2">[3]</ref>. Initially designed in the context of NLP, they were adapted for Computer Vision tasks through the introduction of Vision Transformers (ViTs) <ref type="bibr" target="#b3">[4]</ref>. The working principle of ViTs is similar to those of Transformers, but instead of dividing a sentence into words, they split an image into non-overlapping rectangular patches and look for semantic correlations between them. ViTs have proven to be very competitive, and in some contexts, their performance has been superior to that of Convolutional Neural Networks (CNNs) <ref type="bibr" target="#b3">[4]</ref>. The main problem with ViTs is their computational cost, since for each layer it is necessary to compute the attention of each token with respect to all others. To overcome this problem, several variants of ViTs have been proposed to reduce the cost of the attention layers <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b8">9]</ref>.</p><p>Another area of Artificial Intelligence that has shown great promise in recent years is Reinforcement Learning (RL). In fact, it is being applied in a wide range of contexts, from robotics to intelligent transportation systems <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b10">11,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b14">15]</ref>.</p><p>In this paper, we propose AgentViT, a framework for ViT optimization. To achieve its goal, AgentViT uses RL to reduce the computational complexity of the attention layer and thus the training time of ViTs. In AgentViT, an RL agent selects a subset of the image patches so that the ViT has to process only them for its training, thus reducing the training time while maintaining competitive performances. The RL agent is a Deep Q-Learning Network <ref type="bibr" target="#b15">[16]</ref> that returns a list of selected patches. The agent is composed of three dense layers. For each training batch, it observes the attention values produced by the first ViT layer and returns a subset of the original patches to use for the training of the ViT. After a certain number of training epochs, the agent receives a reward that takes into account training loss and training time. The user can decide how much weight to give to each of these two parameters, thus favoring a set of patches that guarantees a low training time or one that guarantees a low training loss.</p><p>Several approaches have been proposed in the literature to reduce the computational load of attention layers. They are based on different techniques such as quantization <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b17">18,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr" target="#b20">21,</ref><ref type="bibr" target="#b21">22,</ref><ref type="bibr" target="#b22">23]</ref>, pruning <ref type="bibr" target="#b23">[24,</ref><ref type="bibr" target="#b24">25,</ref><ref type="bibr" target="#b25">26,</ref><ref type="bibr" target="#b26">27]</ref>, low-rank factorization <ref type="bibr" target="#b27">[28,</ref><ref type="bibr" target="#b28">29]</ref> and knowledge distillation <ref type="bibr" target="#b29">[30,</ref><ref type="bibr" target="#b30">31,</ref><ref type="bibr" target="#b31">32,</ref><ref type="bibr" target="#b32">33,</ref><ref type="bibr" target="#b33">34,</ref><ref type="bibr" target="#b34">35]</ref>. Other approaches perturbate the input of a ViT to optimize the resources it uses <ref type="bibr" target="#b35">[36,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b36">37]</ref>. Others compute the importance of each token and remove less important tokens as inference proceeds <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b37">38,</ref><ref type="bibr" target="#b38">39,</ref><ref type="bibr" target="#b39">40]</ref>. AgentViT shares with some of the above approaches the policy of setting a variable number of tokens based on the input images. This allows it to fit the images in the best possible way. However, it has a completely different fitting mechanism than the other approaches. In fact, the latter requires the user to specify the maximum number of tokens to be used. If, after resampling, the number of tokens is greater than the number specified by the user, the excess tokens are removed. AgentViT also allows the user to specify the maximum number of tokens desired, and it trains its RL agent to select a number of tokens as close as possible to the number specified by the user. However, if it obtains a particularly low training loss during training, its reward mechanism will prompt it to select a smaller number of tokens for that particular batch of images. Conversely, if it obtains a high training loss for a particular batch, its reward mechanism will prompt it to increase the number of tokens to be used, giving less weight to the number specified by the user.</p><p>This paper is organized as follows: in Section 2, we describe AgentViT. In Section 3, we illustrate our experimental campaign aimed at determining the values of its hyperparameters, comparing it with related approaches, and deriving interesting insights. Finally, in Section 4, we draw our conclusions and look at some possible future developments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Description of AgentVIT</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Schematic workflow of AgentViT</head><p>AgentViT uses a Markov Decision Process (MDP) ℳ = ⟨𝒮, 𝒜, ℛ, 𝒫, 𝛾⟩ <ref type="bibr" target="#b40">[41]</ref>. Here, 𝒮 is a state space, 𝒜 is a discrete set of actions, ℛ : 𝒮 × 𝒜 → R is a reward function, 𝒫 : 𝒮 × 𝒜 → 𝒮 is a transition kernel, and 𝛾 ∈ [0, 1) is a discount factor. In an MDP, a stationary policy 𝜋 : 𝒮 → 𝒜 is a mapping from states to actions; it specifies the action an agent takes when it is in a given state. It is used to describe how an agent interacts with the environment.</p><p>AgentViT uses an Action Value Function 𝑄(𝑠, 𝑎), introduced in Q-Learning <ref type="bibr" target="#b41">[42]</ref>, to estimate the expected cumulative reward an RL agent can obtain from a given state-action (𝑠, 𝑎). Q-Learning uses a table with a row for each observable state and a column for each possible action. As the algorithm runs, the values in the table are updated using the formula expressed in Equation <ref type="formula">2</ref>.1. This allows us to recursively obtain the cumulative reward 𝑄(𝑠, 𝑎) associated with each action-state pair (𝑠, 𝑎).</p><formula xml:id="formula_0">𝑄(𝑠, 𝑎) ← 𝑄(𝑠, 𝑎) + 𝜂 • [︂ ℛ 𝑠,𝑎 + 𝛾 • max 𝑎 ′ ∈𝒜 (𝑄(𝑠 * , 𝑎 ′ )) − 𝑄(𝑠, 𝑎) ]︂ (2.1)</formula><p>Here:</p><p>• 𝜂 is the learning rate; it is a number in the real range [0, 1] and specifies the rate at which the agent learns; • 𝛾 is the discount factor; it represents the importance of the immediacy of the reward. If 𝛾 is closer to 0, actions with immediate rewards are favored; if 𝛾 is closer to 1, all rewards are given equal weight, regardless of their immediacy, which favors a long-term view; • 𝑠 * is the next state, i.e., the state in which the agent arrives when it starts from 𝑠 and executes the action 𝑎.</p><p>The Q-Learning algorithm struggles in the presence of a large number of states or when the states involved are continuous. In these cases, the table is replaced by a neural network called Deep Q-Network. It receives the vector representing the state 𝑠 and computes the Q-values corresponding to each pair (𝑠, 𝑎 * ), where 𝑎 * represents any action the agent can take in 𝑠. The agent chooses the action with the highest Q-value. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">State</head><p>In the context of the MDP, a state 𝑠 ∈ 𝒮 observed by the agent is represented by a vector of real numbers and models the current conditions of the environment. In AgentViT, the state of an environment is represented by the attention score obtained from a batch of images processed by the first attention layer of the ViT.</p><p>More specifically, given an image composed of 𝑛 patches, the output of the attention layer is represented by an 𝑛 × 𝑑 matrix, where 𝑑 is the embedding size of the image. This matrix is given in input to a ViT module downstream of the attention layer, which transforms the matrix itself into a vector of 𝑛 elements obtained by averaging the 𝑑 values along the 𝑛 dimensions. This vector represents the average values of attention (and thus importance) that the attention layer assigns to the different patches. It is also the state given as input to AgentViT's Deep Q-Network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Action</head><p>Given a vector of 𝑛 elements representing a state 𝑠 ∈ 𝒮, the agent returns a list of 𝑛 elements. Each element is associated with a patch and is a real value that, according to the Deep Q-Learning algorithm, represents the cumulative reward associated with the corresponding patches as estimated by the agent <ref type="bibr" target="#b41">[42]</ref>. The list is sorted in descending order so that the first elements represent the most important patches. AgentViT selects all patches in the list whose associated value is greater than the average value of the elements in the list. Therefore, the action 𝑎 ∈ 𝒜 associated with the state 𝑠 ∈ 𝒮 corresponds to the selection of the most promising patches.</p><p>The transition kernel function 𝒫 (see Section 2.1) is the one provided by Deep Q-Learning. Based on it, after the agent chooses an action 𝑎 ∈ 𝒜 and receives a reward ℛ 𝑠,𝑎 , the update of the Q-value associated with the pair (𝑠, 𝑎) is done through the following formula <ref type="bibr" target="#b41">[42]</ref>:</p><formula xml:id="formula_1">𝑄(𝑠, 𝑎) ← 𝑄(𝑠, 𝑎) + 𝜂 • [︂ ℒ 𝐻 (ℛ 𝑠,𝑎 + 𝛾 • max 𝑎 ′ ∈𝒜 (𝑄(𝑠 * , 𝑎 ′ )), 𝑄(𝑠, 𝑎)) ]︂ (2.2)</formula><p>Similarly to Equation 2.1, this formula describes the update of 𝑄(𝑠, 𝑎) by taking into account the previous value and the distance between the maximum cumulative reward associated with the next state 𝑠 * and the Q-value associated with the current state. In AgentViT, we adopted a Huber function ℒ 𝐻 <ref type="bibr" target="#b42">[43]</ref> to compute this distance (unlike <ref type="bibr" target="#b41">[42]</ref> that used an algebraic difference). The reasoning behind this choice is that ℒ 𝐻 is not sensitive to outliers and, in some cases, prevents the gradient explosion problem. The cumulative reward for the next state is predicted by the Target Network, which consists of an exact copy of the agent network except that its weights are not updated by backpropagation, but are periodically copied from the agent network by a soft-copy mechanism. As shown in <ref type="bibr" target="#b43">[44]</ref>, this way of proceeding allows us to stabilize the learning process.</p><p>AgentViT also has a mechanism to avoid falling into a local minimum. Indeed, the agent chooses a random action with a probability equal to 𝜖 instead of the action that maximizes the value of 𝑄. The value of 𝜖 decays exponentially as training progresses to avoid instability. In this way, AgentViT is able to ensure good exploratory analysis in the early stages of ViT training and good stability of results as training progresses.</p><p>Finally, AgentViT uses a replay memory <ref type="bibr" target="#b44">[45,</ref><ref type="bibr" target="#b45">46]</ref> to improve the stability and generalizability of the agent. It can store observed data for later use during training in a way that breaks unwanted temporal correlations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.">Reward</head><p>In AgentViT, the reward ℛ 𝑡 obtained at iteration 𝑡 by starting from a state 𝑠 𝑡 and executing an action 𝑎 𝑡 plays a key role since it serves to define the quality of training. As mentioned above, this quality must take into account the time required for training and the accuracy. Consequently, ℛ 𝑡 must consider both the training loss and the training time. For this purpose, it is defined as a weighted mean of the training loss and the number of patches selected by the agent, which is proportional to the training time.</p><p>Based on this reasoning, ℛ 𝑡 can be formulated as:</p><formula xml:id="formula_2">ℛ 𝑡 = 𝛼 • ℛ 𝑙𝑜𝑠𝑠 𝑡 + (1 − 𝛼) • ℛ 𝑝𝑎𝑡𝑐ℎ 𝑡 (2.3)</formula><p>Here:</p><p>• ℛ 𝑙𝑜𝑠𝑠 𝑡 is the reward related to the training loss; it is equal to the ratio between the value ℒ(0) of the loss function of the ViT at the starting iteration and the value ℒ(𝑡) of the same function at iteration 𝑡.</p><p>• ℛ 𝑝𝑎𝑡𝑐ℎ 𝑡 is the reward related to the number of patches; it is defined as the ratio of the difference between the actual number of patches selected by the agent and the user's desired number of selected patches, to the user's desired number of selected patches.</p><p>• 𝛼 is a value belonging to the real interval [0, 1], that determines the weight to assign to ℛ 𝑙𝑜𝑠𝑠 𝑡 with respect to ℛ 𝑝𝑎𝑡𝑐ℎ 𝑡 .</p><p>In this way, the agent is incentivized to select a number of patches close to the number desired by the user (or a very small number if the user does not specify a value). But, it is also incentivized to select a subset of patches that can minimize training loss. These two goals are represented by ℛ 𝑙𝑜𝑠𝑠 𝑡 and ℛ 𝑝𝑎𝑡𝑐ℎ 𝑡 in Equation <ref type="formula">2</ref>.3. The weight 𝛼 allows the user to specify how much importance to place on each of these goals. If 𝛼 tends to 1, the agent has an incentive to choose a large number of patches. On the other hand, if 𝛼 tends to 0, it is incentivized to minimize the number of patches selected, subject to the accuracy constraints to be achieved.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Testbed</head><p>AgentViT can be applied to any ViT, since it is based on observing the attention scores it returns. Consequently, in our experiments, we could have applied AgentViT to any ViT proposed in the literature. To conduct the experiments in a reasonable time, we decided to employ SimpleViT <ref type="bibr" target="#b46">[47]</ref> that splits images into 64 patches (SimpleViT64), since it can be trained faster than a classical ViT. We performed our experiments on the CIFAR10 dataset, which is a collection of 60,000 color images 32x32 divided into 10 different classes, designed for training and testing machine learning models in computer vision tasks. CIFAR10 is widely used for benchmarking classification algorithms in the deep learning field. For the training and testing phases of our experiments, we used Google Colab, which provides an Intel Xeon CPU with 2 vCPUs, 13 GB of RAM, and an NVIDIA Tesla K80 GPU with 12 GB of VRAM. We refer the reader to the link https://github.com/DavideTraini/RL-for-ViT for the code used to implement AgentViT.</p><p>As a first step in our experiments, we had to define the values of the hyperparameters of AgentViT. Due to space limitations, we cannot report in detail the tasks we performed to determine these values. At the end of these tasks, we obtained the values reported in Table <ref type="table" target="#tab_0">1</ref>.</p><p>As a next step, we decided to compare the performance of AgentViT with that of related approaches already proposed in the literature. Specifically, the approaches we considered for comparison are the original ViT, SimpleViT64, and ATSViT <ref type="bibr" target="#b5">[6]</ref>; the latter, to the best of our knowledge, is the approach most similar to AgentViT. For each of these approaches, we computed their Cumulative Training Time (measured in seconds), Accuracy, Precision, Recall, and F1-Score. Table <ref type="table" target="#tab_1">2</ref> shows the corresponding values.</p><p>This table shows that there are approaches able to guarantee low values of Cumulative Training Time, but at the expense of Accuracy, Precision, Recall, and F1-Score. Conversely, other approaches can obtain high values of these measures, but at the expense of Cumulative Training Time. AgentViT is able to achieve a suboptimal value for all five metrics. In other words, it does not achieve the maximum value for any metric but is able to ensure the best compromise among all metrics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Discussion</head><p>As seen above, the goal of AgentViT is to use RL to select patches for optimal filtering. Comparison with other related approaches has shown us that it has a satisfactory performance. In fact, it is able to train a ViT in less time than the ViT itself, if it is trained without removing  patches. This saving in training time does not come at the expense of accuracy, which remains comparable to that of the SimpleViT trained without patch removal. Moreover, AgentViT allows the user to specify the desired trade-off between accuracy and training time. Finally, AgentViT is the approach capable of providing the best trade-off between Cumulative Training Time on the one hand, and Accuracy, Precision, Recall, and F1-Score on the other hand.</p><p>In addition, AgentViT has other interesting implications. One of them is the possibility of using larger Vision Transformers. In fact, AgentViT's ability to reduce training time makes it possible to adopt architectures that would not normally have been adopted due to their excessive computational load. A second implication concerns the use of AgentViT to build smaller synthetic datasets from the original ones, which can be used to train deep neural networks. A further implication concerns the possibility of extending the use of AgentViT to contexts other than Vision Transformers. In fact, the idea behind AgentViT is general and independent of the type of transformers to which it is applied; therefore, it could work with any transformers, such as those used in the context of NLP. The only condition is that the input of the RL agent within AgentViT can receive an attention matrix as input.</p><p>Finally, we highlight some limitations of AgentViT. The first concerns the fact that the decision to use Deep Q-Learning within AgentViT involves the need to set various hyperparameters, which makes the setup phase rather complex. A second limitation is related to the number of patches required for AgentViT to work properly. In fact, if the Vision Transformer underlying AgentViT works with only a few patches, the agent has difficulty selecting the most important ones.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Conclusion</head><p>In this paper, we proposed AgentViT, a framework that uses RL to reduce the training time of a ViT without significantly reducing its performance. The RL agent present in AgentViT uses a classical MDP-based mechanism to represent an environment for the image classification task. As for this process, we redefined the state, action, and reward needed to train our RL agent. We tested AgentViT using SimpleViT64 as the internal ViT and Deep Q-Network as the internal RL agent. Our experiments showed that AgentViT can achieve the best trade-off between Cumulative Training Time on the one hand and Accuracy, Precision, Recall, and F1-Score on the other hand. The experiments conducted allowed us to draw several implications regarding the strengths and limitations of our framework.</p><p>We can think of several possible future developments of our approach. For example, we could improve the reward function to consider validation loss instead of training loss. We could also define a new metric, similar to the Akaike information criterion <ref type="bibr" target="#b47">[48]</ref>, which takes into account both model performance and the number of tokens. Moreover, we could test other Reinforcement Learning algorithms, such as Multi-Agent RL and Contextual Multi-Armed Bandit, instead of the Deep Q-Network, and see if they can further improve the performance of ViTs. These algorithms could assist in selecting the best actions and the corresponding patches to speed up ViT training. Finally, we could evaluate the impact of our approach on different ViT architectures, possibly including multiple attention layers, which would make our framework more robust and versatile.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 Figure 1 :</head><label>11</label><figDesc>Figure 1: Schematic workflow of AgentViT</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Values of the hyperparameters of AgentViT</figDesc><table><row><cell>Approach</cell><cell cols="4">Cumulative Training Time (s) Accuracy Precision Recall F1-Score</cell></row><row><cell>OriginalViT</cell><cell>5,955</cell><cell>0.8377</cell><cell>0.8136</cell><cell>0.8571 0.8348</cell></row><row><cell cols="2">SimpleViT64 4,870</cell><cell>0.7844</cell><cell>0.7917</cell><cell>0.7857 0.7886</cell></row><row><cell>ATSViT</cell><cell>5,813</cell><cell>0.7429</cell><cell>0.7324</cell><cell>0.7544 0.7432</cell></row><row><cell>AgentViT</cell><cell>3,730</cell><cell>0.8011</cell><cell>0.8013</cell><cell>0.8010 0.7997</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Values of Cumulative Training Time, Accuracy, Precision, Recall and F1-Score obtained by the approaches considered in our experiments</figDesc><table /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Recent advances in natural language processing via large pre-trained language models: A survey</title>
		<author>
			<persName><forename type="first">B</forename><surname>Min</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Ross</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Sulem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Veyseh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Sainz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Agirre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Heintz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Roth</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys</title>
		<imprint>
			<biblScope unit="volume">56</biblScope>
			<biblScope unit="page" from="1" to="40" />
			<date type="published" when="2023">2023</date>
			<publisher>ACM</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Deep learning in computer vision: A critical review of emerging techniques and application scenarios</title>
		<author>
			<persName><forename type="first">J</forename><surname>Chai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zeng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ngai</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine Learning with Applications</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page">100134</biblScope>
			<date type="published" when="2021">2021</date>
			<publisher>Elsevier</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Attention is All you Need</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ł</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Polosukhin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<date type="published" when="2017">2017</date>
			<publisher>Curran Associates, Inc</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Dosovitskiy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Beyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kolesnikov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Weissenborn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Unterthiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dehghani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Minderer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Heigold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gelly</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2010.11929</idno>
		<title level="m">An image is worth 16x16 words: Transformers for image recognition at scale</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Child</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1904.10509</idno>
		<title level="m">Generating long sequences with sparse transformers</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Adaptive token sampling for efficient vision transformers</title>
		<author>
			<persName><forename type="first">M</forename><surname>Fayyaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Koohpayegani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">R</forename><surname>Jafari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sengupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">R V</forename><surname>Joze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Sommerlade</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Pirsiavash</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gall</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the European Conference on Computer Vision (ECCV&apos;22)</title>
				<meeting>of the European Conference on Computer Vision (ECCV&apos;22)<address><addrLine>Tel Aviv, Israel</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="396" to="414" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A-vit: Adaptive tokens for efficient vision transformer</title>
		<author>
			<persName><forename type="first">H</forename><surname>Yin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Vahdat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Alvarez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mallya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kautz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Molchanov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the International Conference on Computer Vision and Pattern Recognition (CVPR&apos;22)</title>
				<meeting>of the International Conference on Computer Vision and Pattern Recognition (CVPR&apos;22)<address><addrLine>New Orleans, LA, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="10809" to="10818" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Renggli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Pinto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Houlsby</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mustafa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Puigcerver</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Riquelme</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2202.12015</idno>
		<title level="m">Learning to merge tokens in vision transformers</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Tong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Xie</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2202.07800</idno>
		<title level="m">Not all patches are what you need: Expediting vision transformers via token renotes</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Survey of model-based reinforcement learning: Applications on robotics</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Polydoros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Nalpantidis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Intelligent &amp; Robotic Systems</title>
		<imprint>
			<biblScope unit="volume">86</biblScope>
			<biblScope unit="page" from="153" to="173" />
			<date type="published" when="2017">2017</date>
			<publisher>Springer</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Reinforcement learning for intelligent healthcare applications: A survey</title>
		<author>
			<persName><forename type="first">A</forename><surname>Coronato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Naeem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">D</forename><surname>Pietro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Paragliola</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Artificial Intelligence in Medicine</title>
		<imprint>
			<biblScope unit="volume">109</biblScope>
			<biblScope unit="page">101964</biblScope>
			<date type="published" when="2020">2020</date>
			<publisher>Elsevier</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">A review on reinforcement learning: Introduction and applications in industrial process control</title>
		<author>
			<persName><forename type="first">R</forename><surname>Nian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Huang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computers &amp; Chemical Engineering</title>
		<imprint>
			<biblScope unit="volume">139</biblScope>
			<biblScope unit="page">106886</biblScope>
			<date type="published" when="2020">2020</date>
			<publisher>Elsevier</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Applications of deep reinforcement learning in communications and networking: A survey</title>
		<author>
			<persName><forename type="first">N</forename><surname>Luong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Hoang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Niyato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">I</forename><surname>Kim</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Communications Surveys &amp; Tutorials</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="page" from="3133" to="3174" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note>IEEE</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Deep reinforcement learning for intelligent transportation systems: A survey</title>
		<author>
			<persName><forename type="first">A</forename><surname>Haydari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yılmaz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Intelligent Transportation Systems</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="page" from="11" to="32" />
			<date type="published" when="2020">2020</date>
			<publisher>IEEE</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Survey on the application of deep reinforcement learning in image processing</title>
		<author>
			<persName><forename type="first">W</forename><surname>Fang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Pang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Yi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal on Artificial Intelligence</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="39" to="58" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<author>
			<persName><forename type="first">V</forename><surname>Mnih</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Kavukcuoglu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Silver</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Graves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Antonoglou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Wierstra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Riedmiller</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1312.5602</idno>
		<title level="m">Playing atari with deep reinforcement learning</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Hardware-aware automated quantization with mixed precision</title>
		<author>
			<persName><forename type="first">K</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Haq</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the International Conference on Computer Vision and Pattern Recognition (CVPR&apos;19)</title>
				<meeting>of the International Conference on Computer Vision and Pattern Recognition (CVPR&apos;19)<address><addrLine>Long Beach, CA, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="8612" to="8620" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Gong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bourdev</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1412.6115</idno>
		<title level="m">Compressing deep convolutional networks using vector quantization</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">X</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the European Conference on Computer Vision (ECCV&apos;22)</title>
				<meeting>of the European Conference on Computer Vision (ECCV&apos;22)<address><addrLine>Tel Aviv, Israel</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="191" to="207" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhou</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2111.13824</idno>
		<title level="m">Fq-vit: Post-training quantization for fully quantized vision transformer</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Towards Accurate Post-Training Quantization for Vision Transformer</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">Y</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the International Conference on Multimedia (MM&apos;22)</title>
				<meeting>of the International Conference on Multimedia (MM&apos;22)<address><addrLine>Lisbon, Portugal</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="5380" to="5388" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cheng</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2201.07703</idno>
		<title level="m">Q-vit: Fully differentiable quantization for vision transformer</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Post-training quantization for vision transformer</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Gao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="28092" to="28103" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Channel pruning for accelerating very deep neural networks</title>
		<author>
			<persName><forename type="first">Y</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the International Conference on Computer Vision (ICCV&apos;17)</title>
				<meeting>of the International Conference on Computer Vision (ICCV&apos;17)<address><addrLine>Venice, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1389" to="1397" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Runtime network routing for efficient image classification</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Rao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Pattern Analysis and Machine Intelligence</title>
		<imprint>
			<biblScope unit="volume">41</biblScope>
			<biblScope unit="page" from="2291" to="2304" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note>IEEE</note>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Han</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2104.08500</idno>
		<title level="m">Vision transformer pruning</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Width &amp; depth pruning for vision transformers</title>
		<author>
			<persName><forename type="first">F</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">W</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Chu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Cui</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the International Conference on Artificial Intelligence (AAAI&apos;22)</title>
				<meeting>of the International Conference on Artificial Intelligence (AAAI&apos;22)<address><addrLine>Virtual Only</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="page" from="3143" to="3151" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">On compressing deep models by low rank and sparse decomposition</title>
		<author>
			<persName><forename type="first">X</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Tao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the International Conference on Computer Vision and Pattern Recognition (CVPR&apos;17)</title>
				<meeting>of the International Conference on Computer Vision and Pattern Recognition (CVPR&apos;17)<address><addrLine>Honolulu, HI, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="7370" to="7379" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Jaderberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Vedaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zisserman</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1405.3866</idno>
		<title level="m">Speeding up convolutional neural networks with low rank expansions</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Hinton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1503.02531</idno>
		<title level="m">Distilling the knowledge in a neural network</title>
				<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Metadistiller: Network self-boosting via metalearned top-down distillation</title>
		<author>
			<persName><forename type="first">B</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Rao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hsieh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the European Conference on Computer Vision (ECCV&apos;20)</title>
				<meeting>of the European Conference on Computer Vision (ECCV&apos;20)<address><addrLine>Glasgow, Scotland, UK</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="694" to="709" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers</title>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Bao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="5776" to="5788" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roitberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Stiefelhagen</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2202.13393</idno>
		<title level="m">TransKD: Transformer knowledge distillation for efficient semantic segmentation</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Dearkd: data-efficient early knowledge distillation for vision transformers</title>
		<author>
			<persName><forename type="first">X</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">G</forename></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Tao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the International Conference on Computer Vision and Pattern Recognition (CVPR&apos;22)</title>
				<meeting>of the International Conference on Computer Vision and Pattern Recognition (CVPR&apos;22)<address><addrLine>New Orleans, LA, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="12052" to="12062" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Training data-efficient image transformers &amp; distillation through attention</title>
		<author>
			<persName><forename type="first">H</forename><surname>Touvron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cord</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Douze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Massa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sablayrolles</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Jégo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the International Conference on Machine Learning (ICML&apos;21), Virtual Only</title>
				<meeting>of the International Conference on Machine Learning (ICML&apos;21), Virtual Only</meeting>
		<imprint>
			<publisher>PMLR</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="10347" to="10357" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">Parameter-efficient model adaptation for vision transformers</title>
		<author>
			<persName><forename type="first">X</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">E</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the International Conference on Artificial Intelligence (AAAI&apos;23)</title>
				<meeting>of the International Conference on Artificial Intelligence (AAAI&apos;23)<address><addrLine>Washington, DC, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="volume">37</biblScope>
			<biblScope unit="page" from="817" to="825" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<analytic>
		<title level="a" type="main">Dynamicvit: Efficient vision transformers with dynamic token sparsification</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Rao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hsieh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="13937" to="13949" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<analytic>
		<title level="a" type="main">Patch slimming for efficient vision transformers</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">X</forename></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Tao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the International Conference on Computer Vision and Pattern Recognition (CVPR&apos;22)</title>
				<meeting>of the International Conference on Computer Vision and Pattern Recognition (CVPR&apos;22)<address><addrLine>New Orleans, LA, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="12165" to="12174" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<analytic>
		<title level="a" type="main">Dynamic spatial sparsification for efficient vision transformers and convolutional neural networks</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Rao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Pattern Analysis and Machine Intelligence</title>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note>IEEE</note>
</biblStruct>

<biblStruct xml:id="b39">
	<analytic>
		<title level="a" type="main">Adavit: Adaptive vision transformers for efficient image recognition</title>
		<author>
			<persName><forename type="first">L</forename><surname>Meng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">J</forename></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">N</forename><surname>Lim</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the International Conference on Computer Vision and Pattern Recognition (CVPR&apos;22)</title>
				<meeting>of the International Conference on Computer Vision and Pattern Recognition (CVPR&apos;22)<address><addrLine>New Orleans, LA, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="12309" to="12318" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b40">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Lan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Oberman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bellemare</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2203.00543</idno>
		<title level="m">On the generalization of representations in reinforcement learning</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b41">
	<analytic>
		<title level="a" type="main">Deep Reinforcement Learning: A brief survey</title>
		<author>
			<persName><forename type="first">K</forename><surname>Arulkumaran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Deisenroth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Brundage</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bharath</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Signal Processing Magazine</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="26" to="38" />
			<date type="published" when="2017">2017</date>
			<publisher>IEEE</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b42">
	<analytic>
		<title level="a" type="main">Robust estimation of a location parameter</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Huber</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Breakthroughs in statistics: Methodology and distribution</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="1992">1992</date>
			<biblScope unit="page" from="492" to="518" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b43">
	<analytic>
		<title level="a" type="main">A theoretical analysis of deep Q-learning</title>
		<author>
			<persName><forename type="first">J</forename><surname>Fan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the International Conference on Learning for Dynamics and Control (L4DC&apos;20)</title>
				<meeting>of the International Conference on Learning for Dynamics and Control (L4DC&apos;20)<address><addrLine>Berkeley, CA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>PMLR</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="486" to="489" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b44">
	<analytic>
		<title level="a" type="main">The effects of memory replay in reinforcement learning</title>
		<author>
			<persName><forename type="first">R</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the Annual Allerton Conference on Communication, Control, and Computing (Allerton&apos;18)</title>
				<meeting>of the Annual Allerton Conference on Communication, Control, and Computing (Allerton&apos;18)<address><addrLine>Monticello, IL, USA</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="478" to="485" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b45">
	<analytic>
		<title level="a" type="main">Self-improving reactive agents based on reinforcement learning, planning and teaching</title>
		<author>
			<persName><forename type="first">L</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine Learning</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="293" to="321" />
			<date type="published" when="1992">1992</date>
			<publisher>Springer</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b46">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Beyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kolesnikov</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2205.01580</idno>
		<title level="m">Better plain ViT baselines for ImageNet-1k</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b47">
	<analytic>
		<title level="a" type="main">A new look at the statistical model identification</title>
		<author>
			<persName><forename type="first">H</forename><surname>Akaike</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Automatic Control</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="page" from="716" to="723" />
			<date type="published" when="1974">1974</date>
			<publisher>IEEE</publisher>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
