<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Token Pruning within the Attention Mechanism for Eficient Vision Transformers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shuto Kusaki</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ryuto Ishibashi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lin Meng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Science and Engineering, Ritsumeikan University</institution>
          ,
          <addr-line>1-1-1 Noji-higashi, Kusatsu, Shiga, 525-8577</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Graduate School of Science and Engineering, Ritsumeikan University</institution>
          ,
          <addr-line>1-1-1 Noji-higashi, Kusatsu, Shiga, 525-8577</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <fpage>71</fpage>
      <lpage>82</lpage>
      <abstract>
        <p>In recent years, Vision Transformers (ViTs) have garnered significant attention in the field of image recognition for their superior performance, outperforming conventional Convolutional Neural Networks (CNNs). However, a major challenge with ViTs is their substantial computational cost and memory usage, which makes them dificult to deploy in resource-constrained environments. In particular, their application to devices with limited computational power and memory, such as Internet of Things (IoT) devices, faces numerous technical barriers. Against this backdrop, the eficient utilization of computational resources is essential to make ViTs practical for real-world use. Therefore, this research focuses on pruning as a means to reduce computational complexity.In this study, we propose a novel pruning method that diverges from conventional token pruning. Our approach prunes the attention mechanism itself-a core component of ViTs-to reduce the computational overhead generated within it. While traditional token pruning only reduces the number of tokens, our proposed method streamlines the attention mechanism, enabling a more significant reduction in computational complexity. This allows for further computational savings that are unattainable with token pruning alone, leading to a substantial decrease in resource consumption while maintaining overall performance.Experiments conducted on the CIFAR-10 dataset show that by applying our proposed attention mechanism pruning, we achieved a 47% reduction in computational complexity with only a 0.86% decrease in accuracy. This result is highly beneficial for running ViTs in computationally restricted settings and indicates the potential for their practical application on IoT and edge devices.Thus, we believe that our novel pruning method significantly enhances the computational eficiency of ViTs, contributing to the expansion of their applicability in resource-constrained environments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Vision Transformer</kwd>
        <kwd>Token Pruning</kwd>
        <kwd>Deep learning</kwd>
        <kwd>Image Recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Image recognition is a fundamental task in Computer Vision (CV), with widespread applications
in diverse fields such as autonomous driving systems, medical image diagnostics, and security
systems. Advances in this technology have dramatically improved the ability of machines to
understand and analyze images in a human-like manner, underpinning modern technological
innovation. Historically, Convolutional Neural Networks (CNNs) [
        <xref ref-type="bibr" rid="ref1">1, 2, 3</xref>
        ] have been the
predominant model for image recognition. CNNs exhibit excellent performance in local feature
extraction, establishing them as the standard model in the field[
        <xref ref-type="bibr" rid="ref2 ref3 ref4">4, 5, 6</xref>
        ]. However, CNNs face
several limitations. In particular, they are constrained by their inability to capture the global
context of an entire image, failing to fully reflect long-range dependencies. Consequently, there
has been a demand for new approaches to enhance recognition capabilities by considering the
broader context of an image.
      </p>
      <p>
        In this context, the Vision Transformer (ViT) [
        <xref ref-type="bibr" rid="ref5">7</xref>
        ] has emerged, adapting the Transformer
architecture [
        <xref ref-type="bibr" rid="ref6">8</xref>
        ] that has rapidly proliferated in the field of Natural Language Processing (NLP).
ViT has garnered significant attention by achieving performance superior to that of CNNs
in image recognition. Its architecture involves dividing an image into fixed-size patches and
feeding them into the model as a sequence of tokens. This approach enables the model to
efectively capture the global context of the entire image, overcoming the bias towards local
features inherent in CNNs. Indeed, ViTs have produced results surpassing traditional CNNs in
various image recognition tasks, and their continued development is highly anticipated.
      </p>
      <p>
        On the other hand, ViTs present several practical challenges[
        <xref ref-type="bibr" rid="ref7">9</xref>
        ]. The most significant of
these are their immense computational complexity and massive memory usage compared to
CNNs. The core Self-Attention mechanism in ViT has a computational complexity that scales
quadratically with the number of input tokens, N. This implies that as the input image resolution
increases or the number of patches grows, the computational cost explodes. This makes it
extremely dificult to deploy ViT in resource-constrained environments such as mobile and
edge devices. Therefore, to resolve these issues and make ViT practical for a wide range of
applications, methods to improve its computational eficiency are urgently needed.
      </p>
      <p>
        As one of the most promising approaches to address this computational bottleneck, token
pruning has gained significant attention in recent years[
        <xref ref-type="bibr" rid="ref8 ref9">10, 11</xref>
        ]. Token pruning is a technique
for creating more eficient models by reducing the computational cost and memory usage.
It achieves this by decreasing the number of tokens in the sequence processed by the ViT,
specifically by identifying and removing those deemed to have a low contribution to the final
prediction or to be redundant. In this method, after the input image is converted into tokens,
low-importance tokens are identified and removed, which directly mitigates the load on the
Self-Attention mechanism. This pruning process allows for the creation of lighter and faster
models by reducing the required computations while minimizing accuracy degradation.
      </p>
      <p>Although token pruning is a crucial technology for advancing the practical application of
ViTs, existing research still leaves room for improvement. Many methods have potential for
further optimization in the design of their importance criteria (scoring) and in the timing and
method of applying the pruning. Therefore, motivated by this gap, this paper proposes an
improved token pruning methodology. In this work, we aim to validate the efectiveness of our
newly proposed method and to provide new insights into enhancing ViT eficiency.</p>
      <p>Overall, our main contributions can be summarized below:
• Integrated Attention Pruning: We introduce a novel pruning mechanism that operates
within the self-attention computation, enabling more eficient processing by reducing
both token and attention computation simultaneously.
• Superior Computational Eficiency : Our method achieves significant computational
savings while maintaining competitive accuracy. Specifically, ATPViT reduces FLOPs
by up to 47% and memory usage by up to 36.4% compared to baseline models, with only
minimal accuracy degradation (0.86–1.12%).
• Enhanced Resource Utilization: Compared to conventional Top-K and EViT
methods, ATPViT achieves additional reductions of 0.9% in FLOPs and 3.6-4.3% in memory
usage without further accuracy loss, demonstrating improved eficiency over existing
approaches.
• Practical Benefits for Edge Deployment : The proposed method enables larger batch
sizes within the same GPU memory constraints and reduces overall energy consumption,
making it particularly suitable for resource-constrained mobile and edge devices.</p>
      <p>Extensive experiments on standard benchmarks demonstrate that ATPViT consistently
outperforms existing token pruning methods in terms of computational eficiency while
maintaining comparable or superior accuracy. The results suggest that integrating pruning directly
into the attention mechanism represents a promising direction for developing eficient Vision
Transformers suitable for practical deployment scenarios.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Vision Transformer</title>
        <p>
          The Vision Transformer (ViT)[
          <xref ref-type="bibr" rid="ref5">7</xref>
          ] revolutionized computer vision by successfully adapting the
Transformer architecture from natural language processing to image-related tasks. Unlike
traditional Convolutional Neural Networks (CNNs)[
          <xref ref-type="bibr" rid="ref1">1, 2, 3</xref>
          ], ViT reimagines an image as a
sequence of fixed-size patches. These patches are treated as tokens, similar to words in a
sentence, allowing the model’s self-attention mechanism to weigh the importance of every
patch in relation to all others. This enables the model to capture global context across the
entire image from its earliest layers, fundamentally challenging the long-held dominance of
convolutional approaches.
        </p>
        <p>
          Initially proving its strength in image classification, the ViT architecture was quickly extended
to more complex, dense prediction tasks like object detection [
          <xref ref-type="bibr" rid="ref10">12</xref>
          ] and semantic segmentation
[
          <xref ref-type="bibr" rid="ref11">13</xref>
          ]. However, the scalability of the original design was a limitation, prompting significant
architectural advancements. The most impactful of these have been hierarchical ViTs, such
as the Swin Transformer [
          <xref ref-type="bibr" rid="ref12">14</xref>
          ]. By computing self-attention within local, non-overlapping
windows that are shifted between layers, Swin Transformer eficiently builds a hierarchical
feature representation. This design established ViTs as a powerful and versatile backbone for a
wide array of vision applications.
        </p>
        <p>
          Despite their success, ViTs are notoriously demanding in terms of computational resources.
The self-attention mechanism’s complexity scales quadratically with the number of input
patches, making ViTs computationally expensive for high-resolution images and dificult to
deploy on resource-constrained devices. This eficiency challenge has become a major focus
for the research community, driving the development of various optimization strategies. These
include architectural redesigns, knowledge distillation[
          <xref ref-type="bibr" rid="ref13">15</xref>
          ], and, central to the work in this
paper, pruning methods designed to reduce computational load without a significant loss in
performance.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Model acceleration</title>
        <p>The fundamental idea behind Token Pruning is based on the observation that the token
sequences processed by ViTs contain numerous redundant tokens that contribute little to the final
prediction. For instance, in a typical image, background regions such as the sky, ground, or walls
often contain less information and have uniform textures. The numerous tokens corresponding
to these areas hold mutually similar information, and it is considered unnecessary to process all
of them in detail. This issue becomes increasingly critical as model size grows because larger
models divide an image into finer and more numerous tokens, which increases the number of
redundant tokens and leads to wasted computational resources.</p>
        <sec id="sec-2-2-1">
          <title>2.2.1. Top-k Token Selection</title>
          <p>Top-k token selection is a straightforward pruning approach that leverages attention weights
to identify the most important tokens for retention. The method computes token importance
scores based on the attention weights from the class token to patch tokens:

where  represents the set of tokens to be pruned,  = ∑︀∈  are normalized importance
weights, and t is the feature vector of the -th token. The final output maintains a fixed
sequence length while preserving global information through the merged token. This weighted
where  represents the importance score of the -th token,  is the number of attention
(ℎ)
heads, and , denotes the attention weight from the class token to the -th patch token in
the ℎ-th head. The method then selects the top- tokens with the highest importance scores:
where  is the total number of tokens and  =  −  with  being the number of tokens to
prune. This approach ofers computational eficiency and requires minimal architectural
modifications. However, it sufers from complete information loss of discarded tokens, potentially
limiting performance when pruned tokens contain relevant information.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.2.2. EViT (Eficient Vision Transformer)</title>
          <p>
            EViT [
            <xref ref-type="bibr" rid="ref14">16</xref>
            ] addresses the information loss problem by introducing a token merging mechanism
that preserves information from low-importance tokens. Instead of simply discarding tokens,
EViT aggregates information from pruned tokens into a single representative token:
 =
          </p>
          <p>1 ∑︁ ,</p>
          <p>(ℎ)</p>
          <p>ℎ=1
 = TopK({}=1, )
t = ∑︁ t
∈
(1)
(2)
(3)
aggregation approach demonstrates superior accuracy-eficiency trade-ofs compared to simple
pruning methods, showing that thoughtful information preservation enhances token reduction
techniques in Vision Transformers.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>In this study, we propose the Attention-based Token Pruning in Vision Transformer (ATPViT),
which performs pruning within the attention mechanism itself to dynamically remove
lowimportance tokens based on attention scores. This approach is designed to minimize information
loss while simultaneously reducing the computational overhead within the attention mechanism.</p>
      <sec id="sec-3-1">
        <title>3.1. ATPViT: TopK-Based</title>
        <p>In conventional ViT pruning, one of the most common approaches, top-k pruning, inserts a
pruning block between the attention block and the MLP block. Pruning is thus performed after
the attention computation is complete. However, with this approach, the computational cost
of the attention mechanism itself is not reduced. In contrast, rather than computing the full
attention output and then pruning tokens, ATPViT optimizes the attention computation itself.</p>
        <p>The detailed methodology of ATPViT is as follows. First, the computation within the
MultiHead Self-Attention (MHSA) of a Transformer Encoder can be described by the following
equations.</p>
        <p>Attention(, ,  ) = softmax
(4)
︂(  )︂
√</p>
        <p>Our pruning process is introduced immediately after the initial operation in this calculation: the
dot product of  and the transposed  ( ). At this stage, based on a criterion for identifying
which tokens are redundant (importance scoring), we retain the top-K rows corresponding to
the most important tokens and discard the rest. The indices of these top-K rows, which represent
the tokens to be kept after the attention calculation, are then saved. While the importance
scores in this study are determined by the methods we describe later, this step can be performed
using any arbitrary method. Next, using the saved indices, we select the top-K tokens from
the original input tokens to the MHSA block and discard the rest. A residual connection is
then formed between these pruned input tokens and the output of the MHSA, thereby reducing
the total number of tokens carried forward. By performing this operation at each layer of the
Transformer block, our method achieves a much greater reduction in computational complexity
compared to conventional approaches.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Computational Benefits</title>
          <p>This approach provides several computational advantages:
• Reduced Attention Computation: By pruning attention matrix rows, we reduce the
computational complexity from ( 2) to (( − ) ×  ) for the attention-value
multiplication.
• Memory Eficiency : The output tensor size is reduced from  to  −  tokens, decreasing
memory requirements for subsequent layers.
• Cascading Speedup: Token reduction in early layers accelerates computation in all
subsequent transformer blocks.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Integration with Transformer Blocks</title>
          <p>The pruning mechanism is seamlessly integrated into the transformer architecture. Each
transformer block processes fewer tokens as the network progresses, maintaining the quality
of representations while achieving significant computational savings. The method requires
minimal modifications to the original Vision Transformer architecture and can be applied to
any layer within the network.</p>
          <p>The overall algorithm maintains the structural integrity of the transformer while providing
adaptive token reduction based on learned attention patterns, making it particularly suitable
for scenarios requiring both accuracy and eficiency.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. ATPViT: EViT-Based</title>
        <p>While ATPViT-Pruning achieves computational eficiency through direct token elimination,
ATPViT-Merge addresses the inherent information loss limitation by introducing an intelligent
token merging mechanism. This variant extends our attention-based pruning framework with
EViT-inspired information preservation strategies, creating a hybrid approach that combines
computational eficiency with enhanced accuracy retention. Similar to ATPViT-Pruning, the
method first computes token importance scores using class token attention weights and identifies
tokens for removal. However, instead of simply discarding these tokens, ATPViT-Merge employs
a weighted aggregation mechanism that preserves their information content.</p>
        <p>The merging process operates directly within the attention computation. After identifying
tokens to be removed (), the method extracts their corresponding attention rows from the
attention matrix and computes weighted combinations based on their importance scores:
∈
where  represents the importance score of token  and a is its attention vector. This creates
a single representative attention row that encapsulates information from all removed tokens.</p>
        <p>The merged attention row is then concatenated with the attention rows of retained tokens,
resulting in a reduced but information-preserving attention matrix. This approach maintains a
ifxed output sequence length while ensuring that no information is completely lost during the
pruning process.</p>
        <p>Compared to ATPViT-Pruning, ATPViT-Merge typically achieves better accuracy retention
at the cost of slightly increased computational overhead due to the merging operations. The
method provides a valuable trade-of option for applications where accuracy preservation is
prioritized over maximum computational reduction, making it particularly suitable for scenarios
requiring high-quality outputs while still benefiting from significant eficiency improvements.</p>
        <p>This dual approach demonstrates the flexibility of our attention-based pruning framework,

∑︀∈  a
a = ∑︁
(5)
allowing practitioners to choose between aggressive pruning (ATPViT-Pruning) and
informationpreserving reduction (ATPViT-Merge) based on their specific requirements.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <p>In this section, we evaluate the computational complexity and performance of various ViT
pruning methods in order to compare our proposed method with existing approaches.</p>
      <sec id="sec-4-1">
        <title>4.1. Implementation details</title>
        <p>For our Vision Transformer models, we use vit_small_patch_16_224 and vit_base_patch_16_224
from the timm library. These are pretrained models configured for an input image size of
224x224 pixels and a patch size of 16x16. The vit_small model consists of 22.05M parameters,
6 attention heads, and 12 Transformer layers, while the vit_base model consists of 86.57M
parameters, 12 attention heads, and 12 Transformer layers. The learning rate is set according to
the formula ( 10ℎ24 × 0.001), where the batch size is adaptively adjusted for each model
based on the available GPU memory.</p>
        <p>All experiments were conducted on a platform equipped with an NVIDIA GeForce RTX 4090
GPU, an AMD Ryzen 9 7900X3D 12-Core Processor, and running Ubuntu 20.04.6 LTS as the
operating system. The deep learning framework used was PyTorch, with Python 3.9 and CUDA
12.1.</p>
        <p>
          Dataset For our experiments, we use two datasets: CIFAR-10 [
          <xref ref-type="bibr" rid="ref15">17</xref>
          ] and the Oxford-IIIT Pet
Dataset.
        </p>
        <p>The CIFAR-10 dataset consists of 50,000 training images and 10,000 test images, each being
a 32x32 pixel RGB (3-channel) image. The images are categorized into 10 classes: airplane,
automobile, bird, cat, deer, dog, frog, horse, ship, and truck. Due to its small image size, CIFAR-10
is well-suited for addressing fundamental image recognition tasks while keeping computational
costs manageable.</p>
        <p>The Oxford-IIIT Pet Dataset was created through a joint efort by the Visual Geometry Group
at the University of Oxford and the International Institute of Information Technology (IIIT),
Hyderabad. It is an image dataset featuring 37 distinct breeds: 25 dog breeds and 12 cat breeds.
The dataset comprises a total of 7,349 images, with approximately 200 images available for each
breed. The annotations for each image include a precise category label for the breed, species
information (dog or cat), and pixel-level segmentation masks that define the exact outline of
each pet. This segmentation data is crucial for tasks that require a clear distinction between
the subject and the background. Consequently, this dataset is used for Fine-Grained Visual
Categorization (FGVC), an advanced classification task that involves distinguishing between
highly similar sub-categories beyond simple labels like "dog" or "cat".</p>
        <sec id="sec-4-1-1">
          <title>4.1.1. Evaluation metrics</title>
          <p>In this study, to conduct a multifaceted evaluation of the performance of token pruning methods
in ViT models, we performed a comprehensive analysis using the following three key metrics
in addition to accuracy.</p>
          <p>FLOPs FLOPs represents the number of floating-point operations executed during one forward
inference pass, indicating the computational complexity of the model. In this study, we used the
FlopCountAnalysis from the fvcore library to analyze the number of operations in each layer in
detail, reporting results in Giga-FLOPs (GFLOPs) units.</p>
          <p>Throughput Throughput is defined as the number of images processed per unit time
(images/second), serving as a metric to evaluate the practical processing performance of models.
For measurement, we employed high-precision time measurement using CUDA event timers
with GPU synchronization processing. Specifically, after 20 warm-up executions, we measured
the inference execution time for 100 iterations and calculated throughput from the average
value.</p>
          <p>Throughput =</p>
          <p>Batch Size</p>
          <p>Average Inference Time
Memory Usage Memory Usage measures GPU memory consumption (MB) during inference
execution, evaluating memory eficiency. Using PyTorch’s CUDA memory statistics
functionality, we calculated the diference between peak memory usage before and after inference
execution.</p>
          <p>Memory Usage = Peak Memory − Initial Memory
By employing these evaluation metrics, we achieved comprehensive performance evaluation of
token pruning methods with emphasis on practical applicability, while considering the trade-of
relationship with accuracy.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Main result</title>
        <p>The main results for ATPViT on the CIFAR-10 dataset are presented in Table 1 and Table 2,
while the results on the Oxford-IIIT Pet Dataset are shown in Table 3. We compare our proposed
method, ATPViT, with the baseline methods it is derived from, Top-K and EViT. The accuracy
measurements in Table 1 and Table 3 were conducted after 50 epochs of training. The
measurements in Table 2 were obtained by applying each respective pruning model to the pretrained
base models.</p>
        <p>For both ViT-Small and ViT-Base, the proposed ATPViT reduces computational cost while
incurring only a minor degradation in accuracy, achieving greater savings compared to
conventional methods. Specifically, on CIFAR-10 with ViT-Small, our ATPViT-Topk reduces FLOPs
by 47% and memory usage by 36.4% compared to the baseline, with only a 1.12% drop in
accuracy. This represents an additional reduction of 0.9% in FLOPs and 3.6% in memory usage
over the conventional Top-K method, without any further loss in accuracy. Furthermore, on
ViT-Small, our ATPViT-EViT reduces FLOPs by 43.7% and memory usage by 31.5% compared to
the baseline, with an accuracy drop of only 0.86%. This achieves an additional 0.9% reduction in
Top-K</p>
        <p>EViT
ATPViT(topk)
ATPViT(evit)</p>
        <p>Top-K</p>
        <p>EViT
ATPViT(topk)</p>
        <p>ATPViT(evit)
Baseline(ViT-base)</p>
        <p>Top-K</p>
        <p>EViT
ATPViT(topk)
ATPViT(evit)</p>
        <p>Top-K</p>
        <p>EViT
ATPViT(topk)
ATPViT(evit)
FLOPs and 4.3% in memory usage compared to the conventional EViT method, again without
further degrading accuracy. When comparing ATPViT-EViT with ATPViT-TopK, the former
exhibits improved accuracy at the cost of increased computation, indicating a trade-of between
accuracy and computational cost.</p>
        <p>Moreover, even when using the more challenging Oxford-IIIT Pet Dataset, ATPViT reduces
computational cost while maintaining high accuracy. This suggests that ATPViT can be used as
a general-purpose method across various datasets.</p>
        <p>On the other hand, the throughput of our proposed ATPVIT was lower than that of the
conventional methods, despite its reduction in GFLOPS. We attribute this to the implementation
overhead introduced by our more intricate pruning process. Specifically, ATPVIT performs
several sequential operations within the attention computation itself: 1) calculating importance
scores, 2) identifying the top-K indices, and 3) gathering the corresponding data from the
attention matrix and input tokens. Optimizing this overhead remains a key challenge for future
work.</p>
        <p>These results indicate that ATPViT holds significant advantages for applications on
resourceconstrained mobile and edge devices. Its benefits also include improving training eficiency
by enabling larger batch sizes within the same GPU memory and reducing overall energy
consumption.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this study, we have proposed a new pruning methodology for Vision Transformers. Through
a pruning strategy for the attention matrix that simultaneously reduces both tokens and the
attention computation itself, our proposed ATPViT achieves greater reductions in computational
cost and memory usage compared to conventional methods, without sacrificing accuracy.
Furthermore, this method can be easily adapted to any ViT architecture as it requires no
additional parameters or specialized training procedures. We therefore conclude that ATPViT
ofers significant potential for the application of Vision Transformers in resource-constrained
environments, such as mobile and edge devices.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative Al</title>
      <p>The author(s) have not employed any Generative Al tools.
[1] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional
neural networks, Advances in neural information processing systems 25 (2012).
[2] K. Simonyan, Very deep convolutional networks for large-scale image recognition, arXiv
preprint arXiv:1409.1556 (2014).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <article-title>Hardware-aware approach to deep neural network optimization</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>559</volume>
          (
          <year>2023</year>
          )
          <fpage>126808</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <article-title>A survey of deep learning for industrial visual anomaly detection</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          <volume>58</volume>
          (
          <year>2025</year>
          )
          <fpage>279</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <article-title>A multi-scale information fusion framework with interaction-aware global attention for industrial vision anomaly detection and localization</article-title>
          ,
          <source>Information Fusion</source>
          <volume>124</volume>
          (
          <year>2025</year>
          )
          <fpage>103356</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ishibashi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <article-title>Automatic pruning rate adjustment for dynamic token reduction in vision transformer</article-title>
          ,
          <source>Applied Intelligence</source>
          <volume>55</volume>
          (
          <year>2025</year>
          )
          <fpage>342</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <article-title>Dataset purification-driven lightweight deep learning model construction for empty-dish recycling robot</article-title>
          ,
          <source>IEEE Transactions on Emerging Topics in Computational Intelligence</source>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          , L. Meng,
          <article-title>Yolo-sm: A lightweight single-class multi-deformation object detection network</article-title>
          ,
          <source>IEEE Transactions on Emerging Topics in Computational Intelligence</source>
          <volume>8</volume>
          (
          <year>2024</year>
          )
          <fpage>2467</fpage>
          -
          <lpage>2480</lpage>
          . doi:
          <volume>10</volume>
          .1109/TETCI.
          <year>2024</year>
          .
          <volume>3367821</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Carion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          , G. Synnaeve,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usunier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kirillov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zagoruyko</surname>
          </string-name>
          ,
          <article-title>End-toend object detection with transformers</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>2005</year>
          .12872. arXiv:
          <year>2005</year>
          .12872.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shen</surname>
          </string-name>
          , B. Cheng, H. Shen,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <article-title>End-to-end video instance segmentation with transformers</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8741</fpage>
          -
          <lpage>8750</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Swin transformer: Hierarchical vision transformer using shifted windows</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2103.14030. arXiv:
          <volume>2103</volume>
          .
          <fpage>14030</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Distilling the knowledge in a neural network, 2015</article-title>
          . URL: https://arxiv.org/abs/1503.02531. arXiv:
          <volume>1503</volume>
          .
          <fpage>02531</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <article-title>Not all patches are what you need: Expediting vision transformers via token reorganizations</article-title>
          ,
          <source>arXiv preprint arXiv:2202.07800</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <article-title>Learning multiple layers of features from tiny images (</article-title>
          <year>2009</year>
          ). URL: https: //www.cs.toronto.edu/~kriz/learning-features-2009
          <source>-TR.pdf.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>