<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Lightweight YOLOX-Based Object Detection with Structural Pruning for Edge Device Deployment</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mikihisa Ishino</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lin Meng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Science and Engineering, Ritsumeikan University</institution>
          ,
          <addr-line>1-1-1 Noji-higashi, Kusatsu, Shiga, 525-8577</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Electronic and Computer Engineering, Ritsumeikan University</institution>
          ,
          <addr-line>1-1-1 Noji-higashi, Kusatsu, Shiga, 525-8577</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <fpage>121</fpage>
      <lpage>132</lpage>
      <abstract>
        <p>Research on object detection using deep learning has achieved remarkable accuracy. However, real-time performance remains a significant challenge on edge devices with limited computational resources, posing a substantial hurdle for practical implementation. Despite significant advancements in the YOLO series, including improving accuracy, speed, and capabilities for detecting small objects and moving towards anchor-free designs, the resulting models remain computationally demanding for resource-constrained edge devices, particularly those limited to a CPU. Alternatively, while reducing model complexity, for instance, by decreasing the number of layers, can shorten computation time, this approach often compromises accuracy, rendering the model impractical for real-world applications. To address the limitations of object detection under constrained computational resources, our paper proposes an eficient object detection model designed for deployment on edge device. The proposed model is based on YOLOX and applies structural pruning to its backbone, which typically has a large number of parameters, to reduce the number of parameters per channel. The proposal is achieved by identifying and removing less important channels based on the scale factor ( ) of the batch normalization layers. Specifically, to improve inference speed under limited resources, we applied iterative structural pruning to the YOLOX backbone, reducing the number of parameters while maintaining inference accuracy. To demonstrate the efectiveness of the proposed model, we create a unique dataset consisting of plastic bottles, cans, and glass bottles for recycling purposes and conduct comprehensive experiments. The experimental results indicate that significant computational reduction is achieved by iterative structural pruning, demonstrating the efectiveness of our model. The work represents a significant step towards enabling high-performance object detection on edge devices with limited computational resources.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Deep Learning</kwd>
        <kwd>YOLOX</kwd>
        <kwd>pruning</kwd>
        <kwd>Object Detection</kwd>
        <kwd>Edge Device</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The concept of object detection emerges with the advent of digital cameras, initially aiming
to automatically detect objects and adjust brightness and focus. The idea gains significant
societal demand because there is a possibility to solve challenges across diverse fields like
autonomous driving, medicine, surveillance, retail, and agriculture, leading to a surge in research.
Coupled with the rise of deep learning in 2012, object detection witnesses a remarkable leap in
accuracy through the application of image recognition models like R-CNN. However, detection
speed remains a bottleneck for applications requiring real-time performance. This challenge is
addressed in 2015 with the introduction of YOLO, which boasts exceptional processing speed.
YOLO’s groundbreaking performance profoundly impacts the field of object detection, raising
expectations for real-time applications and shaping the research landscape to this day.</p>
      <p>
        Throughout the evolution, the YOLO series has seen a variety of enhancements aimed at
improving accuracy, speed, and the number of detectable categories[
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. The evolution
includes better small-object detection and anchor-box-independent detection. Despite the strides,
YOLO series is too slow for real-world application on edge devices with limited computational
power (such as those relying solely on a CPU) because the real situation is changing quickly.
Although reducing deep learning layers can cut down on processing time for resource-limited
environments, the accuracy decreases, making it hard to develop a practical solution.
      </p>
      <p>To address the limitations of object detection on restricted computational resources, our paper
proposes an improved object detection model for edge device implementation. The proposal
applies structural pruning to the backbone of YOLOX. The method reduces the number of
parameters by removing unnecessary channels. To identify which channels to remove, we use
the scale factor of the BN layers.</p>
      <p>This paper addresses the social issue of environmental degradation caused by the waste from
beverage containers. We aim to develop an object detection model for edge devices that can
automatically detect, sort, and collect such waste.</p>
      <p>Overall, our main contributions can be summarized below:
• We demonstrate the successful application of iterative structural pruning to the modern,
anchor-free YOLOX architecture, providing a practical methodology for deployment on
resource-constrained edge devices.
• We have introduced a novel custom dataset of recycling objects (plastic bottles, cans,
and glass bottles) specifically designed to address real-world object detection challenges,
uniquely incorporating overlapping objects, extreme deformations, and image blur.</p>
      <p>The remainder of the article is organized as follows. Section 2 reviews previous lightweight
object detection models and related YOLO models. The detailed proposal is introduced in Section
3. Section 4 presents the experiments and the datasets. Finally, Section 5 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Object detection aims to classify and predict specific objects belonging to predefined categories
within a dataset. With advances in deep learning, object detection is becoming increasingly
practical[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, a challenge has emerged: the dificulty of use under conditions with
limited computational resources or memory capacity. To solve the problem, various methods
have been proposed, successfully reducing the size of the model and improving the speed of
inference. Among various methods, YOLO models stand out for the excellent performance,
being relatively lightweight and ofering a good balance with accuracy, leading to their use in
diverse applications.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Lightweight models for object detection</title>
        <p>
          Research into making object detection models lighter has really taken of in recent years.
Approximately in 2017, object detection research is still in the early stages. During the stages,
the main approach is to replace the backbone (feature extraction part) of object detectors with
lightweight CNN architectures originally proposed for image classification tasks[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          In 2017, Google’s MobileNet paper[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] proposes a groundbreaking network using depth-wise
convolution and point-wise convolutional layers, reducing the computation by about nine times.
MobileNet-SSD[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] employs the MobileNet architecture as a backbone for the then-fast SSD
object detector. The model’s introduction pioneers real-time object detection on mobile devices.
The model allows many developers to integrate object detection features into smartphone apps.
        </p>
        <p>
          MobileNetV2[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], released in 2018, introduces Inverted Residuals and Linear Bottlenecks from
MobileNet, leading to networks that balance higher accuracy and eficiency. Network models
like ShufleNet[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], which introduces the Channel Shufle operation, are combined with detectors
like SSD and YOLO, resulting in numerous higher-performing models.
        </p>
        <p>
          In 2020, Google’s EficientDet[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] proposes techniques like BiFPN and Compound Scaling,
marking a significant breakthrough in the history of lightweight models. The proposed scaling
method provides a systematic approach for selecting an optimal model based on a device’s
computational power, establishing a new base for lightweight models across various fields.
        </p>
        <p>
          In recent years, further eficiency has been achieved by fundamentally rethinking the entire
detector architecture, not just the backbone. The Generalized Focal Loss, used as a loss function
in NanoDet[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] released in 2021, is an innovative loss function. By integrating classification and
regression, the model makes the detection part’s structure significantly simple and lightweight,
enabling high performance even with lightweight models.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. The Evolution of the YOLO Series</title>
        <p>
          The YOLO (You Only Look Once) object detection model[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] is developed with the purpose of
balancing real-time performance and accuracy. Before YOLO, object detection models require
multistage processing: first, finding candidate regions for objects, and then classifying each
region. Although accurate, the model makes them slow and challenging to apply in real-time
scenarios. YOLO integrates the candidate region search into a single neural network,
simultaneously predicting object location and class. The change dramatically improves processing
speed, enabling use in fields where real-time capability is crucial, such as autonomous driving,
surveillance cameras, and robot vision.
        </p>
        <p>
          Since the initial release in 2015, YOLO has undergone numerous improvements. YOLOv2[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ],
released in 2016, improves the accuracy of location prediction by incorporating batch
normalization and anchor boxes. YOLOv3[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], released in 2018, adopts the concept of FPN (Feature
Pyramid Network), predicting objects on three diferent scales, significantly increasing the
detection accuracy of small objects.
        </p>
        <p>
          YOLOv4[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], released in 2020, introduces the "Bag of Specials" concept, optimally combining
various techniques considered efective in object detection research. The model greatly improves
accuracy while maintaining the practicality of being trainable and runnable on a single GPU.
YOLOv5[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] introduces flexible model scaling. The model is easier to balance speed and accuracy
depending on the device and application, thus accelerating the practical adoption. YOLOv6[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ],
released in 2022, improves the head section, increasing inference speed. YOLOv7[
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], released
in the same year, achieves state-of-the-art performance in both speed and accuracy through
advances in network architecture and training methods.
        </p>
        <p>
          YOLOX[
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], introduced in 2021 (between v5 and v6), difers from previous models in the
YOLO series by integrating a suite of contemporary high performance techniques, signifying a
new design approach.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>In this study, we apply structural pruning to its backbone, CSPDarknet-53, to reduce the number
of parameters. Experiments are conducted using an object detection dataset comprising plastic
bottles, cans, and glass bottles.
3.1. YOLOX
YOLOX, released in July 2021, is positioned as a technologically significant model that brings a
major transformation to the design philosophy of the YOLO series. While based on YOLOv3,
YOLOX incorporates the state-of-the-art techniques in object detection research at the time,
such as anchor-free detection and decoupled heads.</p>
      <p>The YOLOX network is composed of three stages: backbone, neck, and head. The model’s
scaling factor allows for adjusting the network’s width and depth, enabling the modification of
detection performance and inference time based on specific objectives and devices.</p>
      <p>The backbone network is responsible for eficiently extracting essential features from the
input image for object detection. The network uses an architecture called CSPDarknet, a hybrid
design combining CSPNet with the Darknet-53 framework introduced in YOLOv3. Darknet-53
is a 52-layer convolutional network with one fully connected layer. The network’s main feature
is a structure designed to prevent gradient vanishing even in deep networks, enabling eficient
training. CSPNet’s design aims to resolve computational bottlenecks within the network,
allowing for richer feature extraction with lower computational cost. The integrated networks
further improve the computational eficiency and gradient flow compared to the original
Darknet53.</p>
      <p>The neck is responsible for creating an optimal set of feature maps for detection, combining
both precise location information and rich semantic information. The neck incorporates PAFPN,
which combines a Feature Pyramid Network (FPN) that transmits semantic information from
deeper to shallower backbone layers, and a Path Aggregation Network (PANet) that feeds
back localization information from shallower to deeper layers. The integration has improved
detection accuracy for objects of various sizes.</p>
      <p>The head employs a decoupled structure,a notable departure from conventional YOLOv5,
that establishes distinct pathways for class classification and localization. The architectural
separation allows for task-specific optimization, resulting in considerable gains in model accuracy
and training eficiency.</p>
      <p>The overall YOLOX architecture introduces an anchor-free method, eliminating the need for
the previous intermediate concept of anchor boxes. Consequently, SimOTA is introduced as the
label assignment algorithm. The method represents innovative changes from previous models.</p>
      <sec id="sec-3-1">
        <title>3.2. Pruning</title>
        <p>
          Pruning is a technique that aims to make trained deep learning models lighter and faster by
removing less important parameters. The method is highly valued, because the deployment
of standard deep learning models on edge devices with limited computational resources is
extremely dificult. Among pruning methods, structural pruning is particularly promising as
the method can lead to practical speed-ups by removing entire structural units like filters or
channels. A widely recognized and efective method for channel-level pruning is to use the
scaling factor ( ) in Batch Normalization (BN) layers as an importance criterion, as proposed
by Liu et al. in their work on Network Slimming [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. The method is expected to improve the
processing speed because the dense matrix structure is maintained even after pruning.
        </p>
        <p>Consider a single convolutional layer in basic structural pruning.</p>
        <p>∈ R× × 
 ∈ R× × × 
 ∈ R× ×</p>
        <p>Here,  represents the input feature map,  represents the weight filter, and  represents
the output feature map. ,  denote the height and width of the feature map, and ℎ, 
denote the kernel sizes. For the layer,  filters are removed through structural pruning. Pruning
reduces the dimensionality of  and, consequently, the computational cost of . The efect
implies a cascading reduction not only in the computational cost of the specific layer but also
in the subsequent processing. The overall reduction in computational load across the model
directly leads to improved inference performance. The pruned weight  ′ and output ′ are
represented as follows:
 ′ ∈ R(− )× × × 
′ ∈ R(− )× × 
(1)
(2)
(3)
(4)
(5)</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Improved structure</title>
        <p>In this study, we aim to leverage powerful parameter reduction techniques and lightweight
design to reduce the number of parameters, improve FLOPs and FPS compared to the standard
YOLO series, and achieve better performance on edge devices with limited computational
resources. The model consists of three parts: backbone, neck, and head. Figure 1 shows
the backbone before pruning, and Figure 2 illustrates how channels are removed from each
convolutional layer through pruning. The backbone, responsible for feature extraction, is a
crucial component for the accuracy of an object detection model and is composed of numerous
convolutional layers. Thus, we aim to reduce the parameter count. In the CSP layer, features are
passed to the next output by separating into a path that goes through the traditional Darknet
block and a path that performs only 1x1 convolutions, and then reintegrating. By applying</p>
        <p>Focus
ConvLayer
CSPLayer
ConvLayer
CSPLayer
ConvLayer
CSPLayer
ConvLayer
CSPLayer</p>
        <p>ConvLayer
SPPBottleNeck</p>
        <p>CSPLayer</p>
        <p>conv
Batch Norm
Activation
ConvLayer
BottleNeck ×N</p>
        <p>CSPLayer
concat
ConvLayer
conv
conv
+
ConvLayer
structural pruning to the layers and reducing the number of channels, the model’s parameter
count can be significantly decreased, improving performance on edge devices.</p>
        <p>In the model’s neck, feature maps from deep layers and shallow feature maps are integrated
to create new fused feature maps. Feature maps from deeper layers are upsampled to match
the feature size of a shallower layer and then integrated. The process of extracting features
using a CSP layer is repeated twice, after which downsampling is performed by a convolutional
layer before being passed to the head. The head is divided into three branches, each handling
detection processing for objects of specific sizes.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <p>In this section, we create a unique dataset of beverage containers for recycling. With the dataset,
we conduct comprehensive experiments with models featuring backbones of varying channel
counts, demonstrating the efectiveness of our proposed model.</p>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>For the experiment, we use a custom dataset specifically created for the task of beverage
container object detection. The dataset used in this study comprises three categories: plastic</p>
        <p>concat
Upsampring
ConvLayer
CSPLayer
concat
Upsampring
ConvLayer</p>
        <p>CSPLayer
ConvLayer
concat
CSPLayer
ConvLayer
concat
CSPLayer</p>
        <p>ConvLayer
ConvLayer
ConvLayer</p>
        <p>Cls
bottles, glass bottles, and cans. To address the lack of data diversity, online image augmentation
is applied to each image.</p>
        <p>The beverage container dataset totals 720 images, with 583 images in the training set, 65
in the validation set, and 72 in the test set. We design the dataset to present three specific
challenges for object detection:
1. Overlapping Objects. We include images where multiple objects overlap, assuming that
scattered waste in real-world scenarios isn’t always isolated.
2. Extreme Deformation: Recognizing that empty beverage containers found on roadsides
aren’t always in pristine condition, we include flattened plastic bottles and cans in their
respective categories.
3. Image Blur. To account for operation in environments where edge devices might be in
motion, we include images with blurred objects.</p>
        <p>To demonstrate the unique characteristics of our dataset, Figure 4 shows sample images
featuring (a) overlapping objects, (b) exuyytreme deformations, and (c) image blur. The dataset
is not publicly available at this time.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Implementation details</title>
        <p>All experiments are conducted on a platform equipped with an NVIDIA GeForce RTX 4090
graphics card, utilizing PyTorch 2.4.0 with CUDA 12.4 as the deep learning framework. The method
proposed in this study is implemented based on a highly flexible, open-source framework.</p>
        <p>The experiment is primarily divided into two stages:</p>
        <p>In the first stage, the conventional YOLOX model is trained on the custom dataset. The
number of epochs and batch size are set to 1000 and 8, respectively. Training is performed using
the SGD optimizer, with an initial learning rate of 0.01 and a minimum learning rate set to 0.01
(a)overlapping objects
(b)extreme deformation
(c)image blur
times the initial learning rate. Momentum is set to 0.937, and weight decay to 0.0005. A cosine
annealing learning rate decay schedule is adopted. Data augmentation is applied to the dataset.
Both Mosaic and Mixup data augmentations are applied with a 50% probability and set to occur
during the first 70% of the training epochs.</p>
        <p>
          In the second stage, structural pruning is performed on the previously trained model. The
importance of each channel in the backbone’s convolutional layers is evaluated using the
scale factor( ) of the BN layer immediately following each convolutional layer, removing
approximately 10% of the backbone’s parameters. After pruning, the created model is fine-tuned
to recover accuracy. For fine-tuning, the number of epochs is set to 150, with other parameters
remaining the same as in the first stage. This cycle of pruning and fine-tuning, known as
iterative pruning, is repeated until the number of parameters in the backbone is less than 50%
of the original size. The iterative approach is a widely adopted practice, as the approach has
been shown to maintain model accuracy more efectively than a single, large pruning step [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
The 50% reduction threshold is determined through preliminary experiments aiming to find the
optimal balance between improving real-time performance on the target device and minimizing
accuracy degradation. We observe that pruning beyond 50% initiated a non-negligible drop
in accuracy, leading us to conclude that the threshold is the optimal trade-of point between
computational cost and precision.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation metrics</title>
        <p>In this study, in addition to typical evaluation metrics commonly used in object detection tasks,
including Precision, Recall, mAP50, and mAP50-95, we also evaluate the model using metrics
such as pruning rate, parameter reduction rate, FLOPs reduction rate, and FPS to assess the
efectiveness of structural pruning. The definitions of these metrics are as follows:
Precision =</p>
        <p>Recall =</p>
        <p>TP
TP + FP</p>
        <p>TP
TP + FN

AP = ∑︁(+1 − )Precisionmax(+1)
=1
(6)
(7)
(8)
(10)
(11)</p>
        <p>TP represents the number of correctly predicted objects, FP represents false positives
(misdetections), and FN represents false negatives (undetected objects). Precision and Recall are
calculated using TP, FP and FN. AP is determined by the area under the precision-recall curve.
mAP50 is obtained by averaging the AP for each class in the dataset when IoU = 0.5. mAP50-95
is obtained by averaging the mAP at diferent IoU thresholds from 0.5 to 0.95. FLOPs represent
the computational load of the model, and FPS represents the number of images that can be
processed per second.</p>
        <p>Furthermore, to estimate the real-time performance on resource-constrained hardware, we
define the estimated inference time and FPS based on the model’s computational load as follows:
Estimated Inference Time (s) =</p>
        <p>Estimated FPS =</p>
        <sec id="sec-4-3-1">
          <title>Model GFLOPs</title>
          <p>CPU GFLOPs × keficiency</p>
          <p>1
Estimated Inference Time (s)</p>
          <p>Here, keficiency represents the CPU eficiency factor.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Performance Comparison</title>
        <p>To demonstrate the efectiveness of proposal using an open-source framework, we compare our
pruned YOLOX model with other structurally pruned models. Table 1 shows the result. The
table evaluates the object detection accuracy using the test set when structural pruning, at a
pruning rate of 10% on the backbone, is applied 7 times until the backbone’s parameters fall
below 50%.The table indicates that despite reducing up to approximately 43% of parameters
solely through backbone pruning, there is almost no change in the accuracy metrics compared to
the original model. The result suggests that the iterative structural pruning efectively reduces
parameters while maintaining robust feature representation.</p>
        <p>Table 2 presents a comparison of the YOLOX and the final pruned model in terms of parameter
count, channel count, computational load, and FPS. In the comparison, the computational load
is calculated with an input size of 3x640x640, and the FPS is computed using the same input
size in addition to a batch size of 1. The table clearly demonstrates that the proposed model
significantly reduces computational load compared to the original model.</p>
        <p>While the experiments in this study are conducted in a GPU environment, we perform a
theoretical estimation to evaluate the model’s performance on an edge device, which is a key
objective of our paper. We select the Raspberry Pi 4 as the target device because we plan to
deploy deep learning models in the future. Based on the oficial specifications, we assume
a theoretical computational performance of 6 GFLOPS for the CPU. Assuming a commonly
used CPU eficiency factor of 30% and using Equations (10) and (11), we estimate the inference
time and FPS. Table 3 shows the estimated values for the edge CPU. The result suggests a
potential performance improvement of approximately 71% on the target edge device, while also
demonstrating the dificulty of using high-end GPUs—which showed only a 2.5% speedup—to
properly evaluate performance on the edge device.</p>
        <p>Model
Original YOLOX-nano
YOLOX-nano-prune*7</p>
        <p>Parameters(rate[%]) channels(Rate[%])</p>
        <p>GFLOPs FPS
897,144(100.0)
507,938(56.62)
7720(100.0)
5845(75.71)
1.28
0.74
237
243</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this study, we create a unique dataset and propose a lightweight model with a reduced number
of parameters to achieve a practical inference speed within limited computational resources.
By performing iterative structural pruning on the backbone, we reduce computational load
and improve inference speed while maintaining the backbone’s feature extraction capabilities.
The experiment demonstrates that our model, built through structural pruning of the YOLOX
backbone, significantly reduces computational complexity, proving its efectiveness. In essence,
our improved network with iterative structural pruning could be a significant step towards
achieving robust object detection performance on edge devices.</p>
      <p>We acknowledge a limitation of our study is that the evaluation is conducted solely on a custom
dataset. However, this dataset includes unique, real-world challenges, such as overlapping
objects, extreme deformation, and image blur, that are often underrepresented in public datasets.
More importantly, the iterative structural pruning approach we applied to YOLOX is a
taskagnostic and universal strategy for model compression. Therefore, we believe the method is
broadly applicable for other object detection tasks requiring deployment on edge devices.</p>
      <p>Looking ahead, our future work proceeds in two directions. First, we plan to validate
the generality of our approach on public benchmark datasets. Second, we aim to apply this
lightweight model to even more resource-constrained hardware, such as CPU-only devices, to
realize our ultimate goal of developing a system that addresses environmental issues caused by
beverage container waste.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative Al</title>
      <sec id="sec-6-1">
        <title>The author(s) have not employed any Generative Al tools.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <article-title>Dataset purification-driven lightweight deep learning model construction for empty-dish recycling robot</article-title>
          ,
          <source>IEEE Transactions on Emerging Topics in Computational Intelligence</source>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          , L. Meng,
          <article-title>Yolo-sm: A lightweight single-class multi-deformation object detection network</article-title>
          ,
          <source>IEEE Transactions on Emerging Topics in Computational Intelligence</source>
          <volume>8</volume>
          (
          <year>2024</year>
          )
          <fpage>2467</fpage>
          -
          <lpage>2480</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          , L. Meng,
          <article-title>Yolo-msa: A multiscale stereoscopic attention network for empty-dish recycling robots</article-title>
          ,
          <source>IEEE Transactions on Instrumentation and Measurement</source>
          <volume>72</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <article-title>A survey of deep learning for industrial visual anomaly detection</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          <volume>58</volume>
          (
          <year>2025</year>
          )
          <fpage>279</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <article-title>A multi-scale information fusion framework with interaction-aware global attention for industrial vision anomaly detection and localization</article-title>
          ,
          <source>Information Fusion</source>
          <volume>124</volume>
          (
          <year>2025</year>
          )
          <fpage>103356</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kalenichenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Weyand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Andreetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Adam</surname>
          </string-name>
          , Mobilenets:
          <article-title>Eficient convolutional neural networks for mobile vision applications</article-title>
          ,
          <source>arXiv preprint arXiv:1704.04861</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Berg</surname>
          </string-name>
          , Ssd:
          <article-title>Single shot multibox detector</article-title>
          ,
          <source>in: European conference on computer vision</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sandler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhmoginov</surname>
          </string-name>
          , L.-C.
          <article-title>Chen, Mobilenetv2: Inverted residuals and linear bottlenecks</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>4510</fpage>
          -
          <lpage>4520</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-H. Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>Shuflenet:</given-names>
          </string-name>
          <article-title>An extremely eficient convolutional neural network for mobile devices</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>6848</fpage>
          -
          <lpage>6856</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Eficientdet: Scalable and eficient object detection</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>10781</fpage>
          -
          <lpage>10790</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <article-title>NanoDet: Super-fast and light-weight anchor-free object detection model</article-title>
          , https: //github.com/RangiLyu/nanodet,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Divvala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <article-title>You only look once: Unified, real-time object detection</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>779</fpage>
          -
          <lpage>788</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Farhadi,</surname>
          </string-name>
          <article-title>Yolo9000: better, faster, stronger</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>7263</fpage>
          -
          <lpage>7271</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Farhadi,</surname>
          </string-name>
          <article-title>Yolov3: An incremental improvement</article-title>
          , arXiv preprint arXiv:
          <year>1804</year>
          .
          <volume>02767</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bochkovskiy</surname>
          </string-name>
          , C.-Y. Wang, H.
          <string-name>
            <surname>-Y. M. Liao</surname>
          </string-name>
          ,
          <article-title>Yolov4: Optimal speed and accuracy of object detection</article-title>
          , arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>10934</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G.</given-names>
            <surname>Jocher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chaurasia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stoken</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Borovec</surname>
          </string-name>
          , et al.,
          <article-title>Yolov5 by ultralytics</article-title>
          , https://github. com/ultralytics/yolov5,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Dang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Han</surname>
          </string-name>
          , et al.,
          <article-title>Yolov6: A single-stage object detection framework for industrial applications</article-title>
          ,
          <source>arXiv preprint arXiv:2209.02976</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>C.-Y. Wang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bochkovskiy</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-Y. M. Liao</surname>
          </string-name>
          ,
          <article-title>Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>7464</fpage>
          -
          <lpage>7475</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          , Yolox: Exceeding yolo series in 2021, in:
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>34</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>26024</fpage>
          -
          <lpage>26035</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <article-title>Learning eficient convolutional networks through network slimming</article-title>
          ,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Pool,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Dally</surname>
          </string-name>
          ,
          <article-title>Learning both weights and connections for eficient neural networks</article-title>
          ,
          <source>in: Advances in neural information processing systems</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>