<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>RetinaGate: A Gated Feature Pyramid Network for Improved Object Detection with SE-based Attention</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Axis Communications AB</institution>
          ,
          <addr-line>Lund</addr-line>
          ,
          <country country="SE">Sweden</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science and Media Technology, Sustainable Digitalisation Research Centre, Malmö University</institution>
          ,
          <addr-line>Malmö</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Mahtab Jamali</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Object Detection</institution>
          ,
          <addr-line>RetinaNet, FPN, Gated Fusion, RetinaGate, SEBlock</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Object detection is a critical task in computer vision with wide-ranging applications, from autonomous driving to surveillance systems. Despite notable progress, challenges such as detecting small objects, managing occlusions, and efectively integrating multiscale features persist. We propose RetinaGate, a novel object detection architecture that introduces a Gated Feature Pyramid Network (G-FPN) to adaptively fuse multi-scale features, enhanced by Squeeze-and-Excitation-based channel attention for improved accuracy. As a plug-and-play module, G-FPN can be seamlessly integrated into existing detection models to enhance their accuracy. These enhancements strengthen the model's capacity to capture fine-grained details and leverage contextual information more efectively. Experimental results on three benchmark datasets demonstrate that RetinaGate outperforms the baseline RetinaNet in terms of detection accuracy, particularly in challenging detection scenarios such as underwater.</p>
      </abstract>
      <kwd-group>
        <kwd>Attention</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Object detection has become a cornerstone in the field of computer vision, with wide-ranging
applications that include autonomous driving, medical diagnostics [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and real-time video analysis [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. As an
essential component of intelligent systems, object detection aims to locate and classify objects within
an image, making it crucial for tasks requiring both precision and computational eficiency [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ].
      </p>
      <p>
        While deep learning detectors such as RetinaNet [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Faster R-CNN [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and YOLO [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] have achieved
remarkable progress, some challenges persist. Small object detection [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] remains a significant hurdle due
to insuficient feature resolution at higher pyramid levels. Other challenges include occlusion, where
objects are partially hidden from view, and cluttered backgrounds, which can lead to false positives
or missed detections [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Furthermore, the semantic gap between low-level and high-level features
can hinder precise localization and classification, especially in complex environments [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. These
limitations highlight the need for more sophisticated backbone architectures and robust feature fusion
mechanisms to improve detection accuracy across diverse scenarios.
      </p>
      <p>
        RetinaNet [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], a one-stage object detector known for its eficiency and Focal Loss, provides a robust
baseline for addressing common detection challenges. However, its default architecture can be further
enhanced to improve performance in complex scenarios, such as detecting small or occluded objects. One
limitation of the standard ResNet-50 backbone is its inability to adaptively focus on the most informative
feature channels, which can reduce its efectiveness in cluttered or context-rich scenes. In addition,
the standard Feature Pyramid Network (FPN) processes each pyramid level independently, without
explicitly fusing cross-level information. This limits its ability to fully exploit the complementary
strengths of multi-scale features.
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>
        To address these limitations, we propose a novel enhancement to RetinaNet, titled RetinaGate by
incorporating Squeeze-and-Excitation (SE) blocks [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and a novel FPN, titled G-FPN (Gated Fusion
FPN). SE block, integrated into the ResNet-50 backbone, improve channel-wise attention, enabling the
model to prioritize the most informative features. The Gated Fusion module, applied after the Feature
Pyramid Network (FPN), enhances the fusion of multiscale features, ensuring robust performance across
diverse object sizes and challenging conditions. These modifications specifically target the weaknesses
in handling small objects, occlusions, and the integration of multiscale features, which are critical for
achieving higher detection accuracy.
      </p>
      <p>This paper is structured as follows. In Section 2, we discuss related works, focusing on advancements
in backbone architectures, feature fusion techniques, and one-stage detectors. Section 3 outlines the
methodology behind our proposed enhancements, detailing the integration of SE blocks and Gated
Fusion. Section 4 presents the datasets used for evaluation, and Section 5 reports the experimental
results, demonstrating the superiority of RetinaGate over the baseline RetinaNet. Finally, Section 6
concludes with future research directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>The field of object detection has witnessed substantial progress with the development of various
architectures and techniques. Among them, RetinaNet has stood out as a significant contribution,
ofering a balance between accuracy and computational eficiency. However, several studies have
identified limitations in RetinaNet and proposed enhancements to address them:</p>
      <p>
        RetinaNet and Multiscale Detection: Lin et al. introduced RetinaNet with Focal Loss to mitigate
the impact of class imbalance in object detection [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Despite its success, challenges such as small object
detection and efective multiscale feature integration remain. For instance, the work by Kong et al.
introduced the Deep Feature Pyramid Network (DFPN) [14], which augments FPNs with enhanced
connectivity to improve multiscale detection, particularly for small objects. Similarly, Libra R-CNN [15]
addressed multiscale imbalance by introducing balanced feature pyramid integration. The NAS-FPN
[16] utilized neural architecture search to optimize feature pyramid designs, achieving state-of-the-art
performance. However, these solutions often introduce significant computational complexity. BiFPN,
proposed in EficientDet [ 17], enhanced multiscale detection by employing lightweight, bidirectional
feature fusion. While efective, BiFPN requires fine-tuned hyperparameters and is not tailored for
one-stage detectors like RetinaNet.
      </p>
      <p>Feature Enhancement Mechanisms: Researchers have proposed various mechanisms to enhance
feature representation. Hu et al. introduced Squeeze-and-Excitation Networks to recalibrate
channelwise feature responses dynamically. These networks have been integrated into diferent architectures to
improve attention mechanisms. For example, SENet [18] was successfully applied to image classification
tasks, and Zhang et al. (2020) extended it to Faster R-CNN for improving object detection. Similarly, Woo
et al. proposed Convolutional Block Attention Module (CBAM) [19], which combines channel and spatial
attention for enhanced feature extraction. CBAM has been incorporated into architectures like YOLOv4,
demonstrating improvements in feature selectivity. More recently, Eficient Attention Networks (EANet)
[20] introduced lightweight attention mechanisms for real-time object detection, which significantly
reduced computational overhead. However, these methods have primarily focused on classification
tasks or two-stage detectors, with limited exploration in one-stage models like RetinaNet.</p>
      <p>Contextual and Multiscale Fusion: Feature fusion is another area of focus for improving object
detection [21]. Works such as PANet [22], and NAS-FPN [23] emphasize enhancing information
lfow across scales. PANet introduced bottom-up path augmentation to complement FPN’s top-down
feature flow, improving multiscale detection capabilities. More recently, Auto-FPN [ 24] employed
neural architecture search to automatically design eficient feature fusion paths, addressing multiscale
detection while maintaining computational eficiency. Additionally, Dynamic FPN [ 25] integrated
adaptive mechanisms to dynamically adjust the contributions of feature levels based on the input image
characteristics, further enhancing context-aware fusion. Although efective, these approaches often
involve high computational costs, making them less suitable for real-time applications. Our approach
incorporates a Gated Fusion module, which selectively integrates multiscale features while maintaining
eficiency, addressing both contextual relevance and multiscale challenges.</p>
      <p>Enhancements in FPN Design: Enhancements to the original FPN architecture have focused
on improving information flow and balancing feature contributions. Libra R-CNN [ 15] introduced a
balanced semantic path to reduce feature-level imbalance, significantly improving object detection
across scales. NAS-FPN [23] used neural architecture search to automate FPN design, resulting in
high-performing but computationally expensive structures. BiFPN, proposed in EficientDet [ 17],
employed bidirectional fusion to refine multiscale feature integration while reducing computational cost.
Additionally, works like Path Aggregation Network (PANet) [22] extended FPN with bottom-up paths,
enabling improved feature reuse for instance segmentation and detection tasks. Recently, Dynamic
FPN [25] adapted FPN contributions dynamically based on input image requirements, addressing both
eficiency and adaptability. While these approaches provide valuable insights, many require extensive
computational resources or are highly domain-specific, limiting their generalizability. Our work adopts
a simpler, yet efective Gated Fusion strategy, ensuring scalability and eficiency for diverse detection
tasks.</p>
      <p>Related Enhancements in One-Stage Detectors: One-stage object detectors such as YOLO[26],
SSD [27], and RetinaNet have been the subject of extensive research and development. SSD (Single Shot
MultiBox Detector) introduced a novel approach to predict object locations and class scores directly
from feature maps, leveraging multiple feature scales for detecting objects of various sizes. However,
its fixed anchor configurations posed challenges for small object detection. YOLOv3 and its successors,
YOLOv4 [28] and YOLOv5 [29], addressed these limitations by employing improved feature extraction
backbones such as CSPNet and introducing techniques like mosaic augmentation to enhance training
data diversity. YOLOv7 [30] and YOLOv8 further explored decoupled head architectures, lightweight
attention modules, and optimized training pipelines to improve accuracy and eficiency. Similarly,
FCOS (Fully Convolutional One-Stage Object Detection) removed the need for anchor boxes altogether,
relying on a center-ness score to predict object locations directly, thus simplifying the pipeline while
maintaining competitive performance. Despite these advances, integrating robust feature attention and
fusion mechanisms, as proposed in our work, remains a critical gap for improving small and occluded
object detection in one-stage detectors.</p>
      <p>Our work diferentiates itself by integrating Squeeze-and-Excitation blocks with Gated Fusion directly
into RetinaNet’s architecture. By addressing the limitations of both the backbone and FPN, our approach
provides a comprehensive solution for object detection and multiscale feature integration without
incurring significant computational overhead. Additionally, our method uniquely combines adaptive
feature prioritization and gated feature fusion, filling the gap between lightweight design and robust
feature representation.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Approach</title>
      <p>3.1. Overview of the Proposed Approach
The proposed approach is illustrated in Figure 1, comprising four main components: (a) ResNet Backbone,
(b) SE Blocks, (c) Feature Pyramid Network (FPN), (d) G-FPN. This architecture combines SEblock and
the novel G-FPN (Gated Fusion FPN) which contains a gated fusion module to address challenges such
as small object detection, occlusions, and domain-specific variations, resulting in enhanced detection
accuracy and robustness. Unlike a standard FPN that directly passes feature maps to the classification and
regression heads without additional refinement, G-FPN integrates the Gated Fusion module to generate
an enriched feature map. This additional feature map enhances the multiscale feature representation,
improving detection accuracy by enabling better contextual understanding and feature refinement. The
proposed model consists of the following components:</p>
      <p>(a) ResNet Backbone: The ResNet-50 backbone extracts hierarchical feature maps from the input
image, capturing both low-level and high-level representations.</p>
      <p>(b) SE Blocks: Squeeze-and-Excitation (SE) block is incorporated into the ResNet-50 backbone after
each major layer group (layer1, layer2, layer3, and layer4). This block enhances the model’s capability to
recalibrate channel-wise feature responses adaptively by modeling dependencies between channels. By
prioritizing informative features and suppressing less relevant ones, SE blocks improve the robustness
of feature representations.</p>
      <p>The primary reason for placing SE blocks in ResNet-50 is to enhance hierarchical feature learning
across diferent layers:
• Early-Layer Enhancement: SE blocks in lower layers focus on improving edge and texture details,
critical for small object detection.
• Mid-Layer Refinement: At intermediate layers, they refine semantic feature representation for
medium-sized objects.
• Deep-Layer Contextualization: In the final layer group, SE blocks emphasize high-level semantic
features, which are essential for addressing occlusions and complex object shapes.</p>
      <p>This block integration ensures that features at all scales are adaptively weighted, contributing to
improved multiscale detection performance.</p>
      <p>(c) Feature Pyramid Network (FPN): The FPN aggregates multiscale features from the ResNet backbone,
enabling robust detection of objects at varying scales.</p>
      <p>(d) G-FPN (Gated Fusion FPN): The original FPN aggregates multiscale features without any dynamic
weighting, treating all scales equally. In contrast, G-FPN introduces:
• Dynamic Feature Prioritization: Ensures relevant scales contribute more significantly.
• Enhanced Feature Representation: Combines fused features with original multiscale outputs,
providing richer context.
• Plug-and-Play Flexibility: Can be integrated into various detection architectures without
significant modification.</p>
      <p>By dynamically weighting the contributions of each scale, G-FPN ensures that the most relevant
features are prioritized, improving the model’s ability to handle objects of varying sizes and complexities.</p>
      <p>The structure of G-FPN is shown in Figure 2. The Gated Fusion Module is designed to enhance the
integration of multiscale feature maps, enabling adaptive fusion based on feature relevance. Unlike
the standard FPN, which aggregates features in a static manner, gated fusion module incorporates
gating mechanisms to modulate the contributions of each scale dynamically. This ensures a more
context-aware and robust feature representation. The Gated Fusion module integrates feature maps
from multiple levels of the Feature Pyramid Network (FPN) and produces an enriched feature map. The
gated fusion is computed as:</p>
      <p>The Gated Fusion Module is designed to selectively integrate feature maps from multiple levels of the
Feature Pyramid Network (FPN). It achieves this by employing a gating mechanism that dynamically
adjusts the contributions of individual feature maps to the final fused representation.
map,  1):
where:</p>
      <sec id="sec-3-1">
        <title>Mathematical Formulation</title>
        <p>Let  1,  2, … ,   represent the feature maps from  levels of the FPN,
where   ∈ ℝ  ×  ×  , and   ,   , and   are the channel, height, and width dimensions of the feature
map at level  . The Gated Fusion Module combines these feature maps as follows:
1. Spatial Alignment: Each feature map is resized to a common spatial resolution, denoted as
(  ,   ), which corresponds to the resolution of a reference feature map (e.g., the first feature
 ̂ = Interpolate(  , size = (  ,   ),mode=’nearest’),
where  ̂ is the resized feature map at level  .</p>
        <p>the importance weights. The gating function is defined as:
2. Gating Mechanism: For each resized feature map  ̂ , a gating mechanism is applied to compute
(  ̂ ) =  ( 2 ∗ ReLU ( 1 ∗  ̂ )) ,
•  1 ∈ ℝ×(/)×1×1
and  2 ∈ ℝ(/)××1×1</p>
        <p>are learnable weight tensors.
•  is the reduction ratio, which controls the dimensionality reduction in the gating mechanism.
• ∗ denotes convolution, and ReLU(⋅)is the Rectified Linear Unit activation function.
•  (⋅) represents the sigmoid activation function, which scales the importance weights between
0 and 1.
3. Feature Weighting: The gated feature map is obtained by element-wise multiplication of the
gating weights and the resized feature map:
 ̂gated</p>
        <p>= (  ̂ ) ⊙  ̂ ,
 fused = ∑  ̂gated.</p>
        <p>=1
where ⊙ denotes element-wise multiplication.</p>
        <p>from all levels:
4. Feature Fusion: The final fused feature map is computed by summing the gated feature maps
The gating mechanism adaptively learns the importance of features at each level of the FPN, ensuring
that only the most relevant features contribute to the final fused representation. The interpolation step
aligns the spatial dimensions of all feature maps, enabling efective fusion across scales. The reduction
ratio  controls the complexity of the gating mechanism, allowing for eficient computation.
Advantages</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Datasets</title>
      <p>and coarse details.</p>
      <p>object detection model.
• Enables selective emphasis on important features from diferent levels of the FPN.
• Facilitates multi-scale feature integration, enhancing the network’s ability to capture both fine
• Reduces the impact of redundant or irrelevant features, improving the overall performance of the
To evaluate the performance and generalization capability of our proposed model, we conducted
experiments on three datasets: Pascal VOC 2007, Pascal VOC 2012, and the Aqua dataset. These datasets
encompass a range of object categories and challenging conditions, allowing us to demonstrate the
versatility and robustness of our enhancements.</p>
      <p>1. Pascal VOC 2007</p>
      <p>Pascal VOC 2007 [31] consists of 5,000 training images and 4,900 testing images, covering 20 object
categories. We performed an ablation study using subsets of Pascal VOC 2007 to analyze the efectiveness
of our modifications. Initially, we tested the model with 100 images from three classes (person, car, bus)
enabling a focused evaluation of the model’s improvements in a simplified setting. Subsequently, we
increased the dataset to 1,000 images covering four classes (person, car, bus, motorbike) to examine the
scalability and consistency of the enhancements. Finally, the model was evaluated on the complete
Pascal VOC 2007 dataset to assess its generalization capability across diverse object classes and a larger
number of images.</p>
      <p>2. Pascal VOC 2012</p>
      <p>Pascal VOC 2012 [32] consists of 13,690 training images and 3,422 testing images, providing a more
comprehensive dataset compared to Pascal VOC 2007. This dataset includes additional images and
variations in image conditions, allowing us to validate the model’s generalization ability across diferent
distributions. Testing on Pascal VOC 2012 ensures the robustness of our approach in handling diverse
object classes and environmental variations.</p>
      <p>3. Aqua Dataset</p>
      <p>The Aqua dataset contains 575 training images and 63 testing images, specifically designed for
underwater object detection. This dataset presents unique challenges, such as blurred objects, low
visibility, and occlusions caused by underwater conditions. These factors often complicate the detection
of marine life, such as fish, which are not only camouflaged but also exhibit irregular shapes and
movements. By applying our model to this dataset, we demonstrate its adaptability and capability to
handle complex environments outside the standard datasets used for object detection.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>We conducted an ablation study using the Pascal VOC 2007 dataset, progressively analyzing the impact
of SEblock and G-FPN on the baseline RetinaNet model. The study included testing with subsets of
Pascal VOC 2007 (100 images and 1,000 images) and the complete Pascal VOC 2007 dataset to understand
the contribution of each module. Additionally, the complete approach (Model 4) was evaluated on Pascal
VOC 2012 and the Aqua dataset to assess its generalization across diferent domains and challenging
Model
Model 1: Original RetinaNet
Model 2: Adding SEBlock
Model 3: Adding Gated Fusion
Model 4: Adding SEBlock and Gated
Fusion</p>
      <p>Key Insights:
scenarios. For all three complete datasets, we trained the proposed model five times and calculated the
standard deviation to demonstrate the stability of the results.</p>
      <p>The Table 1 presents the mean Average Precision (mAP) results for diferent configurations:</p>
      <p>The following table presents the comparison of our proposed approach with are methods across the
Pascal VOC 2007, Pascal VOC 2012, and Aquarium datasets.</p>
      <sec id="sec-5-1">
        <title>Dataset</title>
        <sec id="sec-5-1-1">
          <title>Pascal VOC 2007</title>
        </sec>
        <sec id="sec-5-1-2">
          <title>Pascal VOC 2012</title>
        </sec>
        <sec id="sec-5-1-3">
          <title>Aquarium Dataset</title>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Method</title>
        <sec id="sec-5-2-1">
          <title>RetinaGate (ours)</title>
        </sec>
        <sec id="sec-5-2-2">
          <title>FemtoDet [33]</title>
        </sec>
        <sec id="sec-5-2-3">
          <title>Deformable Parts Model [34]</title>
        </sec>
        <sec id="sec-5-2-4">
          <title>TinyissimoYOLO-v8 [35]</title>
        </sec>
        <sec id="sec-5-2-5">
          <title>RetinaGate (ours)</title>
        </sec>
        <sec id="sec-5-2-6">
          <title>CenterNet [36] DETR [37]</title>
        </sec>
        <sec id="sec-5-2-7">
          <title>RetinaGate (ours) SCL [38] SCAN [39] SIGMA [40]</title>
          <p>YOLOv5 [41]</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we presented RetinaGate, an enhanced RetinaNet-based object detection model
incorporating Squeeze-and-Excitation (SE) blocks and a novel FPN, Gated Fusion FPN (G-FPN). By integrating
SE blocks into the ResNet-50 backbone and introducing G-FPN for adaptive multiscale feature fusion,
our approach efectively addressed challenges such as small object detection, occlusions, and complex
feature integration.</p>
      <p>Experimental results across Pascal VOC 2007, Pascal VOC 2012, and the Aquarium dataset
demonstrated the superiority of the proposed model compared to baseline RetinaNet and several state-of-the-art
methods. Our results highlight the strength of the G-FPN as a plug-and-play module that can be
integrated into other architectures to improve detection performance, particularly in scenarios involving
challenging domains such as underwater environments where objects are often blurred or occluded.
This flexibility and the observed performance gains underline the potential of our proposed
enhancements for broader applications in object detection tasks. Future research will focus on further evaluating
the generalizability of the G-FPN across more diverse datasets and exploring its integration into other
backbone architectures to fully leverage its capabilities.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly for some sections to check grammar.
After using this tool, the authors reviewed and edited the content as needed and take full responsibility
for the publication’s content.
[14] F. Hou, Q. Gao, Y. Song, Z. Wang, Z. Bai, Y. Yang, Z. Tian, Deep feature pyramid network for eeg
emotion recognition, Measurement 201 (2022) 111724.
[15] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, D. Lin, Libra r-cnn: Towards balanced learning for
object detection, in: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, 2019, pp. 821–830.
[16] G. Ghiasi, T.-Y. Lin, Q. V. Le, Nas-fpn: Learning scalable feature pyramid architecture for object
detection, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
2019, pp. 7036–7045.
[17] M. Tan, R. Pang, Q. V. Le, Eficientdet: Scalable and eficient object detection, in: Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10781–10790.
[18] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132–7141. doi:10.
1109/CVPR.2018.00745.
[19] S. Woo, J. Park, J.-Y. Lee, I. S. Kweon, Cbam: Convolutional block attention module, in:
Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–19. doi:10.1007/
978- 3- 030- 01234- 2_1.
[20] Y. Chen, X. Dai, M. Liu, D. Chen, L. Yuan, Z. Liu, Eficient attention network: Accelerate attention
by searching where to plug, arXiv preprint arXiv:2206.01659 (2022).
[21] R. Khoshkangini, M. Tajgardan, M. Jamali, M. G. Ljungqvist, R.-C. Mihailescu, P. Davidsson,
Hierarchical transfer multi-task learning approach for scene classification, in: International
Conference on Pattern Recognition, Springer, 2024, pp. 231–248.
[22] S. Liu, L. Qi, H. Qin, J. Shi, J. Jia, Path aggregation network for instance segmentation, in:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2018, pp. 8759–8768. doi:10.1109/CVPR.2018.00913.
[23] G. Ghiasi, T.-Y. Lin, Q. V. Le, Nas-fpn: Learning scalable feature pyramid architecture for object
detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2019, pp. 7036–7045. doi:10.1109/CVPR.2019.00721.
[24] Q. Wang, T. Yang, J. Zhang, Z. Li, Y. Chen, J. Wang, J. Sun, Auto-fpn: Automatic network
architecture adaptation for object detection beyond classification, in: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 6649–6658. doi:10.
1109/CVPR42600.2020.00668.
[25] S. Zhang, C. Chi, Y. Yao, Z. Lei, S. Z. Li, Dynamic feature pyramid networks for object detection,
arXiv preprint arXiv:2012.00779 (2021).
[26] J. Redmon, A. Farhadi, Yolov3: An incremental improvement, arXiv preprint arXiv:1804.02767
(2018).
[27] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A. C. Berg, Ssd: Single shot multibox
detector, in: Proceedings of the European Conference on Computer Vision (ECCV), 2016, pp.
21–37. doi:10.1007/978- 3- 319- 46448- 0_2.
[28] A. Bochkovskiy, C.-Y. Wang, H.-Y. M. Liao, Yolov4: Optimal speed and accuracy of object detection,
arXiv preprint arXiv:2004.10934 (2020).
[29] G. Jocher, A. Chaurasia, J. Qiu, YOLO by Ultralytics, 2020. URL: https://github.com/ultralytics/
yolov5.
[30] C.-Y. Wang, A. Bochkovskiy, H.-Y. M. Liao, YOLOv7: Trainable bag-of-freebies sets new
state-ofthe-art for real-time object detectors, arXiv (????).
[31] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, A. Zisserman, The pascal visual object
classes (voc) challenge, International journal of computer vision 88 (2010) 303–338.
[32] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, A. Zisserman, The pascal visual
object classes challenge: A retrospective, International journal of computer vision 111 (2015)
98–136.
[33] P. Tu, X. Xie, G. Ai, Y. Li, Y. Huang, Y. Zheng, Femtodet: An object detection baseline for energy
versus performance tradeofs, in: Proceedings of the IEEE/CVF International Conference on
Computer Vision, 2023, pp. 13318–13327.
[34] R. Girshick, F. Iandola, T. Darrell, J. Malik, Deformable part models are convolutional neural
networks, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition,
2015, pp. 437–446.
[35] J. Moosmann, P. Bonazzi, Y. Li, S. Bian, P. Mayer, L. Benini, M. Magno, Ultra-eficient on-device
object detection on ai-integrated smart glasses with tinyissimoyolo, arXiv preprint arXiv:2311.01057
(2023).
[36] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, Q. Tian, Centernet: Keypoint triplets for object detection,
in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6569–6578.
[37] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end object detection
with transformers, in: European conference on computer vision, Springer, 2020, pp. 213–229.
[38] A. Löchner, Semantic clustering by adopting nearest neighbor (scan), in: Der andere Sport: Esports
zwischen gesellschaftlichem Strukturwandel und Marketingstrategie, Springer, 2025, pp. 365–388.
[39] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, Y. Bengio,
Learning deep representations by mutual information estimation and maximization, arXiv preprint
arXiv:1808.06670 (2018).
[40] W. Li, X. Liu, Y. Yuan, Sigma: Semantic-complete graph matching for domain adaptive object
detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2022, pp. 5291–5300.
[41] G. Jocher, Ultralytics yolov5, 2020. URL: https://github.com/ultralytics/yolov5. doi:10.5281/
zenodo.3908559.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Madhavan</surname>
          </string-name>
          ,
          <article-title>Object detection human activity recognition for improved patient mobility and caregiver ergonomics (</article-title>
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jamali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Davidsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Khoshkangini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Ljungqvist</surname>
          </string-name>
          , R.-C.
          <article-title>Mihailescu, Context in object detection: a systematic literature review</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          <volume>58</volume>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>89</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jamali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Davidsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Khoshkangini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Ljungqvist</surname>
          </string-name>
          , R.-C. Mihailescu,
          <article-title>Specialized indoor and outdoor scene-specific object detection models</article-title>
          ,
          <source>in: Sixteenth International Conference on Machine Vision (ICMV</source>
          <year>2023</year>
          ), volume
          <volume>13072</volume>
          ,
          <string-name>
            <surname>SPIE</surname>
          </string-name>
          ,
          <year>2024</year>
          , pp.
          <fpage>201</fpage>
          -
          <lpage>210</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jamali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Davidsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Khoshkangini</surname>
          </string-name>
          , R.-
          <string-name>
            <given-names>C.</given-names>
            <surname>Mihailescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sexton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Johannesson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tillström</surname>
          </string-name>
          ,
          <article-title>Video-audio multimodal fall detection method</article-title>
          ,
          <source>in: Pacific Rim International Conference on Artificial Intelligence</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>62</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.-Y.</given-names>
            <surname>Ross</surname>
          </string-name>
          , G. Dollár,
          <article-title>Focal loss for dense object detection</article-title>
          ,
          <source>in: proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2980</fpage>
          -
          <lpage>2988</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Faster</surname>
          </string-name>
          r-cnn:
          <article-title>Towards real-time object detection with region proposal networks</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>39</volume>
          (
          <year>2016</year>
          )
          <fpage>1137</fpage>
          -
          <lpage>1149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ergu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <article-title>A review of yolo algorithm developments</article-title>
          ,
          <source>Procedia computer science 199</source>
          (
          <year>2022</year>
          )
          <fpage>1066</fpage>
          -
          <lpage>1073</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Wergeles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shang</surname>
          </string-name>
          ,
          <article-title>A survey and performance evaluation of deep learning methods for small object detection</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>172</volume>
          (
          <year>2021</year>
          )
          <fpage>114602</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , J. Ma,
          <article-title>Development and challenges of object detection: A survey</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>598</volume>
          (
          <year>2024</year>
          )
          <fpage>128102</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dollár</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Hariharan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Belongie</surname>
          </string-name>
          ,
          <article-title>Feature pyramid networks for object detection</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2117</fpage>
          -
          <lpage>2125</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tajgardan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shiranzaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jamali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Khoshkangini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rabbani</surname>
          </string-name>
          ,
          <article-title>Advanced stock market prediction using unsupervised federated learning techniques</article-title>
          , in: 2025 29th International Computer Conference, Computer Society of Iran (CSICC), IEEE,
          <year>2025</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dollár</surname>
          </string-name>
          ,
          <article-title>Focal loss for dense object detection</article-title>
          ,
          <source>in: Proceedings of the IEEE International Conference on Computer Vision</source>
          (ICCV),
          <year>2017</year>
          , pp.
          <fpage>2980</fpage>
          -
          <lpage>2988</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCV.
          <year>2017</year>
          .
          <volume>324</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shen</surname>
          </string-name>
          , G. Sun,
          <string-name>
            <surname>Squeeze-</surname>
          </string-name>
          and
          <article-title>-excitation networks</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>7132</fpage>
          -
          <lpage>7141</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>