1. Introduction

August

comprehensive survey on object detection YOLO

Xiangheng Wang

Hengyi Li

lihengyi@fc.ritsumei.ac.jp 2

Xuebin Yue

yue-xb@fc.ritsumei.ac.jp 2

Lin Meng

menglin@fc.ritsumei.ac.jp 0 0 College of Science and Engineering, Ritsumeikan University , 1-1-1 Noji-higashi, Kusatsu, Shiga, 525-8577 , Japan 1 Graduate School of Science and Engineering, Ritsumeikan University , 1-1-1 Noji-higashi, Kusatsu, Shiga, 525-8577 2 Research Organization of Science and Technology, Ritsumeikan University , 1-1-1 Noji-higashi, Kusatsu, Shiga, 525-8577

2023

2 8 29

As a single-stage object detection framework, the YOLO (You Only Look Once) technique has emerged as a prominent technique for various object detection tasks owing to its impressive balance between speed and precision. This research article presents a comprehensive review of the YOLO family of algorithms. This review covers the evolutionary journey of YOLO from its initial release to the latest versions, encompassing an in-depth analysis of the performance and critical characteristics exhibited by each iteration. Particular emphasis is given to exploring the applications of YOLO in diverse domains, focusing on its role in real-time object detection on embedded systems. Furthermore, the paper delves into the latest advancements in compressing algorithms for optimizing the cumbersome YOLO models and practical implementation examples. The potential of deploying YOLO on resource-constrained devices is further unlocked by addressing the challenge of model size reduction. Finally, this study outlines potential research trends and improvements for the YOLO family of algorithms, including novel architectural designs and innovative training strategies. Overall, the thorough investigation presented in this review is a valuable reference for researchers seeking to explore the YOLO framework and its evolving landscape in object detection.

YOLO object detection deep learning application compressing

1. Introduction

Object detection is the main task of computer vision which is to locate the object of interest from the input image target and then accurately judge the class of each target of interest. In recent years, object detection as a popular task in computer vision due to its wide range of applications and recent technological breakthroughs.

Traditional target detection algorithm uses sliding window or image segmentation technology to generate a large number of candidate regions and then extracts image features for each The 5th International Symposium on Advanced Technologies and Applications in the Internet of Things (ATAIT 2023), CEUR Workshop Proceedings candidate region such as Histograms of Oriented Gradients (HOG)[ 1 ]. These features are passed to a classifier like Support Vector Machine (SVM)[ 2 ] to judge the category of the candidate region. However, traditional target detection algorithm often needs to produce many candidate boxes, resulting in low eficiency. The speed and accuracy of detection can not meet the requirement of practical application, so developing a traditional target detection algorithm falls into the bottleneck. The deep convolutional neural network has been applied in various domains since AlexNet[ 3 ] was released in 2012. Deep learning provides a new method for object detection. Since then, a series of models based on deep learning have been proposed, and researchers have focused more on deep learning.

Object detection based on deep learning can be divided into two categories according to detection methods: Region-based and regression-based. The region-based system is known as a two-stage detector that first determines the candidate boxes of the samples and then classifies the samples through the convolutional neural networks (CNN), such as Regional CNN (R-CNN)[ 4 ].

The regression-based system is called a one-stage detector that does not generate candidate boxes during processing and directly realizes target detection based on a specific regression analysis. The comparative analysis shows that the characteristics of the two methods are diferent, the two-stage detectors were better than one-stage models in accuracy, but the realtime performance was slightly slower. Due to one-stage detectors’ comprehensive performance and excellent operational eficiency, researchers have focused more on the one-stage detector. The most typical representative of one-stage detectors is YOLO[ 5 ].

Due to the superior performance of YOLO, researchers have made many improvements to it, and YOLO has been updated to YOLOv8[ 6 ]. In addition, there are YOLOR[ 7 ] and YOLOX[ 8 ].

Shao et al.[ 9 ] provide a detailed description of the YOLO algorithms, but the article is limited to YOLOv5[ 10 ] and does not include the latest version of the YOLO family. Terven et al.[ 11 ] make a summary of YOLO but need to be more intuitive. In this article, we introduce the features of YOLOv1-v8, YOLOR, and YOLOX to give researchers an understanding of the YOLO families. It provides help for researchers to choose models according to their own needs.

The rest of this paper is organized as follows. The architectures of the YOLO family are discussed in Section 2 and summarized according to structure, image size, AP, 50, FPS, and Parameters. Section 3 elaborates on the improvements and applications of YOLO in various domains. Section 4 introduces model compression and summarizes the applications of compressed YOLO models. Finally, section 5 summarizes the development trend of the YOLO framework and gives an outlook on the future of YOLO.

2. Systematic overview on the key features of YOLO families

The initial YOLO was presented by Joseph Redmon et al. in CVPR 2016. The YOLO means “You Only Look Once,” which glances at an image like a human and ofhandedly knows what object is in the image, what they are doing, and where they are. Unlike traditional object detection models based on deep learning, YOLO had excellent accuracy and speed as a regression-based object detection algorithm. YOLO is an advanced one-stage object detection framework that has evolved over the years and spawned several versions. This section introduces the diferences between YOLOv1-YOLOv8, YOLOX, and YOLOR. YOLO algorithm forgoes the traditional sliding window technique, and it divides the input image into × grids, each of which predicts B bounding boxes of the same class and the confidence of each grid for diferent classes. Each bounding box predicts five values: ( , , ,ℎ, ), representing the bounding box’s position, size, and confidence, respectively. Each grid predicts ( ×5+ ) values, after which Non-Maximum Suppression (NMS) is used to remove duplicate detections. 2.1. YOLOv1 Accuracy, including classification accuracy and localization accuracy, and detection speed are essential criteria to judge the quality of image object detection model[ 12 ]. The YOLO model does not need to generate candidate boxes and classify them through the CNN network but directly regresses the candidate boxes and categories of the target in multiple positions of the image. YOLOv1 resizes the input image to 448×448 and trains it with the features extracted by CNN. And then, YOLOv1 processes the prediction results to achieve end-to-end object detection.

YOLOv1[ 5 ] uses a backbone network similar to GoogLeNet[ 13 ], with 24 convolutional layers and two fully connected layers, and 1×1 convolutional layers are used to reduce the number of feature maps. YOLOv1 is pre-trained on ImageNet[ 14 ] and then transferred to validate on VisualObject Classes (VOC) dataset. YOLOv1 divides the input image into 7×7 grids, and each grid predicts two bounding boxes, so there are 7×7×2 bounding boxes. A maximum of 49 targets were identified. YOLOv1 is not good at identifying dense targets and small targets.

Evaluated on the PASCAL VOC2007, YOLOv1 scores 63.4% average precision (AP). 2.2. YOLOv2 YOLOv2[ 15 ] inspires by the architecture of VGG[ 16 ] and constructs a Darknet-19 network that contains 19 convolutional layers and five max pooling layers. YOLOv1 uses the fully connected layer to predict the bounding box directly and loses more spatial information, causing inaccurate positioning. YOLOv2 introduces anchor boxes from Faster R-CNN[ 17 ] instead of fully connected layers to predict bounding boxes. Furthermore, YOLOv2 uses batch normalization to improve convergence—the authors through changing the input size to achieve robustness.

In addition, to achieve the goal of extending object detection to objects lacking detection samples, YOLOv2 uses various datasets to optimize the training jointly. The WordTree method is trained synchronously on the ImageNet classification dataset and MS COCO dataset to achieve real-time detection with more than 9000 object categories. The improved YOLOv2 is also known as YOLO9000.

Evaluated on the PASCAL VOC2007, YOLOv2 achieves 78.6% AP, which is 15.2% higher than YOLOv1. 2.3. YOLOv3 YOLOv3[ 18 ] uses the Darknet-53 as the backbone. The architecture of YOLOv3 draws on the residual structure of ResNet[ 19 ] to deepen the network structure, which solves the problem of network gradient explosion that makes the network dificult to converge. YOLOv3 utilizes a method similar to Feature Pyramid Network (FPN)[ 20 ] and performs multi-scale training to predict three boxes at three diferent scales. Moreover, YOLOv3 also uses k-means to determine the prior of the bounding box of the anchor box. Unlike YOLOv2, YOLOv3’s architecture uses three prior boxes for the scales of the large, medium, and small objects. Furthermore, feature maps of three diferent scales are used for object detection by feature fusion, and logistics is used to replace softmax for category prediction to achieve multi-label object detection. The proposed network not only improves the performance of small objects but also achieves 3 to 4 times faster than previous YOLO models when the bounding box prediction is not strict, and the detection accuracy is similar.

Due to the rapid development of computer vision, evaluation datasets that can better reflect the comprehensive performance of detection algorithms are needed. When YOLOv3 was released, the evaluation benchmark for object detection changed from PASCAL VOC to MS COCO. Therefore, YOLOv3 and subsequent YOLO models are evaluated on the MS COCO dataset. The YOLOv3 algorithm meets the accuracy and speed requirements of real-time detection and has become one of the preferred target detection algorithms in the engineering field.

Evaluated on the MS COCO dataset test-dev 2017, YOLOv3 achieves 31.0% AP and 55.3% 50 at 20 FPS. 2.4. YOLOv4 After YOLOv3, YOLOv4[ 21 ] adopts various improvement methods, which are divided into bag-of-freebies and bag-of-specials: the bag-of-freebies refer to modules that improve training without afecting inference speed, and the bag-of-specials refer to modules that have less impact on inference time and higher performance returns. The key features include Cross Stage Partial (CSP)[ 22 ], and Mish activation function[ 23 ] are adopted in the backbone network. CSPNet solves the bottleneck of the repetition of gradient information in other large neural network frameworks. It uses a unique method to integrate the gradient into the feature map fully, so this method can efectively reduce the number of parameters and FLOPs values of the model.

Therefore, compared with other YOLO models, YOLOv4 has higher accuracy while maintaining a higher inference speed. Furthermore, YOLOv4 is more suitable for training on a single GPU. The architecture of YOLOv4 operates CSPDarknet-53 as the backbone. For the neck, authors also use tricks from YOLOv3-SPP, including a modified version of spatial pyramid pooling (SPP)[ 24 ], multi-scale prediction, and a modified path aggregation network (PANet)[ 25 ] instead of FPN and improved Spatial Attention Module (SAM)[ 26 ]. Moreover, anchor boxes are used in the head to extract features.

Evaluated on the MS COCO dataset test-dev 2017, YOLOv4 achieves 41.2% AP and 62.8% 50 at over 96 FPS by NVIDIA Tesla V100. 2.5. YOLOv5 The basic structure of YOLOv5[ 10 ] is similar to YOLOv4. The significant diference is the scaling based on diferent channels. YOLOv5 provides five scale models of YOLOv5-N (nano) /S (small) / M (medium) / L (large) / X (extra large) are constructed from small to large models. Pytorch develops YOLOv5, which is easier to deploy on hardware than YOLOv4. As of this writing, oficial papers have yet to be published for YOLOv5. According to the homepage, YOLOv5 has been updated to the seventh edition. In the latest version, it is capable of handling classification and instancing segmentation tasks and speeds up training.

Evaluated on the MS COCO dataset test-dev 2017, with image size is 1536×1536, YOLOv5x6 obtains 55.8%AP. In case the image size is 640×640, YOLOv5x achieves 50.7% AP and exceeds 200 FPS on NVIDIA Tesla V100. 2.6. YOLOR The YOLOv4 team releases You Only Learn One Representation (YOLOR)[ 7 ] in 2021. They use implicit knowledge to address the fact that features extracted from previously trained convolutional neural networks are often less adaptable to other problems.

Human beings can acquire explicit knowledge through regular learning and implicit knowledge through subconscious learning. Even things that people have not seen can be judged by experience. YOLOR proposes a unified network combining implicit and explicit knowledge to give the network a learning ability similar to the human brain. YOLOR can be applied in many ifelds, such as segmentation and detection.

Evaluated on the MS COCO dataset test-dev 2017, YOLOR-D6 achieves 55.4% AP and 73.3% 50 at 30 FPS by NVIDIA Tesla V100. The results demonstrate that the performance of all tasks improves after introducing implicit knowledge into the neural network. 2.7. YOLOX YOLOX[ 8 ] is released by Megvii Technology in 2021. The model is based on YOLOv3 by Pytorch. Compared with other YOLO models, YOLOX has three main changes: decoupled head, anchor-free, and advanced label assigning strategy (SimOTA).

Decoupled head: The conflict between classification and regression tasks is an unavoidable problem[ 27 ][ 28 ]. According to the author’s experimental analysis, the coupled detection head may damage the performance, so the head is replaced with a decoupled head. The decoupled head is divided into two parts, one for regression tasks and the other for classification tasks. The structure speeds up the convergence of the network.

Anchor-free: Although the anchor mechanism works well for specific domains, it increases the complexity of the detection head and may cause delays when deployed on edge hardware. Inspired by the target detection models of FCOS[ 29 ], YOLOX uses an anchor-free mechanism that reduces the number of parameters and GFLOPs of the detector and makes the model faster.

SimOTA: Based on the research of OTA[ 30 ], YOLOX reconsiders the tag assignment from a global perspective and proposes to formulate the assignment process as an Optimal Transport (OT) problem. The SimOTA technique proposed by YOLOX achieves better performance with reduced training time.

Evaluated on the MS COCO dataset test-dev 2017, YOLOX-L achieves the best performance is 50.1% AP on COCO at a speed of 68.9 FPS on NVIDIA Tesla V100. 2.8. YOLOv6 The Meituan Vision AI Department released YOLOv6[ 31 ] in 2022. The overall structure refers to YOLOv4 but uses a more advanced mechanism. Firstly, they use a new backbone eficiency developed based on RepVGG[ 32 ]. Secondly, the neck of YOLOv6 adopts the PANet[ 25 ], and RepPAN is obtained after the neck is enhanced. YOLOv6’s architecture is similar to YOLOX. YOLOv6 also uses Eficient Decoupled Head and anchor-free technology. In addition, YOLOv6 also uses a self-distillation strategy to reduce the cost of reasoning. Similar to YOLOv5, YOLOv6 also provides multiple versions to facilitate model quantification and hardware deployment. It is also suitable for application in the industrial field.

Evaluated on the MS COCO dataset test-dev 2017, YOLOv6-L achieves an AP of 52.5% and 50 of 70.0% on a NVIDIA Tesla T4 in the same environment with TensorRT. 2.9. YOLOv7 The research team of YOLOv4 and YOLOR propose YOLOv7[ 33 ] 2022. YOLOv7 outperforms all known object detectors in speed and accuracy, ranging from 5 FPS to 160 FPS. Like YOLOv4, YOLOv7 is trained from scratch on the MS COCO dataset.

The main changes in the architecture of YOLOv7 are Extended eficient layer aggregation networks (E-ELAN) by improved ELAN and Model scaling for concatenation-based models. If more computational blocks are stacked indefinitely, it may destroy the stable state of the network. ELAN[ 34 ] enables the deeper network to learn and converge eficiently by controlling the shortest and longest gradient path. E-ELAN proposed by YOLOv7 further takes expand, shufles, and merges cardinality to achieve the ability to continuously enhance the learning ability of the network without destroying the original gradient path.

Unlike architectures such as ResNet, applying model scaling to a concatenation-based architecture results in a change in the ratio of input and output channels, which leads to the model’s ineficiency for hardware. YOLOv7 proposed a composite scaling approach that maintains the characteristics of the model at the time of initial design and maintains the optimal architecture. The composite scaling method proposed by YOLOv7 maintains the model’s properties at the initial design. It preserves the optimal structure by scaling the width factor on the transition layer with the same amount of variation.

Evaluated on the MS COCO dataset test-dev 2017, when input image size is 640×640, YOLOv7 achieves AP of 51.4% and 50 of 69.7% at 161 FPS by NVIDIA Tesla V100. 2.10. YOLOv8 The YOLOv5 team releases YOLOv8[ 6 ] in January 2023, and an oficial article still needs to be published. The main improvement as follows:

Backbone: The backbone of YOLOv8 is CSPDraknet-53. As YOLOv5, YOLOv8’s C3 module is replaced by the C2f module with richer gradient flow, and diferent channel numbers are adjusted for diferent scale models to achieve further lightweight. Moreover, YOLOv8 still uses the SPPF module used in YOLOv5.

Head: The Head part has two significant improvements compared with YOLOv5. Firstly, it is replaced with the current mainstream Decoupled-Head, which separates the classification and detection. Secondly, it is also changed from Anchor-Based to Anchor-Free.

Loss: YOLOv8 abandons the previous IOU matching or unilateral ratio distribution method but takes the Task-Aligned Assigner positive and negative sample matching method. Furthermore, YOLOv8 introduced Distribution Focal Loss (DFL)[ 35 ].

Data augmentation can improve model performance, but introducing mosaic augmentation in training may have adverse efects. YOLOv8 turns of the Mosaic augmentation in the last ten epochs, improving accuracy. To meet the needs of diferent scenarios, YOLOv8 provides diferent size models of N / S / M / L / X scales.

Evaluated on MS COCO dataset test-dev 2017, YOLOv8X achieves an AP of 53.9% with 283 FPS on NVIDIA Tesla A100 and TensorRT. 2.11. Summary The architecture of the YOLO families is shown in Table??, including the backbone, neck, and anchor. YOLOv2 first proposes darknet-19 as the backbone, and YOLOv3 deepened the 19 convolutional layers to 53 layers. Almost all subsequent YOLO models use Darknet-53 or an improved version of darknet-53 as the backbone architecture. The initial YOLO does not use anchors, which were used in YOLOv2, and improves the prediction accuracy until YOLOX takes an anchor-free approach and performs well. Since then, subsequent versions of YOLO dropped the use of anchors.

Table2 shows the performance, size, FPS, parameters, and GPU used for training mainstream YOLO versions. Among them, YOLOv1 and YOLOv2 were tested on PASCAL VOC2007. When YOLOv3 was released, the benchmark for object detection had changed from PASCAL VOC to Microsoft COCO. Therefore, the performance of subsequent versions is tested on Microsoft COCO. In addition, the YOLOv6 and YOLOv8x models are quantized by TensorRT. From the parameters in Table2, YOLO increasingly favors lighter models. In YOLOv8, the model YOLOv8v8m is only 3.7% less than the 50 of YOLOv8x, but the parameters are only about 38% of that of YOLOv8x, which is 25.9M. In many versions of YOLO, researchers can choose models according to their needs to find a balance between accuracy and speed.

3. Improvements and Applications

YOLO models have been used for Industry, AIoT, Health Care, and the Protection of Cultural Heritage. In industry, many YOLO applications exist for robots, trafic, and personal protection equipment. In terms of the empty-dish recycling robots, Yue et al.[ 36 ][ 37 ] solve the problem that the traditional object detection model requires store parameters, and propose a lightweight dish detection model based on the YOLOX for an empty-dish recycling robot. Ge et al.[ 38 ] through lightweight YOLO-GG to enable the recycling of empty plates eficiently. For trafic, He et al.[ 39 ] design a flexible and eficient one-stage object detection network FE-YOLO for the rail transit scene. Li et al.[ 40 ] optimize the YOLOv5 model and propose a helmet detection system to ensure the safety of workers.

The field of AIOT has also attracted much attention. AIoT integrates artificial intelligence (AI) technology and the Internet of Things (IoT) in practical applications. Morioka et al.[ 41 ] propose a YOLO model-based Android system for ancient text recognition, implemented by communicating with a server equipped to recognize AI models. Building Smart City Trafic Management Systems based on artificial intelligence and big data has become a trend. Liu et al.[ 42 ] use YOLOv3 to detect vehicles and estimate their length by image processing.

YOLO is also widely used in other health care, [ 43 ] propose a system based on YOLOv4-tiny that was quantified and deployed on Jetson-nano at the beginning of COVID-19. The system identifies the wearing status of masks and measures social distance. It efectively protects people’s health. Zhuang et al.[ 44 ] combine YOLO and the improved two-dimensional continuity equation for the cardiac Vector Flow Mapping (VFM) analysis and evaluation.

Besides, ancient books record much information, and decoding these classic books is favorable for studying history, politics, and culture. Liu[ 45 ] and Fujikawa[ 46 ] use YOLO to detect and identify Oracle Bone Inscription (OBI).

To sum up, YOLO models have been shown to play an essential role in various fields. It has become a trend to optimize the YOLO model for diferent applications.

4. Compression methods for YOLO models

The YOLO model has become more complex in pursuit of better performance. Model compression has become the focus of research to facilitate deployment on edge devices. This section introduces several techniques for model compression, such as model pruning, parameter quantization, knowledge distillation, and lightweight model design[ 47 ].

Model Pruning: Model pruning is achieved by searching for redundant layers/channels in the model and deleting them with little or no impact on performance. By pruning the YOLOv3tiny network, Shi et al.[ 48 ] reduce the network computation by 68.7% Wu et al.[ 49 ] propose a real-time apple flower detection using the channel-pruned YOLOv4 model. The number of parameters is reduced by 96.74%.

Parameter Quantization: Parameter quantization converts floating-point calculations to low-bit-rate integer calculations, such as converting float32 to int8 or int4. We quantize the YOLOv4 model on 8-bit through TensorRT and deploy it to the Jetson Nano, and it achieves the purpose of real-time detection of dirty eggs[ 50 ]. Wang et al.[ 51 ] deploy the pruned YOLOv3 detection model on the FPGA, and the model size is reduced by 80% with little change in accuracy.

Knowledge Distillation: The teacher network is a complex pre-trained network, and the student network is a simple small network. By transferring knowledge, the student network that is more suitable for reasoning can be obtained through the teacher network. Chen et al.[ 52 ] propose a lightweight ship detector by knowledge distillation. Xing et al.[ 53 ] use ResNet101 as the teacher network and DD-YOLO as the student network, reducing the model complexity to 61.4%, which is more suitable for mobile deployment.

Lightweight Model Design: Lightweight DNN model design refers to the redesign based on the existing deep neural network structure to reduce the number of parameters and computational complexity. Liu et al.[ 54 ] replace the backbone of YOLOv3 with ShufleNet to realize real-time vehicle detection. Liu et al.[ 55 ] uses the backbone of YOLOv4 with Mobilenetv3, which improves the accuracy of an extensive pedestrian detection network.

5. Conclusion

This paper summarizes the models of the YOLO series and provides a detailed analysis of the critical feature of each model.Afterward, the application of YOLO in diferent scenarios was introduced. Finally, the method of YOLO model compression was briefly described, and an illustration of the application of YOLO in model compression.

Although the YOLO series is the leader in the speed-accuracy balance in the field of target detection, its main work is for the computer side. In further work, how to make YOLO lighter and faster is worth pondering, especially embedded devices such as Nvidia Jetson Nano and Raspberry Pi. Moreover, it will become a trend to carry DNN models on FPGA to make the model run more eficiently. In addition, the technology combining AI and IoT will also bring more convenience to human life. Finally, since the release of YOLOv4, integrating various advanced algorithms has become an essential way to develop the YOLO algorithm. With the development of the YOLO framework, YOLO is more versatile and powerful and will be applied in a broader range of fields.

[1]

Dalal ,

Triggs , Histograms of oriented gradients for human detection , in: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05) , volume 1 , San Diego, CA, USA, 2005 , pp. 886 - 893 . doi:1 0 . 1 1 0 9 / C V P R . 2 0 0 5 . 1 7 7 .

[2]

Cristianini ,

Shawe-Taylor , An introduction to support vector machines and other kernel-based learning methods , Cambridge university press, 2000 .

[3]

Krizhevsky , I. Sutskever,

G. E.

Hinton , Imagenet classification with deep convolutional neural networks , Communications of the ACM 60 ( 2017 ).

[4]

Girshick ,

Donahue ,

Darrell , J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation , in: 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH , USA, 2014 , pp. 580 - 587 . doi:1 0 . 1 1 0 9 / C V P R . 2 0 1 4 . 8 1 .

[5]

Redmon ,

Divvala ,

Girshick ,

Farhadi , You only look once: Unified, real-time object detection , in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas , NV , USA, 2016 , pp. 779 - 788 . doi:1 0 . 1 1 0 9 / C V P R . 2 0 1 6 . 9 1 .

[6]

Glenn

Jocher and

Ayush

Chaurasia and

Jing

Qiu , Yolo by ultralytics, 2023 . URL: https: //github.com/ultralytics/ultralytics.

[7]

C.-Y.

Wang ,

I.-H.

Yeh , H. -Y. M. Liao , You only learn one representation: Unified network for multiple tasks , arXiv preprint arXiv:2105.04206 ( 2021 ).

[8]

Ge ,

Liu ,

Wang ,

Li ,

Sun , Yolox: Exceeding yolo series in 2021 , arXiv preprint arXiv: 2107 .08430 ( 2021 ).

[9]

Xiao ,

Tian ,

Yu ,

Zhang , S. Liu,

Du ,

Lan , A survey of deep learning-based object detection , Multimedia Tools and Applications 79 ( 2020 ). doi:1 0 . 1 0 0 7 / s 1 1 0 4 2 - 0 2 0 - 0 8 9 7 6 - 6 .

[10] Glenn

Jocher

, Yolov5 by ultralytics, 2020 . URL: https://github.com/ultralytics/yolov5.

[11]

Terven ,

Cordova-Esparza , A comprehensive review of yolo: From yolov1 to yolov8 and beyond , arXiv preprint arXiv:2304.00501 ( 2023 ).

[12]

Zou ,

Chen ,

Shi ,

Guo ,

Ye , Object detection in 20 years: A survey , Proceedings of the IEEE 111 ( 2023 ). doi:1 0 . 1 1 0 9 / J P R O C . 2 0 2 3 . 3 2 3 8 5 2 4 .

[13]

Szegedy , W. Liu,

Jia ,

Sermanet ,

Reed ,

Anguelov ,

Erhan ,

Vanhoucke ,

Rabinovich , Going deeper with convolutions , in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , Boston, MA, USA, 2015 , pp. 1 - 9 . doi:1 0 . 1 1 0 9 / C V P R . 2 0 1 5 . 7 2 9 8 5 9 4 .

[14]

Russakovsky ,

Deng ,

Su ,

Krause ,

Satheesh ,

Ma ,

Huang ,

Karpathy ,

Khosla ,

Bernstein , et al., Imagenet large scale visual recognition challenge , International Journal of Computer Vision 115 ( 2015 ). doi:1 0 . 1 0 0 7 / s 1 1 2 6 3 - 0 1 5 - 0 8 1 6 - y .

[15]

Redmon ,

Farhadi , Yolo9000: Better, faster, stronger, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu , HI , USA, 2017 , pp. 6517 - 6525 . doi:1 0 . 1 1 0 9 / C V P R . 2 0 1 7 . 6 9 0 .

[16]

Simonyan ,

Zisserman , Very deep convolutional networks for large-scale image recognition , arXiv preprint arXiv:1409.1556 ( 2014 ).

[17]

Girshick , Fast r-cnn, in: 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015 , pp. 1440 - 1448 . doi: 1 0 . 1 1 0 9 / I C C V . 2 0 1 5 . 1 6 9 .

[18]

Redmon , A. Farhadi, Yolov3: An incremental improvement , CoRR abs/ 1804 .02767 ( 2018 ). URL: http://arxiv.org/abs/ 1804 .02767. a r X i v : 1 8 0 4 . 0 2 7 6 7 .

[19]

He ,

Zhang , S. Ren,

Sun , Deep residual learning for image recognition , in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas , NV , USA, 2016 , pp. 770 - 778 . doi:1 0 . 1 1 0 9 / C V P R . 2 0 1 6 . 9 0 .

[20] T.-Y. Lin , P.

Dollár , R.

Girshick , K.

He , B.

Hariharan , S.

Belongie , Feature pyramid networks for object detection , in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu , HI , USA, 2017 , pp. 936 - 944 . doi:1 0 . 1 1 0 9 / C V P R . 2 0 1 7 . 1 0 6 .

[21]

Bochkovskiy , C.-Y. Wang, H. -Y. M. Liao , Yolov4: Optimal speed and accuracy of object detection , arXiv preprint arXiv: 2004 . 10934 ( 2020 ).

[22] C.-Y. Wang , H. -Y. Mark Liao , Y. -H. Wu , P.-Y. Chen, J.-W.

Hsieh , I.-H.

Yeh , Cspnet: A new backbone that can enhance learning capability of cnn , in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle , WA, USA, 2020 , pp. 1571 - 1580 . doi:1 0 . 1 1 0 9 / C V P R W 5 0 4 9 8 . 2 0 2 0 . 0 0 2 0 3 .

[23]

Misra , Mish: A self regularized non-monotonic activation function , arXiv preprint arXiv: 1908 . 08681 ( 2019 ).

[24]

He ,

Zhang , S. Ren,

Sun , Spatial pyramid pooling in deep convolutional networks for visual recognition , IEEE Transactions on Pattern Analysis and Machine Intelligence 37 ( 2015 ). doi:1 0 . 1 1 0 9 / T P A M I . 2 0 1 5 . 2 3 8 9 8 2 4 .

[25]

Liu ,

Qi ,

Qin ,

Shi ,

Jia , Path aggregation network for instance segmentation , in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT , USA, 2018 , pp. 8759 - 8768 . doi:1 0 . 1 1 0 9 / C V P R . 2 0 1 8 . 0 0 9 1 3 .

[26]

Woo ,

Park , J.-

Lee ,

I. S.

Kweon , Cbam: Convolutional block attention module , in: Computer Vision - ECCV 2018 , 2018 , pp. 3 - 19 . doi:1 0 . 1 1 0 9 / C V P R . 2 0 1 8 . 0 0 9 1 3 .

[27]

Song ,

Liu ,

Wang , Revisiting the sibling head in object detector , in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , Seattle, WA, USA, 2020 , pp. 11560 - 11569 . doi:1 0 . 1 1 0 9 / C V P R 4 2 6 0 0 . 2 0 2 0 . 0 1 1 5 8 .

[28]

Wu ,

Chen ,

Yuan ,

Liu ,

Wang ,

Li ,

Fu , Rethinking classification and localization for object detection , in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , Seattle, WA, USA, 2020 , pp. 10183 - 10192 . doi:1 0 . 1 1 0 9 / C V P R 4 2 6 0 0 . 2 0 2 0 . 0 1 0 2 0 .

[29]

Tian ,

Shen ,

Chen , T. He, Fcos: Fully convolutional one-stage object detection , in: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019 , pp. 9626 - 9635 . doi: 1 0 . 1 1 0 9 / I C C V . 2 0 1 9 . 0 0 9 7 2 .

[30]

Ge ,

Liu ,

Li ,

Yoshie ,

Sun , Ota: Optimal transport assignment for object detection , in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville , TN , USA, 2021 , pp. 303 - 312 . doi:1 0 . 1 1 0 9 / C V P R 4 6 4 3 7 . 2 0 2 1 . 0 0 0 3 7 .

[31]

Li ,

Jiang ,

Weng ,

Geng ,

Li ,

Ke ,

Li , M. Cheng, W. Nie, et al., Yolov6: A single-stage object detection framework for industrial applications , arXiv preprint arXiv:2209.02976 ( 2022 ).

[32]

Ding ,

Zhang , N. Ma, J. Han, G . Ding,

Sun , Repvgg: Making vgg-style convnets great again , in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville , TN , USA, 2021 , pp. 13728 - 13737 . doi:1 0 . 1 1 0 9 / C V P R 4 6 4 3 7 . 2 0 2 1 . 0 1 3 5 2 .

[33] C.-Y. Wang , A. Bochkovskiy , H. -Y. M. Liao , Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors , arXiv preprint arXiv:2207.02696 ( 2022 ).

[34] C.-Y. Wang , H. -Y. M. Liao , I.-H. Yeh , Designing network design strategies through gradient path analysis , arXiv preprint arXiv:2211.04800 ( 2022 ).

[35]

Li ,

Wang ,

Wu ,

Chen ,

Hu ,

Li ,

Tang ,

Yang , Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection , Advances in Neural Information Processing Systems 33 ( 2020 ).

[36]

Yue ,

Li ,

Shimizu ,

Kawamura , L. Meng, Yolo-gd: A deep learning-based object detection algorithm for empty-dish recycling robots , Machines 10 ( 2022 ). doi:1 0 . 3 3 9 0 / m a c h i n e s 1 0 0 5 0 2 9 4 .

[37]

Yue ,

Li ,

Meng , An ultralightweight object detection network for empty-dish recycling robots , IEEE Transactions on Instrumentation and Measurement 72 ( 2023 ). doi:1 0 . 1 1 0 9 / T I M . 2 0 2 3 . 3 2 4 1 0 7 8 .

[38]

Ge ,

Yue ,

Meng , A high-eficiency dirty-egg detection system based on yolov4 and tensorrt , in: 2022 International Conference on Advanced Mechatronic Systems (ICAMechS) , Toyama, Japan, 2022 , pp. 59 - 63 .

[39]

He ,

Zou ,

Chen ,

Liu ,

Miao , Rail transit obstacle detection based on improved cnn , IEEE Transactions on Instrumentation and Measurement 70 ( 2021 ). doi:1 0 . 1 1 0 9 / T I M . 2 0 2 1 . 3 1 1 6 3 1 5 .

[40]

Li ,

Xie ,

Zhang ,

Lu ,

Xie ,

Su ,

Du ,

Hou , Toward eficient safety helmet detection based on yolov5 with hierarchical positive sample selection and box density filtering , IEEE Transactions on Instrumentation and Measurement 71 ( 2022 ). doi:1 0 . 1 1 0 9 / T I M . 2 0 2 2 . 3 1 6 9 5 6 4 .

[41]

Morioka ,

Aravinda ,

Meng , An ai-based android application for ancient documents text recognition , in: Proceedings of the 2021 International Symposium on Advanced Technologies and Applications in the Internet of Things , Virtual, Kusatsu, Japan, 2021 , pp. 91 - 98 .

[42]

Liu ,

Reynolds ,

Huynh , G. Hassan, Study of accurate and fast estimation method of vehicle length based on yolos , in: 2020 IEEE International Conference on Artificial Intelligence and Information Systems (ICAIIS) , Dalian, China, 2020 , pp. 118 - 121 . doi:1 0 . 1 1 0 9 / I C A I I S 4 9 3 7 7 . 2 0 2 0 . 9 1 9 4 9 3 0 .

[43]

Yue ,

Li ,

Meng , Ai-based prevention embedded system against covid-19 in daily life , Procedia Computer Science 202 ( 2022 ).

[44]

Zhuang , G. Liu,

Ding ,

A. N. J.

Raj ,

Qiu ,

Guo ,

Yuan , Cardiac vfm visualization and analysis based on yolo deep learning model and modified 2d continuity equation , Computerized Medical Imaging and Graphics 82 ( 2020 ).

[45]

Liu ,

Xing ,

Xiong , Spatial pyramid block for oracle bone inscription detection , in: Proceedings of the 2020 9th International Conference on Software and Computer Applications , New York, NY, USA, 2020 „ p. 133 - 140 . doi:1 0 . 1 1 4 5 / 3 3 8 4 5 4 4 . 3 3 8 4 5 6 1 .

[46]

Fujikawa ,

Li ,

Yue ,

Aravinda ,

G. A.

Prabhu ,

Meng , Recognition of oracle bone inscriptions by using two deep learning models , International Journal of Digital Humanities ( 2022 ). doi:1 0 . 1 0 0 7 / s 4 2 8 0 3 - 0 2 2 - 0 0 0 4 4 - 9 .

[47]

Li ,

Meng , Model compression for deep neural networks: A survey , Computers 12 ( 2023 ). doi:1 0 . 3 3 9 0 / c o m p u t e r s 1 2 0 3 0 0 6 0 .

[48]

Shi ,

Li ,

Yamaguchi , An attribution-based pruning method for real-time mango detection with yolo network, Computers and Electronics in Agriculture 169 ( 2020 ). doi:1 0 . 1 0 0 7 / s 1 1 2 6 3 - 0 1 4 - 0 7 3 3 - 5 .

[49]

Wu ,

Lv ,

Jiang ,

Song , Using channel pruning-based yolo v4 deep learning algorithm for the real-time and accurate detection of apple flowers in natural environments , Computers and Electronics in Agriculture 178 ( 2022 ). doi:1 0 . 1 0 1 6 / j . c o m p a g . 2 0 2 0 . 1 0 5 7 4 2 .

[50]

Wang ,

Yue ,

Li ,

Meng , A high-eficiency dirty-egg detection system based on yolov4 and tensorrt , in: 2021 International Conference on Advanced Mechatronic Systems (ICAMechS) , Tokyo, Japan, 2021 , pp. 75 - 80 . doi:1 0 . 1 1 0 9 / I C A M e c h S 5 4 0 1 9 . 2 0 2 1 . 9 6 6 1 5 0 9 .

[51]

Wang ,

Li ,

Yue ,

Meng , Design and acceleration of field programmable gate array-based deep learning for empty-dish recycling robots , Applied Sciences 12 ( 2022 ). doi:1 0 . 3 3 9 0 / a p p 1 2 1 4 7 3 3 7 .

[52]

Chen ,

Zhan ,

Wang ,

Zhang , Learning slimming sar ship object detector through network pruning and knowledge distillation , IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14 ( 2021 ). doi:1 0 . 1 1 0 9 / J S T A R S . 2 0 2 0 . 3 0 4 1 7 8 3 .

[53]

Xing ,

Chen ,

Pang , Dd-yolo: An object detection method combining knowledge distillation and diferentiable architecture search , IET Computer Vision 16 ( 2022 ). doi: 1 0 .

1 0 4 9 / c v i 2 . 1 2 0 9 7 .

[54]

Liu ,

Zhang , Vehicle detection and ranging using two diferent focal length cameras , Journal of Sensors 14 ( 2020 ). doi:1 0 . 1 1 5 5 / 2 0 2 0 / 4 3 7 2 8 4 7 .

[55]

Liu ,

Ke ,

Lin ,

Xu , et al., Research on pedestrian detection algorithm based on mobilenet-yolo, Computational intelligence and neuroscience 2022 ( 2022 ). doi:1 0 . 1 1 5 5 / 2 0 2 2 / 8 9 2 4 0 2 7 .