<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>URL Detection based on YOLO Network in Various Conditions⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leila Boussaad</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aldjia Boucetta</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aymene Zeroual</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wail Ala-eddine Zeroual</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science dept., Batna2 University</institution>
          ,
          <addr-line>Batna, 05000</addr-line>
          ,
          <country country="DZ">Algeria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LAMIE Laboratory, Computer science dept., Batna2 University</institution>
          ,
          <addr-line>Batna, 05000</addr-line>
          ,
          <country country="DZ">Algeria</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Management Sciences dept., Batna1 University</institution>
          ,
          <addr-line>Batna, 05000</addr-line>
          ,
          <country country="DZ">Algeria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Object detection is a key technique in computer vision, as it is considered a necessary step in any recognition process. It is the procedure for determining the instance of the class to which the object belongs and estimating its location by displaying its bounding box. It was widely accepted that advances in object detection have generally gone through two periods: ”the traditional object detection period”, where detection was performed through classical machine learning techniques, and ”the deep learning based detection period”, where classical machine learning techniques have been completely replaced by methods based on deep neural networks. In this paper, we will focus on object detection based on deep learning. The main objective is to carry out a comparative study of three models of the YOLO family, already proven to be efective for object detection that are YOLOv3, YOLOv4, and YOLOv5 in the context of the detection of URLs in photos taken by a mobile phone. The experimental results, expressed in terms of average precision, showed the generalization ability of the three models, YOLOv3, YOLOv4, and YOLOv5. In addition, the stability of the YOLOv4 model against several dificulties added to the images.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Object Detection</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Convolutional Neural Networks (CNN)</kwd>
        <kwd>YOLO</kwd>
        <kwd>URL Detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Internet. These pages and documents are interconnected</title>
        <p>through hypertext links, which users can click to access
Object detection plays a central role in any recognition information. This information can take various forms,
insystem, encompassing the task of identifying an object’s cluding text, images, audio, and video. To visit a website,
class and estimating its spatial coordinates by delineating a specific page on a site, or more precisely, an "online
a bounding frame around the object. Recent advance- resource" (such as content or an online service), users
ments in deep learning-based object detection have de- can enter its address, known as a Uniform Resource
Lolivered remarkable outcomes. However, the real-world cator (URL), into the browser’s address bar. The URL is
implementation of object detection faces a host of chal- indispensable for pinpointing a particular page within
lenges when confronted with actual images, including the vast sea of billions of web pages. Each web resource
factors like noise, occlusion, lighting fluctuations, rota- possesses a unique URL, which serves as the web address
tions, and others. These elements have a pronounced displayed in your browser.
impact on the precision of object detection and demand To be more eficient and remove the step of entering
thorough scrutiny during the detection process. the URL, especially with increasing processing
capabili</p>
        <p>Conversely, the web has consistently served as a ties such as the availability of smart phones and visual
medium that allows the transfer of data in a simple and input devices such as cameras built into smartphones,
fast way. It counts as a necessary tool in modern life, this process can be divided into three steps: image
acofering a multitude of prospects for both individuals and quisition, URL localization, and URL recognition. In this
large corporations. paper, we will mainly focus on the second step, which</p>
        <p>World Wide Web, often referred to as the Web or plays a crucial role in the localization of URLs in a
capWWW, encompasses all publicly accessible websites and tured image containing text. Object detection methods
pages that users can access on their local devices via the are well suited to accomplish this process. The detection
of a URL can be useful in several fields, particularly in
6th International Hybrid Conference On Informatics And Applied Math- the field of tourism. The tourist can take a picture of a
*emCoartircess,pDoencdeimngbearu6t-h7o,r2.023 Guelma, Algeria URL and view website information without having to
† These authors contributed equally. type on their keyboard. Businesses, store owners, and
$ boussaad.mous@gmail.com (L. Boussaad); their customers can advertise and post information by
boucetta_batna@yahoo.fr (A. Boucetta); leaving URLs to the services they ofer on their ad slots,
aymenezeroual@gmail.com (A. Zeroual); and users can retrieve the URL(s) by clicking a button.
wailalaeddinezeroual@gmail.com (W. A. Zeroual)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License The model can also be used to capture URL references
Attribution 4.0 International (CC BY 4.0).
while listening to a presentation at a conference. Given
the capability of smart phone cameras lately, taking a
picture of a URL in transit should result in a "good
quality" image that can be passed as input to a model, and
the URL(s) of interest will be recovered.</p>
        <p>The primary goal of this study is to implement and
assess the efectiveness of three established models in
the domain of object detection—specifically, YOLOv3,
YOLOv4, and YOLOv5—for the identification of URLs
within images characterized by numerous challenges.</p>
        <p>This includes the application of traditional image
processing techniques like rotation and noise addition, which
exist in real-world scenarios.</p>
        <p>The subsequent sections of the manuscript are
designed as follows: Section II provides a concise overview
of the key concepts underpinning this paper. Section III
outlines the evaluation process we employed. Section IV
is dedicated to the presentation of experimental results
and ensuing discussions. Finally, in Section V, the paper
draws its conclusions.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Backgrounds</title>
      <p>mask even if there are other objects with the same class
(see figure 1 (d)).</p>
      <p>
        Prior to delving into the process and the diverse
techniques employed in object detection, it is essential to Figure 1: The diferent computer vision tasks.
establish a precise comprehension of object detection
itself. Frequently, this term is used interchangeably with In this paper, we will focus on object detection, where it
techniques like image classification, object recognition, is widely accepted that progress in this field has generally
segmentation, and more. Nonetheless, it is imperative to crossed two periods: ”the traditional object detection
acknowledge that many of the techniques mentioned are period (before 2014)” and ”the period of deep
learningdistinct tasks typically encompassed within the broader based detection (after 2014)."[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
realm of object detection. Treating them as synonyms is During the traditional object detection period (before
inaccurate, as each corresponds to a task of equal impor- 2014), object detection predominantly relied on classical
tance. Thus, we can distinguish these computer vision machine learning techniques. Among the notable
methtasks: ods that emerged during this period, three significant
ap
      </p>
      <p>
        Image classification is about predicting the class of proaches are worthy of mention. These include the
Violaan element in an image, while object localization is Jones detector, originally developed in 2001 by Paul Viola
about locating the presence of objects in an image and and Michael Jones [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for real-time human face detection,
indicating their location using a bounding box (see figure and it has demonstrated its utility in diverse applications.
1(a)), and object detection is about locating the presence The Histogram of Oriented Gradients (HOG), introduced
of objects with a bounding box and the types or classes in 2005 by N. Dalal and B. Triggs [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], represents an
enof objects located in an image. Figure 1 (b) clearly shows hancement over SIFT descriptors, shape contexts, and
the result of an object detection process in a road scene. contour orientation histograms. HOG provides a robust,
      </p>
      <p>
        Another extension of this division of computer vision scale-invariant solution. Additionally, there is the
Detasks is semantic image segmentation where instances formable Partial Model (DPM) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], initially proposed by P.
of recognized objects are indicated by highlighting spe- Felzenszwalb in 2008 as an extension of the HOG
deteccific pixels of the object. This technique gives a precise tor. DPM introduced a novel strategy involving learning
location (at the pixel level) of an object and the pixels the components and their overall structure.
found. The pixels produced can also be called a mask Despite the success of traditional approaches, the
ef(see figure 1 (c)). Combining semantic segmentation with fort required to create efective and eficient detection
object detection leads to instance segmentation, which models remains significant. Therefore, they have been
ifrst detects object instances and then segments each into completely replaced by methods based on deep neural
detected boxes (in this case called regions of interest). In networks, resulting in greater accuracy and
generalizaother words, each object in the image gets its own unique tion. In the era of deep learning, object detection is
grouped into two classes: "two-step detection” and
“onestep detection”. Typically, an object detector solves two
successive tasks: finding an arbitrary number of objects
(perhaps even zero) and classifying each object and
estimating its size using a bounding box. Methods that
combine both tasks in one step are called single-step
detectors.
      </p>
      <p>One-stage detectors skip the region proposal step and
perform detection directly on a dense sampling of
locations. They generally consider all positions in the image
as potential objects and try to classify each region of
interest as a background or target object.</p>
      <p>
        Within the realm of single-step object detection
algorithms, a noteworthy mention goes to the YOLO (You
Only Look Once) family of algorithms [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which serves
as the central focus of this investigation. This approach
employs a single, fully trained neural network that
receives an image as input and directly generates
predictions for bounding boxes and their associated class labels.
      </p>
      <p>
        In the subsequent sections, we provide a concise
overview of the three models under consideration.
2.1. YOLOv3 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
YOLOv3 is primarily comprised of two key components:
a feature extractor and a detector. The initial step
involves passing the image through the feature extractor
known as Darknet-53. Darknet-53 is responsible for
processing the image and generating feature maps at various
scales. These feature maps at each scale are subsequently
directed into distinct branches of the detector. The
detector’s primary function is to process these multiple
feature maps at diverse scales, culminating in the
creation of output grids that contain objectivity scores and
bounding boxes. The complete architecture of YOLOv3
is illustrated in Figure 2.
      </p>
      <p>Darknet-53 integrates both residual blocks and Feature
Pyramid Networks (FPNs), as illustrated in Figure 3.
Serving as a feature extractor, Darknet-53 accepts single-scale
images of arbitrary dimensions as input and yields
appropriately scaled multi-level feature maps. This feature
design enables exceptional performance across a broad
range of input resolutions.
parameter aggregation method from various levels of the
backbone for distinct detector levels, deviating from the
FPN approach employed in YOLOv3.</p>
      <p>
        Furthermore, an SPP block, as referenced in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], is
introduced for the notable expansion of the receptive
ifeld. This block efectively isolates crucial contextual
features while maintaining minimal impact on network
operational speed. Finally, YOLOv3 is employed as the
network’s head, tasked with extracting pertinent features.
      </p>
      <p>
        The figure 4 provides a clear structure of the YOLOv4
architecture.
2.3. YOLOv5
2.2. YOLOv4 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
      </p>
      <sec id="sec-2-1">
        <title>Developed by Ultralytics in 2020, this development</title>
        <p>
          Developed in 2020 by Alexei Bochkovsky, the YOLOv4 ar- marked a substantial enhancement in facilitating
realchitecture features CSPDarknet53 as its backbone, which time object detection. The shift from Darknet to PyTorch
builds upon the foundation of DarkNet-53. It incorpo- as a framework played a pivotal role in this improvement.
rates a CSPNet strategy, as referenced in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], to parti- Darknet, known for its complexity in configuration and
tion the base layer’s feature map into two segments and limited production readiness, was surpassed by PyTorch,
subsequently reunite them through a multi-step hierar- leading to significant reductions in both training and
chy. This division and reunification approach facilitates prediction times.
more degraded flow within the network. Following the The model’s architecture bears similarities to YOLOv4,
backbone, YOLOv4 adopts PANet, as cited in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], as a incorporating CSPDarknet53 as the backbone, SPP and
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>PANet for the neck, and employing YOLOv3 as the head, as illustrated in Figure 5. models’ generalization aptitude across distinct usage scenarios featuring diverse content.</title>
        <sec id="sec-2-2-1">
          <title>3.1. Creating and preparing the Training and Testing Datasets</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>The chosen object detection algorithms are rooted in</title>
        <p>deep learning, and their intricate architecture
necessitates training on specific datasets to attain the desired
objectives. The dataset plays a pivotal role in influencing
the models’ performance; thus, it is imperative to have a
robust dataset in order to achieve optimal performance.</p>
        <p>In the context of this research, our focus lies on the
application of various models for the detection of URLs.
A challenge surfaced during the preliminary phase of our
study, as no standardized database was readily available
for conducting a comprehensive evaluation and
comparison of these models. In response, we undertook the task
of creating our own dataset, comprising a total of 160
images, all of which feature URLs.</p>
        <p>The URL starts with three consecutive letters ’w’ and
a dot, followed by a label. The label is a series of English
letters from a to z (not case-sensitive) and can also
contain digits from 0 to 9. Hyphens can be added, but not
at the beginning or at the end, and adding more than
one consecutively is not allowed. The label length is
between 3 and 63 characters maximum. In the end, after a
point, an extension is added. The most used extensions
are ".com", ".net" and ".org". This part can be called the
domain name.The URL can start with a protocol such
as http://, but modern web clients like browsers
automatically add the protocol before the URL if it doesn’t
contain one. The URL can also contain, after the
extension name, more data, such as the filename /index.html
or subdirectories like /dir1/dir2 (see figure 6).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology Steps Description</title>
      <sec id="sec-3-1">
        <title>This section ofers an in-depth clarification of the evalu</title>
        <p>ation methodology employed in this study. The research For our image dataset, we created random tag names
involves a comparative analysis of three deep models according to the previously listed conventions with
within the YOLO family: YOLOv3, YOLOv4, and YOLOv5. the defined extension added at the end, which
These models are single-stage detectors recognized for is:.com,.net,.org,.fr,.dz,.ca,.uk. These extensions are
their fast, high-accuracy detection capabilities and are widely spread, especially in our region. We added a few
ideally suited for deployment on low-end systems, such URLs with additional data at the end, but for the majority
as embedded platforms. of images, we focused heavily on the domain name (label</p>
        <p>We commence this section by delineating the primary and extension).
stage, which entails the creation and preparation of the Object detection techniques exclusively process
pixeltraining dataset. Subsequently, we delve into the train- level data, which implies that they perceive distinct
variaing process for the three selected models. The section tions between the letter ’A’ in one font style and the same
culminates with a series of tests conducted on images letter ’A’ in a diferent font style. Moreover, there can be
presenting various challenges, aimed at assessing the substantial disparities between a handwritten letter and
its printed equivalent, despite both conveying the same
semantic meaning through diferent visual
representations (see figure 7). So, as a starting point for creating the
dataset, the printed URLs are written using the popular
Arial font. and for color, black is chosen.
• For YOLOv5, the medium model is chosen,
balancing precision and speed more efectively. The
input image dimensions are set to 640× 640 with 30
epochs. Remarkably, unlike its predecessors, this
model required just 15 minutes to complete the
training process, showcasing exceptional speed.</p>
        <p>The training of YOLOv3 and YOLOv4 is executed
within the Darknet framework, whereas YOLOv5 is
trained using PyTorch. In all cases, oficial pre-trained
weights are chosen to apply transfer learning. The final
weights of the models are uploaded to the local machine
to be used in the evaluation phase.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Model evaluation, results and discussion</title>
      <sec id="sec-4-1">
        <title>The majority of global printing is performed on A4 pa</title>
        <p>per, and thus, our dataset will exclusively feature sample
URLs that have been printed on A4 paper. These URLs
will be juxtaposed with randomly generated text writ- The evaluation is conducted through a two-part process.
ten in various languages, encompassing Latin, Arabic, In the initial stage, we assess the models’ capacity for
genChinese, Russian, and Indian scripts. eralization in URL detection by considering their overall</p>
        <p>After the generation of numerous simulated images, performance, which involves the utilization of the entire
the next step consists of a manual labeling process. Dur- test dataset. In the subsequent stage, we subject the three
ing this phase, each image’s linked URL box is defined models to testing under various conditions commonly
enand manually set. Subsequently, these annotations are countered in photos taken with mobile phones, thereby
stored using the YOLO image annotation format, wherein evaluating their stability.
each image corresponds to an individual text annotation In the field of object detection, the evaluation of model
ifle, denoted by the same name as the image itself. Each performance relies on several crucial metrics that provide
line within the annotation file serves to define a ground- a comprehensive assessment of a model’s object detection
truth object present within the image, represented like capabilities. In the context of this work, we have utilized
this: the defined metrics below:</p>
        <p>"&lt;  &gt;&lt;  &gt;&lt;  &gt;&lt; ℎ &gt;&lt; The Intersection over Union (IoU), also known as the
ℎℎ &gt;" Jaccard Index, quantifies the similarity between predicted</p>
        <p>This dataset format is compatible exclusively with bounding boxes and actual bounding boxes. Formally,
Darknet-based versions of YOLO, namely YOLOv3 and IoU equals the intersection between the real and
preYOLOv4, and is not compatible with YOLOv5. To accom- dicted bounding boxes divided by their union. The figure
modate YOLOv5, we adopted the Robooflw web platform, 8 clearly illustrates this concept of IoU. IoU ranges from 0
which serves as a comprehensive solution for hosting, to 1; the closer the actual and predicted bounding boxes,
annotating, and converting datasets across diverse for- the closer the IoU measure is to 1 (see Figure 9).
mats.</p>
        <sec id="sec-4-1-1">
          <title>3.2. Model training</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>The models are trained on 80 % (127 images) of all the data.</title>
        <p>The training is carried out in a Google Colab environment.</p>
        <p>The training parameters are:
• In the case of YOLOv3, the input images are set to
dimensions of 416 × 416, and the training covers
30 epochs, accumulating a total training duration Figure 8: Intersection over Union (IoU) metric.
of 7 hours.
• In the case of YOLOv4, the input images are con- Precision and recall. Precision assesses the
proporifgured at dimensions of 608 × 608, and the train- tion of correct predictions among all positive predictions,
ing persisted for 30 epochs, amounting to a total while recall measures the proportion of true positives
training duration of 7 hours. identified among all actual objects.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Finally, the precision-recall curve illustrates the tradeof between precision and recall for diferent confidence thresholds, providing an overall view of the model’s performance across a range of confidence thresholds.</title>
        <sec id="sec-4-3-1">
          <title>4.1. Generalization ability evaluation</title>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>The evaluation involves a subset of 20% of the complete</title>
        <p>dataset, comprising 33 images. We have selected an IoU
(Intersection over Union) threshold of 0.5 for this
assessment. Results are depicted in figure ??.</p>
        <sec id="sec-4-4-1">
          <title>4.2. Evaluation in diferent conditions</title>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>In this section, we assess the model’s performance using images that present challenges not encountered in the training dataset. These challenges include:</title>
      </sec>
      <sec id="sec-4-6">
        <title>In the evaluation of our object detection models, dis</title>
        <p>tinct performance characteristics emerge. YOLOv3, for
instance, achieved an average accuracy of 70.53%.
Notably, precision begins to decline once recall surpasses
75%.</p>
        <p>Conversely, YOLOv4 demonstrates a notably higher The table below showcases the models’ performance as
average accuracy of 90.91%, with a subsequent drop in measured by the average precision (AP) for each dificulty
precision observed after reaching a recall rate of 98%. As category.
for YOLOv5, it achieves an average accuracy of 88.63%, From results presented in Table 1, we can draw the
with precision exhibiting a decrease once recall exceeds following conclusions:
91%.</p>
        <p>Despite the limited dataset size, the above observations 4.2.1. Testing with distinct typestyles:
underscore the model’s ability to generalize efectively.
• Distinct typestyles, such as Algerian, Bradley</p>
        <p>Hand ITC, and Jokerman.
• Diferent background colors and character colors.
• Rotation of images of 90° and 180°.
• URLs prefixed with the https:// protocol tag.
• Handwritten URL characters.</p>
        <p>• Images with Gaussian Noise.</p>
      </sec>
      <sec id="sec-4-7">
        <title>This challenge had a relatively minor impact on the models’ performance, as all three models demonstrated</title>
        <p>closely aligned average accuracies. YOLOv3 yielded
the lowest average accuracy of 72.7%, while YOLOv5
achieved the highest average accuracy of 88.07%, and
YOLOv4 84.2%. Furthermore, figures 13, 14, and 15
provide a comprehensive representation of URLs successfully
identified within text samples rendered in three diferent
fonts.
4.2.2. Testing with diferent color polices and
backgrounds:
This challenge posed a significant impact on the models’
performance. YOLOv3, for instance, exhibited the
inability to detect URLs against a colored background, yet it
demonstrated success in detecting URLs with colored
characters, achieving an average accuracy of 31.71%. In
contrast, YOLOv4 emerged as the top-performing model
in this context, attaining an average accuracy of 54.44%.
YOLOv4 managed to successfully detect all URLs with
colored characters and a majority of URLs against
colored backgrounds. Conversely, YOLOv5 displayed the
weakest performance with an average accuracy of 27.27%,
struggling to detect URLs with colored objects, as well as
those against colored backgrounds. Additionally, figures
16, 17, and 18 provide a visual representation of the
identiifcation of URLs within text samples that feature colored
characters and are placed against colored backgrounds.
4.2.3. Testing with diferent image rotations (90 ° et
180°):</p>
      </sec>
      <sec id="sec-4-8">
        <title>Rotating the images by 180° had no notable influence</title>
        <p>on the models’ performance. However, when rotated by
90°, the models’ performance was significantly decreased,
with all three models achieving an average accuracy of
0%. as evident in figures 19, 20, and 21.
4.2.4. URL prefixed with the http:// protocol tag:
In the case of URLs prefixed with the http:// tag, the
models encountered dificulties in their detection, with
some models failing to identify the complete URL,
recognizing only the domain name. YOLOv3 exhibited an
average accuracy of 12.12%, YOLOv4 outperformed the
others with the highest accuracy at 28.9%, and YOLOv5
achieved an accuracy of 13.64%. Examples of detections
for this particular challenge are illustrated in Figures 22,
23 and 24.
4.2.5. Handwritten URL characters:</p>
      </sec>
      <sec id="sec-4-9">
        <title>In this case, all models were unable to identify the URL, resulting in an average accuracy of 0% across the board. Figure 25 provides an image depicting a sheet with var</title>
        <p>ious handwritten URLs that remained unrecognized by
all three models.
4.2.6. Gaussian noise addition:</p>
      </sec>
      <sec id="sec-4-10">
        <title>The introduction of noise had a profound impact on the</title>
        <p>models’ accuracy, leading to the detection of false objects.
YOLOv3’s accuracy dropped to 36.36%, and YOLOv4
also experienced a decrease, with an accuracy of 37.36%.
In contrast, YOLOv5 achieved the highest accuracy of
83.98% under these conditions (see figures 26, 27, and 28).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <sec id="sec-5-1">
        <title>In this study, we investigated a very interesting topic</title>
        <p>in the field of computer vision, specifically focusing on
object detection, a pivotal stage in recognition processes.
Our primary goal was to assess the generalization
capability and robustness of three distinct models—YOLOv3,
YOLOv4, and YOLOv5—in the context of URL detection
within mobile phone-captured images.</p>
        <p>The experimental results, expressed in terms of
average precision, allowed us to deduce the following
conclusions:</p>
        <p>The three models gave very satisfactory generalization
results, and the best is YOLOv4.</p>
        <p>Concerning stability for several dificulties, the 3
models did not completely recognize URLs rotated by a
90° rotation angle, where the average precision achieved
is 0:0%. Also, for handwritten URLs, all three models
provided an average accuracy of 0.0%.</p>
      </sec>
      <sec id="sec-5-2">
        <title>To improve these results, we propose to:</title>
        <p>• increase the size of the dataset.
• Augment the image set with images containing
diferent dificulties for the training dataset.
• Test other versions of the YOLO family, even
other models of the two-stage detector family.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <article-title>Object detection in 20 years: A survey</article-title>
          ,
          <source>Proceedings of the IEEE</source>
          <volume>111</volume>
          (
          <year>2023</year>
          )
          <fpage>257</fpage>
          -
          <lpage>276</lpage>
          . doi:
          <volume>10</volume>
          .1109/JPROC.
          <year>2023</year>
          .
          <volume>3238524</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Viola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <article-title>Rapid object detection using a boosted cascade of simple features</article-title>
          ,
          <source>in: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR</source>
          <year>2001</year>
          , volume
          <volume>1</volume>
          ,
          <year>2001</year>
          , pp.
          <source>I-I. doi:10</source>
          .1109/CVPR.
          <year>2001</year>
          .
          <volume>990517</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dalal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Triggs</surname>
          </string-name>
          ,
          <article-title>Histograms of oriented gradients for human detection</article-title>
          ,
          <source>in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05)</source>
          , volume
          <volume>1</volume>
          ,
          <year>2005</year>
          , pp.
          <fpage>886</fpage>
          -
          <lpage>893</lpage>
          vol.
          <volume>1</volume>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2005</year>
          .
          <volume>177</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P. F.</given-names>
            <surname>Felzenszwalb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>McAllester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramanan</surname>
          </string-name>
          ,
          <article-title>Object detection with discriminatively trained part-based models</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>32</volume>
          (
          <year>2010</year>
          )
          <fpage>1627</fpage>
          -
          <lpage>1645</lpage>
          . doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2009</year>
          .
          <volume>167</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Divvala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <article-title>You only look once: Unified, real-time object detection</article-title>
          ,
          <source>in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>779</fpage>
          -
          <lpage>788</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2016</year>
          .
          <volume>91</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Farhadi,</surname>
          </string-name>
          <article-title>Yolov3: An incremental improvement</article-title>
          , arXiv preprint arXiv:
          <year>1804</year>
          .
          <volume>02767</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bochkovskiy</surname>
          </string-name>
          , C.-Y. Wang, H.
          <string-name>
            <surname>-Y. M. Liao</surname>
          </string-name>
          ,
          <article-title>Yolov4: Optimal speed and accuracy of object detection</article-title>
          , arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>10934</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-Y. Mark Liao</surname>
          </string-name>
          , Y.
          <string-name>
            <surname>-H. Wu</surname>
            , P.-Y. Chen,
            <given-names>J.-W.</given-names>
          </string-name>
          <string-name>
            <surname>Hsieh</surname>
            ,
            <given-names>I.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Yeh</surname>
          </string-name>
          ,
          <article-title>Cspnet: A new backbone that can enhance learning capability of cnn</article-title>
          ,
          <source>in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1571</fpage>
          -
          <lpage>1580</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPRW50498.
          <year>2020</year>
          .
          <volume>00203</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <article-title>Path aggregation network for instance segmentation</article-title>
          ,
          <source>in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>8759</fpage>
          -
          <lpage>8768</lpage>
          . doi:
          <volume>10</volume>
          . 1109/CVPR.
          <year>2018</year>
          .
          <volume>00913</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Spatial pyramid pooling in deep convolutional networks for visual recognition</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>37</volume>
          (
          <year>2015</year>
          )
          <fpage>1904</fpage>
          -
          <lpage>1916</lpage>
          . doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2015</year>
          .
          <volume>2389824</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>