Traffic sign recognition using the mask R-CNN
                                Mykola Korablyov1,†, Natalia Axak1,†, Ihor Ivanisenko2,3,*,†, Maksym Kushnaryov1,†
                                and Igor Kobzev4,†

                                1 Kharkiv National University of Radio Electronics, Kharkiv 61166, Ukraine

                                2 University of Jyväskylä, Jyväskylä 40014, Finland

                                3 Kharkiv National Automobile and Highway University, Kharkiv 61002, Ukraine

                                4 Simon Kuznets Kharkiv National University of Economics, Kharkiv 61166, Ukraine


                                                Abstract
                                                Today, intelligent technologies are developing at a rapid pace, which, in turn, leads to the
                                                development of intelligent transport systems. Therefore, constructing traffic sign recognition
                                                systems using machine and deep learning technologies is urgent. Traffic sign recognition is a
                                                computer visualization problem that can be solved using convolutional neural networks. The
                                                analysis of the most effective models of convolutional neural networks of image processing was
                                                carried out to choose the most suitable one for recognizing traffic signs: R-CNN, Fast R-CNN,
                                                Faster R-CNN, and Mask R-CNN. The analysis showed that applying Mask R-CNN for traffic sign
                                                recognition is appropriate. It effectively detects objects in the image, creates a high-quality
                                                segmentation mask for each instance, and can be used in vehicle systems. Considering issues of
                                                traffic sign recognition using Mask R-CNN, the work consists of implementing relevant stages and
                                                components. The training of Mask R-CNN, which must learn to detect objects in the image and
                                                segment images, is considered. Experimental studies on Mask R-CNN for traffic sign recognition
                                                were conducted, for which a neural network training web application was created. Examples of
                                                training and testing of the work of Mask R-CNN on the recognition of traffic signs are presented,
                                                from which it is clear that Mask R-CNN, based on the trained classes, clearly finds and processes
                                                several traffic signs in the image. This makes it possible to expand the number of classes and
                                                objects for recognition and improve image processing quality.

                                                Keywords
                                                recognition, traffic sign, model, regional convolutional neural networks, learning, segmentation,
                                                dataset, web application 1


                                1. Introduction
                                Object detection in images is a key component of many deep-learning models and has
                                undergone several significant transformations in recent years. Object detection algorithms
                                are used in areas such as self-driving vehicles, security cameras, robotics, and almost all


                                MoMLeT-2024: 6th International Workshop on Modern Machine Learning Technologies, May, 31 - June, 1, 2024,
                                Lviv-Shatsk, Ukraine
                                ∗ Corresponding author.
                                † These authors contributed equally.

                                   mykola.korablyov@nure.ua (M. Korablyov); nataliia.axak@nure.ua (N. Axak); ihor.i.ivanisenko@jyu.fi (I.
                                Ivanisenko); maksym.kushnarov@nure.ua (M. Kushnaryov); ikobzev12@gmail.com (I. Kobzev)
                                    0000-0002-8931-4350 (M. Korablyov); 0000-0001-8372-8432 (N. Axak); 0000-0002-2679-959X (I.
                                Ivanisenko); 0000 0002-3772-3195 (M. Kushnaryov); 0000-0002-7182-5814 (I. Kobzev)
                                           © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
applications that involve visualization, such as medicine, as well as emerging areas such as
self-service stores, cash registers, etc. The main problem was that many applications
require object detection in real time. Today, this problem has been solved, as a whole set of
methods, models, and algorithms have been developed, which can be used to detect objects
in real time. A separate actual task of object detection in images is traffic sign recognition.
Today, there is a significant increase in the number of vehicles on the roads, which leads to
traffic jams and casualties. Therefore, improving the efficiency and safety of road traffic is
an extremely important need of the hour. The attention of many car manufacturers is now
focused on the creation of unmanned vehicles, which involves the introduction of a whole
complex of software and hardware solutions, that work, including based on artificial
intelligence technologies.
    Automatic traffic sign recognition systems are widely used to increase the safety of
motor vehicles. Every year, the need for an automatic traffic sign recognition system
becomes more and more urgent. These systems are widely used in autopilots and driver
assistants to increase the safety of motor vehicles. The systems can help to adhere to the
established speed regime and observe travel restrictions and overtaking, which will help to
significantly reduce accidents on the roads.
    Today, intelligent technologies are developing at a rapid pace, which, in turn, leads to the
development of intelligent transport systems. Currently, in various countries of the world,
unmanned vehicles are widely used, which, to improve traffic efficiency, are equipped to
significantly reduce the number of traffic accidents and reduce traffic jams. This is possible
if autonomous vehicles are equipped with navigation aids and a traffic sign recognition
system based on machine and deep learning technologies.
    Traffic signs installed on the sides of the road provide navigational information and
necessary warnings during trips. Therefore, vehicles must be equipped with automated
systems that can analyze traffic events in real-time, and recognize and understand traffic
signs placed on the roadsides while driving to ensure traffic safety. Established traffic sign
recognition systems work with text signs or datasets containing clear traffic data, regardless
of usage conditions. Therefore, the construction of traffic sign recognition systems using
machine and deep learning technologies is an urgent task.

2. Analysis of approaches to traffic sign recognition

Image processing methods, models, and algorithms can solve the following tasks [1]:

      Classification of objects.
      Localization of objects.
      Object detection.
      Segmentation.

   When classifying objects, the input data is the image, the output is the class of the object
represented in the image. The number of classes is determined during system design.
   Localization of objects means the determination of the location of the object in the image.
As a rule, the object is highlighted in the image by a rectangle.
   Detection is understood as a set of localization and classification operations performed
sequentially. If several objects are detected in the image, each of them is classified
separately. There are two main approaches to detection:

       Bounding box detection.
       Landmark detection.

   Detection using bounding frames is based on the selection using some rectangle of the
part of the image in which the object is located, having the coordinates of the center, height,
and width. Segmentation refers to the process of dividing an image into several segments
by establishing. identical visual characteristics of pixels that belong to the same type of
object in the image. The result of the algorithm is a set of segments covering the image.
   Traffic sign recognition is a computer imaging task that involves identifying a specific
object, scene, or other feature in an image. There are two main approaches to image
recognition [2]:

   1.   An approach based on signs.
   2.   Learning-based approach.

   A feature-based approach is to first extract certain features from an image, which are
then used to classify the image. Features can be low-level, such as color, texture, or shape,
or high-level, such as context or semantics. A training-based approach is to train a model to
recognize certain objects or scenes using a labeled image dataset.
   The use of color segmentation is a common method that can be used to identify traffic
signs in real-time using simple, inexpensive equipment. In [3], the use of color-based
segmentation in the detection stage is considered, while taking into account the difference
in RGB components ensures the reliability of the results. Color segmentation works when
analyzing images with small signs or low-resolution traffic signs. After pixel-level color-
based segmentation, traffic signs are classified using the Support Vector Machines (SVM)
[4]. In traffic sign recognition, the SVM method is widely used together with other methods
such as decision trees, random forests, and directed gradient histograms [4-6].
   Today, there are many approaches to traffic sign recognition that use machine learning
techniques [7-9]. The general approach to traffic sign recognition consists of two main
stages: detection and classification. A large number of tasks for the detection and
classification of objects in the image are solved using neural networks with different
architectures [10]. At the same time, the use of different types of neural networks in
combination with other methods allows for obtaining high-accuracy indicators. Many
different neural network architectures are used for traffic sign recognition. Such variants of
Convolutional Neural Networks (CNN) as R-CNN, Fast R-CNN, and Faster R-CNN have been
most actively implemented, the use of which eliminates the need for manual feature
extraction [11,12]. It was shown in [12] that using Faster R-CNN for traffic sign recognition
increased the speed of the model compared to Fast R-CNN. To detect objects in the image,
the newly developed Mask R-CNN is used [10, 13], which is an extension of Faster R-CNN
and, in addition to the class label, provides an additional object mask and coordinates for
bounding frames, and also increases the accuracy of traffic sign recognition.
   Thus, to solve the problem of traffic sign recognition, we will use one of the models based
on the use of one or another neural network.

3. Selection of a neural network model for traffic sign recognition
The most common models of neural networks, which are an effective means of object
recognition, include CNN, autoencoders, transformers, global models, etc. It should be noted
that most models based on neural networks provide real-time mode, reliable functionality,
and a high level of accuracy. We will analyze the most effective models of neural networks
for image processing to choose the most suitable one for recognizing traffic signs.
    The basis of Regional Convolutional Neural Networks (R-CNN) is obtaining a set of
regions that probably contain objects for classification, and then their further processing by
a convolutional neural network [10]. Representatives of such networks are R-CNN, Fast R-
CNN, Faster R-CNN, and Mask R-CNN, which is one of the latest models in the family of object
detector algorithms [11-13].
    R-CNN accepts an image as an input and forms up to 2000 regions of different sizes on it
using a selective search algorithm. A region is a part of an image where there is a high
probability of finding target objects. Each region is assigned a certain class and bounding
box. In the next step, R-CNN uses a large CNN to compute features for each previously
proposed region. At the final stage, each region is classified by the SVM and linear regression
to obtain the most accurate coordinates of the object. The disadvantage of the R-CNN model
is its slowness and energy consumption.
    Since R-CNN processes each CNN region independently, this slows down the model
significantly. To solve this problem, Fast R-CNN performs image processing once on the
entire image. Fast R-CNN, based on the selective search algorithm, processes the entire
image in parallel with a regular CNN to obtain features, which ensures the receipt of
proposals for the regions of object placement. Then the obtained features and regions are
processed in the region subsampling layer (RoI pooling), in which the region is transformed
from image coordinates to feature map coordinates, obtaining a feature vector of fixed
length as an output.
    Each feature vector is fed into fully connected layers (FC), the result of which is then
output to two output layers:

      Softmax – to assess whether an object belongs to a class.
      Regressions – to specify the coordinates of the object bounding box.

   The first layer, using the softmax function, determines the possibility of assigning the
object to one or another class, taking into account the background class of the entire image.
The second layer outputs real numbers describing the position of the bounding box for each
object. Thus, the main differences of Fast R-CNN are as follows:
      During processing, a set of features is generated for the entire image at once, and not
       for each frame, from which features for parallel obtained regions are then extracted
       using a special layer.
      The SVM and linear regression are not used by using additional layers of the full
       neural network.

   In Faster R-CNN, the selective search algorithm is replaced by the Region Proposal
Network (RPN) for searching regions. Fast R-CNN is used for detection. The object detection
algorithm is based on predicting the object category and the deviation from the true
bounding box for a large number of generated keyframes, followed by their filtering. The
RPN receives features from the CNN as input, based on which it forms a set of proposals of
regions for the placement of objects with some evaluation. To reduce the number of regions,
the Non-Maximal Suppression (NMS) algorithm is used, which significantly reduces the
number of regions. The received data is fed into the Fast R-CNN algorithm. Due to the use of
the same convolutional layers in both networks, the speed of operation increases
significantly and the object detection model can work in a mode close to the real-time mode.
   Among the R-CNN models, the Mask Region-based Convolutional Neural Network (Mask
R-CNN) should be singled out, which has many advantages and effectively detects objects in
the image. Mask R-CNN is a type of CNN and is an extension of Faster R-CNN, specially
designed for solving tasks of semantic segmentation of objects in images. The main idea is
to add the layer to the Faster R-CNN architecture to generate a binary mask of each selected
object. Mask R-CNN predicts the position of the mask covering the detected object, solving
the problem of segmentation of image instances at the pixel level, which significantly
improves the recognition accuracy.
   Mask R-CNN, unlike Faster R-CNN, which effectively finds objects in the image, can create
a high-quality segmentation mask for each instance. That is, Mask R-CNN can not only
determine the location of objects in the image but also accurately outline the shape of each
object. The architecture of Mask R-CNN, which is shown in Figure 1, consists of convolution
layers, Region Proposal Networks (RPNs), and Fully Connected Networks (FCNs) [13].


Figure 1: Mask R-CNN architecture.
   In the Mask R-CNN architecture, two stages can be distinguished:

      Region supply network for object search.
      Head networks for object classification and segmentation mask prediction.

   The first step in processing the input image is a pre-trained CNN, such as ResNet, which
extracts high-level features from the image that are important for finding the complex
patterns required for object detection. A feature pyramid network (FPN) by combining
features from different layers of the backbone creates a multi-level feature pyramid that
includes objects with different spatial resolutions, covering both high-resolution objects
containing semantic information and low-resolution objects that provide more accurate
spatial details of objects.
   The RPN plays a crucial role in identifying potential objects in the image. Using a sliding
window method, RPN scans the image to identify areas that may contain objects. By creating
object-bound frames, RPN narrows the scope of interest. These suggestions are then refined
and used in subsequent stages for more detailed analysis. The integration of RPN in Mask
R-CNN allows for real-time processing and lower computational costs compared to stand-
alone object detection methods.
   The Region of Interest (ROI) ALIGN solves the problem of spatial discrepancies caused
by the quantization process. ROI Align uses bilinear interpolation to accurately extract
feature maps for each proposed feature region. This method ensures that the obtained
features exactly match the objects, resulting in more accurate segmentation and
classification. The classification and bounding box regression are Mask R-CNN components
that simultaneously perform object classification and refine bounding boxes. For each
region proposed by the RPN, the network predicts an object class by distinguishing different
types of image objects. In addition to classification, the coordinates of each proposal of the
bounding box are adjusted, specifying its size and position for more accurate coverage of
the object.
   Using a Fully Convolutional Network (FCN) for each ROI, a binary mask is generated that
outlines the exact shape of the object. Mask prediction is done pixel by pixel, which allows
for detailed and accurate segmentation. This is especially important for tasks that require a
detailed outline of objects, such as the task of recognizing traffic signs.
   Thus, using Mask R-CNN allows you to detect objects in the image while creating a high-
quality segmentation mask for each instance. Mask R-CNN is easy to learn, it is easy to
generalize it to other tasks, for example, to estimate human pose in the same structure, etc.
Mask R-CNN provides an opportunity to obtain not only a library of regions but also
accurate masks for objects in the image. This makes it very effective for tasks where the
accuracy of determining the outline of an object is important.
   All these factors make Mask R-CNN a powerful tool for solving various computer vision
problems, as well as object segmentation in images.
   Thus, the analysis of neural networks with the aim of their application for traffic sign
recognition showed that for these purposes it is appropriate to use Mask R-CNN, which
effectively detects objects in the image, creates a high-quality segmentation mask for each
instance, and can be used in systems motor vehicles.
4. Implementation of traffic sign recognition using Mask R-CNN
The work of Mask R-CNN on the recognition of traffic signs consists of the following main
stages and components:

   1. Selection of RPN that contains objects.
   2. Extracting signs. The images and regions selected by the RPN are fed into a
   convolutional neural network for feature extraction.
   3. Main branch. Includes feature submission for classification and regression, similar
   to Faster R-CNN.
   4. Mask head. An additional layer that is responsible for generating binary masks for
   objects. This layer has its convolutional architecture and is used to accurately define the
   shape and position of each object in the image.
   5. Loss and learning function. A loss function is used, which takes into account both
   classification and regression losses, as well as losses relative to mask generation.

   Training a Mask R-CNN can be a challenging task. This is because the neural network
must learn to perform two tasks: object detection in the image and image segmentation.
These tasks are quite complex and the neural network must be large enough to perform
them. Mask R-CNN training consists of the following stages:

   1. Data preparation. In this step, you need to collect a dataset of human-labeled images.
      Descriptions should include the coordinates of the contours of objects in the image,
      as well as their class.
   2. Data conversion. At this stage, you need to convert the data into a format that can be
      used for neural network training.
   3. Setting the neural network parameters. At this stage, you need to configure the
      neural network parameters, such as learning rate and batch size.
   4. Neural Network Training. At this stage, the neural network is trained on a set of
      image data.
   5. Experimental evaluation. In this step, you need to evaluate the accuracy of the neural
      network on an image dataset that was not used for training.

   The image dataset for training the Mask R-CNN should include images with different
objects to be detected. Descriptions must be accurate and consistent. If the descriptions are
not accurate, the neural network can learn to detect false objects.
   Many different datasets can be used to train a Mask R-CNN. Some of the more common
datasets include:

      COCO. This dataset includes 80,000 images with 80 different object classes.
      PASCAL VOC. This dataset includes 11,000 images with 20 different object classes.
      MS COCO-Stuff. This dataset includes 118,000 images with 171 different object
       classes.

   Once the image dataset is collected, it needs to be converted into a format that can be
used to train the neural network. For this, you can use special tools such as COCO API or
PASCAL VOC API. Training rate and batch size are two important parameters that need to
be adjusted before starting to train a neural network. The learning rate determines how
quickly the neural network will update its parameters. The batch size determines how many
images will be used for one neural network update. Training a neural network can take a
long time. It depends on the dataset size, batch size, and training speed. Once the neural
network is trained, it needs to be evaluated on an image dataset that was not used for
training. This will help determine how well the neural network performs on images it has
not seen before.

5. Experimental results
    When using Mask R-CNN sequence of actions must be performed. It is necessary to
prepare a dataset with images of the object that needs to be recognized. Importantly, the
more images of an object are based in the dataset in different angles, backgrounds, and
colors, the more accurately the neural network will be able to recognize subsequent images
and objects. Next, it is necessary to perform annotations for the dataset, which consist of the
following steps:
    Step 1 – you need to add many different images, in different positions, under different
lighting, among other objects, and against different backgrounds.
    Step 2 – transition to annotations, i.e. manual determination of object position and label
assignment. This can be done using the open-source project “www.makesense.ai”.
    Step 3 – create a project in the web application and upload the collected dataset with
images.
    Step 4 – Create a marker for further use to determine to which class the object found in
the image belongs.
    Step 5 – Manual selection of objects in the images using a special tool "polygon" (polygon)
so that Mask R-CNN learns with the help of the initial dataset and can recognize similar
objects already in the images that are not located in the dataset.
    Step 6 - after manually selecting an object in the image (in some cases, there may be
several of them in one image), it is necessary to assign the created labels to each selected
object in the image.
    Step 7 - all subsequent images must be processed according to the previous steps.
    Step 8 – after processing all the images in the dataset, it is necessary to export the "JSON"
file in the "SOSO" format. With this file, which contains the coordinate data of all selected
points in each image, Mask R-CNN can be trained.
    To obtain recognition results, a neural network training application was created. First,
all modules and libraries must be downloaded for Mask R-CNN to work. After compilation,
the system creates a trained Mask R-CNN model. Next, it is necessary to upload the dataset
archive in ".zip" format and the downloaded file with annotations in "JSON" format to the
files of the created project. Extracting the image archive from the dataset and annotation
and assigning values to the variables was done using the web application. The results of
extracting manually processed images of the dataset from the "JSON" file, which was made
using the web service, are presented in Figure 2, which shows the original images of objects
(left) and their images with a neural network mask (right).
   Figure 2: Image of traffic signs with a neural network mask.

   Then, the following actions were performed in two stages. In the first stage, the number
of images was checked and preparatory processing was performed for Mask R-CNN
training.
   In the second stage, the neural network itself was trained directly based on the prepared
annotations and dataset. In the “logs” folder, files in “.h5” format were created, which are
the results of Mask R-CNN training. Then, for subsequent image processing based on the
trained model for traffic sign recognition, the last trained model, which is the most accurate
of all the previous ones, was loaded.


Figure 3: Examples of original images (left) and processed images (right)
Considered examples of testing the work of Mask R-CNN on the recognition of traffic signs
on random images. Figure 3 shows examples of original images of objects with traffic signs
(left) and examples of images processed by the Mask R-CNN network (right). From the given
examples, it can be seen that Mask R-CNN based on trained classes finds and processes
several traffic signs in the image. This allows you to expand the number of classes and
objects for recognition and improve the quality of image processing with Mask R-CNN. In
general, Mask R-CNN is a powerful tool for object segmentation, which significantly
improves the capabilities of computer vision systems in various fields of application, in
particular, in traffic sign recognition systems.

6. Conclusions
    Every year, the need for automatic traffic sign recognition systems becomes more and
more urgent. These systems are widely used in autopilots and driver assistants to increase
the safety of motor vehicles. The systems can help to adhere to the established speed regime
and observe travel restrictions and overtaking, which will help to significantly reduce
accidents on the roads.
    An analysis of approaches to traffic sign recognition as a task of object detection in the
image was carried out. Since road sign recognition is a computer visualization task, both a
feature-based approach and a learning-based approach can be used to solve it. The most
effective approach to traffic sign recognition is the use of machine and deep learning
technologies, in particular, convolutional neural networks.
    The analysis of the most effective models of convolutional neural networks of image
processing was carried out to choose the most suitable one for recognizing traffic signs: R-
CNN, Fast R-CNN, Faster R-CNN, and Mask R-CNN. Their analysis showed that it is
appropriate to use Mask R-CNN for traffic sign recognition, which effectively detects objects
in the image, creates a high-quality segmentation mask for each instance, and can be used
in vehicle systems.
    The implementation of the task of recognizing traffic signs using Mask R-CNN is
considered, the work of which is presented as a sequence of execution of the relevant stages
and components. Mask R-CNN training is focused on the tasks of image object detection and
image segmentation and consists of the following stages: data preparation, data conversion,
neural network parameter settings, neural network training directly, and experimental
evaluation of its effectiveness.
    Experimental studies on the use of Mask R-CNN for traffic sign recognition were
conducted, for which a corresponding dataset was prepared. Annotations were performed
for the dataset, consisting of the implementation of the corresponding steps. To obtain
recognition results, a neural network training web application was created, with the help of
which images of traffic signs with a neural network mask were obtained using downloaded
relevant modules and libraries.
    Considered test examples of the work of Mask R-CNN on the recognition of traffic signs
on random images, which showed that Mask R-CNN based on trained classes finds and
processes several traffic signs on the image, allowing to expansion of the number of classes
and objects for recognition and improve image processing quality.
References
[1] R. Archana, P.S. Eliahim Jeevaraj. Deep learning models for digital image processing: a
     review. Artificial Intelligence Review (2024) 57:11. https://doi.org/10.1007/s10462-
     023-10631-z.
[2] R. Szeliski. Computer Vision: Algorithms and Applications, 2nd Edition. Springer
     Nature, (2022) 925. https://cord.isir.upmc.fr/pdfs/courses/rdfia/SzeliskiBook_
     draft.pdf.
[3] A. Ruta, Y. Li, and X. Liu. Real-time traffic sign recognition from video by class-specific
     discriminative features. Pattern Recognition, 43(1) (2010) 416–430. doi:
     10.1016/j.patcog.2009.05.018.
[4] F. Zaklouta, and B. Stanciulescu. Real-time traffic sign recognition using tree classifiers.
     IEEE Transactions on Intelligent Transportation Systems, 13(4) (2012) 1507–1514.
     doi:10.1109/TITS.2012.2225618.
[5] A. Ellahyani, M. El Ansari, and I. El Jaafari. Traffic sign detection and recognition based
     on random forests. Applied Soft Computing, 46 (2016) 805–815. doi:
     10.1016/j.asoc.2015.12.041.
[6] J. Greenhalgh, and M. Mirmehdi. Real-time detection and recognition of road traffic
     signs. IEEE Transactions on Intelligent Transportation Systems, 13(4) (2012) 1498–
     1506. doi:10.1109/TITS.2012.2208909.
[7] H.H. Aghdam, E.J. Heravi, and D. Puig. A practical approach for detection and
     classification of traffic signs using convolutional neural networks. Robotics and
     Autonomous Systems, 84 (2016) 97–112. doi: 10.1016/j.robot.2016.07.003.
[8] M.M. William, P.S. Zaki, B.K. Soliman, K.G. Alexsan, M. Mansour, M. El-Moursy, and K.
     Khalil. Traffic Signs Detection and Recognition System using Deep Learning. Ninth
     International Conference on Intelligent Computing and Information Systems (ICICIS)
     (2019). 160-166. doi:10.1109/ICICIS46948.2019.9014763.
[9] A. Karne, R. Karne, K. K. Vaigandla, and A. Arunkumar. Convolutional Neural Networks
     for Object Detection and Recognition. Journal of Artificial Intelligence Machine
     Learning and Neural Network, vol.3, no 2 (2023) 1-13. doi:10.55529/jaimlnn.32.1.13.
[10] A. Barade, H. Poornachandran, K.M. Harshitha, E.D. Shiloah, R.R.C. Sunil. Automatic
     Traffic Sign Recognition System Using CNN. International Journal of Information
     Retrieval      Research,      IGI    Global,     Vol.   12,    Iss.   1    (2022)     1-14.
     https://ideas.repec.org/s/igg/jirr00.html.
[11] G. Zhang, Y. Peng, and H. Wang. Road Traffic Sign Detection Method Based on RTS R-
     CNN       Instance     Segmentation        Network.     Sensors     (2023)    23,    6543.
     https://doi.org/10.3390/s23146543.
[12] Z. Zuo, K. Yu, Q. Zhou, X. Wang, and T. Li, Traffic signs detection based on Faster R-CNN.
     IEEE 37th International Conference on Distributed Computing Systems Workshops
     (ICDCSW) (2017) 286-288, doi:10.1109/ICDCSW.2017.34.
[13] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. IEEE International Conference
     on Computer Vision (ICCV) (2017) 2980-2988. doi:10.1109/ICCV.2017.322.