YOLOv3-based Mask and Face Recognition Algorithm
for Individual Protection Applications
Roberta Avanzatoa , Francesco Beritellia , Michele Russob , Samuele Russoc and
Mario Vaccarob
a Department of Electrical, Electronic and Computer Engineering, University of Catania, Catania, CT, Italy
b VICOSYSTEMS S.r.l V.le Odorico da Pordenone, 33, Catania, CT, Italy
c Sapienza University of Rome, Piazzale Aldo Moro 5, Roma, Italy


                                          Abstract
                                          To combat the spread of the COVID-19 pandemic, it is essential to strictly obey social distancing measures, as well as have
                                          the possibility to possess and wear personal protective equipment. This paper proposes a mask and face recognition algo-
                                          rithm based on YOLOv3 for individual protection applications. The proposed method processes images directly in raw data
                                          format input to a neural network trained with deep learning techniques. System training was performed on a set of images
                                          appropriately obtained from the MAFA dataset by selecting those with surgical masks for a total of about 6,000 cases. The
                                          performances obtained indicate 84% accuracy in recognizing a mask and 96% in the case of a face.

                                          Keywords
                                          Image processing, Face recognition, Mask recognition, Computer vision, Deep learning


1. Introduction                                                 ditions, and, possibly, where necessary, compliance with
                                                                the restrictions on individual protection (masks, gloves,
TIn an emergency phase, the fight against the spread overalls etc.). There are several important advantages:
of COVID-19 contamination is regulated by procedures the safeguard of people’s health, the mitigation of the
of medical-scientific rigor and official protocols adopted risk of contamination return, the possibility of timely
as regulations until the epidemic is definitively defeated interventions by the law enforcement engaged in pre-
on a global scale. For the return to normality, which serving public health orders, as well as safe and fast
is expected to be gradual and of medium-long dura- return to work.
tion, it is essential to strictly obey social distancing           The key issues forming the basis for the proposal
measures, as well as have the possibility to possess described in this paper arises are the following:
and wear personal protective equipment for those who
continue to work in potentially contagious environ-                  • need for social distancing outdoors (streets, squa-
ments. Thus, it becomes strategic to focus on solu-                    res, parks, etc.) and indoors (offices, schools,
tions that can remotely and non-intrusively monitor                    shopping centers, theaters, restaurants, pubs, sho-
people’s behaviour and health, while ensuring respect                  ps, etc.);
for privacy. One solution is represented by innova-
tive video intelligence technologies for the automatic               • need to manage quotas for access and use of pub-
detection of body temperature and the proximity dis-                   lic areas and public carriers;
tance between individuals in order to guarantee, and                 • need for timely notification of gatherings to the
possibly certify, in outdoor or indoor environments,                   managers of the frequented areas and, in the most
compliance with the regulations on the constraints of                  serious cases, to the law enforcement, possibly
the distance between individuals (and/or the maximum                   via the certification of critical events;
capacity in a given environment), access to indoor en-
vironments for individuals without critical health con-              • need to monitor the state of health (by checking
                                                                       the temperature) of people who access an indoor
ICYRIME 2020: International Conference for Young Researchers in        environment;
Informatics, Mathematics, and Engineering, Online, July 09 2020
" roberta.avanzato@phd.unict.it (R. Avanzato);                                                                        • need to monitor compliance with the use of pro-
francesco.beritelli@dieei.unict.it (F. Beritelli);                                                                      tective equipment (masks, gloves, overalls), es-
m.russo@vicosystems.it (M. Russo); samuelerussoct@gmail.com
(S. Russo); m.vaccaro@vicosystems.it (M. Vaccaro)
                                                                                                                        pecially in the most at-risk work contexts.

                                    © 2020 Copyright for this paper by its authors. Use permitted under Creative
                                    Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                     The last point is the one the present study focuses
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)                                        on by proposing a mask/face recognition algorithm.
   In the state of the art there are many studies dealing
with face recognition and, in particular, recognition of
masked faces.
   In [1] the authors propose a masked face detection
technique useful for monitoring and identifying crim-
inals or terrorists. They propose a CNN-based cascade
framework, which consists of three carefully designed
convolutional neural networks to detect masked faces.
The accuracy in recognizing masked faces is 87.8%.
   In [2] the authors propose a further method of iden-
tifying masked faces based on the LLE-CNN network
and MAFA database [3]. In this approach, the authors
achieved a performance of 76.4%. The authors in [4]
address the issue of the importance of greater accuracy
in face recognition during the period of COVID-19.
The study proposes a face-eye based multi-granular
recognition model. With this approach, the accuracy
of masked face recognition goes from the initial 50%
to 95%.
   In the present study, a mask/face recognition tech-
nique is proposed using a very performing type of con-
volutional neural network called YOLOv3. This method Figure 1: Block diagram of the proposed method.
allows to derive the detection and classification per-
formance of the "faces" and "masks" within the video
and/or images.                                             3. Adopted Neural Network
   The paper is structured as follows: Section 2 de-
scribes the proposed method; Section 3 illustrates the The application of artificial intelligence and machine
neural network used; Section 4 describes the database learning algorithms turns out to be a very complex
used; section 5 shows the performances obtained by approach if the problems requiring a solution are not
the proposed technique; the last section is dedicated highlighted [5, 6, 7, 8, 9, 10, 11, 12, 13]. In this study,
to conclusions.                                            we are interested in recognizing the face and any mask
                                                           worn by the various people present in the video record-
                                                           ings.
2. Proposed Method                                            The theme of face recognition and masks falls within
                                                           the subject of object detection. Object detection is the
This section describes the process of detecting masks basis of computer vision, and specifically for applica-
and faces.                                                 tions such as instance segmentation, image caption-
   Figure 1 shows the block diagram of the proposed ing and object detection/tracking. From an application
technique.                                                 point of view, it is possible to group object detection
   The first block represents the acquisition of the video into two macro-categories:
signal by means of cameras, which can be installed in
indoor or outdoor environments.                                • General object detection: the goal is to inves-
   Once the video signal is acquired, a pre-processing           tigate methods for identifying different types of
phase is performed (Video processing block) which is             objects using a single framework, in order to sim-
responsible for extracting the frames with a frame-rate          ulate human vision and cognition;
equal to 30 fps. Subsequently, the frames are fed into
the previously trained YOLOv3 neural network. The              • Detection applications: refers to the recognition
output neural network provides a percentage of detec-            of objects of a certain class in specific applica-
tion and classification accuracy of the face and masks           tion scenarios. For example, there may be var-
present in the input data frames.                                ious applications for pedestrian detection, face
                                                                 detection or for text detection.

                                                                There are several models that implement object recog-
                                                             nition, present in the state of the art.


                                                        42
   One of them is the Faster R-CNN [14] which rep-                Table 1
resents the current state of the art for models that di-          Objects in training e testing dataset for each classs.
vide the task of identifying objects into several phases.
                                                                               Dataset     Training      Testing
This network allows you to simultaneously train a rec-
ognizer and a bounding box designer within a single                             Mask         5555         1855
                                                                                Face         4173         1299
model. The procedure carried out by this network is
of the "proposal detection and verification" type.
   A second model is YOLO (You Only Look Once).                   Table 2
In [15, 16] the authors have completely abandoned the             Confusion matrix.
pre-existing paradigm of "proposal detection and veri-
                                                                                             Face     Mask
fication". Instead, YOLO follows a completely different
                                                                                    Face     0.94     0.06
philosophy: applying a single model to the entire im-
                                                                                    Mask     0.14     0.86
age. YOLO, in fact, divides the image into regions, pre-
dicts the bounding boxes and for each of them, deter-
mines the probabilities of belonging to a certain class,
all using a single network.                                          In this dataset, a great presence of images was noted
   In [17] the authors define the SSD (Single Shot De-            in which the masks were not suitable for individual
tector) model. This method has greatly contributed to             protection, such as: scarves, sweaters and hands cov-
the change of perspective towards the generation of               ering the face, full masks used for masquerades, etc.
bounding boxes: unlike the previous models that were              For this reason, image filtering was performed; in par-
concerned with accurately predicting the location of              ticular, selecting those that contained surgical masks.
an object within the image, SSD starts from a set of                 Subsequently, a re-labeling of the dataset was per-
bounding boxes by default. Starting from this set a de-           formed, in order to obtain an automatic recognition
viation and its classification are predicted for each of          system of the presence of a protective mask on a face.
these boxes. Thanks to a set of operations and SSD fil-              Via the new labeling, a dataset of 5,800 images was
ters, it also obtains excellent accuracy in the prediction        extracted, where 3,800 images were used for training
of object classes.                                                the neural network and 2,000 images were used for
   In order to make an exhaustive comparison of the               testing.
various convolutional models presented, to maintain                  Each image can contain one or more “Mask” and
a certain consistency in the results, it was decided to           “Face” objects. In this regard, Table 1 shows the num-
use the work done in [18] as a framework to compare               ber of the two types of objects for the training and test-
the performances. In this study, the authors indicate             ing dataset.
that YOLOv3 is clearly superior, compared to the other
CNNs, both in terms of computational time and accu-               5. Performance Evaluation
racy. However, it should be noted that Fast R-CNN, de-
spite the huge gap in terms of computational time, al-            Once the neural network model and the dataset in use
lows, among others, to identify very accurate segmen-             are defined, it is possible to analyse the performances
tations (polylines) when compared with the "simple"               obtained when the dataset described above is fed to
bound boxes provided by YOLOv3 or SSD. Therefore,                 the network.
based on the specific application context there may be               After a training phase of the neural network model,
some cases in which Fast R-CNN is the optimal solu-               the testing dataset was applied, containing images com-
tion.                                                             pletely unknown to the network.
                                                                     The performances on the testing dataset obtained
                                                                  from the network are shown in Tables 2 and 3. Table 2
4. Database                                                       shows the confusion matrix produced by the neural
Once the neural network model was defined, we moved               network. The performances obtained are quite high,
to the search for a database containing faces and masks           implying that the network is able to perform good de-
to train the model.                                               tection of the two classes on images that it has never
   At first, MAFA [3] database designed to recognize              seen before.
faces partially occluded by objects was used as a refer-             Table 3 shows the performances, in percentage, ob-
ence, containing 25,000 images for training and 10,000            tained using the statistical classification parameters:
images for testing.                                               accuracy, recall or sensitivity, precision and F1 score


                                                             43
Figure 2: ROC curve.


Table 3                                                     dataset the obtained performances in the present study
Global performance of the proposed method.                  are 13.6
            A [%]      Recall [%]   PRE [%]   F1 [%]
   Face      94           94           87                   6. Conclusion
   Mask      86           86          93.5
   Mean      90           90          90.3    89.98         From the point of view of the recognition of the in-
                                                            dividual protective garment (mask), which is the sub-
                                                            ject of our study, we focused on the simple detection
[19].                                                       of faces and masks within a frame (image or video),
   The table shows that the obtained results in terms of    obtaining an average accuracy of 90%. A future devel-
accuracy, recall, precision and F1 score are quite high.    opment includes the extension of the neural network
   For further validation of the network model and the      by adding one or more decision-making layers (clas-
performance obtained from the statistical classification    sification) in order to be able to identify not only the
parameters the ROC (Receiver Operator Characteris-          presence of the mask in the photo but also its posi-
tic) and ROC AUC (Area Under the Curve) graphs have         tion with respect to the person’s face, so as to define
been produced.                                              whether it is worn properly or not.
   Applying this concept to our classification method,
in Figure 2 we observe the resulting ROC curve which
indicates that a certain degree of variance between the     References
various parts and the average ROC AUC [20] lies be-
                                                             [1] W. Bu, J. Xiao, C. Zhou, M. Yang, C. Peng, A
tween the perfect score (1.0) and the diagonal (0.5).
                                                                 cascade framework for masked face detection,
   The graph shows that the area under the ROC curve
                                                                 in: 2017 IEEE International Conference on Cy-
is very large. This means that our model has excellent
                                                                 bernetics and Intelligent Systems (CIS) and IEEE
performance. In fact, the accuracy (average for the two
                                                                 Conference on Robotics, Automation and Mecha-
classes) is equal to 89.31%.
                                                                 tronics (RAM), IEEE, 2017, pp. 458–462.
   The results show that the proposed neural network
                                                             [2] S. Ge, J. Li, Q. Ye, Z. Luo, Detecting masked faces
and the new re-labeled database perform better than
                                                                 in the wild with lle-cnns, in: Proceedings of the
the state of the art methods. In particular, compar-
                                                                 IEEE Conference on Computer Vision and Pat-
ing our study with the one presented in [2] it is clear
                                                                 tern Recognition, 2017, pp. 2682–2690.
that by conducting the research almost with the same


                                                       44
 [3] MAFA          open      dataset,     2019.       URL:               LNAI (2012) 21–29.
     http://221.228.208.41/gl/dataset/                              [14] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn:
     0b33a2ece1f549b18c7ff725fb50c561.                                   Towards real-time object detection with region
 [4] Z. Wang, G. Wang, B. Huang, Z. Xiong, Q. Hong,                      proposal networks, IEEE Transactions on Pat-
     H. Wu, P. Yi, K. Jiang, N. Wang, Y. Pei, et al.,                    tern Analysis and Machine Intelligence 39 (2017)
     Masked face recognition dataset and application,                    1137–1149.
     arXiv preprint arXiv:2003.09093 (2020).                        [15] J. Redmon, S. Divvala, R. Girshick, A. Farhadi,
 [5] R. Avanzato, F. Beritelli, F. Di Franco, V. F. Puglisi,             You only look once: Unified, real-time object de-
     A convolutional neural networks approach to                         tection, in: Proceedings of the IEEE confer-
     audio classification for rainfall estimation, in:                   ence on computer vision and pattern recognition,
     2019 10th IEEE International Conference on In-                      2016, pp. 779–788.
     telligent Data Acquisition and Advanced Com-                   [16] J. Redmon, A. Farhadi, Yolov3: An incremental
     puting Systems: Technology and Applications                         improvement, arXiv preprint arXiv:1804.02767
     (IDAACS), volume 1, IEEE, 2019, pp. 285–289.                        (2018).
 [6] S. Spanò, G. C. Cardarilli, L. Di Nunzio, R. Fazzo-            [17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy,
     lari, D. Giardino, M. Matta, A. Nannarelli, M. Re,                  S. Reed, C.-Y. Fu, A. C. Berg, Ssd: Single shot
     An efficient hardware implementation of rein-                       multibox detector, in: European conference on
     forcement learning: The q-learning algorithm,                       computer vision, Springer, 2016, pp. 21–37.
     Ieee Access 7 (2019) 186340–186351.                            [18] X. Zhang, W. Yang, X. Tang, J. Liu, A fast learn-
 [7] R. Avanzato, F. Beritelli, A cnn-based differen-                    ing method for accurate and robust lane detec-
     tial image processing approach for rainfall clas-                   tion using two-stage feature extraction with yolo
     sification, Advances in Science, Technology and                     v3, Sensors 18 (2018) 4308.
     Engineering Systems Journal 5 (2020) 438–444.                  [19] C. Beleites, R. Salzer, V. Sergo, Validation of soft
 [8] S. I. Illari, S. Russo, R. Avanzato, C. Napoli, A                   classification models using partial class mem-
     cloud-oriented architecture for the remote as-                      berships: An extended concept of sensitivity &
     sessment and follow-up of hospitalized patients,                    co. applied to grading of astrocytoma tissues,
     in: Symposium for Young Scientists in Technol-                      Chemometrics and Intelligent Laboratory Sys-
     ogy, Engineering and Mathematics, volume 2694,                      tems 122 (2013) 12–22.
     CEUR-WS, 2020.                                                 [20] A. P. Bradley, The use of the area under the roc
 [9] R. Avanzato, F. Beritelli, A. Raspanti, M. Russo,                   curve in the evaluation of machine learning algo-
     Assessment of multimodal rainfall classification                    rithms, Pattern recognition 30 (1997) 1145–1159.
     systems based on an audio/video dataset, Inter-
     national Journal on Advanced Science, Engineer-
     ing and Information Technology 10 (2020) 1163–
     1168.
[10] R. Avanzato, F. Beritelli, Automatic ecg diagnosis
     using convolutional neural network, Electronics
     9 (2020) 951.
[11] C. Napoli, F. Bonanno, G. Capizzi, Exploiting
     solar wind time series correlation with magne-
     tospheric response by using an hybrid neuro-
     wavelet approach, Proceedings of the Interna-
     tional Astronomical Union 6 (2010) 156–158.
[12] C. Napoli, F. Bonanno, G. Capizzi, An hybrid
     neuro-wavelet approach for long-term predic-
     tion of solar wind, Proceedings of the Interna-
     tional Astronomical Union 6 (2010) 153–155.
[13] G. Capizzi, C. Napoli, L. Paternò, An innovative
     hybrid neuro-wavelet method for reconstruction
     of missing data in astronomical photometric sur-
     veys, Lecture Notes in Computer Science (includ-
     ing subseries Lecture Notes in Artificial Intelli-
     gence and Lecture Notes in Bioinformatics) 7267


                                                               45