YOLOv3-based Mask and Face Recognition Algorithm for Individual Protection Applications Roberta Avanzatoa , Francesco Beritellia , Michele Russob , Samuele Russoc and Mario Vaccarob a Department of Electrical, Electronic and Computer Engineering, University of Catania, Catania, CT, Italy b VICOSYSTEMS S.r.l V.le Odorico da Pordenone, 33, Catania, CT, Italy c Sapienza University of Rome, Piazzale Aldo Moro 5, Roma, Italy Abstract To combat the spread of the COVID-19 pandemic, it is essential to strictly obey social distancing measures, as well as have the possibility to possess and wear personal protective equipment. This paper proposes a mask and face recognition algo- rithm based on YOLOv3 for individual protection applications. The proposed method processes images directly in raw data format input to a neural network trained with deep learning techniques. System training was performed on a set of images appropriately obtained from the MAFA dataset by selecting those with surgical masks for a total of about 6,000 cases. The performances obtained indicate 84% accuracy in recognizing a mask and 96% in the case of a face. Keywords Image processing, Face recognition, Mask recognition, Computer vision, Deep learning 1. Introduction ditions, and, possibly, where necessary, compliance with the restrictions on individual protection (masks, gloves, TIn an emergency phase, the fight against the spread overalls etc.). There are several important advantages: of COVID-19 contamination is regulated by procedures the safeguard of people’s health, the mitigation of the of medical-scientific rigor and official protocols adopted risk of contamination return, the possibility of timely as regulations until the epidemic is definitively defeated interventions by the law enforcement engaged in pre- on a global scale. For the return to normality, which serving public health orders, as well as safe and fast is expected to be gradual and of medium-long dura- return to work. tion, it is essential to strictly obey social distancing The key issues forming the basis for the proposal measures, as well as have the possibility to possess described in this paper arises are the following: and wear personal protective equipment for those who continue to work in potentially contagious environ- • need for social distancing outdoors (streets, squa- ments. Thus, it becomes strategic to focus on solu- res, parks, etc.) and indoors (offices, schools, tions that can remotely and non-intrusively monitor shopping centers, theaters, restaurants, pubs, sho- people’s behaviour and health, while ensuring respect ps, etc.); for privacy. One solution is represented by innova- tive video intelligence technologies for the automatic • need to manage quotas for access and use of pub- detection of body temperature and the proximity dis- lic areas and public carriers; tance between individuals in order to guarantee, and • need for timely notification of gatherings to the possibly certify, in outdoor or indoor environments, managers of the frequented areas and, in the most compliance with the regulations on the constraints of serious cases, to the law enforcement, possibly the distance between individuals (and/or the maximum via the certification of critical events; capacity in a given environment), access to indoor en- vironments for individuals without critical health con- • need to monitor the state of health (by checking the temperature) of people who access an indoor ICYRIME 2020: International Conference for Young Researchers in environment; Informatics, Mathematics, and Engineering, Online, July 09 2020 " roberta.avanzato@phd.unict.it (R. Avanzato); • need to monitor compliance with the use of pro- francesco.beritelli@dieei.unict.it (F. Beritelli); tective equipment (masks, gloves, overalls), es- m.russo@vicosystems.it (M. Russo); samuelerussoct@gmail.com (S. Russo); m.vaccaro@vicosystems.it (M. Vaccaro) pecially in the most at-risk work contexts.  © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). The last point is the one the present study focuses CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) on by proposing a mask/face recognition algorithm. In the state of the art there are many studies dealing with face recognition and, in particular, recognition of masked faces. In [1] the authors propose a masked face detection technique useful for monitoring and identifying crim- inals or terrorists. They propose a CNN-based cascade framework, which consists of three carefully designed convolutional neural networks to detect masked faces. The accuracy in recognizing masked faces is 87.8%. In [2] the authors propose a further method of iden- tifying masked faces based on the LLE-CNN network and MAFA database [3]. In this approach, the authors achieved a performance of 76.4%. The authors in [4] address the issue of the importance of greater accuracy in face recognition during the period of COVID-19. The study proposes a face-eye based multi-granular recognition model. With this approach, the accuracy of masked face recognition goes from the initial 50% to 95%. In the present study, a mask/face recognition tech- nique is proposed using a very performing type of con- volutional neural network called YOLOv3. This method Figure 1: Block diagram of the proposed method. allows to derive the detection and classification per- formance of the "faces" and "masks" within the video and/or images. 3. Adopted Neural Network The paper is structured as follows: Section 2 de- scribes the proposed method; Section 3 illustrates the The application of artificial intelligence and machine neural network used; Section 4 describes the database learning algorithms turns out to be a very complex used; section 5 shows the performances obtained by approach if the problems requiring a solution are not the proposed technique; the last section is dedicated highlighted [5, 6, 7, 8, 9, 10, 11, 12, 13]. In this study, to conclusions. we are interested in recognizing the face and any mask worn by the various people present in the video record- ings. 2. Proposed Method The theme of face recognition and masks falls within the subject of object detection. Object detection is the This section describes the process of detecting masks basis of computer vision, and specifically for applica- and faces. tions such as instance segmentation, image caption- Figure 1 shows the block diagram of the proposed ing and object detection/tracking. From an application technique. point of view, it is possible to group object detection The first block represents the acquisition of the video into two macro-categories: signal by means of cameras, which can be installed in indoor or outdoor environments. • General object detection: the goal is to inves- Once the video signal is acquired, a pre-processing tigate methods for identifying different types of phase is performed (Video processing block) which is objects using a single framework, in order to sim- responsible for extracting the frames with a frame-rate ulate human vision and cognition; equal to 30 fps. Subsequently, the frames are fed into the previously trained YOLOv3 neural network. The • Detection applications: refers to the recognition output neural network provides a percentage of detec- of objects of a certain class in specific applica- tion and classification accuracy of the face and masks tion scenarios. For example, there may be var- present in the input data frames. ious applications for pedestrian detection, face detection or for text detection. There are several models that implement object recog- nition, present in the state of the art. 42 One of them is the Faster R-CNN [14] which rep- Table 1 resents the current state of the art for models that di- Objects in training e testing dataset for each classs. vide the task of identifying objects into several phases. Dataset Training Testing This network allows you to simultaneously train a rec- ognizer and a bounding box designer within a single Mask 5555 1855 Face 4173 1299 model. The procedure carried out by this network is of the "proposal detection and verification" type. A second model is YOLO (You Only Look Once). Table 2 In [15, 16] the authors have completely abandoned the Confusion matrix. pre-existing paradigm of "proposal detection and veri- Face Mask fication". Instead, YOLO follows a completely different Face 0.94 0.06 philosophy: applying a single model to the entire im- Mask 0.14 0.86 age. YOLO, in fact, divides the image into regions, pre- dicts the bounding boxes and for each of them, deter- mines the probabilities of belonging to a certain class, all using a single network. In this dataset, a great presence of images was noted In [17] the authors define the SSD (Single Shot De- in which the masks were not suitable for individual tector) model. This method has greatly contributed to protection, such as: scarves, sweaters and hands cov- the change of perspective towards the generation of ering the face, full masks used for masquerades, etc. bounding boxes: unlike the previous models that were For this reason, image filtering was performed; in par- concerned with accurately predicting the location of ticular, selecting those that contained surgical masks. an object within the image, SSD starts from a set of Subsequently, a re-labeling of the dataset was per- bounding boxes by default. Starting from this set a de- formed, in order to obtain an automatic recognition viation and its classification are predicted for each of system of the presence of a protective mask on a face. these boxes. Thanks to a set of operations and SSD fil- Via the new labeling, a dataset of 5,800 images was ters, it also obtains excellent accuracy in the prediction extracted, where 3,800 images were used for training of object classes. the neural network and 2,000 images were used for In order to make an exhaustive comparison of the testing. various convolutional models presented, to maintain Each image can contain one or more “Mask” and a certain consistency in the results, it was decided to “Face” objects. In this regard, Table 1 shows the num- use the work done in [18] as a framework to compare ber of the two types of objects for the training and test- the performances. In this study, the authors indicate ing dataset. that YOLOv3 is clearly superior, compared to the other CNNs, both in terms of computational time and accu- 5. Performance Evaluation racy. However, it should be noted that Fast R-CNN, de- spite the huge gap in terms of computational time, al- Once the neural network model and the dataset in use lows, among others, to identify very accurate segmen- are defined, it is possible to analyse the performances tations (polylines) when compared with the "simple" obtained when the dataset described above is fed to bound boxes provided by YOLOv3 or SSD. Therefore, the network. based on the specific application context there may be After a training phase of the neural network model, some cases in which Fast R-CNN is the optimal solu- the testing dataset was applied, containing images com- tion. pletely unknown to the network. The performances on the testing dataset obtained from the network are shown in Tables 2 and 3. Table 2 4. Database shows the confusion matrix produced by the neural Once the neural network model was defined, we moved network. The performances obtained are quite high, to the search for a database containing faces and masks implying that the network is able to perform good de- to train the model. tection of the two classes on images that it has never At first, MAFA [3] database designed to recognize seen before. faces partially occluded by objects was used as a refer- Table 3 shows the performances, in percentage, ob- ence, containing 25,000 images for training and 10,000 tained using the statistical classification parameters: images for testing. accuracy, recall or sensitivity, precision and F1 score 43 Figure 2: ROC curve. Table 3 dataset the obtained performances in the present study Global performance of the proposed method. are 13.6 A [%] Recall [%] PRE [%] F1 [%] Face 94 94 87 6. Conclusion Mask 86 86 93.5 Mean 90 90 90.3 89.98 From the point of view of the recognition of the in- dividual protective garment (mask), which is the sub- ject of our study, we focused on the simple detection [19]. of faces and masks within a frame (image or video), The table shows that the obtained results in terms of obtaining an average accuracy of 90%. A future devel- accuracy, recall, precision and F1 score are quite high. opment includes the extension of the neural network For further validation of the network model and the by adding one or more decision-making layers (clas- performance obtained from the statistical classification sification) in order to be able to identify not only the parameters the ROC (Receiver Operator Characteris- presence of the mask in the photo but also its posi- tic) and ROC AUC (Area Under the Curve) graphs have tion with respect to the person’s face, so as to define been produced. whether it is worn properly or not. Applying this concept to our classification method, in Figure 2 we observe the resulting ROC curve which indicates that a certain degree of variance between the References various parts and the average ROC AUC [20] lies be- [1] W. Bu, J. Xiao, C. Zhou, M. Yang, C. Peng, A tween the perfect score (1.0) and the diagonal (0.5). cascade framework for masked face detection, The graph shows that the area under the ROC curve in: 2017 IEEE International Conference on Cy- is very large. This means that our model has excellent bernetics and Intelligent Systems (CIS) and IEEE performance. In fact, the accuracy (average for the two Conference on Robotics, Automation and Mecha- classes) is equal to 89.31%. tronics (RAM), IEEE, 2017, pp. 458–462. The results show that the proposed neural network [2] S. Ge, J. Li, Q. Ye, Z. Luo, Detecting masked faces and the new re-labeled database perform better than in the wild with lle-cnns, in: Proceedings of the the state of the art methods. In particular, compar- IEEE Conference on Computer Vision and Pat- ing our study with the one presented in [2] it is clear tern Recognition, 2017, pp. 2682–2690. that by conducting the research almost with the same 44 [3] MAFA open dataset, 2019. URL: LNAI (2012) 21–29. http://221.228.208.41/gl/dataset/ [14] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: 0b33a2ece1f549b18c7ff725fb50c561. Towards real-time object detection with region [4] Z. Wang, G. Wang, B. Huang, Z. Xiong, Q. Hong, proposal networks, IEEE Transactions on Pat- H. Wu, P. Yi, K. Jiang, N. Wang, Y. Pei, et al., tern Analysis and Machine Intelligence 39 (2017) Masked face recognition dataset and application, 1137–1149. arXiv preprint arXiv:2003.09093 (2020). [15] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, [5] R. Avanzato, F. Beritelli, F. Di Franco, V. F. Puglisi, You only look once: Unified, real-time object de- A convolutional neural networks approach to tection, in: Proceedings of the IEEE confer- audio classification for rainfall estimation, in: ence on computer vision and pattern recognition, 2019 10th IEEE International Conference on In- 2016, pp. 779–788. telligent Data Acquisition and Advanced Com- [16] J. Redmon, A. Farhadi, Yolov3: An incremental puting Systems: Technology and Applications improvement, arXiv preprint arXiv:1804.02767 (IDAACS), volume 1, IEEE, 2019, pp. 285–289. (2018). [6] S. Spanò, G. C. Cardarilli, L. Di Nunzio, R. Fazzo- [17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, lari, D. Giardino, M. Matta, A. Nannarelli, M. Re, S. Reed, C.-Y. Fu, A. C. Berg, Ssd: Single shot An efficient hardware implementation of rein- multibox detector, in: European conference on forcement learning: The q-learning algorithm, computer vision, Springer, 2016, pp. 21–37. Ieee Access 7 (2019) 186340–186351. [18] X. Zhang, W. Yang, X. Tang, J. Liu, A fast learn- [7] R. Avanzato, F. Beritelli, A cnn-based differen- ing method for accurate and robust lane detec- tial image processing approach for rainfall clas- tion using two-stage feature extraction with yolo sification, Advances in Science, Technology and v3, Sensors 18 (2018) 4308. Engineering Systems Journal 5 (2020) 438–444. [19] C. Beleites, R. Salzer, V. Sergo, Validation of soft [8] S. I. Illari, S. Russo, R. Avanzato, C. Napoli, A classification models using partial class mem- cloud-oriented architecture for the remote as- berships: An extended concept of sensitivity & sessment and follow-up of hospitalized patients, co. applied to grading of astrocytoma tissues, in: Symposium for Young Scientists in Technol- Chemometrics and Intelligent Laboratory Sys- ogy, Engineering and Mathematics, volume 2694, tems 122 (2013) 12–22. CEUR-WS, 2020. [20] A. P. Bradley, The use of the area under the roc [9] R. Avanzato, F. Beritelli, A. Raspanti, M. Russo, curve in the evaluation of machine learning algo- Assessment of multimodal rainfall classification rithms, Pattern recognition 30 (1997) 1145–1159. systems based on an audio/video dataset, Inter- national Journal on Advanced Science, Engineer- ing and Information Technology 10 (2020) 1163– 1168. [10] R. Avanzato, F. Beritelli, Automatic ecg diagnosis using convolutional neural network, Electronics 9 (2020) 951. [11] C. Napoli, F. Bonanno, G. Capizzi, Exploiting solar wind time series correlation with magne- tospheric response by using an hybrid neuro- wavelet approach, Proceedings of the Interna- tional Astronomical Union 6 (2010) 156–158. [12] C. Napoli, F. Bonanno, G. Capizzi, An hybrid neuro-wavelet approach for long-term predic- tion of solar wind, Proceedings of the Interna- tional Astronomical Union 6 (2010) 153–155. [13] G. Capizzi, C. Napoli, L. Paternò, An innovative hybrid neuro-wavelet method for reconstruction of missing data in astronomical photometric sur- veys, Lecture Notes in Computer Science (includ- ing subseries Lecture Notes in Artificial Intelli- gence and Lecture Notes in Bioinformatics) 7267 45