1. Introduction

Face Recognition in the Presence of Non‐Gaussian Noise

Oleg Rudenko

oleh.rudenko@nure.ua 0

Oleksandr Bezsonov

oleksandr.bezsonov@nure.ua 0

Denys Yakovliev

denys.yakovliev@hneu.net 1 0 Kharkiv National University of Radio , Nauky Ave. 14, Kharkiv, 61166 , Ukraine 1 Simon Kuznets Kharkiv National University of Economics , Nauky Ave.9, Kharkiv, 61166 , Ukraine

A method for detecting a face and recognizing a person or a group of persons at pictures or videos that can be corrupted by non-gaussian noise using different architectures of convolutional neural networks is proposed. The technical details of building a deep learning-based face recognition system are discussed. The test results confirm the prospects of using the developed method for solving tasks of face detection and recognition.

1 Сonvolutional neural network learning algorithm facial boundaries marker measure of similarity

1. Introduction

Visual pattern recognition is one of the most important components of information management and processing systems, automated systems and decision-making systems. Tasks related to the classification and identification of objects, phenomena and signals, characterized by a finite set of certain properties and characteristics arise in areas such as robotics, information retrieval, monitoring and analysis of visual data, and are the subject of artificial intelligence. The main task of artificial intelligence is to build intelligent information systems that would have a level of effectiveness in solving informal problems that can be compared with human capabilities or those that exceed them.

When interacting with other people, a person's face is an important source of information. Facial expressions, gestures when talking, head movements are a convenient and natural way to convey information. The inability of the computer on the one hand to perceive, and on the other hand to reproduce the natural human ways of communication complicates the transmission and perception of information when working with a computer. To achieve the goal of computer recognition of head movements, facial expressions, it is necessary to implement stable algorithms for analysis and classification of digital images of human faces keeping in mind that they can be corrupted by noisy pixels.

Such algorithms can perform a wide range of commercial and non-commercial tasks. To date, there is no algorithm that could identify a human face with 100% accuracy. However, it is already possible to implement programs for detecting faces, which can be successfully used in education, medicine, security and safety, computer games, advertising and other areas.

The purpose of this work is the development and testing of a method for detecting faces and recognizing a person or a group of persons at pictures and videos that can be corrupted by nongaussian noise based on the use of convolutional neural networks.

2. Recognition stages

It is generally assumed that full face recognition consists of four main steps: 1. Face detection 2. Preliminary processing 3. Isolation of signs 4. Classification

The first step is to detect faces in the image, regardless of scale and location. To do this, an advanced filtering procedure is often used to distinguish locations representing faces and filter them using some classifiers. It is noteworthy that all changes in displacement, scaling and rotation must be processed during the face detection phase. Note that facial expressions and changing hairstyles or smiling and frowning faces still present difficulties at the stage of pattern recognition [ 1 ].

While the accuracy and speed of face detection systems has improved, the two biggest challenges remain to some extent. Face detectors are essential to deal with large and complex variations of facial changes and effectively distinguish between faces and non-faces in unrestricted environments. In addition, large differences in face position and size in a large search space create problems that reduce detection efficiency [ 2 ]. This requires a trade-off between high detection accuracy and computational efficiency of the detection procedure.

In the next step, a system based on an anthropometric dataset predicts the approximate location of key features such as eyes, nose and mouth. The entire procedure is repeated to predict small features in relation to major features and is checked with collocation statistics to reject any misplaced features.

Highlighted anchor points are generated as a result of geometric combinations on the face image, and then the actual recognition process begins. It does this by finding a local representation of the facial appearance at each of the anchor points. The presentation scheme depends on the used approach.

Feature extraction usually occurs immediately after face detection and can be considered one of the most important steps in face recognition systems, since their effectiveness depends on the quality of the extracted features. This is because the facial landmarks are identified by a given network that determines how accurately features are represented. Traditional landmark locators are model-based, while many recent methods are based on cascade regression [ 3 ].

After feature extraction, face recognition is performed. The next stage of classification is simply performed using the general nearest neighbor technique, while more specific algorithms are developed for the preprocessing stage. This step usually consists of cropping, zooming, and aligning the detected faces. Unless sophisticated techniques are required for cropping and zooming, alignment is not an easy problem to solve. In practice, alignment consists in detecting a series of landmarks on the face (nose, eyes, etc.) and then transforming the face pictures so that the position of these landmarks is constant.

3. Convolutional neural networks for face recognition

Significant progress has been made in face detection and recognition through the use of convolutional neural networks (CNN).

The first CNN framework, known as LeNet-5, was developed in 1990 to classify handwritten digits by recognizing visual patterns in image pixels without the need for preprocessing [ 4 ]. In [ 5 ], a neural network used for vertical, frontal face recognition in grayscale was presented for the first time, which, although quite simple as for today's architectures, is comparable in accuracy to modern methods of that time.

Because face recognition differs from object recognition in that it requires alignment before extraction, this is reflected in the differences between CNNs designed for face recognition and those used for object recognition.

Because of the proven ability of deep neural network (DCNN)-based systems to outperform human performance in face-checking tasks, research in this area has skyrocketed.

Face detection and recognition research is currently focusing on DCNNs, which have demonstrated impressive accuracy on highly complex databases such as the WIDER FACE dataset [ 6 ] and MegaFace Challenge [ 7 ], as well as older databases such as Labeled Faces in the Wild (LFW) [ 8 ].

The growth in deep neural network research has been accompanied by the emergence of many deep learning frameworks and the development of the Caffe, TensorFlow, Torch, MXNET, and Theano frameworks that use platforms such as CUDA and libraries such as cuDNN. Finally, these frameworks can be used with a number of programming languages such as C ++, Python, or Matlab.

Rapid progress in this direction has been driven by the increase in the availability of powerful GPUs and improvements in CNN's architecture for real-world applications. In addition, the development of large annotated datasets and a better understanding of nonlinear mapping between input images and class labels have contributed to increased research interest in these networks. DCNNs are very efficient due to their ability to approximate non-linear functions. Their significant disadvantage is high computational costs due to the presence of intensive convolution and nonlinear operations [ 9 ]. However, DCCN is expected to have a bright future, and is currently being developed by such large corporations as Google, Facebook and Microsoft [ 10 ].

Using CNN in face recognition tasks consists of two main stages: training and inference. Training is a global optimization process that involves identifying network parameters by observing huge datasets. Inference essentially involves deploying a trained CNN to classify observed data [ 11 ]. The training process includes minimization the loss function to determine the parameters of the network and the number of layers required, and the organization of connections between the layers.

CNN face recognition systems are characterized by the training data used to build the model, the network architecture and settings, and the type of the loss function [ 12 ]. DCNNs have the ability to learn highly discriminatory and invariant representations of objects when trained on very large datasets. Training is reduced to the application of optimization algorithms to minimize the loss function. In this case, the role of the loss function is to determine the forecast error, and its choice depends on the problem being solved (regression, classification, etc.). Different loss functions will produce different error values for an identical forecast and, thus, largely determine the network performance. To minimize the error, a backpropagation algorithm is used, the weights are adjusted (corrected) by using some algorithm.

Modern face recognition systems that use DCNN implement deep feature extraction and similarity comparison, which involves conversion of test images to deep representations and computation of Euclidean or cosine distance.

4. Convolutional neural network architecture

Initially, the structure of the convolutional neural network was created taking into account the structural features of some parts of the human brain that are responsible for vision. The development of such networks is based on three mechanisms: - local perception; - formation of layers in the form of a set of feature maps (shared weights); - subsampling.

According to these mechanisms to build a convolutional neural network three main layers are applied: convolution, pooling (or subsampling), fully connected layer. 4.1.

Convolution Layer

The convolution formula for the l -th (l  1,...,L) layer of the network, which looks like [ 13 ] xilj      a b walb  y(lis1a)( jsb) b l , (1) l reflects the movement of the nucleus w from the input image or feature map for this layer y l1 ... Here yilj1  f (xilj1) – image after the (l 1) -th layer; f () – used activation function; bl – offset. i, j, a,b – indices of elements in matrices, s – the size of the convolution step.

As can be seen from (1), convolution operations are performed for each element i, j of image matrices xl .

Convolution preserves spatial relationships between pixels.

Each convolutional layer is followed by a downsampling (subsampling) layer, or a computational layer, which serves to reduce the image dimension by local averaging the neuron output values. 4.2.

Subsampling layer (pooling, MAX‐pooling)

The subsampling layer reduces the scale of the planes by locally averaging the neuron outputs. Thus, a hierarchical organization is achieved. Subsequent layers extract more general characteristics less dependent on image distortion.

The pooling layer is described by the expression

  xilj     l downwalb y(lis1a)( jsb) bl , (2)

a b where down() – the pooling function. This function adds blocks of the input image, thus reducing the dimension of the output image of the layer. In this case, any output map is set by two offset parameters: multiplicative  l and additive bl .

The difference between the subsampling layer and the convolution layer is that the regions of neighboring neurons overlap, which does not happen in the subsampling layer.

The pooling layer operates independently of the slice depth of the input data and scales the volume spatially using a maximum function.

In the architecture of the convolutional network, it is generally accepted that the presence of a feature is more important than information about its location. Therefore, from several neighboring neurons in the feature map, the maximum is selected and its value is considered one neuron in the feature map of a lower dimension.

In addition to the maximum subsampling, pooling layers can perform other functions, such as an averaging subsampling or even an L2-normalized subsampling. 4.3.

Non‐linear activation layer

On these layers within the network, a non-linear activation function is applied to all input values f () and the result is sent to the output. Thus, the activation layer does not change the size of the entrance.

Usually, due to significant positive properties, ReLu is used for hidden layers. (ReLU(x) = max(0, x)) and its various modifications (Leaky ReLU, Parametric ReLU, Randomized ReLU), and for fully L  N L 1 connected layer – SoftMax function (when solving classification problems) f jL  ex j   exiL  or    i1  linear (for regression problems). 4.4.

Dropout layer

Various regularization methods are used to avoid overfitting the network.

Dropout is a simple and effective method of regularization and consists in the fact that in the process of training a network from its aggregate topology, a subnet is repeatedly randomly allocated, i.e. some neurons are turned off from the process and the next update of the weights occurs only within the dedicated subnet. Thus, weights change only in the remaining neurons. Each neuron is dropped from the aggregate network with a certain probability, which is called the dropout ratio.

This layer reduces the time of one training epoch due to the smaller number of parameters to be optimized, and also makes it possible to better deal with network overfitting in comparison with standard regularization methods. 4.5.

Normalizing layer

On this layer, the standard normalization of the inputs occurs (the sample mean of their values is subtracted, and the result is divided by the root of the sample variance). Sample values are calculated taking into account the values at the inputs of this layer at previous training iterations. This approach allows you to increase the speed of network learning and improve the final result. 4.6.

Fully connected layer

This layer is a conventional multilayer perceptron, the purpose of which is classification. It simulates a complex nonlinear function, optimization of which improves the recognition quality.

The neurons of each map of the previous subsample layer are associated with one neuron of the hidden layer. Thus, the number of neurons in the hidden layer is equal to the number of maps in the subsample layer.

As in conventional neural networks, in a fully connected layer, neurons are connected to all activations in the previous layer. Their activations can be calculated by multiplying matrices and applying an offset.

The difference between fully connected and convolutional layers is that the neurons of the convolutional layer 1) are connected only to the local input area; 2) can share parameters. 4.7.

Choice of training criterion

CNN training is an iterative process. Each iteration calculates the network outputs for one (or more) samples in the training set, and adjusts the network weights to reduce the error between the actual network output. ( yiL, p )(i  1,..., N L ) and the target output for a given sample di, p . Therefore, training is reduced to minimizing some functionality.

In practice, a quadratic function, cross entropy, or some combined functional is used as a criterion for the error function.

A backpropagation neural network (BPNN) is a multi-layer feedforward neural network that uses a supervised training algorithm known as an error backpropagation algorithm. Errors accumulated on the output layer are propagated back to the network to correct the weights. A conventional MP, which consists of three types of layers: input, hidden and output has no backward computation other than the operations used in training. All operations are performed in the forward direction during simulation.

Training of convolutional neural network

CNN training is similar to training of any direct propagation ANN and consists in correcting its weight parameters based on minimizing some selected loss function, which are usually used as a quadratic function or cross entropy.

To train convolutional neural networks, both the standard backpropagation method and its various modifications can be used. The derivation of the backpropagation algorithm for CNN training is discussed in sufficient detail in [ 14 ]. This method is based on the stochastic gradient descent algorithm (SGD).

In practice, the most common neural network learning algorithm is usually used, based on the gradient descent method (the backpropagation error method) and its modifications: SGD, Adam, AdaGrad, AdaDelta, etc. [ 15 ].

Training neurons of the output (fully connected) layer

To train the neurons of this layer, a gradient backpropagation algorithm is used

 (k 1)  (k)  J ( (k)), where  – network parameters (elements of weight matrices, displacements, angles of inclination of activation functions, etc.) ; J ( (k )) – objective function (loss function E or C);  – the parameter of the learning rate. 6.1.

Training neurons when choosing a quadratic function E

If the quadratic function E is chosen as the network loss function, and the sigmoidal function is chosen as the activation function of neurons, then to adjust the weights matrix of the L -th layer following equations are used (3) (4) (5) (6) where wkLi J (wkLi )  E wkLi  yEiL yxiiLL wxikLLi   iL  wxikLLi   iL  ykL1, i  (0,..., N L ),k  (0,..., N L1),  iL  E yiL E yiL ; yiL xiL  yiL  di ; yiL  yiL (1  yiL ); xiL di – required value of the i -th output. yiL  yiL (1 yiL ) , and besides, yiL   yiL y Lj , because E xiL x Lj xiL j1 yELj xyiLLj , i  (1,..., N L ).   If, however, as the activation function of neurons SoftMax is selected, then in (4) one should take N L Similarly, one can get the procedure for adjusting the offsets for neurons of a fully connected layer E biL

yiL xiL biL  E yiL xiL   iL  xiL   iL. biL When training neurons of other (hidden, l -x, l  1,..., L  1) layers it is needed to calculate gradients E N L1 N L1 ykl  i1 il  yxklil1  i1 il1  wkli1, (7) i  (1,..., N L )k  (1,..., N L1).

Here wilj1 – the weight of the connection between the neuron i of the current (hidden) layer and neuron j of the next layer. When choosing the cross entropy as the loss function, the training procedure for the output (fully connected) layer of the CNN will take the form

 (k 1)  (k)  C( (k)), (8) where   (w, b) .

Calculating the partial derivatives with respect to the tunable parameters, we have (here it is taken into account that for the output layer yiL  fiL ) wC  C wiLk 

yCiL wyiiLLk  fkL1( fkL1  di ); bC  C biL  C yiL  fiL  di ;

yiL biL  f C  fCjL   fjL  in1 di ln fiL    dfjLj  fiL  di. When calculating the gradient, the following derivatives are used:   2  ex Lj  ex Lj  fiL   kn1exkL   kn1exkL  i  j;   fiL (1L fLiL ),i  j; x Lj   enxiL ex Lj 2 i  j;   fi f j   exkL 

   k1  C xiL     n d j ln fiL   n C f jL  yiL  diL.

xiL  j1  j1 f jL xiL 6.3.

Training neurons in the subsampling layer (9) (10) (11) (12) (13) (14) The backpropagation algorithm is also used to train the neurons of this layer.

Core is a filter that slides over the entire image and finds its features anywhere, i.e. provides invariance to displacements.

Formula for updating the convolution kernel walb has the form [ 16–18 ] E walb    E yilj  y(lis1a)( jsb) ,a  (,...,)b  (,...,).

i j yilj xilj Only one offset can be used for one feature map bl , which is “connected” with all the elements of The peculiarity of this layer is that it sits in front of both fully connected and convolutional layers. In the first case, it has the same neurons and connections as a fully connected one. Therefore  is calculated in the same way as in a hidden layer. If the subsampling layer is in front of convolutional layer,  is computed by the inverse convolution method using the rotation by 1800 rot180o  ilj  [16– 6.4.

Training of neurons of the convolutional layer (15) E yilj1 this map. Accordingly, when correcting the value of this offset, all values from the map obtained during the back propagation of the error should be taken into account. In this case (when using one displacement for one feature map) we have E bl    i j yilj xilj bl E yilj xilj    i j yilj xilj E yilj .

As an alternative, one can take as many offsets bilj for a separate feature map, as many elements parameters of the convolution kernels themselves). For this case

To train neurons in the convolution layer, the gradient procedure uses the derivative calculated as follows: E yilj1    i j yilj xilj yilj1 E yilj xilj    E yilj  w(isi)( js j).

l i j yilj xilj (16)

Histogram of oriented gradients (HOG)

Histogram of Oriented Gradients (HOG) – is a method of information evaluation of special points of the image, based on the calculation of the number of directional gradients in the space of these points. The basic idea of the algorithm is the assumption that the appearance and shape of the object in the image area can be described by the distribution of intensity gradients or the direction of the edges [ 19 ]. Such a description is carried out by dividing the image into cells and constructing histograms of directional gradients of cell pixels. The result of the algorithm is a descriptor, which includes a combination of the obtained histograms.

HOG algorithm contains following stages: 1. Gradient calculation. 2. Calculation of the histograms of the image cells. 3. Forming and normalizing descriptor blocks. 4. The final step is to classify the HOG descriptors using a training system with a teacher.

The result of the classifier's work is two object images with positive and negative weights of reference vectors. Positive weights mean that the features belong to the target class, and negative weights mean that the features belong to the background.

SSD network

The SSD method is based on such CNN architectures as Faster R-CNN and YOLO, however, the authors of [20] took into account their shortcomings, thanks to which SSD could achieve new peaks of accuracy and speed. The method is based on the Feed-forward convolutional network, which creates a finite set of limiting rectangular frames and quantitative estimates of the presence of objects of various classes in these frames, after which it produces the suppression of maximums ("Nonmaximum suppression predictors"). The structure of the network is shown in Fig. 1 and can be divided into four functional parts: 1. Network input. It accepts a three-channel (color) image with a size of 300x300.

2. Backbone network. Some standard architecture for image classification is usually used as a backbone network. In this case it is VGG-16. However, the fully connected layer of VGG-16 is not used in SSD. This architecture is described in detail in [20].

3. Layer of additional features. These are convolutional layers and subsampling layers that generate feature maps of different resolution for detecting objects of different scales.

4. At this stage operations are performed to merge the outputs of various layers and suppress maximums to obtain the output of the network. The SSD method has the following key features: 1. Detection of objects of various scales. Several convolutional layers of different sizes are sequentially added to the base network, which make it possible to obtain maps of signs of different resolutions and, accordingly, to predict objects at different scales (unlike, for example, YOLO, which receives only one map).

2. Convolutional predictors for detection. Each feature layer generates a finite set of assumptions about the class and position of the object using a set of convolutional filters. Each such convolutional filter has a kernel that determines the probability of belonging to the default frame offset class itself.

3. Default bounding frames. The default set of frames is linked to the cards of signs of different resolutions described in point 1: each cell of the card of signs corresponds to a frame from the set by default. For each cell of the feature map, estimates are determined according to the class of objects, as well as four offsets of the default frame relative to its initial position.

Combining these properties allowed efficiently sampling many different forms of the resulting bounding frames, which had a positive impact on the speed of the network.

MTCNN network

Of the neural network approaches in face detection, Multi-task Cascaded CNN (MTCNN) is especially significant. This network is described in details in [ 21 ]. 10.

RetinaFace Network

The architecture of the RetinaFace network [ 22 ] is shown in Fig. 2. This network is a single-stage face detector that detects faces through collaborative, supervised and self-guided multi-tasking learning. The network has several features that improve the ArcFace network's face detection performance and surpass most network architectures in the vast majority of tests for such systems: - RetinaFace uses five additional key points on faces to dramatically improve facial recognition results;

- RetinaFace has a self-checking grid decoder branch to predict per-pixel information about 3D face shapes in parallel with existing controlled branches;

- By utilizing lightweight backbones, RetinaFace delivers performance that allows it to be run on the CPU, while most similar systems achieve such performance solely using the GPU. Table 1 System performance results

Method HOG fr128d 86.67% arc512d 87%

Acknowledgements

The European Commission's support for the production of this publication does not constitute an endorsement of the contents, which reflect the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein.

[1]

Yang ,

Yan ,

Lei ,

Li , Aggregate channel eatures for multi-view face detection , in IEEE International Joint Conference on Biometrics, 2014 , pp. 1 - 8 .

[2]

Zhang , et al., Detecting Face with Densely Connected Face Proposal Network , 2017 , pp. 3 - 12 .

[3]

Ranjan , et al., Deep Learning for Understanding Faces: Machines May Be Just as Good, or Better, than Humans . Signal Processing Magazine , IEEE, 2018 . 35 ( 1 ), p. 66 - 83 .

[4]

LeCun et al., Handwritten digit recognition with a back-propagation network , 1990 .

[5]

H.A.

Rowley ,

Baluja , T. Kanade, Neural network-based face detection . Pattern Analysis and Machine Intelligence , IEEE Transactions on, 1998 . 20 ( 1 ), p. 23 - 38 .

[6]

Yang , et al., WIDER FACE: A Face Detection Benchmark , 2015 .

[7]

Kemelmacher , et al., The MegaFace Benchmark: 1 Million Faces for Recognition at Scale , 2016 , pp. 4873 - 4882 .

[8]

G.B.

Huang , et al., Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , in Workshop on Faces in 'Real-Life' Images: Detection, Alignment, and Recognition , Marseille, France, 2008 .

[9]

Wu , et al., Funnel-structured cascade for multi-view face detection with alignment-awareness . Neurocomputing, v. 221 , 2017 , p. 138 - 145 .

[10]

Krizhevsky , I. Sutskever, G. Hinton. , ImageNet classification with deep convolutional neural networks . Communications of the ACM , 2017 . 60 ( 6 ), p. 84 - 90 .

[11]

Parashar , et al., SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks . ACM SIGARCH Computer Architecture News, v. 45 ( 2 ), 2017 , pp. 27 - 40 .

[12]

Deng ,

Guo ,

Zafeiriou . ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018 .

[13] Z. Zhang. , Derivation of Backpropagation in Convolutional Neural Network (CNN) , Technical Report . University of Tennessee, Knoxvill, TN , October 18 , 2016 . URL: http://web.eecs.utk.edu/~zzhang61/docs/reports/2016.pdf

[14]

LeCun , Y. Bengio, Convolutional networks for images, speech, and timeseries , The Handbook of Brain Theory and Neural Networks , 1995 , pp. 255 - 258 . O. Rudenko , O.

Bezsonov , K.

Oliinyk , First-Order Optimization ( Training) Algorithms in Deep LearningProceedings of the 4th International Conference on Computational Linguistics and Intelligent Systems (COLINS 2020 ). Volume I: Main Conference Lviv, Ukraine, April 23-24 , 2020 , pp. 921 - 935 .

[15] Deep Learning for Image Processing Applications , Ed. Hemanth DY , Estrella

, 2017 , 273 p. URL: http://ebooks.iospress.nl/volume/deep -learning-for-image-processing-applications

[16] Deep learning tutorial , Stanford University. Autoencoders. URL: http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/

[17] Convolutional network in python. Part 2. Derivation of formulas for training the model . URL: https://habr.com/ru/company/ods/blog/344116/

[18]

Viola ,

M.J.

Jones , and

Snow , Detecting pedestrians using patterns of motion and appearance, The 9 - th

ICCV

, Nice, France, v. 1 , 2003 , pp. 734 - 741 .

[19]

Liu ,

Anguelov ,

Erhan ,

Szegedy ,

Reed , C.-Y. Fu, SSD: Single Shot MultiBox Detector, 2016 . URL: https://arxiv.org/abs/1512.02325. Recognition . URL: https://arxiv.org/abs/1409.1556v6

[21]

Zhang ,

Li ,

Qiao , Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks . URL: https://arxiv.org/pdf/1604.02878

[22]

Deng ,

Guo ,

Zhou ,

Yu ,

Kotsia , S. Zafeiriou, RetinaFace: Single-stage Dense Face Localisation in the Wild . URL: https://arxiv.org/abs/ 1905 .00641