<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Face Recognition in the Presence of Non‐Gaussian Noise </article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oleg Rudenko</string-name>
          <email>oleh.rudenko@nure.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleksandr Bezsonov</string-name>
          <email>oleksandr.bezsonov@nure.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Denys Yakovliev</string-name>
          <email>denys.yakovliev@hneu.net</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kharkiv National University of Radio</institution>
          ,
          <addr-line>Nauky Ave. 14, Kharkiv, 61166</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Simon Kuznets Kharkiv National University of Economics</institution>
          ,
          <addr-line>Nauky Ave.9, Kharkiv, 61166</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>     A method for detecting a face and recognizing a person or a group of persons at pictures or videos that can be corrupted by non-gaussian noise using different architectures of convolutional neural networks is proposed. The technical details of building a deep learning-based face recognition system are discussed. The test results confirm the prospects of using the developed method for solving tasks of face detection and recognition.</p>
      </abstract>
      <kwd-group>
        <kwd> 1   Сonvolutional neural network</kwd>
        <kwd>learning algorithm</kwd>
        <kwd>facial boundaries</kwd>
        <kwd>marker</kwd>
        <kwd>measure of similarity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction </title>
      <p>Visual pattern recognition is one of the most important components of information management
and processing systems, automated systems and decision-making systems. Tasks related to the
classification and identification of objects, phenomena and signals, characterized by a finite set of
certain properties and characteristics arise in areas such as robotics, information retrieval, monitoring
and analysis of visual data, and are the subject of artificial intelligence. The main task of artificial
intelligence is to build intelligent information systems that would have a level of effectiveness in
solving informal problems that can be compared with human capabilities or those that exceed them.</p>
      <p>When interacting with other people, a person's face is an important source of information. Facial
expressions, gestures when talking, head movements are a convenient and natural way to convey
information. The inability of the computer on the one hand to perceive, and on the other hand to
reproduce the natural human ways of communication complicates the transmission and perception of
information when working with a computer. To achieve the goal of computer recognition of head
movements, facial expressions, it is necessary to implement stable algorithms for analysis and
classification of digital images of human faces keeping in mind that they can be corrupted by noisy
pixels.</p>
      <p>Such algorithms can perform a wide range of commercial and non-commercial tasks. To date,
there is no algorithm that could identify a human face with 100% accuracy. However, it is already
possible to implement programs for detecting faces, which can be successfully used in education,
medicine, security and safety, computer games, advertising and other areas.</p>
      <p>The purpose of this work is the development and testing of a method for detecting faces and
recognizing a person or a group of persons at pictures and videos that can be corrupted by
nongaussian noise based on the use of convolutional neural networks.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Recognition stages </title>
      <p>It is generally assumed that full face recognition consists of four main steps:
1. Face detection
2. Preliminary processing
3. Isolation of signs
4. Classification</p>
      <p>
        The first step is to detect faces in the image, regardless of scale and location. To do this, an
advanced filtering procedure is often used to distinguish locations representing faces and filter them
using some classifiers. It is noteworthy that all changes in displacement, scaling and rotation must be
processed during the face detection phase. Note that facial expressions and changing hairstyles or
smiling and frowning faces still present difficulties at the stage of pattern recognition [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        While the accuracy and speed of face detection systems has improved, the two biggest challenges
remain to some extent. Face detectors are essential to deal with large and complex variations of facial
changes and effectively distinguish between faces and non-faces in unrestricted environments. In
addition, large differences in face position and size in a large search space create problems that reduce
detection efficiency [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This requires a trade-off between high detection accuracy and computational
efficiency of the detection procedure.
      </p>
      <p>In the next step, a system based on an anthropometric dataset predicts the approximate location of
key features such as eyes, nose and mouth. The entire procedure is repeated to predict small features
in relation to major features and is checked with collocation statistics to reject any misplaced features.</p>
      <p>Highlighted anchor points are generated as a result of geometric combinations on the face image,
and then the actual recognition process begins. It does this by finding a local representation of the
facial appearance at each of the anchor points. The presentation scheme depends on the used
approach.</p>
      <p>
        Feature extraction usually occurs immediately after face detection and can be considered one of
the most important steps in face recognition systems, since their effectiveness depends on the quality
of the extracted features. This is because the facial landmarks are identified by a given network that
determines how accurately features are represented. Traditional landmark locators are model-based,
while many recent methods are based on cascade regression [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>After feature extraction, face recognition is performed. The next stage of classification is simply
performed using the general nearest neighbor technique, while more specific algorithms are
developed for the preprocessing stage. This step usually consists of cropping, zooming, and
aligning the detected faces. Unless sophisticated techniques are required for cropping and
zooming, alignment is not an easy problem to solve. In practice, alignment consists in detecting a
series of landmarks on the face (nose, eyes, etc.) and then transforming the face pictures so that
the position of these landmarks is constant.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Convolutional neural networks for face recognition </title>
      <p>Significant progress has been made in face detection and recognition through the use of
convolutional neural networks (CNN).</p>
      <p>
        The first CNN framework, known as LeNet-5, was developed in 1990 to classify handwritten
digits by recognizing visual patterns in image pixels without the need for preprocessing [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a
neural network used for vertical, frontal face recognition in grayscale was presented for the first time,
which, although quite simple as for today's architectures, is comparable in accuracy to modern
methods of that time.
      </p>
      <p>Because face recognition differs from object recognition in that it requires alignment before
extraction, this is reflected in the differences between CNNs designed for face recognition and those
used for object recognition.</p>
      <p>Because of the proven ability of deep neural network (DCNN)-based systems to outperform
human performance in face-checking tasks, research in this area has skyrocketed.</p>
      <p>
        Face detection and recognition research is currently focusing on DCNNs, which have
demonstrated impressive accuracy on highly complex databases such as the WIDER FACE dataset
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and MegaFace Challenge [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], as well as older databases such as Labeled Faces in the Wild (LFW)
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>The growth in deep neural network research has been accompanied by the emergence of many
deep learning frameworks and the development of the Caffe, TensorFlow, Torch, MXNET, and
Theano frameworks that use platforms such as CUDA and libraries such as cuDNN. Finally,
these frameworks can be used with a number of programming languages such as C ++, Python,
or Matlab.</p>
      <p>
        Rapid progress in this direction has been driven by the increase in the availability of powerful
GPUs and improvements in CNN's architecture for real-world applications. In addition, the
development of large annotated datasets and a better understanding of nonlinear mapping between
input images and class labels have contributed to increased research interest in these networks.
DCNNs are very efficient due to their ability to approximate non-linear functions. Their significant
disadvantage is high computational costs due to the presence of intensive convolution and nonlinear
operations [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. However, DCCN is expected to have a bright future, and is currently being developed
by such large corporations as Google, Facebook and Microsoft [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Using CNN in face recognition tasks consists of two main stages: training and inference. Training
is a global optimization process that involves identifying network parameters by observing huge
datasets. Inference essentially involves deploying a trained CNN to classify observed data [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The
training process includes minimization the loss function to determine the parameters of the network
and the number of layers required, and the organization of connections between the layers.
      </p>
      <p>
        CNN face recognition systems are characterized by the training data used to build the model, the
network architecture and settings, and the type of the loss function [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. DCNNs have the ability to
learn highly discriminatory and invariant representations of objects when trained on very large
datasets. Training is reduced to the application of optimization algorithms to minimize the loss
function. In this case, the role of the loss function is to determine the forecast error, and its choice
depends on the problem being solved (regression, classification, etc.). Different loss functions will
produce different error values for an identical forecast and, thus, largely determine the network
performance. To minimize the error, a backpropagation algorithm is used, the weights are adjusted
(corrected) by using some algorithm.
      </p>
      <p>Modern face recognition systems that use DCNN implement deep feature extraction and similarity
comparison, which involves conversion of test images to deep representations and computation of
Euclidean or cosine distance.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Convolutional neural network architecture </title>
      <p>Initially, the structure of the convolutional neural network was created taking into account the
structural features of some parts of the human brain that are responsible for vision. The development
of such networks is based on three mechanisms:
- local perception;
- formation of layers in the form of a set of feature maps (shared weights);
- subsampling.</p>
      <p>According to these mechanisms to build a convolutional neural network three main layers are
applied: convolution, pooling (or subsampling), fully connected layer.
4.1.</p>
    </sec>
    <sec id="sec-5">
      <title>Convolution Layer  </title>
      <p>
        The convolution formula for the l -th (l  1,...,L) layer of the network, which looks like [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
xilj 
 
 
a b
walb  y(lis1a)( jsb) b l ,  
(1) 
l
reflects the movement of the nucleus w from the input image or feature map for this layer y l1 ...
Here yilj1  f (xilj1) – image after the (l 1) -th layer; f () – used activation function; bl – offset.
i, j, a,b – indices of elements in matrices, s – the size of the convolution step.
      </p>
      <p>As can be seen from (1), convolution operations are performed for each element i, j of image
matrices xl .</p>
      <p>Convolution preserves spatial relationships between pixels.</p>
      <p>Each convolutional layer is followed by a downsampling (subsampling) layer, or a computational
layer, which serves to reduce the image dimension by local averaging the neuron output values.
4.2.</p>
    </sec>
    <sec id="sec-6">
      <title>Subsampling layer (pooling, MAX‐pooling) </title>
      <p>The subsampling layer reduces the scale of the planes by locally averaging the neuron outputs.
Thus, a hierarchical organization is achieved. Subsequent layers extract more general characteristics
less dependent on image distortion.</p>
      <p>The pooling layer is described by the expression</p>
      <p> 
xilj     l downwalb y(lis1a)( jsb) bl ,   (2) </p>
      <p>a b
where down() – the pooling function. This function adds blocks of the input image, thus
reducing the dimension of the output image of the layer. In this case, any output map is set by two
offset parameters: multiplicative  l and additive bl .</p>
      <p>The difference between the subsampling layer and the convolution layer is that the regions of
neighboring neurons overlap, which does not happen in the subsampling layer.</p>
      <p>The pooling layer operates independently of the slice depth of the input data and scales the volume
spatially using a maximum function.</p>
      <p>In the architecture of the convolutional network, it is generally accepted that the presence of a
feature is more important than information about its location. Therefore, from several neighboring
neurons in the feature map, the maximum is selected and its value is considered one neuron in the
feature map of a lower dimension.</p>
      <p>In addition to the maximum subsampling, pooling layers can perform other functions, such as an
averaging subsampling or even an L2-normalized subsampling.
4.3.</p>
    </sec>
    <sec id="sec-7">
      <title>Non‐linear activation layer </title>
      <p>On these layers within the network, a non-linear activation function is applied to all input
values f () and the result is sent to the output. Thus, the activation layer does not change the size of
the entrance.</p>
      <p>Usually, due to significant positive properties, ReLu is used for hidden layers. (ReLU(x) = max(0,
x)) and its various modifications (Leaky ReLU, Parametric ReLU, Randomized ReLU), and for fully
L  N L 1
connected layer – SoftMax function (when solving classification problems) f jL  ex j   exiL  or
 
 i1 
linear (for regression problems).
4.4.</p>
    </sec>
    <sec id="sec-8">
      <title>Dropout layer </title>
      <p>Various regularization methods are used to avoid overfitting the network.</p>
      <p>Dropout is a simple and effective method of regularization and consists in the fact that in the
process of training a network from its aggregate topology, a subnet is repeatedly randomly allocated,
i.e. some neurons are turned off from the process and the next update of the weights occurs only
within the dedicated subnet. Thus, weights change only in the remaining neurons. Each neuron is
dropped from the aggregate network with a certain probability, which is called the dropout ratio.</p>
      <p>This layer reduces the time of one training epoch due to the smaller number of parameters to be
optimized, and also makes it possible to better deal with network overfitting in comparison with
standard regularization methods.
4.5.</p>
    </sec>
    <sec id="sec-9">
      <title>Normalizing layer </title>
      <p>On this layer, the standard normalization of the inputs occurs (the sample mean of their values is
subtracted, and the result is divided by the root of the sample variance). Sample values are calculated
taking into account the values at the inputs of this layer at previous training iterations. This approach
allows you to increase the speed of network learning and improve the final result.
4.6.</p>
    </sec>
    <sec id="sec-10">
      <title>Fully connected layer </title>
      <p>This layer is a conventional multilayer perceptron, the purpose of which is classification. It
simulates a complex nonlinear function, optimization of which improves the recognition quality.</p>
      <p>The neurons of each map of the previous subsample layer are associated with one neuron of the
hidden layer. Thus, the number of neurons in the hidden layer is equal to the number of maps in the
subsample layer.</p>
      <p>As in conventional neural networks, in a fully connected layer, neurons are connected to all
activations in the previous layer. Their activations can be calculated by multiplying matrices and
applying an offset.</p>
      <p>The difference between fully connected and convolutional layers is that the neurons of the
convolutional layer
1) are connected only to the local input area;
2) can share parameters.
4.7.</p>
    </sec>
    <sec id="sec-11">
      <title>Choice of training criterion </title>
      <p>CNN training is an iterative process. Each iteration calculates the network outputs for one (or
more) samples in the training set, and adjusts the network weights to reduce the error between the
actual network output. ( yiL, p )(i  1,..., N L ) and the target output for a given sample di, p . Therefore,
training is reduced to minimizing some functionality.</p>
      <p>In practice, a quadratic function, cross entropy, or some combined functional is used as a criterion
for the error function.</p>
      <p>A backpropagation neural network (BPNN) is a multi-layer feedforward neural network that uses a
supervised training algorithm known as an error backpropagation algorithm. Errors accumulated on
the output layer are propagated back to the network to correct the weights. A conventional MP, which
consists of three types of layers: input, hidden and output has no backward computation other than the
operations used in training. All operations are performed in the forward direction during simulation.</p>
    </sec>
    <sec id="sec-12">
      <title>Training of convolutional neural network </title>
      <p>CNN training is similar to training of any direct propagation ANN and consists in correcting its
weight parameters based on minimizing some selected loss function, which are usually used as a
quadratic function or cross entropy.</p>
      <p>
        To train convolutional neural networks, both the standard backpropagation method and its various
modifications can be used. The derivation of the backpropagation algorithm for CNN training is
discussed in sufficient detail in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. This method is based on the stochastic gradient descent
algorithm (SGD).
      </p>
      <p>
        In practice, the most common neural network learning algorithm is usually used, based on the
gradient descent method (the backpropagation error method) and its modifications: SGD, Adam,
AdaGrad, AdaDelta, etc. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
    </sec>
    <sec id="sec-13">
      <title>Training neurons of the output (fully connected) layer  </title>
      <p>To train the neurons of this layer, a gradient backpropagation algorithm is used</p>
      <p> (k 1)  (k)  J ( (k)),  
where  – network parameters (elements of weight matrices, displacements, angles of inclination
of activation functions, etc.) ; J ( (k )) – objective function (loss function E or C);  – the parameter
of the learning rate.
6.1.</p>
    </sec>
    <sec id="sec-14">
      <title>Training neurons when choosing a quadratic function E </title>
      <p>If the quadratic function E is chosen as the network loss function, and the sigmoidal function is
chosen as the activation function of neurons, then to adjust the weights matrix of the L -th layer
following equations are used
(3) 
(4) 
(5) 
(6) 
where
wkLi J (wkLi ) 
E
wkLi 
yEiL yxiiLL wxikLLi   iL  wxikLLi   iL  ykL1,
 
i  (0,..., N L ),k  (0,..., N L1),
 iL 
E
yiL
E yiL ;
yiL xiL
 yiL  di ;
yiL  yiL (1  yiL );
xiL
 
di – required value of the i -th output.
yiL  yiL (1 yiL ) , and besides, yiL   yiL y Lj , because E
xiL x Lj xiL
j1 yELj xyiLLj , i  (1,..., N L ).
 
If, however, as the activation function of neurons SoftMax is selected, then in (4) one should take
N L
Similarly, one can get the procedure for adjusting the offsets for neurons of a fully connected layer
E
biL</p>
      <p>yiL xiL biL
 E yiL xiL   iL  xiL   iL.  
biL
When training neurons of other (hidden, l -x, l  1,..., L  1) layers it is needed to calculate gradients
E N L1 N L1
ykl  i1 il  yxklil1  i1 il1  wkli1, (7) 
i  (1,..., N L )k  (1,..., N L1).</p>
      <p> </p>
      <p>Here wilj1 – the weight of the connection between the neuron i of the current (hidden) layer and
neuron j of the next layer.
When choosing the cross entropy as the loss function, the training procedure for the output (fully
connected) layer of the CNN will take the form</p>
      <p> (k 1)  (k)  C( (k)),   (8) 
where   (w, b) .</p>
      <p>Calculating the partial derivatives with respect to the tunable parameters, we have (here it is taken
into account that for the output layer yiL  fiL )
wC 
C
wiLk </p>
      <p>yCiL wyiiLLk  fkL1( fkL1  di );  
bC 
C
biL
 C yiL  fiL  di ;  </p>
      <p>yiL biL
 f C  fCjL   fjL  in1 di ln fiL    dfjLj  fiL  di.  
When calculating the gradient, the following derivatives are used:
  2
 ex Lj  ex Lj 
fiL   kn1exkL   kn1exkL  i  j;   fiL (1L fLiL ),i  j;  
x Lj   enxiL ex Lj 2 i  j;   fi f j
  exkL </p>
      <p>
  k1 
C
xiL
    n d j ln fiL   n C f jL  yiL  diL.  </p>
      <p>xiL  j1  j1 f jL xiL
6.3.</p>
      <p>Training neurons in the subsampling layer  
(9) 
(10) 
(11) 
(12) 
(13) 
(14) 
The backpropagation algorithm is also used to train the neurons of this layer.</p>
      <p>Core is a filter that slides over the entire image and finds its features anywhere, i.e. provides
invariance to displacements.</p>
      <p>
        Formula for updating the convolution kernel walb has the form [
        <xref ref-type="bibr" rid="ref16 ref17 ref18">16–18</xref>
        ]
E
walb
   E yilj  y(lis1a)( jsb) ,a  (,...,)b  (,...,).  
      </p>
      <p>i j yilj xilj
Only one offset can be used for one feature map bl , which is “connected” with all the elements of
The peculiarity of this layer is that it sits in front of both fully connected and convolutional layers.
In the first case, it has the same neurons and connections as a fully connected one. Therefore  is
calculated in the same way as in a hidden layer. If the subsampling layer is in front of convolutional
layer,  is computed by the inverse convolution method using the rotation by 1800 rot180o  ilj  [16–
6.4.</p>
      <p>Training of neurons of the convolutional layer  
(15) 
E
yilj1
this map. Accordingly, when correcting the value of this offset, all values from the map obtained
during the back propagation of the error should be taken into account. In this case (when using one
displacement for one feature map) we have
E
bl
  
i j yilj xilj bl
E yilj xilj   
i j yilj xilj
E yilj .  </p>
      <p>As an alternative, one can take as many offsets bilj for a separate feature map, as many elements
parameters of the convolution kernels themselves). For this case</p>
      <p>To train neurons in the convolution layer, the gradient procedure uses the derivative
calculated as follows:
E
yilj1
  
i j yilj xilj yilj1
E yilj xilj   
E yilj  w(isi)( js j).  </p>
      <p>l
i j yilj xilj
(16) </p>
    </sec>
    <sec id="sec-15">
      <title>Histogram of oriented gradients (HOG) </title>
      <p>
        Histogram of Oriented Gradients (HOG) – is a method of information evaluation of special points
of the image, based on the calculation of the number of directional gradients in the space of these
points. The basic idea of the algorithm is the assumption that the appearance and shape of the object
in the image area can be described by the distribution of intensity gradients or the direction of the
edges [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Such a description is carried out by dividing the image into cells and constructing
histograms of directional gradients of cell pixels. The result of the algorithm is a descriptor, which
includes a combination of the obtained histograms.
      </p>
      <p>HOG algorithm contains following stages:
1. Gradient calculation.
2. Calculation of the histograms of the image cells.
3. Forming and normalizing descriptor blocks.
4. The final step is to classify the HOG descriptors using a training system with a teacher.</p>
      <p>The result of the classifier's work is two object images with positive and negative weights of
reference vectors. Positive weights mean that the features belong to the target class, and negative
weights mean that the features belong to the background.</p>
    </sec>
    <sec id="sec-16">
      <title>SSD network </title>
      <p>The SSD method is based on such CNN architectures as Faster R-CNN and YOLO, however, the
authors of [20] took into account their shortcomings, thanks to which SSD could achieve new peaks
of accuracy and speed. The method is based on the Feed-forward convolutional network, which
creates a finite set of limiting rectangular frames and quantitative estimates of the presence of objects
of various classes in these frames, after which it produces the suppression of maximums
("Nonmaximum suppression predictors"). The structure of the network is shown in Fig. 1 and can be
divided into four functional parts:
1. Network input. It accepts a three-channel (color) image with a size of 300x300.</p>
      <p>2. Backbone network. Some standard architecture for image classification is usually used as a
backbone network. In this case it is VGG-16. However, the fully connected layer of VGG-16 is not
used in SSD. This architecture is described in detail in [20].</p>
      <p>3. Layer of additional features. These are convolutional layers and subsampling layers that
generate feature maps of different resolution for detecting objects of different scales.</p>
      <p>4. At this stage operations are performed to merge the outputs of various layers and suppress
maximums to obtain the output of the network.
The SSD method has the following key features:
1. Detection of objects of various scales. Several convolutional layers of different sizes are
sequentially added to the base network, which make it possible to obtain maps of signs of different
resolutions and, accordingly, to predict objects at different scales (unlike, for example, YOLO, which
receives only one map).</p>
      <p>2. Convolutional predictors for detection. Each feature layer generates a finite set of assumptions
about the class and position of the object using a set of convolutional filters. Each such convolutional
filter has a kernel that determines the probability of belonging to the default frame offset class itself.</p>
      <p>3. Default bounding frames. The default set of frames is linked to the cards of signs of different
resolutions described in point 1: each cell of the card of signs corresponds to a frame from the set by
default. For each cell of the feature map, estimates are determined according to the class of objects, as
well as four offsets of the default frame relative to its initial position.</p>
      <p>Combining these properties allowed efficiently sampling many different forms of the resulting
bounding frames, which had a positive impact on the speed of the network.</p>
    </sec>
    <sec id="sec-17">
      <title>MTCNN network </title>
      <p>
        Of the neural network approaches in face detection, Multi-task Cascaded CNN (MTCNN) is
especially significant. This network is described in details in [
        <xref ref-type="bibr" rid="ref20">21</xref>
        ].
10.
      </p>
      <p> RetinaFace Network </p>
      <p>
        The architecture of the RetinaFace network [
        <xref ref-type="bibr" rid="ref21">22</xref>
        ] is shown in Fig. 2. This network is a single-stage
face detector that detects faces through collaborative, supervised and self-guided multi-tasking
learning. The network has several features that improve the ArcFace network's face detection
performance and surpass most network architectures in the vast majority of tests for such systems:
- RetinaFace uses five additional key points on faces to dramatically improve facial recognition
results;
      </p>
      <p>- RetinaFace has a self-checking grid decoder branch to predict per-pixel information about 3D
face shapes in parallel with existing controlled branches;</p>
      <p>- By utilizing lightweight backbones, RetinaFace delivers performance that allows it to be run on
the CPU, while most similar systems achieve such performance solely using the GPU.
Table 1 
System performance results </p>
      <p>Method  HOG 
fr128d  86.67% 
arc512d  87% </p>
      <p> Acknowledgements </p>
      <p>The European Commission's support for the production of this publication does not constitute an
endorsement of the contents, which reflect the views only of the authors, and the Commission cannot
be held responsible for any use which may be made of the information contained therein.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Aggregate channel eatures for multi-view face detection</article-title>
          , in IEEE International Joint Conference on Biometrics,
          <year>2014</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <source>Detecting Face with Densely Connected Face Proposal Network</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ranjan</surname>
          </string-name>
          , et al.,
          <article-title>Deep Learning for Understanding Faces: Machines May Be Just as Good, or Better, than Humans</article-title>
          .
          <source>Signal Processing Magazine</source>
          , IEEE,
          <year>2018</year>
          .
          <volume>35</volume>
          (
          <issue>1</issue>
          ), p.
          <fpage>66</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          et al.,
          <article-title>Handwritten digit recognition with a back-propagation network</article-title>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.A.</given-names>
            <surname>Rowley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Baluja</surname>
          </string-name>
          , T. Kanade,
          <article-title>Neural network-based face detection</article-title>
          .
          <source>Pattern Analysis and Machine Intelligence</source>
          , IEEE Transactions on,
          <year>1998</year>
          .
          <volume>20</volume>
          (
          <issue>1</issue>
          ), p.
          <fpage>23</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          , et al.,
          <source>WIDER FACE: A Face Detection Benchmark</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>I.</given-names>
            <surname>Kemelmacher</surname>
          </string-name>
          , et al.,
          <source>The MegaFace Benchmark: 1 Million Faces for Recognition at Scale</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>4873</fpage>
          -
          <lpage>4882</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.B.</given-names>
            <surname>Huang</surname>
          </string-name>
          , et al.,
          <article-title>Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments</article-title>
          , in Workshop on Faces in 'Real-Life' Images: Detection, Alignment, and
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          , Marseille, France,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          , et al.,
          <article-title>Funnel-structured cascade for multi-view face detection with alignment-awareness</article-title>
          . Neurocomputing, v.
          <volume>221</volume>
          ,
          <year>2017</year>
          , p.
          <fpage>138</fpage>
          -
          <lpage>145</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <surname>G. Hinton.</surname>
          </string-name>
          ,
          <article-title>ImageNet classification with deep convolutional neural networks</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <year>2017</year>
          .
          <volume>60</volume>
          (
          <issue>6</issue>
          ), p.
          <fpage>84</fpage>
          -
          <lpage>90</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Parashar</surname>
          </string-name>
          , et al.,
          <article-title>SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks</article-title>
          . ACM SIGARCH Computer Architecture News, v.
          <volume>45</volume>
          (
          <issue>2</issue>
          ),
          <year>2017</year>
          , pp.
          <fpage>27</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zafeiriou</surname>
          </string-name>
          .
          <source>ArcFace: Additive Angular Margin Loss for Deep Face Recognition</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Z. Zhang.</surname>
          </string-name>
          ,
          <source>Derivation of Backpropagation in Convolutional Neural Network (CNN)</source>
          ,
          <source>Technical Report</source>
          . University of Tennessee, Knoxvill,
          <string-name>
            <surname>TN</surname>
          </string-name>
          ,
          <year>October 18</year>
          ,
          <year>2016</year>
          . URL: http://web.eecs.utk.edu/~zzhang61/docs/reports/2016.pdf
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , Y. Bengio,
          <article-title>Convolutional networks for images, speech, and timeseries</article-title>
          ,
          <source>The Handbook of Brain Theory and Neural Networks</source>
          ,
          <year>1995</year>
          , pp.
          <fpage>255</fpage>
          -
          <lpage>258</lpage>
          . O.
          <string-name>
            <surname>Rudenko</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Bezsonov</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Oliinyk</surname>
          </string-name>
          ,
          <string-name>
            <surname>First-Order Optimization</surname>
          </string-name>
          (
          <article-title>Training) Algorithms in Deep LearningProceedings of the 4th</article-title>
          <source>International Conference on Computational Linguistics and Intelligent Systems (COLINS</source>
          <year>2020</year>
          ). Volume I: Main Conference Lviv, Ukraine,
          <source>April 23-24</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>921</fpage>
          -
          <lpage>935</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <article-title>Deep Learning for Image Processing Applications</article-title>
          , Ed.
          <source>Hemanth DY</source>
          ,
          <string-name>
            <surname>Estrella</surname>
            <given-names>VV</given-names>
          </string-name>
          ,
          <year>2017</year>
          , 273 p. URL: http://ebooks.iospress.nl/volume/deep
          <article-title>-learning-for-image-processing-applications</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <article-title>Deep learning tutorial</article-title>
          , Stanford University. Autoencoders. URL: http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <article-title>Convolutional network in python. Part 2. Derivation of formulas for training the model</article-title>
          . URL: https://habr.com/ru/company/ods/blog/344116/
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>P.</given-names>
            <surname>Viola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.J.</given-names>
            <surname>Jones</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Snow</surname>
          </string-name>
          ,
          <article-title>Detecting pedestrians using patterns of motion and appearance, The 9</article-title>
          -
          <string-name>
            <surname>th</surname>
            <given-names>ICCV</given-names>
          </string-name>
          , Nice, France, v.
          <volume>1</volume>
          ,
          <issue>2003</issue>
          , pp.
          <fpage>734</fpage>
          -
          <lpage>741</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reed</surname>
          </string-name>
          , C.-Y. Fu, SSD: Single Shot MultiBox Detector,
          <year>2016</year>
          . URL: https://arxiv.org/abs/1512.02325.
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          . URL: https://arxiv.org/abs/1409.1556v6
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          , Joint Face Detection and
          <article-title>Alignment using Multi-task Cascaded Convolutional Networks</article-title>
          . URL: https://arxiv.org/pdf/1604.02878
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Kotsia</surname>
          </string-name>
          , S. Zafeiriou,
          <article-title>RetinaFace: Single-stage Dense Face Localisation in the Wild</article-title>
          . URL: https://arxiv.org/abs/
          <year>1905</year>
          .00641
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>