=Paper=
{{Paper
|id=Vol-2744/paper28
|storemode=property
|title=Improving the Neural Network Algorithm for Assessing the Quality of Facial Images
|pdfUrl=https://ceur-ws.org/Vol-2744/paper28.pdf
|volume=Vol-2744
|authors=Nikita Lisin,Alexander Gromov,Vadim Konushin,Anton Konushin
}}
==Improving the Neural Network Algorithm for Assessing the Quality of Facial Images==
<pdf width="1500px">https://ceur-ws.org/Vol-2744/paper28.pdf</pdf>
<pre>
    Improving the Neural Network Algorithm for Assessing
                the Quality of Facial Images*

        Nikita Lisin1 [0000-0003-4943-0733], Alexander Gromov2 [0000-0001-9818-3770],
     Vadim Konushin2 [0000-0003-3949-0548], and Anton Konushin1,3 [0000-0002-6152-0021]
                  1 Faculty of Computational Mathematics and Cybernetics,

                Lomonosov Moscow State University, Moscow, Russia
          {nikita.lisin, anton.konushin}@graphics.cs.msu.ru
                 2 Video Analysis Technologies LLC, Moscow, Russia

                  {alexander.gromov, vadim}@tevian.ru
                 3 NRU Higher School of Economics, Moscow, Russia


       Abstract. The paper considers the task of obtaining a quality assessment of facial
       images for usage in various video surveillance systems, video analytics and bio-
       metric identification. Accuracy of person recognition and classification depends
       on the quality of the input images. We consider an approach to obtaining single
       face image quality assessment using neural network model, which is trained on
       pairs of images that are split into two possible classes: the quality of the first
       image is better or worse than the quality of the second one. Two modifications
       of the selected baseline algorithm are proposed. A face recognition system is ap-
       plied to change the loss function and image and face quality attributes are used
       when training the model. Experimental studies of the proposed modifications
       show their effectiveness. The accuracy of selecting the best and worst frame is
       increased by 1.3% and 1.9%, respectively.

       Keywords: Computer Vision, Face Quality Assessment, Face Recognition


1      Introduction

Computer vision algorithms such as face recognition, algorithms for determining emo-
tions, demographic characteristics and key points of a human face, are widely used in
video surveillance systems, video analytics and biometric identification. The received
data in these systems is a video stream, which contains a set of several frames for each
person. But most algorithms are built so that they process frames independently of each
other and, as has been shown in many studies, their accuracy depends on the quality of
the input images [1]. Therefore, these systems use face quality assessment algorithms


Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).

*Publication is supported by RFBR grant №19-07-00844.
2 N. Lisin, A. Gromov, V. Konushin, A. Konushin


to select the best frame to improve system performance [2] or reject the worst frames
to improve the overall accuracy [3].
   The task of face quality assessment is to obtain one scalar value for the input image
that reflects the overall quality and takes into account both image quality attributes (il-
lumination, blur, noise, etc.) and face quality attributes (head pose, face occlusion, etc.).
Usually, this scalar value is enclosed in the range from 0 to 100, where the values 0 and
100 correspond to the image with the worst and best quality, respectively. Algorithms
for obtaining this value are trained either on pairs of images split into two possible
classes (the quality of the first image is better or worse than the quality of the second
one), as in [4], or using regression to obtain a specific quality value [5]. Obtaining
ground truth labels in these works is carried out with the help of experts.
   Many of the recent works consider the problem of face quality assessment from a
different point of view: as an indicator that reflects the usefulness of the image for the
specific algorithm being used. Algorithms for obtaining this value use one of the exist-
ing face recognition systems, based on which either the training and test dataset is
marked up [6] or the finished model is obtained directly [7], [8]. The main idea is that
the confidence of the face recognition system for a pair of images of the same person
and the difference in the quality of these images are interrelated. The lower the confi-
dence of the face recognition system that the images in a pair belong to the same person,
the more they differ from each other in quality, and vice versa.
   In this paper, we use the approach from article [4] to obtain an overall quality indi-
cator. Two modifications are proposed for the baseline algorithm. The first modifica-
tion is the use of face recognition system to change the loss function. Our approach
differs from the previous ones in that the developed algorithm remains universal: it can
be applied together with any other algorithm, and not only with the used face recogni-
tion system. The second proposed modification is to apply image and face attributes.
Algorithms based on this approach use training of the neural network model for several
tasks [9], when one of the tasks is quality assessment, and the remaining tasks are image
and face attributes assessments. For example, in [10], the authors use sharpness, tone
and colorfulness, and their experiments show that this leads to improved algorithm ac-
curacy. At the same time, there is a small number of works that use images of human
face and take into account new properties. An example of such work is [11], which uses
alignment, visibility, deflection and clarity. The disadvantage is that the markup method
chosen in this article is quite subjective: the ground truth labels are the mean opinion
scores in the range from 0 to 1, obtained with the help of experts. We use image attrib-
utes such as illumination and blur, which are marked up for image pairs, as well as face
attributes such as head rotation angles in the range from −90° to +90° and occlusion
marking for 23 areas of the face into two possible classes. We assume that the consid-
ered approach to applying attributes for face quality assessment is more reliable than
previously proposed.
        Improving the Neural Network Algorithm for Assessing the Quality of Facial Images 3


2      Baseline algorithm

In the article [4], on which our algorithm is based, a neural network model is trained in
two stages. At the first stage the neural network model is a siamese network with two
identical branches with shared weights. The input is a pair of images, where second
image in the pair is obtained from the first using some distortion, which degrades the
image quality. The first and second images are the input of the first and second branches
of the network respectively and the output is the scores Q1 and Q 2 . These values are
used in Hinge Loss as follows (1):

                       Hinge Loss = max(0, Q 2 − Q1 + 1).                              (1)

The minimum of this function is achieved when the quality score of the second image
in the pair is less than the quality score of the first image in more than 1. Proposed
approach allows to obtain a ranking model by training on pairs of images without man-
ual markup. At the second stage, only one branch of the siamese network is used to
evaluate the quality of a single face image. This branch is also fine-tuned on a separate
dataset using regression.
   Our algorithm trains in the same way on pairs of images, but we use pairs of different
images containing people's faces. For each pair, the markup into two possible classes
was obtained using experts: the quality of the first image is better or worse than the
quality of the second one. The best image in a pair was considered to be the one that
best matches the combination of the following properties: front projection, normal illu-
mination, absence of occlusion, noise and blur, etc. This approach for obtaining pairs
was chosen because, from our point of view, it allows us to take into account more
complex cases that occur between pairs of different images, which cannot be obtained
by applying distortion to the original image. Different strategies for selecting pairs for
markup were used to cover more cases, and pairs for which it is not possible to uniquely
define a class were not used later. We don't apply fine-tuning on a separate dataset.
Resnet-10 [12], shown in Fig. 1, is used as a neural network model.


                             Fig. 1. Neural network architecture
4 N. Lisin, A. Gromov, V. Konushin, A. Konushin


3       First modification

As the first modification of the baseline algorithm, a new loss function is proposed,
which uses the output of the face recognition system. In the baseline algorithm loss
function for quality assessment has the following form (2):
                        LFQA = max(0, Q 2 − Q1 + margin),                             (2)

where margin is a constant value equal to 1. For a part of the training dataset that con-
sists of pairs of images of the same person, we calculate a set of probabilities using the
face recognition system. A probability is in the range from 0 to 1, where the value 1
means that a pair of images belongs to the same person and the value 0 means that they
belong to different people. We use this probability as an indicator of the similarity of
two images in terms of quality. This set is then normalized so that the expected value
would be 0 and the standard deviation would be 1. In the new loss function, margin has
the following form (3):
                           1,    a pair of images of different people
      margin = {                                                                ,     (3)
                   max α, 1 − β ∗ FR),
                      (                     a pair of images of a single person

where α ∈ (0, 1) – new minimum value of the margin; β ∈ ℝ+ – custom parameter;
FR ∈ ℝ – output of the face recognition system after normalization. In our experiments,
we use the following parameter values: α = 0.4 , β = 2. Face recognition system is
developed by Video Analysis Technologies [16]. Our approach differs from the previ-
ous ones in that the resulting algorithm remains universal: it can be used in the future
together with any other algorithm, and not only with the face recognition system se-
lected for training.


4       Second modification

As a second modification of the baseline algorithm, we consider multi-task learning of
the neural network to evaluate the face quality and image attributes such as illumination
and blur, as well as face attributes such as head pose and face occlusion. The general
loss function has the following form (4):
Loss = w1 ∗ LFQA + w2 ∗ LIllumination + w3 ∗ LBlur + w4 ∗ LPose + w5 ∗ LOcclusion ,
                                                                                (4)

where {wi }5i=1 − weight coefficients. In our experiments, we use the following param-
eter values: w1 = 64, w2 = 48, w3 = 48, w4 = 2, w5 = 4.


4.1     Illumination and Blur
Evaluation of illumination and blur occurs on pairs of images similar to the face quality
assessment in the baseline algorithm, with the use of Hinge Loss as LIllumination and
        Improving the Neural Network Algorithm for Assessing the Quality of Facial Images 5


LBlur . During training, the value of the loss function for pairs without markup is as-
sumed to be zero.


4.2    Head Pose
The head pose estimation consists of determining three angles for each image in a pair:
pitch, roll and yaw. Each angle is enclosed in the range from −90° to +90°. Training is
performed using regression and Weighted Mean Absolute Error is used as the loss func-
tion (5):
                                        ̅ ∣+ β∗∣R−R
                                   α∗∣P−P         ̅ ∣+ γ∗∣Y−Y
                                                            ̅∣
                         LPose =                                 ,                     (5)
                                             α+β+γ

where P̅, R
          ̅, Y
             ̅ ∈ ℕ — ground truth labels; P, R, Y ∈ ℕ — neural network output; α,
β, γ ∈ ℝ — weight coefficients for pitch, roll and yaw respectively. In this work, we
use the following parameter values: α = 1, β = 1, γ = 0.05 . The low value of γ is
explained by the fact that the marking for yaw is less accurate than for pitch and roll.
Fig. 2 shows an example of head rotation angles using three guide vectors.


                          Fig. 2. Example of head rotation angles


           Fig. 3. An example of areas markup. Blue indicates the invisible class
6 N. Lisin, A. Gromov, V. Konushin, A. Konushin


4.3    Face Occlusion
Face occlusion task is to determine one of two types of visibility for 23 areas of the
face: the visible region and the invisible region due to the rotation of the head, overlap-
ping by an external object, or going beyond the boundaries of the image. The scheme
of dividing the face into regions is based on the approach proposed in [13], while some
regions were divided into subdomains, and new ones were added. Fig. 3 shows an ex-
ample of areas markup. The loss function has the following form (6):

                             LOcclusion = ∑23
                                           i=1 CE Lossi                                 (6)

where CE Lossi is the Cross Entropy Loss for the i-th region.


5      Training Dataset

Since the necessary dataset are not publicly available, we created our own training set.
Images for constructing pairs were provided by Video Analysis Technologies [16]. Ta-
ble 1 describes the methods of obtaining labels for quality assessment and attributes.
Expert markup for face quality assessment and blur was performed with the help of five
people and each pair was labeled only by one. The neural network models used for
markup of illumination, head pose, and face occlusion were trained on separate da-
tasets:

1. the dataset for training face occlusion classifier is marked up with the help of experts;
2. the dataset for training head poses classifier is marked up by determining the rotation
   angles based on 68 key points;
3. the dataset for training illumination classifier is marked up into 13 levels of illumi-
   nation as follows:
   a. the autoencoder was trained;
   b. outputs correlating with the degree of illumination were found in the intermediate
      representation of the autoencoder;
   c. based on the found outputs, all data was divided into 13 classes;
   d. additional expert markup was made to remove false cases.
    It should be noted that in the case of illumination, training is carried out on pairs of
images, the markup for which is obtained automatically based on their illumination
levels, because this approach turned out to be more stable. General characteristics of
the obtained dataset, taking into account the transitive closure and the number of pairs
used together with the face recognition system, are given in Table 2.
         Improving the Neural Network Algorithm for Assessing the Quality of Facial Images 7


                        Table 1. Methods used to get ground truth labels.
             Task                         Type of markup                    Markup method


                                    Paired markup into two pos-
    Face Quality Assessment                                                  Expert markup
                                            sible classes
                                                                      Neural network classifier for
                                    Paired markup into two pos-
         Illumination                                                          13 levels
                                            sible classes
                                                                            of illumination
                                    Paired markup into two pos-
             Blur                                                            Expert markup
                                            sible classes
                                                                       Neural network algorithm
                                    Values of three angles for an
          Head Pose                                                   for determining rotation an-
                                         individual image
                                                                            gles of the head
                                           Visibility type
                                                                      Neural network classifier of
        Face Occlusion                       for 23 areas
                                                                            face occlusion
                                       of an individual image

                                  Table 2. Dataset characteristics.

                Characteristic                                          Quantity
            Total number of pairs                                       417 240
           Total number of images                                       334 660
                    Blur, pairs                                         124 820
              Illumination, pairs                                       286 240
              Pairs of one person                                       65 010
           Face Quality Assessment                                      all pairs
                    Head Pose                                          all images
               Face Occlusion                                          all images


6       Test Dataset and Metrics

Since the most common datasets are designed to evaluate the quality of an arbitrary
image that does not necessarily contain a human face, we have created a new dataset
for this purpose. The test dataset consists of tracks – sets of 5 to 12 frames containing
images of faces belonging to one person. The total number of tracks is 7070, and for
each track the markup of the best and worst frames was made in terms of quality as-
sessment. The best frames are those that have the best matches the combination of the
following properties: front projection, normal illumination, absence of occlusion, noise
and blur, etc. Similarly, the worst frames are those that have the worst correspondence
8 N. Lisin, A. Gromov, V. Konushin, A. Konushin


to these properties. Each track was marked by three experts, and the frame was consid-
ered the best or worst only if the opinions of at least two experts were the same. It was
also required to select as few frames as possible. Fig. 4 shows an example of such a
track. For this dataset, we define three metrics: Best Shot Accuracy, Worst Shot Accu-
racy and Pair Accuracy.
         Frame 1          Frame 2          Frame 3             Frame 4         Frame 5


                       Best Frame                            Worst Frame

      Fig. 4. Track example. Blue indicates the best frame; red indicates the worst frame

Notation for the constructed dataset:

• S = {St }N
           t=1 – set of N tracks;
• Mt ∈ ℕ, Mt ∈ [5, 12] – number of frames for each track;
             Mt
• St = {Fkt }k=1  – track with the number t, which consists of frames Fkt;
• B = {Bt }Nt=1 – set of tracks with the best frames, Bt ⊂ St ;
• W = {Wt }N  t=1 – set of tracks with the worst frames, Wt ⊂ St .

Notation for the algorithm output:

• Q = {Q t }N
            t=1 – collection of N sets with quality scores;
               Mt
• Q t = {QFkt }k=1 – set of quality scores for a track with the number t.

We define two indicator functions:
                                     1, F t ∈ Bt
                              ItB = { mt         , m = argmax QFkt,                         (7)
                                     0, Fm ∉ Bt        ∀k∈[1, Mt ]

                                     1, F t ∈ Wt
                              ItW = { m   t      , m = argmin QFkt.                         (8)
                                     0, Fm  ∉ Wt       ∀k∈[1, Mt]

The indicator function (7) corresponds to the choice of the best frame – it is equal to 1
if and only if the frame with the highest quality score is contained in the set of the best
frames. Similarly, (8) corresponds to the selection of the worst frame. Based on the
indicator functions, we define two metrics:
                                                        N
                                                        ∑ IB
                                                           t
                                                       t=1
                             Best Shot Accuracy =              ,                             (9)
                                                         N
           Improving the Neural Network Algorithm for Assessing the Quality of Facial Images 9

                                                            N
                                                            ∑ IW
                                                               t
                                                           t=1
                              Worst Shot Accuracy =                .                           (10)
                                                             N

Since each track consists of frames of three types (best, worst and normal), we can also
create image pairs for each track that consist of two different types of frames. The re-
sulting pairs are marked up into two classes (the quality of the first image is better or
worse than the quality of the second one), which is uniquely determined based on the
types of images in the pair. Pair Accuracy is defined as the percentage of correctly
classified specified pairs, the total number of which is 173 940.


7         Experiments

On the test datasets four algorithms are compared: Baseline, Baseline with Modified
Loss, Baseline with Attributes, Baseline with Modified Loss and Attributes. On Fig. 5
the scheme of the baseline algorithm along with two modifications is given.
   During training a polynomial change of the learning rate with the degree of polyno-
mial 2 is used, and the initial value of the learning rate is 0.001. The total number of
epochs is 25. We also use the Ranger optimizer, which is a combination of two methods
proposed in [14] and [15]. Augmentation is the same transformation of both images in
a pair (changing saturation, illumination, contrast, additive Gaussian noise, blurring,
cropping the image, etc.). Inference latency of the baseline algorithm is 0.002 second
on a single core of Intel Core CPU i5-9400.


    Fig. 5. Scheme of the resulting algorithm. Blue is Baseline, green is Attributes and orange is
                                            Modified Loss

The results obtained on the test dataset are shown in Table 3. Experimental assessment
shows that the use of both modifications achieves the best result and increases the ac-
curacy of selecting the best frame by 1.3%, the accuracy of selecting the worst frame
10 N. Lisin, A. Gromov, V. Konushin, A. Konushin


by 1.9%, pair accuracy by 0.9%. The advantage of the proposed modifications is that
they do not increase the inference time of the baseline algorithm.

                            Table 3. Experimental assessment

                             Best Shot              Worst Shot
      Algorithm                                                           Pair Accuracy
                             Accuracy               Accuracy

       Baseline                0.710                   0.703                  0.855

 Baseline with Modi-
                               0.714                   0.712                  0.858
      fied Loss

    Baseline with
                               0.721                   0.722                  0.862
     Attributes
 Baseline with Mod-
  ified Loss and At-           0.723                   0.722                  0.864
        tributes


      Fig. 6. Dependence of the face recognition system on the quality of input images

To study the dependence of the face recognition system on the quality of input images
we use a second test dataset consisting of 61 500 pairs of images of a single person and
a face recognition system from the company Video Analysis Technologies [16]. The
dataset used has a limited variation in image quality and was not specially designed for
this purpose, so it is insufficient to demonstrate the difference between the modifica-
tions. But we present the results as additional confirmation of the applicability of the
developed algorithms. Fig. 6 shows the dependence of True Positive Rate on the per-
centage of lowest quality images removed, with a fixed False Acceptance Rate of
0.0001.
       Improving the Neural Network Algorithm for Assessing the Quality of Facial Images 11


8      Conclusion

In this paper the modified neural network algorithm for face quality assessment is pro-
posed. An approach from article [4] is used for baseline algorithm. As modifications
the face recognition system is applied to change the loss function and image and face
quality attributes are used when training the model. Changes increase the accuracy of
selecting the best and worst frame by 1.3% and 1.9%, respectively, without affecting
performance.


References

 1. Grother, P., Hom, A., Ngan, M., Hanaoka, K.: Ongoing Face Recognition Vendor Test
    (FRVT) Part 5: Face Image Quality Asssessment. Information Access Division Information
    Technology Laboratory, NIST (2020).
 2. Nikitin, M., Konushin, A., Konushin, V.: Face quality assessment for face verification in
    video. In: Proceedings of the 24th International Conference on Computer Graphics and Vi-
    sion GraphiCon'2014, pp. 111–114 (2014).
 3. Bagrov, N., Konushin, A., Konushin, V.: Face recognition with low false positive error rate.
    In: ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial In-
    formation Sciences, pp. 11–15 (2019).
 4. Liu, X., Van De Weijer, J., Bagdanov, A.: RankIQA: Learning from Rankings for No-Ref-
    erence Image Quality Assessment. In: 2017 IEEE International Conference on Computer
    Vision (ICCV), pp. 1040-1049 (2017).
 5. Best-Rowden, L., Jain, A.: Learning Face Image Quality From Human Assessments. In:
    IEEE Transactions on Information Forensics and Security, vol. 13, no. 12, pp. 3064-3077
    (2018).
 6. Hernandez-Ortega, J., Galbally, J., Fiérrez, J., Haraksim, R., Beslay, L.: FaceQnet: Quality
    Assessment for Face Recognition based on Deep Learning. In: 2019 International Confer-
    ence on Biometrics (ICB), pp.1-8 (2019).
 7. Terhorst, P., Kolf, J., Damer, N., Kirchbuchner, F., Kuijper, A.: SER-FIQ: Unsupervised
    Estimation of Face Image Quality Based on Stochastic Embedding Robustness (2020).
 8. Nikitin, M., Konushin, V., Konushin, A.: Neural network model for video-based face recog-
    nition with frames quality assessment. In: Computer Optics 2017, vol. 41, pp. 732-742
    (2017). (In Russian)
 9. Kuharenko, A., Konushin, A.: Simultaneous facial attribute classification with convolutional
    neural networks. In: 11th International Conference on Pattern Recognition and Image Anal-
    ysis: New Information Technologies (PRIA-11-2003), IPSI RAS Samara, vol. 2, pp. 623–
    626 (2013).
10. Yang, D., Peltoketo, V., Kämäräinen, J.: CNN-Based Cross-Dataset No-Reference Image
    Quality Assessment. In: 2019 IEEE/CVF International Conference on Computer Vision
    Workshop (ICCVW), Seoul, Korea (South), pp. 3913-3921 (2019).
11. Lijun, Z., Xiaohu, S., Fei, Y., Pingling, D., Xiang-dong, Z., Yu, S.: Multi-branch Face Qual-
    ity Assessment for Face Recognition. In: 2019 IEEE 19th International Conference on Com-
    munication Technology (ICCT), Xi'an, China, pp. 1659-1664 (2019).
12. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: IEEE
    Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778 (2016).
12 N. Lisin, A. Gromov, V. Konushin, A. Konushin


13. Maze, Brianna et al.: IARPA Janus Benchmark - C: Face Dataset and Protocol. In: 2018
    International Conference on Biometrics (ICB), pp. 158-165 (2018).
14. Zhang, M., Lucas, J., Hinton, G., Ba, J.: Lookahead Optimizer: k steps forward, 1 step back.
    NeurIPS, (2019).
15. Liu, Liyuan et al.: On the Variance of the Adaptive Learning Rate and Beyond, (2020).
16. Video Analysis Technologies Homepage, https://tevian.ru.

</pre>