1. Introduction

Identification of Modern Facial Emotion Recognition Models

Kirill Smelyakov

kyrylo.smelyakov@nure.ua 0

Oleksandr Bohomolov

oleksandr.bohomolov@nure.ua 0

Maksym Kizitskyi

maksym.kizitskyi@nure.ua 0

Anastasiya Chupryna

anastasiya.chupryna@nure.ua 0 0 Kharkiv National University of Radio Electronics , 14 Nauky Ave., Kharkiv, 61166 , Ukraine

The paper is devoted to the problem of developing a generalized algorithm for the effective identification of computational intelligence models used to recognize emotions by a person's facial expression. To solve this problem, an actual dataset was selected, alternative recognition models, algorithms and machine learning technologies were identified, as well as performance indicators and metrics that are used in the course of a comparative analysis of the obtained results. A series of numerous experiments has been carried out in relation to the identification of the parameters of alternative models of neural networks that are used to recognize emotions and evaluate the effectiveness of their application. Based on a comparative analysis of the effectiveness of the results of experiments, a generalized algorithm for identifying emotions was formulated, as well as recommendations for the use of certain architectures of neural networks in the framework of the tasks of facial emotion recognition.

1 Computer vision facial emotion recognition face recognition convolutional neural network transfer learning

1. Introduction 2. Related Works

Researches in recent years focus on facial emotion recognition (FER) task [ 1-3 ]. Such systems often supplement to face recognition systems (Azure Face API, Face, FaceReader, etc.) [ 4-6 ] and can be used in many situations, from customer satisfaction analysis, service at the checkout, to tracking emotions at a psychologist’s appointment [ 7, 8 ], in perspective drone vision services [ 9 ], etc.

The most efficient approaches that use such networks as ResNet, AffectNet, MobileNet, etc. on facial emotion recognition (FER) task are described by researchers. To simplify the access to this information they organized special list [ 10 ].

On the other hand, it includes various forms of ensembling and stacking of neural networks. It gives a win in the quality of the classification of emotions, but this approach also has disadvantages. Firstly, the model itself becomes quite huge and heavy, and a lot of time is spent on predictions. Because of this, the application of models of this kind is very complicated on mobile devices or in real-time systems. Secondly, due to the presence of several neural networks, the process of maintaining them within the production system becomes more complicated, and the task of updating models while maintaining the logic of work becomes more difficult compared to a solution in the form of an end-toend model. Therefore, the issue of developing a model, perhaps not as effective, but much more compact and easy to maintain for use in face recognition systems, remains relevant and open.

At the same time, a wide variety of machine learning models and algorithms, as well as a high degree of uncertainty in the application conditions, often create great difficulties in choosing an appropriate network architecture and tuning its parameters effectively [ 11-13 ].

Why are neural networks and transfer learning considered to solve FER problems? In recent years, neural networks have become the standard tool in the area of computer vision [ 14, 15 ]. A large number of diverse architectural solutions (EfficientNet, ResNet, Yolov5, etc.) and machine learning methods have been proposed to solve the problems of image classification object detection, and recognition. Their performance is affected by the quality of images [ 16, 17 ], result of image segmentation [ 18, 19 ], the architecture and hyperparameter settings of neural networks [ 20 ]. Moreover, researches on the application of convolutions are carried out to improve the effectiveness of CNN application in the case of the optimization of convolution mask parameters, the number of layers and a number of other parameters [ 21, 22 ].

For the purposes of identifying the parameters of a neural network, a wide range of machine learning algorithms is currently used. One of the most effective is transfer learning [ 23 ]. Transfer learning (finetuning a neural network with pre-trained weights on a huge data set (for example ImageNet) to solve a specific problem) is widely used in all areas of computer vision and increases the quality of solving different kinds of problems [ 24, 25 ].

The main advantage of this approach is that, thanks to the pre-trained weights, the model transforms the input image into a smaller set of meaningful features. Because of this, the relief of the loss function is smoothed out and the models converge faster to its minimum. Also recently, in such a field as face recognition, SOTA technique is often used, where the model is trained to compress the image into a feature vector by which a person’s face can be identified [26]. Which in turn is very similar to what transfer learning is used for. That's why we decided to compare classical transfer learning models with face recognition models in more detail. Besides, this domain was selected because this is quite a popular area and many pre-trained models are in the public access [27].

For models to benefit from pre-trained weights, the task must be related to the domain on which the models were trained.

The research results are important not only for FER services, but also for solving a great number of related tasks, including the development of effective integrated E-learning services, AI solutions [28], the development of ICT solutions, network solutions and security services [ 12, 29, 30 ]. In addition, if face recognition based models show advantages over standard approaches, it means that the use of face recognition learning approaches can improve the quality of transfer learning models in other areas, increase learning speed and allow using less data for training. And it will allow specialists to conduct more experiments and reduce the outlay of cloud learning services.

3. Methods and Materials

First of all, consider the data that will be used in further experiments, some other materials and methods proposed to solve the problem under consideration. 3.1.

Dataset Description

In order to test our approach, we chose a quite well known data set FER2013 [31]. The 2013 Facial Expression Recognition dataset (FER2013) is a Kaggle dataset, introduced by Pierre-Luc Carrier and Aaron Courvill at the International Conference on Machine Learning (ICML) in 2013.

This dataset was chosen because it is in the public access. It also contains photographs of people of different age, gender, race, nationality, with different background and accessories (such as glasses, masks). It allows a better evaluation of the generalization abilities in emotions recognizing.

This dataset contains grayscale images of faces. Their size is 48x48 pixels. These images have been created using an automatic face registration so that the faces on them are centered and occupy nearly the same amount of space in each image. So when making a comparison, we assume that the images have already been preprocessed in advance, therefore we will not consider this issue within the framework of our paper. Each image is labeled with one of seven emotions from the following list: Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral.

The Disgust expression has the minimal number of images – 547, while other labels have nearly 5,000 samples each. More detailed information is presented in Table 1. ● loss on the training set; accuracy on the validation set; accuracy on the validation set; mean convergence rate (MCR) mean overfitting rate (MOFR) = 1

∑ =1(

_ ● ● ● ● ● _ ● _ where n – number of epochs; – performance metric on train data set during i`s epoch; (1) (2) − _

_ −1), ) − ( _ −1 − where n – number of epochs; – performance metric on validating data set during i`s epoch;

– performance metric on train data set during i`s epoch; initial accuracy – accuracy after training for 1 epoch. We chose this metric because it shows how well the pre-trained weights of the model fit the domain; ● initial loss – loss after training for 1 epoch.

In our experiment Metric will be accuracy and loss (categorical cross entropy).

In general, this data set provides a wide variety of face images, which will favorably affect the generalization ability of the model. However, it also has an imbalance of classes that is why the accuracy of determining the emotion of disgust will probably be lower in comparison with others.

To split the data set, a standard function from the sklearn package, train_test_split, was used. Training dataset - 70% (25,121 images). Validation dataset - 10% (3,589 image). Test dataset - 20% (7,177 images). The partition was stratified by emotion in the image with random_state = 42.

4. Experiment

This section presents the plan of the experiment.

In order to evaluate the effectiveness of transfer learning, we will compare several popular architectures such as VGG-Face (Figure 2), OpenFace (Figure 3) which are neural networks trained for face recognition. Our hypothesis is that since the task of face recognition is in some way similar to FER, therefore, the weights of the networks will already contain the necessary features that will increase the learning performance. We also chose ResNet-50, MobileNet (Figure 4) pretrained on ImageNet dataset because they are the standard choice as a backbone in transfer learning. In these networks, the last layer was excluded, and all layers except the last 4 were frozen.

The model structure of VGG-Face and OpenFace were loaded using deepface library [35]. The pretrained weights are available on [36-38]. ResNet-50, MobileNet were loaded using keras framfork [39]. Each model will be trained with a fixed set of hyperparameters such as the learning rate (10-4), the number of epochs is 20. Also, key metrics will be measured every 5 epochs. As a loss function we chose categorical cross entropy.

To compare the efficiency of transfer learning, we will train neural networks in 2 versions: with pretrained weights and with randomly initialized weights. This approach will allow us to determine how and at what stages the pre-trained weights affect the efficiency of the model.

After the experiment we will find out in which model the pre-trained weights give the greatest value compared to random initialization, determine which model converges faster than others, is more resistant to overfitting and shows the highest accuracy.

Training will be carried out in the Google Colaboratory environment.

5. Results

The results of the experiments are presented in Figures 5 – Figure 8 and in Tables 2 – Table 5. The high resolution versions of all images are presented here [40]. 5.1.

ML Results

a) b) Figure 5: MobileNet training process: a) Accuracy change over epoch of MobileNet; b) Loss change over epoch of MobileNet

a) b) Figure 15: Classification result of emotion “neutral”: a) An image example [26]; b) Predicted emotion probabilities

6. Discussions

As a result of the experiment, it was revealed that pre-trained models showed better performance than randomly initialized ones in the FER task. Also, the pre-trained models had a higher average convergence rate at the first epochs (1-10), but then values became the same, in some cases, at epochs 15-20, the randomly initialized model converged faster. This is mainly due to the fact that the pretrained model at that moment reached an accuracy of more than 0.8 and, accordingly, the quality increase slowed down. On the other hand, pre-trained models are more prone to overfitting, therefore, when using them, it is desirable to apply various regularization methods or data augmentation.

The best model in terms of initial and final accuracy on the validation set is VGGFace_pretrained. Therefore, its weights are initially best suited for the FER task. But in our experiment, this model had the worst performance in terms of convergence rate. That is why, for its training, other hyperparameters should be used, for example, to increase the learning rate or add more dense classification layers.

The second model for face detection - OpenFace - shows a level of accuracy comparable to the standard solutions in transfer learning - ResNet-50. But at the same time it has fewer parameters, therefore it fits and predicts faster. OpenFace has 3,743,280 parameters and ResNet-50 has 23,587,712 parameters. MobileNet has the fewest parameters (3,228,864), but it`s performance is lower than in OpenFace. Also, OpenFace has the highest convergence rate and overfitting rate in comparison with other models.

Thus, the face recognition based models proved to be at a fairly high level, in some cases even surpassing standard models like ResNet-50 and MobileNet.

As can be seen from Figures 8-14, such emotions as happiness, anger, fear, surprise are best recognized, and disgust is worst of all recognized. This is because this class is the least represented in the dataset. In addition, some pictures are rather controversially labeled (for example, pictures 12-13). In these examples neural networks show low confidence in the image class.

Based on the results of the experiment, the final learning algorithm was developed, which can be suggested to use in FER systems:

Preprocessing: 1) apply the face detection model to the image. You can use one of the pre-trained models, or train your own;

2) apply various augmentations to images. This will balance the classes (if the original dataset is unbalanced) and also increase the stability of the model on new data.

Training: 1) select a backbone model. If speed is more important within the task and there is enough data for training, we recommend choosing OpenFace. If the quality of recognition is more important and there are no enough resources for full model training, choose VGGFace; 2) freeze all layers of the neural network for training and add fully connected layers on top of them; 3) select hyper parameters and start the learning process with them.

7. Conclusions

As a result of the research the aim and goals of the work were reached. We formulated an effective algorithm for neural network identification and usage within the framework of the FER task; determined which architecture of neural networks was better to use as a backbone for FER tasks in different situations; compared the effectiveness of face recognition based backbones with standard solutions for transfer learning.

We found one of the most popular datasets on FER task – FER-2013. While analyzing its structure we found out that it was quite unbalanced. On the one hand it’s a drawback, because models will learn how to distinguish minor class worse. But on the other hand it will show how models will work with real-world datasets that are often unbalanced.

Then we defined key metrics for analysis of networks performance during learning. Proposed metrics showed the efficiency of transfer learning for each architecture and determined what pre trained weights are most suitable for FER task and lead to faster convergence and less overfitting speed.

As part of this work, we organized an experiment and conducted a comparative analysis of the quality of the most popular neural network architectures for transfer learning (ResNet-50, MobileNet) with networks for face recognition (OpenFace, VGG-Face) within the FER task using various metrics. The obtained results show only general performance of the networks because they were all trained under the same conditions, and the best set of hyperparameters was not selected.

Based on the analysis of the experimental results, we recommend using the algorithm proposed in this article with a pretrained VGGFace. Also, under the condition of limited resources and the use of regularization methods, we recommend OpenFace as an alternative. But we also recommend specifically setting up the classifier for each specific task separately, because this will give a gain in quality.

For a deeper analysis of the effectiveness of neural networks, it is necessary to perform a deeper study, which is not the purpose of this work. It includes testing a larger class of architectures on a larger number of data sets, using various types of classifiers for embedding (including those not based on neural networks).

8. References

Approaches," in IEEE Transactions on Medical Imaging, vol. 38, no. 8, pp. 1777-1787, Aug. 2019, doi: 10.1109/TMI.2019.2894349. [26] Deep Face Recognition: A Survey. URL: https://arxiv.org/pdf/1804.06655.pdf?source=post_page. [27] Deepfase. URL: https://github.com/serengil/deepface. [28] Y. Lu, Q. Mao and J. Liu, "A Deep Transfer Learning Model for Packaged Integrated Circuit Failure Detection by Terahertz Imaging," in IEEE Access, vol. 9, pp. 138608-138617, 2021, doi: 10.1109/ACCESS.2021.3118687. [29] O. Lemeshko, O. Yeremenko and A. M. Hailan, "Two-level method of fast ReRouting in softwaredefined networks," 2017 4th International Scientific-Practical Conference Problems of Infocommunications. Science and Technology (PIC S&T), 2017, pp. 376-379, doi: 10.1109/INFOCOMMST.2017.8246420. [30] Shubin, I., Kyrychenko, I., Goncharov, P., Snisar, S., "Formal representation of knowledge for infocommunication computerized training systems," 2017 IEEE 4th International ScientificPractical Conference Problems of Infocommunications, Science and Technology (PIC S&T), 2017, pp. 287–291, doi: 10.1109/INFOCOMMST.2017.8246399. [31] Learn facial expressions from an image. URL: https://www.kaggle.com/msambare/fer2013. [32] VGG-Face network architecture. URL: https://www.researchgate.net/figure/VGG-Face-networkarchitecture_fig2_319284653. [33] OpenFace architecture. URL: https://www.cs.cmu.edu/~satya/docdir/CMU-CS-16-118.pdf. [34] MobileNet-50 architecture. URL: https://arxiv.org/pdf/1704.04861.pdf. [35] OpenFace. URL: A general-purpose face recognition library with mobile applications: http://reports-archive.adm.cs.cmu.edu/anon/2016/CMU-CS-16-118.pdf . [36] VGGF.URL:https://drive.google.com/file/d/1CPSeum3HpopfomUEK1gybeuIVoeJT_Eo/view. [37] Openface.URL:https://drive.google.com/file/d/1LSe1YCV1x-BfNnfb7DFZTNpv_Q9jITxn/view. [38] ResNet and ResNetv2. URL: https://keras.io/api/applications/resnet/#resnet50-function. [39] Keras. URL: https://keras.io/api/applications/mobilenet. [40] All images. URL: https://docs.google.com/document/d/1Z_S_FpRkv4Xf2cRAqHxo23BUv7aYqt MZ59aJrpvYf-M/edit?usp=sharing.

[1]

Guo et al., "Dominant and Complementary Emotion Recognition from Still Images of Faces," in IEEE Access , vol. 6 , pp. 26391 - 26403 , 2018 , doi: 10.1109/ACCESS. 2018 . 2831927 .

[2]

Zhang and

Xu , "Weakly Supervised Emotion Intensity Prediction for Recognition of Emotions in Images," in IEEE Transactions on Multimedia , vol. 23 , pp. 2033 - 2044 , 2021 , doi: 10.1109/TMM. 2020 . 3007352 .

[3]

Li ,

Qiu ,

Y. -Y.

Shen , C. -L. Liu and

He , "Multisource Transfer Learning for Cross-Subject EEG Emotion Recognition," in IEEE Transactions on Cybernetics , vol. 50 , no. 7 , pp. 3281 - 3293 , July 2020 , doi: 10.1109/TCYB. 2019 . 2904052 .

[4]

Smelyakov ,

Datsenko ,

Skrypka and

Akhundov , "The Efficiency of Images Reduction Algorithms with Small-Sized and Linear Details , " 2019 IEEE International Scientific-Practical Conference Problems of Infocommunications, Science and Technology (PIC S&T) , 2019 , pp. 745 - 750 , doi: 10.1109/PICST47496. 2019 . 9061250 .

[5]

Li ,

Mu ,

Li and

Peng , "A Review of Face Recognition Technology," in IEEE Access , vol. 8 , pp. 139110 - 139120 , 2020 , doi: 10.1109/ACCESS. 2020 . 3011028 .

[6]

Zhao ,

Yan and

Feng , "Towards Age-Invariant Face Recognition," in IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 44 , no. 1 , pp. 474 - 487 , 1 Jan. 2022 , doi: 10.1109/TPAMI. 2020 . 3011426 .

[7]

N. -C.

Ristea ,

L. C.

Duţu and

Radoi , "Emotion Recognition System from Speech and Visual Information based on Convolutional Neural Networks," 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD) , 2019 , pp. 1 - 6 , doi: 10.1109/SPED. 2019 . 8906538 .

[8]

Partila ,

Tovarek ,

Voznak ,

Rozhon ,

Sevcik and

Baran , "Multi-Classifier Speech Emotion Recognition System," 2018 26th Telecommunications Forum (TELFOR ), 2018 , pp. 1 - 4 , doi: 10.1109/TELFOR. 2018 . 8612050 .

[9] Tokariev

, Tkachov

, Ilina

, Partyka

Implementation of combined method in constructing a trajectory for structure reconfiguration of a computer system with reconstructible structure and programmable logic // Selected Papers of the XIX International Scientific and Practical Conference "Information Technologies and Security" , (ITS 2019 ), CEUR Workshop Processing , 28 Nov, 2019 , pp. 71 - 81 .

[10]

Facial

Expression Rec . URL: https://paperswithcode.com/task/facial-expression-recognition.

[11]

Smelyakov ,

Shupyliuk ,

Martovytskyi ,

Tovchyrechko and

Ponomarenko , "Efficiency of image convolution," 2019 IEEE 8th International Conference on Advanced Optoelectronics and Lasers (CAOL) , 2019 , pp. 578 - 583 , doi: 10.1109/CAOL46282. 2019 . 9019450 .

[12]

D. C.

Nguyen et al., "Enabling AI in Future Wireless Networks: A Data Life Cycle Perspective," in IEEE Communications Surveys & Tutorials , vol. 23 , no. 1 , pp. 553 - 595 , Firstquarter 2021 , doi: 10.1109/COMST. 2020 . 3024783 .

[13]

Chaterji et al., "Lattice: A Vision for Machine Learning, Data Engineering, and Policy Considerations for Digital Agriculture at Scale," in IEEE Open Journal of the Computer Society , vol. 2 , pp. 227 - 240 , 2021 , doi: 10.1109/OJCS. 2021 . 3085846 .

[14]

Cao ,

Ma ,

Meng ,

Gao and

Meng , " Emotion Recognition Based On CNN," 2019 Chinese Control Conference (CCC) , 2019 , pp. 8627 - 8630 , doi: 10.23919/ChiCC. 2019 . 8866540 .

[15]

Tian , "Artificial Intelligence Image Recognition Method Based on Convolutional Neural Network Algorithm," in IEEE Access , vol. 8 , pp. 125731 - 125744 , 2020 , doi: 10.1109/ACCESS. 2020 . 3006097 .

[16]

Smelyakov ,

Chupryna ,

Hvozdiev and

Sandrkin , "Gradational Correction Models Efficiency Analysis of Low-Light Digital Image," 2019 Open Conference of Electrical, Electronic and Information Sciences (eStream) , 2019 , pp. 1 - 6 , doi: 10.1109/eStream. 2019 . 8732174 .

[17]

A. I.

Wright ,

C. M.

Dunn ,

Hale ,

G. G. A.

Hutchins and

D. E.

Treanor , "The Effect of Quality Control on Accuracy of Digital Pathology Image Analysis," in IEEE Journal of Biomedical and Health Informatics , vol. 25 , no. 2 , pp. 307 - 314 , Feb. 2021 , doi: 10.1109/JBHI. 2020 . 3046094 .

[18]

Yin ,

Yuan , Y. Cheng and

Wu , "Deep Guidance Network for Biomedical Image Segmentation," in IEEE Access , vol. 8 , pp. 116106 - 116116 , 2020 , doi: 10.1109/ACCESS. 2020 . 3002835 .

[19]

Wang et al., "DeepIGeoS: A Deep Interactive Geodesic Framework for Medical Image Segmentation," in IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 41 , no. 7 , pp. 1559 - 1572 , 1 July 2019 , doi: 10.1109/TPAMI. 2018 . 2840695 .

[20]

Nunes and

Pádua , "A Convolutional Neural Network for Learning Local Feature Descriptors on Multispectral Images," in IEEE Latin America Transactions , vol. 20 , no. 2 , pp. 215 - 222 , Feb. 2022 , doi: 10.1109/TLA. 2022 . 9661460 .

[21]

Tian ,

Liu ,

Wang and

Meng , " Automatic CNN Compression Based on Hyperparameter Learning," 2021 International Joint Conference on Neural Networks (IJCNN) , 2021 , pp. 1 - 8 , doi: 10.1109/IJCNN52387. 2021 . 9533329 .

[22]

Liao ,

Zhao ,

Wei ,

Wei and

Wang , "Parameter Distribution Balanced CNNs," in IEEE Transactions on Neural Networks and Learning Systems , vol. 31 , no. 11 , pp. 4600 - 4609 , Nov. 2020 , doi: 10.1109/TNNLS. 2019 . 2956390 .

[23]

Gonzales-Martínez ,

Machacuay ,

Rotta and

Chinguel , "Hyperparameters Tuning of Faster R-CNN Deep Learning Transfer for Persistent Object Detection in Radar Images," in IEEE Latin America Transactions , vol. 20 , no. 4 , pp. 677 - 685 , April 2022 , doi: 10.1109/TLA. 2022 . 9675474 .

[24]

Liu ,

Yu ,

Liang ,

Griffith and

Golmie , "Toward Deep Transfer Learning in Industrial Internet of Things," in IEEE Internet of Things Journal , vol. 8 , no. 15 , pp. 12163 - 12175 , 1 Aug.1, 2021 , doi: 10.1109/JIOT. 2021 . 3062482 .

[25]

Hussein ,

Kandel ,

C. W.

Bolan ,

M. B.

Wallace and

Bagci , "Lung and Pancreatic Tumor Characterization in the Deep Learning Era: Novel Supervised and Unsupervised Learning