=Paper=
{{Paper
|id=Vol-2744/paper32
|storemode=property
|title=Facial Expression Recognition using Distance Importance Scores Between Facial Landmarks
|pdfUrl=https://ceur-ws.org/Vol-2744/paper32.pdf
|volume=Vol-2744
|authors=Elena Ryumina,Alexey Karpov
}}
==Facial Expression Recognition using Distance Importance Scores Between Facial Landmarks==
Facial Expression Recognition using Distance Importance Scores Between Facial Landmarks* Elena Ryumina1,2[0000-0002-4135-6949], and Alexey Karpov1[0000-0003-3424-652X] 1St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS), St. Petersburg, Russia 2 ITMO University, St. Petersburg, Russia ryumina_ev@mail.ru, karpov@iias.spb.su Abstract. In this paper, we present a feature extraction approach for facial ex- pressions recognition based on distance importance scores between the coordi- nates of facial landmarks. Two audio-visual speech databases (CREMA-D and RAVDESS) were used in the research. We conducted experiments using the Long Short-Term Memory Recurrent Neural Network model in a single corpus and cross-corpus setup with different length sequences. Experiments were car- ried out using different sets and types of visual features. An accuracy of facial expression recognition was 79.1% and 98.9% for the CREMA-D and RAVDESS databases, respectively. The extracted features provide a better recognition result compared to other methods based on the analysis of facial graphical regions. Keywords: Visual Feature Extraction · Facial Landmarks · Facial Expression Recognition · Automatic Emotion Recognition 1 Introduction Facial expressions are an important channel of nonverbal communication, so interest in automatic recognition of human emotions by facial expressions increases every year. This is also due to the fact that smart emotion recognition technologies are in demand and are introduced around the world, for example, automatic facial expres- sion recognition systems are widely used in medicine [1], psychology [2], education [3], fraud detection [4], driver assistance systems [5], etc. In recent years, more re- search has focused on the analysis of facial expressions in a video [6–9] since video can transmit a change in facial expressions over time. Feature extraction is one of the most important steps in facial expression recognition systems by a video stream [10]. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). * This research is supported by the Russian Science Foundation (project No. 18-11-00145). 2 E. Ryumina, A. Karpov The main problems faced by researchers in the field of facial expression recogni- tion are high variability in illumination, occlusions, gender, age, national origin, intra- class variation, inter-class similarities. The extraction of graphical facial regions, which is the most widely used approach, does not cope well with illumination and occlusion problems, while finding the coordinates of facial landmarks adapts well to illumination variation and partial occlusion. In this work, we extracted the coordinates of facial landmarks from a video stream of two large-scale databases: CREMA-D and RAVDESS. Important features were calculated in the form of Euclidean distances between landmarks, and the importance of the features was evaluated using ensemble classifiers. We extracted features ac- cording to the algorithm presented in [11] to compare the effectiveness of the pro- posed approach. We formed the extracted features into sequences of different lengths, which were applied as a neural network input. The rest of the article is organized as following: Section 2 presents analysis of ex- isting approaches in the field of facial expression recognition and a brief overview of available emotional databases, Section 3 gives a new approach to feature extraction from the coordinates of facial landmarks, Section 4 shows the results of conducted experiments, Section 5 contains the discussion and conclusions. 2 Related Work 2.1 Facial Features There are two main approaches to feature extraction from a video stream, namely: extracting facial graphical regions, where it is possible to save the raw images or use various methods of preprocessing images of faces [8, 9]; finding the coordinates of facial landmarks and extracting distances, angles, areas and other calculations with the coordinates found [12, 13]. Detection of facial landmarks in facial graphical images is performed by finding the points on regions of the mouth, eyebrows, eyes, nose, etc. This is easily imple- mented using pre-trained models from the Dlib library [14]. To date, there are a few research works based on finding and tracking landmarks [15, 16]. Emotii application on Android for audio-visual mood analysis is presented in [11]. Emotii recognizes the user's mood from the video by extracting coordinates of facial landmarks, distance from the coordinates to the "Center of Gravity" and calculating the face offset correc- tion by finding the angle of the nose. A similar approach was previously proposed in [13]. OpenFace - an open source framework is described in [6]. OpenFace tracks faci- al landmarks, head position, gaze and evaluates facial Action Units (AU) [17]. This allows for analysis of facial behavior in real time. A face can be divided into regions of interest using the coordinates of the facial landmarks. The division of the face into 12 regions of interest is suggested in [18]. Regions are analyzed for changes in the intensity of each pixel using histograms. Determining pixel intensity allows tracking the changes in micro-expressions in successive images. The method of facial expres- sion recognition based on 74 geometric features from (x, y)-coordinates, namely 11 distances and 26 areas for each coordinate is presented in [12]. Facial Expression Recognition using Distance Importance Scores Between Facial… 3 To date, except for [11, 19], feature vectors have not been extracted from the CREMA-D [20] or RAVDESS [21] databases using facial landmarks. In [11], an accuracy was achieved by 96.3% for the RAVDESS with 7 classes (calmness was not considered) using the Support Vector Machine (SVM). In [19], authors proposed using facial landmarks to detect facial regions in the image with further conversion to grayscale. Then 32 features are extracted from the images using Gabor filters, which are combined with 68 positions of facial landmarks. After reading all frames from the video, the values of 2176 (32×68) features are averaged. An accuracy of 96.53% was obtained for the RAVDESS with 8 classes of emotional speech. An approach based on the 3D Convolutional Neural Networks (CNN) branch of a Two-Stream Inflated 3D ConNect and randomly resizing frames to increase data is described in [9]. The time context of images of detected faces is considered using Long Short-Term Memory network (LSTM). An accuracy was 66.8% and 60.5% for the CREMA-D and RAVDESS databases, respectively. The use of Haar features to detect facial re- gions with subsequent rotation of images at the same level of the pupils of the eye is proposed in [22]. An accuracy of 79.74% was achieved in 6 classes of emotional songs of the RAVDESS database using the pre-trained model of CNN Alex net. 2.2 Emotional Databases Emotional database (EDB) is a key element in the emotion recognition task. EDB are divided into multimodal, bimodal or unimodal. Visual databases contain of images or video clips with facial expressions. Well-annotated data have a significant impact on the performance of machine learning classification algorithms. Most databases as- sume 5-7 basic emotions, namely happiness, sadness, anger, fear, disgust, surprise, and neutral. However, some databases include valence-arousal dimensions and AU codes. Also, EDBs are divided into ones collected in laboratory (imitating emotional expressions) and real ("in-the-wild" - natural emotional expressions) conditions. An extended overview of multimodal databases is presented in [23]. Several most popular of the existing EDBs are compared in Table 1. For our experiments, we have selected and used two representative audio-visual databases with varying levels of emotional intensity: CREMA-D and RAVDESS. CREMA-D database contains 7442 videos for speech, where 91 actors imitate 6 emotions, happiness (1271 videos), sadness (1271), anger (1271), fear (1271), disgust (1271), and neutral (1087). The cast has different ethnicities ranging in age from 20 to 74 years. The resolution of video clips is 480×360 with 30 frames per second. The database was evaluated by 2443 people for audio, video, and audiovisual data, where an accuracy of emotion recognition for the considered modalities was 40.9%, 58.2%, and 63.6%, respectively. RAVDESS database contains 4904 videos for speech and songs, where 24 actors imitate 8 emotions, happiness (752 videos), sadness (752), anger (752), fear (752), disgust (384), surprise (384), neutral (376), and calmness (752). The resolution of video clips is 1280×720 with 30 frames per second. The database was evaluated by 247 people for audio, video, and audiovisual data, where an accuracy of emotion recognition for the considered modalities was 60%, 75% and 80%, respectively. 4 E. Ryumina, A. Karpov Table 1. A comparison of multimodal emotional databases Database # Subjects # Emotions # Videos Specificity CK+ [24] 123 7 593 sequences AU codes MMI [25] 75 5 over 2900 AU codes. Various ethnicity (videos + images) SAVEE [26] 4 7 480 60 markers on the faces Oulu-CASIA 80 6 480 sequences Various illumination condi- [27] tions and age (23 to 58 years old) CREMA-D 91 6 7442 Various age (20 to 74 years [20] old), races and ethnicity RAVDESS 24 8 4904 Emotional speech and song [21] RAMAS [28] 10 7 564 Motion-capture data and physiological signals Aff-Wild2 458 7 558 "In-the-wild" database. AU [29] codes and valence-arousal dimensions. 3 Proposed Method for Feature Extraction The architecture of our proposed approach for feature extraction and facial expression recognition is depicted in Figure 1. Stage 1: Data preprocessing and creating a database Normalizing Database landmarks Saving landmarks Video Facial landmark detection and metadata Database with Stage 2 landmarks and metadata Stage 3 Extracting all observations Getting importance landmark pairs Extracting N Сalculating M Feature Feature observations Euclidean distances importance scores extraction Learning Сreating Normalizing Prediction models sequences features Fig. 1. Pipeline of our approach for feature extraction and facial expression recognition. Facial Expression Recognition using Distance Importance Scores Between Facial… 5 Data preprocessing and creating a database with facial landmarks and metadata are carried out at Stage 1. We used Dlib open source library [14] to find the coordinates of key facial landmarks. The detected coordinates were scaled to a resolution of 224×224 pixels since the video resolutions in the research datasets are different. Then we saved the received coordinates and metadata about the video and frames (database, video title, video duration, frame number, emotion) for subsequent extraction of fea- tures. As a result of processing, it was revealed that the average video duration for the CREMA-D database is 76 frames, and 122 frames for RAVDESS. Feature importance scores are performed at Stage 2. We randomly took 120К ob- servations from the considered databases. We extracted 2278 unique Euclidean dis- tances between the coordinates of facial landmarks (for example, the distances be- tween the coordinates of points 0 and 1 is equal to the distance between the coordi- nates of points 1 and 0, so only one of two possible combinations was taken into ac- count). But since not all the distances considered have a positive impact on the deci- sion-making of a classifier, it is necessary to leave only the most important features. The obtained observations were used as input to the ensemble classifiers Random Forest Classifier (RFC) [30], Extra Trees Classifier (ETC) [31] and AdaBoost Classi- fier (ABC) [32], which allow us to calculate feature importance scores. The parame- ters of classifiers are shown in Table 2. Table 2. Parameters of the machine classifiers applied. Сlassifier Optimised Parameters RFC n_jobs=3, n_estimators=500, warm_start=True, max_depth=6, min_samples_leaf=2, max_features=sqrt ETC n_jobs=3, n_estimators=500, max_depth=8, min_samples_leaf=2 ABC n_estimators=n_estimators, learning_rate=0.75 We obtained feature importance scores from 3 different classifiers and averaged them to find the mean. We then set three different thresholds of importance (0.0009, 0.001 and 0.002) and obtained three different feature sets, whose importance scores exceed the corresponding threshold. This resulted in sets of 368, 259 and 104 features, re- spectively. An algorithm for processing landmark pairs is open-sourced†. The 10 most important distances between the coordinates of facial landmarks with their feature importance scores are depicted in Figure 2 using the example of a frame from the RAVDESS database. The high score was obtained by the distance between facial landmarks 9 and 24 and amounted to 0.014. As one can see from the figure, most of the 10 important dis- tances are in the lower part of the face. † The annex to article "Facial Expression Recognition using Distance Importance Scores Between of Fa- cialLandmarks", https://elenaryumina.github.io/GraphiCon_2020/ 6 E. Ryumina, A. Karpov 1 – 0.014 2 – 0.008 3 – 0.006 4 – 0.006 5 – 0.006 7 – 0.005 8 – 0.005 9 – 0.005 10 – 0.005 Fig. 2. Top-10 important distances and their feature importance scores. Feature extraction of various sets, learning models, and obtaining predictions are carried out at Stage 3. We calculated 368, 259 and 104 Euclidean distances for all observations. 272 features were extracted using the algorithm presented in [11] to compare the effectiveness of the proposed approach. Thus, 5 different feature sets with dimensionalities 136 (68 (x, y)-coordinates of facial landmarks), 272, 104, 259 and 368 were obtained, which were normalized by the average values and standard deviations of the features of the training set. This improves accuracy of facial expres- sion classification. Feature vectors were applied as LSTM input, which includes two LSTM layers with 128 and 256 output neurons, and a dropout rate of 0.5 after each layer, the last layer is a fully connected layer with the number of neurons equal to the number of classes and with softmax activation function. The number of epochs for all experiments was 30. Adam was chosen as the optimizer with a learning rate of 0,001 and a weight decay of 0,00005. The size of batches was 64. The parameters were determined using a grid search at the first training stage. 4 Experimental Results The experiments were carried out using the LSTM Recurrent Neural Network. Differ- ent sequence length of features was applied as LSTM input. First, we set the sequence length equal to the average video duration (76, 122). If the video duration was less than the average duration, then the arrays were supplemented with zeros to the desired length, if it was longer than the average duration, then frames were selected in steps equal to video length divided by the average video duration. Also, the sequence length was set equal to the number of frames per second (30). Then video sequences were divided into sections of 30 frames, if the section was less than 30 frames, then the array was supplemented with zeros, so all the frames were considered. We divided the datasets into 10 roughly identical sets to perform cross-validation. The reported re- sults are the average of these 10 sets. We conducted experiments when training on one dataset and testing on another dataset. Since the CREMA-D dataset does not con- tain the emotions surprise and calmness and the average video duration of the CREMA-D database is 76 frames, so all emotions and sequence length of 122 frames were considered only when cross-validation for the RAVDESS database. Accuracy results for feature vectors with dimension 136 components and experiment numbers are shown in Table 3. Facial Expression Recognition using Distance Importance Scores Between Facial… 7 The best accuracy was achieved with a sequence length of 76, this is especially no- ticeable when training and testing on various databases, for the RAVDESS an in- crease in the accuracy by 9.05% was achieved, for the CREMA-D - 5.60%. The re- sults show that the model trained on the CREMA-D database gives the better accura- cy on unfamiliar samples compared to the model trained on the RAVDESS database. The results of accuracy and improvement obtained using feature vectors with dimen- sions 272, 104, 259 and 368 components are presented in Table 4. An absolute im- provement in accuracy is considered relative to accuracy obtained without feature extraction from the coordinates of facial landmarks. Thus, 8 experiments (with setups presented in Table 3) were performed for each set of features. Table 3. Accuracy results for feature vectors of 136 components. No. Training Testing Сlasses Sequence Accuracy Database Database length (%) 1 CREMA-D RAVDESS 6 76 66.69 2 CREMA-D CREMA-D 6 76 76.65 3 RAVDESS CREMA-D 6 76 47.88 4 RAVDESS RAVDESS 8 122 97.80 5 CREMA-D RAVDESS 6 30 57,64 6 CREMA-D CREMA-D 6 30 76.58 7 RAVDESS CREMA-D 6 30 42.28 8 RAVDESS RAVDESS 8 30 97.59 Table 4. Accuracy (A, %) and absolute improvement (Delta) values for various feature vectors. 104 comp. 259 comp. 272 comp. 368 comp. No. A D A D A D A D 1 68.07 1.38 69.43 2.74 66.75 0.06 69.59 2.90 2 77.87 1.22 79.07 2.42 77.37 0.72 78.03 1.38 3 49.54 1.66 49.91 2.03 47.41 -0.47 49.15 1.27 4 98.65 0.85 98.86 1.06 97.84 0.04 98.41 0.61 5 59.00 1.36 58.43 0.79 57.71 0.07 57.44 -0.20 6 77.64 1.06 78.14 1.56 77.20 0.62 77.60 1.02 7 43.81 1.53 44.83 2.55 42.83 0.55 43.74 1.46 8 98.22 0.63 98.37 0.78 98.12 0.53 98.14 0.55 By feature extraction from the coordinates of facial landmarks using the method pro- posed in [11], the accuracy of facial expression recognition was increased, a growth rate of over 1%. As can be seen from the table, feature vectors with dimensions of 259 components provide a greater growth in an accuracy value than feature vectors with dimensions of 104 and 368 components. In doing so, the accuracy exceeds the one obtained for feature vectors with dimensions 272 and 136 components. This con- firms the effectiveness of the proposed approach. The classification accuracy was 79.07% and 98.68% by using cross-validation with the average video duration and feature vectors of dimension 259 components for the CREMA-D and RAVDESS databases, respectively. An accuracy of 69.43% and 49.91% was achieved with a dimension of feature vectors 259 components and a sequence length of 76 by training and testing on different databases for the RAVDESS and CREMA-D, respectively. 8 E. Ryumina, A. Karpov Table 5 shows a comparison of our accuracy with other solutions proposed in the recent literature. Table 5. Comparison of the proposed method with existing approaches. Method Classes Accuracy, % CREMA-D Cao et al. 2014 [20] 6 58.2 Ghaleb et al. 2020 [9] 6 66.8 Proposed, seq. length 30 6 78.1 Proposed, seq. length 76 6 79.1 RAVDESS Livingstone et al. 2018 [21] 8 75.0 Ghaleb et al. 2020 [9] 8 60.5 He et al. 2019 [22] 6 79.7 Alshamsi et al. 2019 [11] 7 96.3 Jaratrotkamjorn et al. 2019 [19] 8 96.5 Proposed, seq. length 30 8 98.4 Proposed, seq. length 122 8 98.9 As can be seen from the table, our approach is superior to modern results in the task of classifying facial expressions on the CREMA-D and RAVDESS datasets. So, using facial landmarks significantly increases the accuracy of facial expression recognition compared to methods based on the analysis of facial graphical regions. 5 Conclusions In the paper, we have studied various feature extraction methods calculated using coordinates of facial landmarks. The research was conducted on two large-scale da- tasets CREMA-D and RAVDESS containing various human’s emotions with different degrees of intensity. The highest recognition accuracy was achieved after carrying out the following proposed processing steps. 68 detected coordinates of facial landmarks were scaled to an area of 224×224 since some videos have different resolutions. 2278 unique Euclidean distances were calculated between 68 facial landmarks. Three con- figurations with different number of facial distances were studied that have the great- est importance score and accurately characterize changes in facial expressions. LSTM has been applied to capture long-term dependence of frame-by-frame changes in faci- al expressions for different sequence lengths and feature sets. We analyzed the impact of different feature sets on facial expression recognition using both a single corpus (10-folds cross-validation experiments) and cross-corpus setups. The experimental results showed that an absolute improvement of the recognition accuracy is achieved with an average video duration and the feature set of 259 com- ponents. This suggests that 259 components better generalize changes in facial ex- pressions both in a single corpus and cross-corpus setup. The best recognition accura- cy results of 79.1% and 98.9% were obtained with a single corpus for the CREMA-D and RAVDESS datasets, respectively. Our results of facial expression recognition outperform state-of-the-art results for the same datasets and experimental setups. Facial Expression Recognition using Distance Importance Scores Between Facial… 9 In our future work, we are going to apply the proposed approach to some other widely used databases, such as CK+, Aff-Wild2, etc. References 1. Nijsse, B., Spikman, J. M., Visser-Meily, J. M., de Kort, P. L., van Heugten, C. M.: Social Cognition Impairments in the Long Term Post Stroke. Archives of Physical Medicine and Rehabilitation. vol. 100, no. 7, pp. 1300–1307 (2019) 2. Chen, L., Wu, M., Zhou, M., Liu, Z., She, J., Hirota, K.: Dynamic emotion understanding in human-robot interaction based on two-layer fuzzy SVR-TS model. IEEE Transactions on Systems, Man, and Cybernetics: Systems. vol. 50, no. 2, pp. 490–501 (2017) 3. Ninaus, M., Greipl, S., Kiili, K., Lindstedt, A., Huber, S., Klein, E., Moeller, K.: Increased emotional engagement in game-based learning–A machine learning approach on facial emotion detection data. Computers & Education. vol. 142, pp. 103641 (2019) 4. Prasad, N., Unnikrishnan, K., Jayakrishnan, R.: Fraud Detection by Facial Expression Analysis Using Intel RealSense and Augmented Reality. In: 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), pp. 919–923 (2018) 5. Izquierdo-Reyes, J., Ramirez-Mendoza, R. A., Bustamante-Bello, M. R., Navarro-Tuch, S., Avila-Vazquez, R.: Advanced driver monitoring for assistance system (ADMAS). In- ternational Journal on Interactive Design and Manufacturing (IJIDeM). vol. 12, no.1, pp. 187–197 (2018) 6. Baltrušaitis, T., Robinson, P., Morency, L. P.: Openface: an open source facial behavior analysis toolkit. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). Lake Placid, NY, USA, pp. 1–10 (2016) 7. Jannat, R., Tynes, I., Lime, L. L., Adorno, J., Canavan, S.: Ubiquitous emotion recognition using audio and video data. In: Proceedings of the 2018 ACM International Joint Confer- ence and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, pp. 956–959. (2018) 8. Fan, Y., Lu, X., Li, D., Liu, Y.: Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In: Proceedings of the 18th ACM International Conference on Mul- timodal Interaction, pp. 445–450 (2016) 9. Ghaleb, E., Popa, M., Asteriadis, S.: Multimodal and Temporal Perception of Audio-visual Cues for Emotion Recognition. In: 2019 8th International Conference on Affective Com- puting and Intelligent Interaction (ACII). Cambridge, United Kingdom, pp. 552–558 (2019) 10. Ryumina E.V., Karpov A.A. Analytical review of methods for emotion recognition by human face expressions. Scientific and Technical Journal of Information Technologies, Mechanics and Optics. vol. 20, no. 2, pp. 163–176 (2020) (in Russian) 11. Alshamsi, H., Kepuska, V., Alshamsi, H., Meng, H.: Automated Facial Expression and Speech Emotion Recognition App Development on Smart Phones using Cloud Computing. In: 2018 IEEE 9th Annual Information Technology, Electronics and Mobile Communica- tion Conference (IEMCON). Vancouver, BC, Canada, pp. 730–738 (2018) 12. Nasir, M., Jati, A., Shivakumar, P. G., Nallan Chakravarthula, S., Georgiou, P.: Multimod- al and multiresolution depression detection from speech and facial landmark features. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pp. 43–50 (2016) 13. Van Gent, P.: Emotion Recognition Using Facial Landmarks Python DLib and OpenCV. A tech blog about fun things with Python Embed. Electron (2016) 10 E. Ryumina, A. Karpov 14. King, D. E.: Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research. vol. 10, no. Jul, pp. 1755–1758 (2009) 15. Gite, B., Nikhal, K., Palnak, F.: Evaluating facial expressions in real time. In: 2017 Intelli- gent Systems Conference (IntelliSys). London, UK, pp. 849–855 (2017) 16. Al-Omair, O. M., Huang, S. A: Comparative Study of Algorithms and Methods for Facial Expression Recognition. In: 2019 IEEE International Systems Conference (SysCon). Or- lando, FL, USA, pp. 1–6 (2019) 17. Ekman, P., Friesen, W. V.: Facial action coding system: Investigator’s guide. Consulting Psychologists Press (1978) 18. Li, Q., Zhan, S., Xu, L., Wu, C.: Facial micro-expression recognition based on the fusion of deep learning and enhanced optical flow. In: Multimedia Tools and Applications. vol. 78, no. 20, pp. 29307-29322. Springer, (2019). https://doi.org/10.1007/s11042-018-6857-9 19. Jaratrotkamjorn, A., Choksuriwong, A.: Bimodal Emotion Recognition using Deep Belief Network. In: 2019 23rd International Computer Science and Engineering Conference (ICSEC), pp. 103–109 (2019) 20. Cao, H., Cooper, D. G., Keutmann, M. K., Gur, R. C., Nenkova, A., Verma, R.: CREMA- D: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing. vol. 5, no. 4, pp. 377–390 (2014) 21. Livingstone, S. R., Russo, F. A.: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS One. vol. 13, no. 5, e0196391 (2018) 22. He, Z., Jin, T., Basu, A., Soraghan, J., Di Caterina, G., Petropoulakis, L.: Human emotion recognition in video using subtraction pre-processing. In: Proceedings of the 2019 11th In- ternational Conference on Machine Learning and Computing, pp. 374–379 (2019) 23. Siddiqui, M.F.H., Javaid A.Y: A Multimodal Facial Emotion Recognition Framework through the Fusion of Speech with Visible and Infrared Images. Multimodal Technologies and Interaction. vol. 4, no. 3:46, pp. 1–20 (2020) 24. Lucey, P., Cohn, JF., Kanade, T., Saragih, J., Ambadar, Z., Matthews, I.: The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified ex- pression. In: 2010 IEEE computer society conference on computer vision and pattern recognition-workshops, San Francisco, CA, USA, pp. 94–101 (2010) 25. Valstar, M., Pantic, M.: May. Induced disgust, happiness and surprise: an addition to the mmi facial expression database. In: Proc. 3rd Intern. Workshop on EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect, pp. 65–70 (2010) 26. Haq, S., Jackson, P.J.: Multimodal emotion recognition. Machine audition: principles, al- gorithms and systems, pp. 398–423 (2010) 27. Zhao, G., Huang, X., Taini, M., Li, S.Z., PietikäInen, M.: Facial expression recognition from near-infrared videos. Image and Vision Computing. vol. 29, no. 9, pp.607–619 (2011) 28. Perepelkina O., Kazimirova E., Konstantinova M.: RAMAS: Russian Multimodal Corpus of Dyadic Interaction for Affective Computing. Springer International Publishing. vol. 11096, pp. 501–510 (2018) 29. Kollias, D., Zafeiriou, S.: Aff-wild2: Extending the aff-wild database for affect recogni- tion. arXiv preprint arXiv:1811.07770 (2018) 30. Breiman, L.: Random forests. Machine learning. vol. 45, no. 1, pp. 5–32 (2001) 31. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Machine learning. vol. 63, no. 1, pp. 3–42 (2006) 32. Hastie, T., Rosset, S., Zhu, J., Zou, H.: Multi-class adaboost. Statistics and its Interface. vol. 2, no. 3, pp. 349–360 (2009)