A Multi-resolution Training for Expression Recognition in the Wild (Discussion Paper) Fabio Valerio Massoli1 , Donato Cafarelli1 , Giuseppe Amato1 and Fabrizio Falchi1 1 ISTI-CNR, via G. Moruzzi 1, 56124 Pisa, Italy Abstract Facial expressions play a fundamental role in human communication, and their study, which represents a multidisciplinary subject, embraces a great variety of research fields, e.g., from psychology to computer science, among others. Concerning Deep Learning, the recognition of facial expressions is a task named Facial Expression Recognition (FER). With such an objective, the goal of a learning model is to classify human emotions starting from a facial image of a given subject. Typically, face images are acquired by cameras that have, by nature, different characteristics, such as the output resolution. Moreover, other circumstances might involve cameras placed far from the observed scene, thus obtaining faces with very low resolutions. Therefore, since the FER task might involve analyzing face images that can be acquired with heterogeneous sources, it is plausible to expect that resolution plays a vital role. In such a context, we propose a multi-resolution training approach to solve the FER task. We ground our intuition on the observation that, often, face images are acquired at different resolutions. Thus, directly considering such property while training a model can help achieve higher performance on recognizing facial expressions. To our aim, we use a ResNet-like architecture, equipped with Squeeze-and-Excitation blocks, trained on the Affect-in-the-Wild 2 dataset. Not being available a test set, we conduct tests and model selection by employing the validation set only on which we achieve more than 90% accuracy on classifying the seven expressions that the dataset comprises. Keywords Facial Expression Recognition, Deep Convolutional Neural Networks, Multi-resolution training. 1. Introduction Facial expressions play a fundamental role in human communication. Indeed, they typically reveal the actual emotional status of people beyond the spoken language. Moreover, the comprehension of human affect based on visual patterns is a crucial ingredient for any human- machine interaction [1] system and, for such reasons, the task of Facial Expression Recognition (FER) draws both scientific and industrial interest. In recent years, Deep Learning techniques reached very high performance on FER by exploiting different architectures and learning paradigms. In such a context, we propose a multi-resolution approach to solve the FER task. SEBD 2021: The 29th Italian Symposium on Advanced Database Systems, September 5-9, 2021, Pizzo Calabro (VV), Italy " fabio.massolli@isti.cnr.it (F. V. Massoli); donato.caf@gmail.com (D. Cafarelli); giuseppe.amato@isti.cnr.it (G. Amato); fabrizio.falchi@isti.cnr.it (F. Falchi)  0000-0001-6447-1301 (F. V. Massoli); 0000-0002-7575-0143 (D. Cafarelli); 0000-0003-0171-4315 (G. Amato); 0000-0001-6258-5313 (F. Falchi) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) We ground our intuition on the observation that, often, face images are acquired at different resolutions. Thus, directly considering such property while training a model can help achieve higher performance on recognizing facial expressions. To our aim, we use a ResNet-like architecture, equipped with Squeeze-and-Excitation blocks, trained on the Affect-in-the-Wild 2 dataset. Not being available a test set, we conduct tests and model selection by employing the validation set only on which we achieve more than 90% accuracy on classifying the seven expressions that the dataset comprises. To let our researcher reproduce our results, we made our code publicly available on github1 . Concerning the remaining part of the paper, we organized it as follows. In Section 2 we report several works related to the FER task, while in Section 3 and Section 4 we describe our approach and the dataset we use, respectively. Moreover, we describe the experimental campaigns we perform and the corresponding model performance in Section 5. Finally, in Section 6 we conclude our work by reporting our plans. 2. Related Works Nowadays, the most promising approaches to the FER task are based on the use of Deep Convolutional Neural Networks (DCNN). A typical approach consists of a pre-processing phase, where the images are subject to various transformations, and a training phase where these images are iteratively given as input to a DCNN model for feature extraction and expression classification. In [2], the authors propose a new approach for face cropping to remove useless regions in an image and a novel rotation strategy to cope with data scarcity. Furthermore, they built a simplified DCNN structure to reduce training/inference time and achieve real-time FER on devices with limited resources. Their experiments were conducted on two databases, CK+ [3] and JAFFE [4], and achieved state-of-the-art results. In [5], a novel activation function based on the ReLU function, called LS-ReLU, is presented. It exploits an adjustable log and the soft-sign functions. Neural networks based on LS–ReLU function can avoid the over-fitting problem during the training process and reduce the oscillations problem. Their experiments on JAFFE [4] and FER2013 [6] datasets showed that a DCNN based on this novel activation function has a better performance compared to most state-of-the-art activation functions. With the transition of FER datasets from laboratory-controlled to in-the-wild conditions, this task has become more challenging due to variations in pose, brightness, and background, to mention some. Therefore, in [7] the authors focus on resolving the FER task by analyzing the contribution of different face areas to different emotions, including nose, mouth, eyes, nose to mouth, nose to eyes, and mouth to eyes areas, together with the whole faces. The paper [8] addresses the problem of the class imbalance in wild FER datasets. To such an aim, the authors propose a novel Discriminant Distribution-Agnostic loss (DDA loss) to optimize the embedding space for extreme class imbalance scenarios. Specifically, DDA loss enforces inter-class separation of deep features for both majority and minority classes. In [9] the authors propose a multi-task learning framework to extract local-global and spatio-temporal information for a discriminative and robust representation of facial expressions. Their experiments achieved competitive results on the CK+ [3] and Oulu-CASIA [10] datasets. To improve the performance on the FER 1 https://github.com/fvmassoli/affwild2-challenge.git task, [11] proposes a novel “Masking Idea" that is implemented in a Residual Masking Network that contains several masking blocks applied across residual layers to improve the network’s attention ability on relevant information. Experiments showed competitive results on the FER2013 [6] dataset. 3. Approach Usually, face images come from heterogeneous sources [12], e.g., cameras with different resol- utions or different distances from the scene. Such characteristics directly impact DL models’ performance on tasks such as Face Recognition (FR) by dramatically lowering their perform- ance [13]. Based on such an observation, we propose our approach grounded on the hypothesis that the images’ resolution has a non-negligible impact on DL models’ behavior when tested against the FER task. Precisely, we move our steps from [13] in which the authors explicitly take care of the multi-resolution nature of face images by designing a training technique to accommodate for such an issue adequately. In our work, we take inspiration from the author’s training procedure, and we adapted it to our case. Specifically, we experimentally notice that we do not need any Teacher-supervised signal nor curriculum learning. Thus, we simplify the training procedure by only exploiting the double random extraction to set the final image resolution. To train the models and perform model selection, we employ the Aff-Wild2 [14] dataset. We refer the reader to Section 4 for a brief description of the dataset. Our base model is a ResNet-50 architecture [15], equipped with Squeeze-and-Excitation blocks [16], that has been pre-trained on the VGGFace2 dataset [17]. To train our models, we use the Adam [18] optimizer. We set the weight decay of 1.𝑒−4 and the learning at 1.𝑒−3 for the last fully connected layer and 1.𝑒−4 for all the others. Moreover, we set the batch size to 128, and we use data augmentation techniques to avoid overfitting. Specifically, we first resize the images to have the shortest side of 256 pixels (while keeping the original aspect ratio), then we random crop a square of 224x224 pixels, and finally, we normalize the input channels. Moreover, we apply a random grayscale conversion with a probability of 0.2. We substitute the random crop with the center one, and we remove the grayscale operation to test the model on the validation set. Concerning the random resolution extractions to train the models, we perform several experiments considering different ranges for the final image size concerning the multi-resolution training, with the minimum and maximum considered values being 8 and 256 pixels, respectively. 4. Dataset: Affect-in-the-Wild 2 The Aff-Wild2 [14] dataset is the first-ever database annotated for all three main behavior tasks: Valence Arousal (VA), Action Units (AU), and Expression (EX) classification. Concerning the last one, the dataset consists of 547 videos (collected from YouTube) that account for ∼2.6M of frames labeled considering seven basic expressions: neutral, anger, disgust, fear, happiness, sadness, and surprise. The annotation is made frame-by-frame by a team of seven experts. The dataset is shipped with a protocol that divides it into three non-overlapping subsets for training, validation, and test purposes. Specifically, the three partitions consist of 253, 71, and 223 videos, respectively. The cropped-aligned version of the dataset is made of images preprocessed to have a fixed resolution of 112x112 pixels. Among the ∼2.6M available images, ∼1.2M are available for training and validation on the FER task. We report in Figure 1 an example of training images in the Aff-Wild2 [14] dataset. Happiness Surprise Neutral Neutral Neutral Neutral Neutral Happiness Surprise Neutral Figure 1: Example of face images from the Aff-Wild2 [14] dataset. On top of each image, we report the corresponding ground truth expression. As we mentioned previously, the dataset comprises seven different types of expressions with a very different cardinality. In Table 1, we report the number of images for each class, both for the training and validation sets, while in Table 2, we report the classes’ weight concerning the training images only. Expression Neutral Anger Disgust Fear Happiness Sadness Surprise Training 585896 23484 12497 11120 149920 100548 38564 (%) (63.5) (2.5) (1.4) (1.2) (16.3) (11.0) (4.1) Validation 181884 8003 5401 9671 52842 38534 22988 (%) (57.0) (2.5) (1.7) (3.0) (16.5) (12.1) (7.2) Table 1 Classes’ cardinality for the Aff-Wild2 [14] dataset. As one can notice from Table 1, the classes are not balanced. For that reason, we leveraged a balanced cross-entropy loss to account for the class unbalance. To such an aim we use the weights reported in Table 2. Neutral Anger Disgust Fear Happiness Sadness Surprise Aff-Wild2 [14] 0.365 0.975 0.986 0.988 0.837 0.891 0.958 Table 2 Classes’ weight. The values reported are referred to the training set only. Note that weights do not need to sum up to one. The lower the weight, the higher the cardinality of the corresponding class. 5. Experimental Results In this section, we report the experimental results we obtained on the Aff-Wild2 [14] dataset. Since the dataset is currently employed in the Affect-in-the-Wild Challenge [19], the test set’s ground truth labels are not available. For such a reason, we quote the performance of our model on the validation set. Before the training, we took a small subsample of the validation set and used it for model selection purposes to avoid bias. Subsequently, we tested the best model on the entire validation set. To quote our results, we use different metrics. First, we evaluate the F1-score on each class, then we summarize the overall performance of our best model across all the seven expressions by quoting the F1-score (macro-average) and the overall accuracy. Finally, we evaluate the same score as required by the Affect-in-the-Wild Challenge [19], which is equal to: 𝑠 = 0.33 · accuracy + 0.67 · f1 score; (1) where the accuracy and the f1 score are relative to the whole dataset. We report the results in Table 3 and Table 4 concerning single classes and the whole dataset, respectively. Expression Neutral Anger Disgust Fear Happiness Sadness Surprise F1 Score 0.978 0.960 0.965 0.971 0.946 0.987 0.937 Table 3 F1 score for each class of the Aff-Wild2 [14] dataset. Accuracy F1 Score Challenge Score (macro-average) 0.970 0.964 0.966 Table 4 Summary statistics on all the classes of the Aff-Wild2 [14] dataset. From the previous tables, we can notice that our model shows promising performance on the FER task. Moreover, we acknowledge the stability of the scores among different classes even though the dataset is highly unbalanced as reported in Table 1 6. Conculsions abd Future Works In this work, we report our first experimental campaign focused FER task. We tackle such a problem by giving more representational power to our models, assuming a multi-resolution context, and we observe promising results. As a next step, we will extend our experimental campaign to test our approach on different publicly available datasets such as FER2013 [6], RAF-DB [20], and Oulu-CASIA [10]. Acknowledgments We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan V GPU used for this research. This work was partially supported by WAC@Lucca funded by Fondazione Cassa di Risparmio di Lucca, AI4EU - an EC H2020 project (Contract n. 825619), and upon work from COST Action 16101 “Action MULTI-modal Imaging of FOREnsic SciEnce Evidence (MULTI-FORESEE)”, supported by COST (European Cooperation in Science and Technology). References [1] V. Bettadapura, Face expression recognition and analysis: The state of the art, CoRR abs/1203.6722 (2012). URL: http://arxiv.org/abs/1203.6722. arXiv:1203.6722. [2] K. Li, Y. Jin, M. W. Akram, R. Han, J. Chen, Facial expression recognition with convolutional neural networks via a new face cropping and rotation strategy, The Visual Computer 36 (2020) 391–404. [3] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, I. Matthews, The extended cohn- kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression, in: 2010 ieee computer society CVPR-workshops, IEEE, 2010, pp. 94–101. [4] M. Lyons, S. Akamatsu, M. Kamachi, J. Gyoba, Coding facial expressions with gabor wavelets, in: Proceedings Third IEEE international conference on automatic face and gesture recognition, IEEE, 1998, pp. 200–205. [5] Y. Wang, Y. Li, Y. Song, X. Rong, The influence of the activation function in a convolution neural network model of facial expression recognition, Applied Sciences 10 (2020) 1897. [6] I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.-H. Lee, et al., Challenges in representation learning: A report on three machine learning contests, in: International conference on neural information processing, Springer, 2013, pp. 117–124. [7] Z. Lian, Y. Li, J.-H. Tao, J. Huang, M.-Y. Niu, Expression analysis based on face regions in real-world conditions, International Journal of Automation and Computing 17 (2020) 96–107. [8] A. H. Farzaneh, X. Qi, Facial expression recognition in the wild via deep attentive center loss, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2402–2411. [9] M. Yu, H. Zheng, Z. Peng, J. Dong, H. Du, Facial expression recognition based on a multi-task global-local network, Pattern Recognition Letters 131 (2020) 166–171. [10] G. Zhao, X. Huang, M. Taini, S. Z. Li, M. PietikäInen, Facial expression recognition from near-infrared videos, Image and Vision Computing 29 (2011) 607–619. [11] P. Luan, V. Huynh, T. Tuan Anh, Facial expression recognition using residual masking network, in: IEEE 25th International Conference on Pattern Recognition, 2020, pp. 4513– 4519. [12] F. V. Massoli, F. Falchi, C. Gennaro, G. Amato, Cross-resolution deep features based image search, in: International Conference on Similarity Search and Applications, Springer, 2020, pp. 352–360. [13] F. V. Massoli, G. Amato, F. Falchi, Cross-resolution learning for face recognition, Image and Vision Computing 99 (2020) 103927. [14] D. Kollias, S. Zafeiriou, Aff-wild2: Extending the aff-wild database for affect recognition, arXiv:1811.07770 (2018). [15] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition. corr abs/1512.03385 (2015), 2015. [16] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE CVPR, 2018, pp. 7132–7141. [17] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, A. Zisserman, Vggface2: A dataset for recognising faces across pose and age. corr abs/1710.08092 (2017), arXiv:1710.08092 (2017). [18] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv:1412.6980 (2014). [19] D. Kollias, S. Zafeiriou, First Affect-in-the-Wild Challenge, https://ibug.doc.ic.ac.uk/ resources/first-affect-wild-challenge/, 2020. [20] S. Li, W. Deng, J. Du, Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild, in: Proceedings of the IEEE CVPR, 2017, pp. 2852–2861.