Methods for emotions, mood, gender and age recognition

               D D Pribavkin1, P Y Yakimov1,2


               1
                Samara National Research University, Moskovskoe Shosse 34А, Samara, Russia, 443086
               2
                Image Processing Systems Institute of RAS - Branch of the FSRC "Crystallography and
               Photonics" RAS, Molodogvardejskaya street 151, Samara, Russia, 443001


               e-mail: pribavkindenis@gmail.com


               Abstract. Recognition on images not only of shapes, but also of metadata is becoming
               increasingly popular among researchers in the field of convolutional neural networks and deep
               learning. This article provides an analytical overview of modern software solutions that
               recognize the images of emotions, mood, gender and age of a person. Enthusiasts invent all
               new and new architectures of convolutional neural networks, allowing to solve the tasks with
               considerable recognition accuracy.


1. Introduction
In recent years, there has been a rapid development of parallel data processing technologies, in
particular due to the development of graphics processors, which are no longer intended only for
computer graphics. This made it possible to train even the most complex neural networks in their
architectures and opened up a whole horizon of unsolvable tasks [1], [2], [3]. Modern intellectual
systems focus not only on pattern recognition from the input image, but also learn to isolate metadata
from recognized objects, such as emotions, mood, gender, or a person’s age.
    Many researchers and enthusiasts in the field of machine learning and convolutional neural
networks develop and offer their own, unique solutions that are different both ideologically and
technically.
    This article offers an analytical review of the following software solutions in the field of
recognition of emotions, mood, sex and age of a person:
     • Emotion recognition using Deep Convolutional Neural Network [4].
     • A Compact Soft Stagewise Regression Network [5].
     • Real-time Convolutional Neural Networks for Emotion and Gender Classification [6].
     • Age Recognition using CNNs [7].

2. Review of existing solutions

2.1. Emotion recognition using Deep Convolutional Neural Networks
A solution that is a trained neural network that recognizes real-time emotions on a human face
recognized from the input video stream.


                   V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)
Data Science
D D Pribavkin, P Y Yakimov


   It was built using the TFLearn programming library for the python programming language, based
on the well-known TensorFlow machine learning framework developed by Google in 2015 [8]. This
framework simplifies the development of the network, as it requires describing only the layers
themselves instead of describing each neuron separately, and also simplifies network training by
providing real-time process feedback and learning accuracy. Moreover, the library allows you to save
the result of a trained model to use it later.
   The resulting neural network model is shown in Figure 1.


                                           Figure 1. Neural network model.

    In each frame of the video stream, an attempt is made to recognize a human face (s). This is
achieved using the OpenCV open library recognition method [9]. Then, if a face was recognized in the
image, that face is cut out and scaled to a size of 48x48 pixels. Only after that it is fed to the input of
the neural network. Thus, we get optimized software that affects neural network resources only if there
is at least one human face in the frame.
    The model was trained with the help of dataset FERC-2013, which has about 20,000 images
containing examples of the following emotions: anger, fear, happiness, sadness, surprise, indifference
and disgust. The density of the distribution of emotions in this data is reflected in Figure 2.


                     Figure 2. Density of distribution of emotions in dataset FERC-2013.

    According to the results of training, the accuracy of recognition of emotions was achieved in 67%.

2.2. SSR-Net
This solution is an original neural network with soft stepwise regression (soft stagewise regression
network) for recognizing age and sex. The network recognizes age and gender according to the
following principle: images of 64x64 pixels are fed to the input of the network, a multi-level
classification is made from several classes, where each level serves to refine the previous result, and
then the result of the classification is processed using a regression.
    The model itself is very compact and takes only 0.32 MB. But in spite of its compact dimensions,
the performance of SSR-Net is close to the characteristics of the most modern methods, the sizes of
models of which are 1500 times larger.
    A model of this neural network with three levels and a pool size of 2 is shown in Figure 3.


V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)                   543
Data Science
D D Pribavkin, P Y Yakimov


                                     Figure 3. Neural network model SSR-Net.

   For training this model, such datasets as IMDB, WIKI and Morph2 [11] were used. About 80% of
randomly selected images from datasets were used to train the network, and the remaining 20% were
used for testing.
   An example of dependence of the number of SSR-Net, MobileNet and DenseNet network
recognition errors trained in Morph2 data on the number of epochs is presented in Figure 4.


                      Figure 4. Graph of recognition errors versus the number of epochs.

2.3. Face classification and detection from the B-IT-BOTS robotics team
It is a real-time classifier of emotions and gender of a person, based on the convolutional neural
network and the open image processing library openCV [9]. The model of this neural network is
shown in Figure 5 and contains 600,000 parameters.
    This model was trained in IMDB dataset, which has about 460,723 RGB images, each of which
belongs to one of the classes: male or female [10]. At this dataset, recognition accuracy of 96% was
achieved. Also, this model was validated on dataset FER-2013, which includes 35,887 images in gray
tones, each of which belongs to one of the classes of emotions: anger, disgust, fear, joy, sadness,
surprise and indifference. At this dataset, 66% accuracy was achieved.

2.4. Age and gender estimation
This solution is the implementation of a convolutional neural network for recognizing the sex and age
of a person from the input image. The basis for the VGG-16 network architecture was taken due to its
depth and controllability. This network accepts 256x256 pixel images as input.
   The training was carried out on IMDB-WIKI datasets, and recognition accuracy of 64% was
achieved [10].


V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)             544
Data Science
D D Pribavkin, P Y Yakimov


                                           Figure 5. Neural network model.

2.5. General comparison of implementations
As a result of the analytical review, the decisions contained in the publications [6, 7] were selected:
   1. A solution that recognizes a person's age, trained in dataset IMDB-WIKI with a recognition
      accuracy of 64%.
   2. A solution that recognizes the sex of a person, trained in dataset IMDB with a recognition
      accuracy of 96%.
   3. A solution that recognizes a person's emotions, trained in dataset FER-2013 with a recognition
      accuracy of 66%.
   The source code of each solution was carefully analyzed and revised so that the digital image was
provided as input to the software, and the result of the prediction of a convolutional neural network
was obtained.

3. Conducting experimental studies
At the end of the previous section, software solutions were obtained, the main task of which is to
recognize the age, gender and emotions of a person from a digital face image.
   These solutions were chosen as objects for conducting an experimental study of their performance
on 10 random images of the faces of people from the IMDB-WIKI dataset.
   The following equipment and software were used during the pilot study:
   1. Processor: intel Core i5-4570 3.2 GHz.
   2. RAM: 8 Gb.
   3. Operation system: Manjaro 18.0.4 «Illyria».


V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)               545
Data Science
D D Pribavkin, P Y Yakimov


    4. Programming language: Python 3.6.5.
    Below (Figures 6–15)are the images used in the pilot study:


       Figure 6. Image number 1.               Figure 7. Image number 2.                Figure 8. Image number 3.


  Figure 9. Image number 4.                   Figure 10. Image number 5.                Figure 11. Image number 6.


  Figure 12. Image number 7.                  Figure 13. Image number 8.                Figure 14. Image number 9.


                                            Figure 15. Image number 10.


V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)                                546
Data Science
D D Pribavkin, P Y Yakimov


    Since the neural network with the same input data always produces the same result, performing
multiple independent attempts at recognition in a row does not have much value, however, to measure
a more accurate execution time of a convolutional neural network in a single digital image.
    Tables 1, 2 and 3 present the results of recognizing age, gender and emotions, respectively, as well
as the average execution time for 5 runs of a convolutional neural network in the next digital image.

                                            Table 1. Final Age Recognition.
              Image, №                                     Age                          Average time, ms
                 1                                          64                                523
                 2                                          31                                642
                 3                                          39                                627
                 4                                          34                                510
                 5                                          47                                481
                 6                                          17                                524
                 7                                          56                                537
                 8                                          25                                450
                 9                                          34                                583
                 10                                         25                                592

                                          Table 2. Final Gender Recognition.
              Image, №                                   Gender                         Average time, ms
                 1                                        Man                                 324
                 2                                       Woman                                451
                 3                                        Man                                 430
                 4                                        Man                                 318
                 5                                        Man                                 293
                 6                                       Woman                                305
                 7                                        Man                                 421
                 8                                        Man                                 326
                 9                                        Man                                 432
                 10                                      Woman                                476

                                 Table 3. Final Emotion Recognition.
             Image, №                             Emotion                     Average time, ms
                 1                               Happiness                           389
                 2                               Happiness                           513
                 3                                Surprise                           527
                 4                               Happiness                           408
                 5                                 Fear                              362
                 6                               Happiness                           385
                 7                               Happiness                           503
                 8                               Happiness                           476
                 9                                Sadness                            513
                 10                               Neutral                            563
    As we can see from tables 1, 2 and 3, although the accuracy of recognition stated by the authors
still requires some refinements (for example, a neural network that recognizes the sex of a person is
clearly mistaken in image No. 8), these pre-trained convolutional neural networks are capable of
producing meaningful results, reflecting reality.

4. Conclusion
As a result, we can conclude that such tasks as the recognition of emotions, mood, gender and age are
very popular among researchers all over the world. Enthusiasts use different approaches to the


V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)                      547
Data Science
D D Pribavkin, P Y Yakimov


implementation of intelligent systems that can solve such problems and achieve good results in
accuracy of recognition, even with limited resources. The main means of the implementation of the
tasks are convolutional neural networks of various architectures, trained in well-known in the network
dataset images.

5. References
[1] Bibikov S A, Kazanskiy N L and Fursov V A 2018 Vegetation type recognition in hyperspectral
      images using a conjugacy indicator Computer Optics 42(5) 846-854 DOI: 10.18287/2412-
      6179-2018-42-5-846-854
[2] Shatalin R A, Fidelman V R and Ovchinnikov P E 2017 Abnormal behavior detection method
      for video surveillance applications Computer Optics 41(1) 37-45 DOI: 10.18287/2412- 6179-
      2017-41-1-37-45
[3] Shustanov A, Yakimov P 2017 CNN Design for Real-Time Traffic Sign Recognition Procedia
      Engineering 201 718-725 DOI: 10.1016/j.proeng.2017.09.594
[4] Correa E, Jonker A, Ozo M, Stolk R Emotion Recognition using Deep Convolutional Neural
      Network URL: https://github.com/isseu/emotion-recognition-neural-networks/blob/master/paper
      /Report_NN.pdf (1.11.2018)
[5] Tsun-Yi Y, Yi-Hsuan H, Yen-Yu L, Pi-Cheng Hu, Yung-Yu Ch SSR-Net: A Compact Soft
      Stagewise Regression Network for Age Estimation URL: https://github.com/shamangary/SSR-
      Net/blob/master/ijcai18_ssrnet_pdfa_2b.pdf (14.11.2018)
[6] Arriaga O, Plöger P G, Valdenegro M Real-time Convolutional Neural Networks for Emotion
      and Gender Classification URL: https://github.com/oarriaga/face_classification/blob/master/
      report.pdf (8.10.2018)
[7] Pakulich D, Alyamkin S, Yakimov S 2019 Age estimation using face recognition with
      convolutional neural networks Avtometriya 55(3) 52-61 (in Russian) DOI: 10.15372/
      AUT20190307
[8] TFLearn library URL: http://tflearn.org/ (04.10.2018)
[9] OpenCV library URL: http://opencv.org (04.10.2018)
[10] IMDB-wiki dataset URL: https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/ (04.10.2018)

Acknowledgements
This work was partly funded by the Russian Foundation for Basic Research – Project
# 17-29-03112 ofi_m and the Russian Federation Ministry of Science and Higher Education within a
state contract with the "Crystallography and Photonics" Research Center of the RAS under agreement
007-ГЗ/Ч3363/26.


V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)              548