=Paper=
{{Paper
|id=Vol-2786/Paper55
|storemode=property
|title=Recognition of Facial Expression using Landmark Detection in Deep Learning Model
|pdfUrl=https://ceur-ws.org/Vol-2786/Paper55.pdf
|volume=Vol-2786
|authors=Palak Girdhar,Vishu Madaan,Tanuj Ahuja,Shubham Rawat
|dblpUrl=https://dblp.org/rec/conf/isic2/GirdharMAR21
}}
==Recognition of Facial Expression using Landmark Detection in Deep Learning Model==
460
Recognition of Facial Expression using Landmark Detection in Deep
Learning Model
Palak Girdhar1, Vishu Madaan2, Tanuj Ahuja1 and Shubham Rawat1
1
Department of Computer Science and Engineering, Bhagwan Parshuram Institute of Technology, India
2
School of Computer Science Engineering, Lovely Professional University, Punjab, India
Abstract
With the advent of Deep Learning algorithms and especially Convolutional Neural Network (CNN), is used to
extract the important features from face. However, most of the discriminative features come from the mouth
region, nose and eyes. Whereas the other regions such as forehead, hair, ears hold a very small role in the
analysis. In this paper, we present a deep learning-based approach for facial expression recognition using
Landmark detection in CNN, that has the ability to focus on the sensitive area of the face and ignores the less
sensitive information. This method of extracting information is known as landmark detection. The proposed
work uses CNN with Landmark detection having a learning rate of 0.001 and on 50 epochs. The proposed
model concentrates on the chief areas of the face, instead of less important regions of face. The methodology
is tested and validated on Jaffe dataset using 10-fold cross validation of 141 images. The empirical results of
the proposed methodology show that accuracy of recognition of facial expression increases by 87.5% as
compared with state-of-the-art methods of classical CNN with 78.1% using Adam optimizer. The
methodology can be further utilized as a base for emotional and behavior analysis using soft computing
techniques.
Keywords
Human Computer Interaction, Facial Expression Recognition, landmark detection Convolutional Neural
Network, Deep Learning
It is gaining strength in the area of visual interactive
1. Introduction gaming, data driven animations, robotics, surveillance
Expressions on our face plays a vital role in daily systems and many more. Verbal communication
human-to-human communication. Automatic (speech and textual data) and non-verbal
detection of these facial expressions has long been communication (face expression, gestures, eye
studied due to its potential applications in various movement, body movement) are the two categories
domains such as service robots, driver drowsiness through which human emotions can be expressed.
monitoring, and intelligent teaching systems. It is also Emotions are the representation of the human nervous
gaining popularity in the field of Human Computer system towards the external situations. Brain first
Interaction (HCI). It refers to the interaction of sends the instructions for the corresponding feedback
humans and computer technology. It has almost which may reflect through human facial expression,
impacted on every area of our daily lives. pitch of the voice, body movement, gestures also
influence human organs like heart rate and brain etc.
Face Expressions can be studied for various reasons
like: they hold numerous useful features for
ISIC’21: International Semantic Intelligence Conference, expression recognition, they are visible, and their
February 25-27, 2021, Delhi, India. https://orcid.org/0000- datasets are readily available as compared to other
0002-4042-6001 EMAIL: palakgirdhar@bpitindia.com (P. expression recognition features. Expressions of the
Girdhar) face can be grouped into six principal classes: anger,
2021 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). surprise, sadness, disgust, fear and happiness.
CEUR Workshop Proceedings (CEUR-WS.org)
Emotions are a critical part of our communication
with another party. It gives clues of our current state
of mind even without saying something. Facial
expression recognition (FER) has become an active
area research due to its applications in medicine, e
461
learning, monitoring, entertainment, marketing, This work set the foundation for all the other future
human-computer interaction etc. Therefore, there is a work done in the field of emotion recognition. Earlier
need to develop a mechanism to detect emotions. works on facial expressions mainly involved a two-
Traditionally, handcrafted features were used along step process. The first step involves manual
with machine learning algorithms to address this extraction of features from the faces by a human or
problem. But with the recent triumph of deep learning automatically by computer software, then in the
and especially convolution neural networks (CNN) in second step popular classification algorithms like as
tasks such as object recognition, face recognition, and support vector machines, or k - nearest neighbours are
object detection researchers have started to explore used to classify emotions. A few of the traditional,
these methods in the area of expression recognition well-known, approaches to extract features from
[1][2]. images are Gabor wavelets [10], histogram of
Despite achieving excellent results, robust facial oriented gradients [11], Haar features [12] etc. These
expression recognition is still a tough task for existing methods worked well on limited sized datasets but
these deep learning methods because the images in didn‟t generalize well on larger datasets and the
the wild vary a lot in pose, background, occlusion etc. datasets, which had more variations. In recent times,
Moreover, deep learning requires a lot of data and deep learning-based approaches specifically CNN
relatively small datasets are available for emotion have gained a lot of popularity because of their ability
recognition tasks, making the training for deep to extricate features automatically from the images.
networks difficult. Therefore, there is a need to Deep learning has been found to perform very well
develop a system that can accurately identify the for object recognition and other vision related
emotional state of a person, along with the given problems, as a result several researchers have
constraints. In this research, we aim to artificially proposed deep learning driven facial expression
increase the size of the dataset using data recognition (FER) models. Recent works have
augmentation methods [4]. concentrated on building a compounded network and
training that network on the input images, the mixture
Furthermore, motivated by the fact that human of multiple structures makes the model extremely
observers pay close attention where the expressions powerful. Mayya et al. [7] proposed a deep
are most prevalent, we decide to focus on only the convolutional neural network (DCNN) based
essential parts of the image and ignore other approach to identify the facial expressions. They used
irrelevant parts such as background details as they ImageNet, which is a famous DCNN architecture to
contribute little or no information [3]. In this extract the facial features. The last layer of the
proposed method, landmark detection is used to network gave them a dimensional vector, which they
ignore irrelevant features of the image using a tool plugged into a support vector machine (SVM)
„Open Face‟ developed by Zadeh et. al [9]. classifier to recognize emotion on the faces. They
Outline: The paper is organized as follows: Section II obtained an accuracy of 96.02% and 98.12% for 7
discusses the state-of-the-art methods used in classes of emotions on two separate databases namely
literature survey. Section III presents the materials CK+ and JAFFE, respectively. Despite achieving
and methods used in developing the proposed competitive results their approach has three major
methodology. Section IV and V show the downsides, firstly is it difficult to understand,
experimental results and conclusion with future scope secondly it is not an end-to-end approach, and lastly it
respectively. takes a lot of time to train. Zhang et al. [8] proposed a
unique well-architecture of CNN, while training the
network has the ability to maximize different-emotion
differences and at same time minimize same-emotion
2 Related Work variations. They used a two-way soft-max activation
Ekman [6] et. at. had performed one of the earlier function which requires a high level of expertise and
works in the domain of emotion recognition. They skill set. However, their model is for smile detection,
identified the six basic principle emotions namely and the size of the dataset is way larger than any FER
anger, happiness, surprise, fear, sadness and disgust. dataset, close to 4000 images for a single expression.
462
Lucy et al. [13] proposed a system based on DCNN network architecture used for FER. We used three
using facial parts, it used a mixture of algorithms for convolution layers, three max pooling layers, one
feature extraction, face detection and classification. It flatten layer and one output layer in our proposed
used a two channel CNN. For the first channel the network. The first layer is a convolution layer with 6
input was the extracted eyes and for the second kernels and the third layer is also a convolution layer
channel the input was the extracted mouth. Although with 16 kernels. Both the kernels have a size of 5 × 5.
most of the previous works obtained notable The second and fourth layers are max pooling layers,
improvements over the orthodox works on facial with a size of 2 × 2. We used rectified linear unit
emotion recognition, they haven‟t focused on the (ReLU) as an activation function for the CNN and
principal regions of the face. In this work, we aim to flattened the max pooling layer to get a 14400-
tackle this problem and focus on salient face regions dimensional vector. The vector then acts as an input
to the output layer which has a soft-max activation
3 PROPOSED METHODOLOGY function and dropout of 0.5. The CNN uses a glorot
uniform initializer, adam optimizer and a cross
In this section, we discussed the materials and entropy loss function.
methods used in developing the proposed
methodology. The section presents the details of
dataset used (JAFFE), use of CNN architecture with
landmark detection
3.1 Dataset used
We used JAFFE dataset, a widely used dataset for
FER to train our model. The JAFFE dataset contains
7 facial expressions consisting of 10 Japanese female
models. In total there are 213 images. The size of
each image 128 x 128. The expressions in the dataset
include happiness, anger, disgust, fear, surprise,
sadness, and neutral with around 30 images each.
Figure 1 shows a few images from the Jaffe dataset.
Figure 2: CNN Network Architecture
3.3 Landmark Detection on JAFFE dataset
Motivated by the fact that human observers pay close
attention where the expressions are most prevalent,
Figure 1: JAFFE dataset sample images we decide to focus on only the essential parts of the
image and ignore other irrelevant parts such as
3.2 Convolutional Neural Networks (CNN) background details as they contribute little or no
information. We have used image cropping to get
We implemented CNN to build the proposed model. important features from the images and ignore
CNNs are known to emulate the human brain when irrelevant features like hair, ears, neck, etc., which
working with visuals. In this research we have used a contribute little or no information. [9] Zadeh et. al.
CNN to extricate facial features and detect the has performed landmark detection using modified
emotions from the dataset. Figure 2 shows the
463
Constrained Local Models (CLMs). They developed
an end-to-end framework that combines the benefits
of mixtures of experts and neural architectures. Their
proposed algorithm outperformed state-of-the-art
models by a large margin.
Figure 3: Proposed Approach
The designer of the tool has made the resources Figure 4 Original Images (top three rows),
available for external use through an open-source Cropped Images (bottom three rows) for the
software OpenFace. Machine learning researchers, JAFFE dataset
groups keen on working on interactive applications
involving facial behaviour analysis and organizations In Figure 4 there are nine original images from the
researching in the area of affective computing are the JAFFE dataset (in first three rows) and the set of
tools main consumers. The tools offer the following images obtained after cropping (last three rows). It is
capabilities: head pose estimation, eye-gaze visible that the cropped images do not contain
estimation, landmark detection and facial action unit irrelevant features like hair, neck, ears and
recognition. Also, it's an open-source project with background.
available source code for both training the network
and the models. The advantages of this tool are two- 4 RESULTS AND ANALYSIS
fold, firstly it comes with real-time performance and
secondly it does not require any special hardware. We tested the proposed model performance on the
JAFFE dataset. The proposed model is trained on the
subset of the available dataset and validated on the
validation set. To measure the accuracy, test dataset is
used. Architecture and hyper parameters are kept the
same for all the experiments in the training procedure.
For the comparison purpose, each model is trained on
50 epochs. The JAFFE dataset is trained on the CPU.
We initialized network weights with a glorot uniform
initializer and for optimization purposes and used
adam optimiser with a learning rate of 0.001. As in
the JAFFE dataset, there are a limited number of
images, so it took very less time to train the model.
For training and testing purposes, 10-cross validation
has been performed. It is taken care of that every set
of data has a balanced distribution of classes.
For the experiment, 141 images are used for the
training, 29 images for the validation and 43 images
are used to perform testing. And the proposed model
464
is found to work better with an increased accuracy of
almost 9%.
The proposed model is tested on the JAFFE dataset
and is validated on 50 epochs with accuracy 87.5%.
Table 1 Accuracies obtained for different models
and different datasets.
Method Optimi Learning Dataset Accuracy
zer Rate
Used
CNN Adam 0.001 JAFFE 78.1
CNN + Adam 0.001 JAFFE 87.5
Landm
ark
Detectio
n
Figure 5 Comparison of the two approaches
Successfully applied landmark detection on JAFFE
dataset and the result is tabulated in Table 1. Figure 5
shows the comparison of the two approaches that is
using CNN and CNN in addition with landmark
detection.
Figure 6: Model loss and model accuracy for
experiment I (first two graphs) and
experiment II (second two graphs)
respectively.
The four graphs in Figure 6 show the model loss and
model accuracy with the increase in epoch count. The
figure on the top left corner shows that the validation
465
loss after roughly around 10 epochs does not improve Units When Doing Expression Recognition?”, 2015
and the model starts overfitting which is clearly IEEE International Conference on Computer Vision
visible as the training loss starts to decrease and the Workshop (ICCVW), Santiago, pp. 19-27, doi:
validation loss remains the same or increases further. 10.1109/ICCVW.2015.12.
However, the figure on the bottom left corner shows [2] Han, S., Meng, Z., Khan, A. S. and Tong, Y.,
that the gap between the validation loss and training “Incremental boosting convolutional neural network
loss is much less as compared to figure on the top left for facial action unit recognition”, International
corner. Also, the loss steeply decreases in the first 10 Conference on Neural Information Processing
epochs showing the network certainly benefits from Systems (NeurIPS‟2016), pp. 109-117, 2016.
the image cropping methodology followed in this
research. [3] Minaee, S. and Amirali A., “Deep-emotion: Facial
expression recognition using attentional
convolutional network”, arXiv preprint, arXiv
abs/1902.01019, 2019.
5 Conclusion and Future Scope
[4] Li, K., Yi, J., Akram, M. W., Han, R. and Chen,
Proposed an efficient approach for the Facial J., “Facial expression recognition with convolutional
Expression Recognition. The proposed approach for neural networks via a new face cropping and rotation
FER uses Convolutional Neural Networks and facial strategy." The Visual Computer 36(2), pp. 391-404,
landmark detection. The use of CNN makes the 2020.
model more accurate towards the feature extraction [5] Mehrabian, A.: Communication without words.
and classification process [14-17].
Communication Theory, 2nd Edition, pp. 193-200,
In our proposed work, we applied CNN with Taylor & Francis, 2008
landmark detection approach to extract important
image features and to remove the irrelevant image [6] Paul, E. and Friesen, W. V., “Constants across
features like ear, hair, neck, background, etc. The idea cultures in the face and emotion”, Journal of
of using landmark detection is to remove irrelevant personality and social psychology, 17(2), pp. 124-
features from the face that does not contribute or 126, 1971.
contains very less cue for the analysis. By applying [7] Mayya, V., Radhika, M. P., and Manohara, M. M.
landmark detection, the original face images are P., “Automatic facial expression recognition using
cropped. Now, the cropped images are clear enough DCNN”, Procedia Computer Science 93 (1), pp. 453-
to read the expression from the human face. The 461, 2016.
approach is validated on JAFFE dataset. The
accuracy achieved with the CNN model is 78.1% and [8] Kaihao, Z., Huang, Y., Wu, H. and Wang, L.,
with the proposed method, accuracy is raised to “Facial smile detection based on deep learning
87.5%. Researchers could certainly use more features." 3rd IAPR Asian Conference on Pattern
complex models and achieve higher accuracy but the Recognition (ACPR), pp. 534-538. IEEE, 2015.
time would also increase correspondingly. Whereas, [9] Zadeh, A., Yao, C. L., Baltrusaitis, T. and
in this study we used a simple model which can be Morency, L. P., “Convolutional experts constrained
trained quickly and have concentrated in making the local model for 3d facial landmark detection”,
dataset efficient in terms of the feature set. International Conference on Computer Vision
This work can be extended to real world applications Workshops, pp. 2519-2528, 2017.
like in driver‟s drowsiness detection, pain assessment, [10] Lee, T. S., “Image representation using 2D
lie detection etc. Gabor wavelets”, IEEE Transactions on pattern
analysis and machine intelligence, 18(10), pp. 959-
971, 1996.
References
[11] Nigam, S., Singh, R., and Misra, A. K.,
[1] Khorrami, P., Thomas P., and Thomas H., (2015), “Efficient facial expression recognition using
"Do Deep Neural Networks Learn Facial Action histogram of oriented gradients in wavelet domain”,
466
Multimedia tools and applications, 77(21), pp. 28725- [15] Agrawal, P., Madaan, V., Kundu, N., Sethi, D.
28747, 2018. and Singh, S. K., "X-HuBIS: A fuzzy rule based
[12] Wilson, P. I., and Fernandez, J., “Facial feature human behavior identification system based on body
detection using Haar classifiers”, Journal of gestures", Indian Journal of Science and Technology,
Computing Sciences in Colleges 21(4), pp. 127-133, 9(44), pp. 1-6, 2016.
2016. [16] Kaur, G., Agrawal, P., “Optimization of image
[13] Lucy, N., Wang, H., Lu, J., Unwala, I., Yang, X. fusion using feature matching based on SIFT and
and Zhang, T., “Deep convolutional neural network RANSAC”, Indian Journal of Science and
for facial expression recognition using facial parts”, Technology, 9(47), pp. 1-7, 2016.
International Conference on Pervasive Intelligence [17] Agrawal, P., Chaudhary, D., Madaan, V.,
and Computing, pp. 1318-1321. IEEE, 2017. Zabrovskiy, A., Prodan, R., Kimovski, D. and
[14] Girdhar, P., Virmani, D. and Kumar, S. S., “A Timmerer, C., “Automated bank cheque verification
hybrid fuzzy framework for face detection and using image processing and deep learning methods”,
recognition using behavioral traits”, Journal of Multimedia Tools and Applications, pp. 1-32, 2020.
Statistics and Management Systems 22(2), pp. 271- https://doi.org/10.1007/s11042-020-09818-1
287, 2019.