Explaining   Emotional                                                   Attitude                     Through   the    Task
of Image-captioning
Oleg Bisikalo 1 , Volodymyr Kovenko 1 Ilona Bogach 1 and Olha Chorna 2
1
 Vinnytsia National Technical University, Khmelnytsky highway 95, Vinnytsya, 21021, Ukraine
2
 Kremenchuk Mykhailo Ostrohradskyi National University, Pershotravneva Street, 20, Kremenchuk, 39600,
Ukraine


                 Abstract
                 Deep learning algorithms trained on huge datasets containing visual and textual information,
                 have shown to learn useful features for other downstream tasks. This implies that such
                 models understand the data on different levels of hierarchies. In this paper we study the
                 ability of SOTA (state-of-the-art) models for both texts and images to understand the
                 emotional attitude caused by a situation. For this purpose we gathered a small size dataset
                 based on IMDB-WIKI one and annotated it specifically for the task. In order to investigate
                 the ability of pretrained models to understand the data, the KNN clustering procedure over
                 representations of text and images is utilized in parallel. It’s shown that although used models
                 are not capable of understanding the task at hand, a transfer learning procedure based on
                 them helps to improve results on such tasks as image-captioning and sentiment analysis. We
                 then frame our problem as the task of image captioning and experiment with different
                 architectures and approaches to training. Finally, we show that adding additional biometric
                 features such as probabilities of emotions and gender probabilities improves the results and
                 leads to better understanding of emotional attitude.

                 Keywords 1
                 Deep learning algorithms; Emotional attitude; SOTA models; Image-captioning; NLP;
                 Transfer-learning

1. Introduction
    Recent development of hardware and access to big datasets allowed researchers to train
sophisticated deep learning based algorithms, which suppressed many other approaches. The deep
learning revolution affected many fields, whereas the most interesting results were obtained in the
field of NLP (natural language processing) [1] and CV (computer vision) [2]. It was shown that
SOTA models trained on big datasets (ImageNet [3], Google News) tend to learn useful features that
can be used for other downstream tasks [4]. Building on that idea we study how well such models
understand the emotional attitude and its cause implicitly or explicitly introduced by visual and
textual data. Understanding the emotional attitude and explaining it is a pretty hard task even for a
human, as the solution requires the exact understanding of cause and consequence that are affected by
the environment and biometric features. For the purpose of experiments a new small-size dataset
containing image-text pairs called “EmoAtCap” is collected. The overall contribution of our work is
summarized below:
    1. A small-size dataset “EmoAtCap” which is based on IMDB-WIKI one, that can be used for
    image-captioning and sentiment analysis. It is publicly available [5] for facilitating future research
    in this domain.

COLINS-2022: 6th International Conference on Computational Linguistics and Intelligent Systems, May 12–13, 2022, Gliwice, Poland.
EMAIL: obisikalo@gmail.com (O.Bisikalo); urumipainblackreaper@gmail.com (V.Kovenko); ilona.bogach@gmail.com (I.Bogach),
diolgan@gmail.com (O.Chorna)
ORCID: 0000-0002-7607-1943 (O.Bisikalo); 0000-0003-3825-1115 (V.Kovenko); 0000-0001-9398-8529 (I.Bogach);
0000-0002-7113-1572 (O.Chorna);
            ©️ 2022 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)
      2. A set of experiments on the tasks of image-captioning and sentiment analysis, based on
      features extracted from highlighted models. It’s also shown that adding biometric features as
      gender and emotions distribution improves the performance of image-captioning models.
           The training procedure was conducted using tensorflow [6] and pytorch [7].

2. Data collection
    The actual data needed to include both images and their captions. As the main intent was to
capture the emotional attitude, the images would have to contain people and explicit or implicit
information about the cause of their emotional state. The captions should have contained an
exhaustive unbiased description of the situation. Based on highlighted requirements, the first idea was
to make a dataset from the subset of existing image-captioning datasets.
    Image-captioning is the process of generating textual description of an image. The task implies
that the relevant dataset consists of image-text pairs. One of the most popular datasets for the
discussed task is COCO [8], which consists of 330K images. We used only the subset of dataset
related to image-captioning, mainly the 2014 train split, which consisted of 29766 images along with
5 captions per each image. As it would be pretty hard and cumbersome to filter out images manually,
a YoloV3 [9] object-detection algorithm trained on the discussed dataset was used. Only images that
contained objects of class “person” were left. As a result, the COCO dataset was shrunk to 3731
images. However, filtered images and captions only contained the actual plot of the image without
any emotional attitude. The other analyzed dataset was a VizWiz [10] one. VizWiz is the first goal-
oriented VQA (visual question answering) dataset arising from a natural VQA setting, which consists
of over 31,000 visual questions originating from blind people. Needed data subset was found by
filtering the captions using people related words. As the final data was of a poor quality, this variant
was declined. The last image-text data we experimented with was SentiCap [11] one. SentiCap
consists of 2360 images containing sentiments. After filtering the dataset in the same way as it was
done for VizWiz one, we arrived with only 830 samples, which was not enough for our task.
    The other variant was to gather a dataset from the very beginning and annotate it. The images were
taken from the IMDB-WIKI [12] dataset for age and gender detection. Each image was annotated
with a description of the emotional attitude of the person or people on it. As a result we arrived with
the dataset of 3840 image-text pairs, where each image was resized to 224x224 pixels (Fig. 1).


 a)


 b)
Figure 1 (a, b): Dataset examples with corresponding captions
c)


d)


e)


 f)
Figure 1 (c - f): Dataset examples with corresponding captions

  In order to categorize the dataset, sentiments related to captions were added using Vader [13],
which is a rule based model for sentiment analysis. Then the sentiments were checked by humans one
more time to produce more meaningful ones. As the result of the analysis, the data appeared to be
imbalanced in terms of the new category (Fig. 2).


Figure 2: Distribution of caption sentiments in the dataset
   New sentiment category was used for analysis of clustering and for solving the task of sentiment
analysis given the captions.

3. Pretrained models overview
   In order to analyze the ability of pretrained models to understand such difficult information as
emotional attitude, recent SOTA models trained on big datasets of textual and visual information were
chosen.

3.1.    ResNet
   ResNet, introduced by Kaiming He et.al, is a deep convolutional architecture, which suppressed
previous results on Imagenet benchmark and showed to be pretty successful for object detection by
obtaining a 28% relative improvement on the COCO object detection dataset. Main advantage of such
architecture is the addition of residual connections that help to fight the problem of vanishing
gradients, which is typical for deep neural networks. This advantage gave a possibility to train a very
deep network, each layer of which learned different useful features. In our work ResNet152V2
pretrained on the Imagenet dataset was used. We also experimented with ResNet50 trained on FER
[14] dataset.

3.2.    EfficientNet
   EfficientNet, introduced by Tan et.al, is a deep convolutional neural network architecture and
scaling method that uniformly scales all dimensions of depth/width/resolution using a compound
coefficient. It achieves state-of-the-art 84.3% top-1 accuracy on ImageNet and transfers well to other
tasks, reaching state-of-the-art accuracy on CIFAR-100 [15] (91.7%), Flowers [16] (98.8%), and 3
other transfer learning datasets. In our work EfficientNet trained on age-gender IMDB-WIKI dataset
was used.

3.3.    Word2Vec
   Word2Vec, introduced by Mikolov et.al [17] is a neural network based approach to learning word
embeddings. The approach gives a possibility to use two methods of learning: CBOW and skip-
gramm. During the CBOW approach, the model is asked to predict the current word given the
context, whereas skip-gram one tries to predict words within a certain range before and after the
current word. As a result of such training, the model learns meaningful word vectors that are often
used for transfer learning. Word2Vec embeddings pretrained on Google News with the vectors’
dimensionality of 300 were used in the paper.
   The exact setup of experiments and description of layers using which the data representation was
derived along with experimental results is discussed further in the paper

4. Experiments

4.1 Image-captioning
   Image understanding is the process of interpreting regions/objects to figure out what's happening
in the image. This may include figuring out what the objects are, their spatial relationship to each
other, etc [18]. This statement implies that one of the definitions of scene understanding is a capability
of describing its context. Thus, we theorize that a model which can describe the emotional attitude
based on image is capable of understanding it. The task of describing the image is known as image-
captioning, and gained a huge popularity with the development of deep neural networks [19]. Though
there are many different approaches to the task [20], we exploit only the encoder-decoder architecture,
where encoder’s goal is to encode the representation of the image into the feature vector and
decoder’s one is to generate the captions based on this information. The theoretical foundations of
constructing text messages / captions by modeling combinations of significant words are considered
in [21]. For the role of encoder, a convolutional neural network is often exploited, whereas for the role
of decoder - recurrent one. In our work the research is done due to different encoder-decoder
architectures used to solve the task of image-captioning.
    As it was stated by Kovenko et.al [22], by solving the problem of data reconstruction,
autoencoders tend to learn low-level features, which are useful for transfer learning. Based on this
idea we train the deep convolutional autoencoder on our dataset and use latent code produced by the
encoder part for encoding images in image-captioning task. Also the experiments include the output
of 4th block of ResNet, along with the logits of ResNet as the encoders. In order to compare this
transfer learning approaches, we also experiment with custom not pretrained convolutional encoder.
    The decoder part is represented by the embedding layer and LSTM (Long-short-term-memory)
[23] network. LSTM is capable of learning long-time dependencies, which is especially useful when
working with sequential data. As the embedding layer, for all the experiments, Word2Vec was used.
For all the approaches, layer normalization [24] after LSTM was used. As it was stated by Xu et.al
[25], the attention mechanism applied to image-captioning tasks can greatly improve results. Nezami
et.al [26] showed that usage of additional features of emotions helps to improve results on the image-
captioning datasets that include emotional aspects. Based on these ideas, we experimented with using
attention and conditioning LSTM on additional features. Different from Nezami’s approach, gender
features were also used and the emotional ones were encoded as probability distribution. Specifically,
YoloV3 is used to extract face regions from the images and EfficientNet trained on Age-Gender
dataset along with ResNet trained on FER one are used to predict gender and emotions.
    Gender features are produced using predicted probabilities for each face presented on the image
(formula 1).
                                          𝑁
                        𝑆𝑔                     𝑔
                   𝑆𝑔 = 𝐺 ,        𝑆𝑔 = ∑ 1𝑃𝑖 ,     𝑃𝑖 = 𝑎𝑟𝑔𝑚𝑎𝑥(𝑝𝑟𝑒𝑑𝑖 )                           (1)
                       ∑𝑔 𝑆𝑔
                                         𝑖=1
where G - number of unique genders, g - gender, 𝑆 𝑅𝑥𝐺 - normalized vector with gender probabilities,
                                                𝑔
N - number of faces presented on the image, 1𝑃𝑖 - identifier of Pibeing equal to specific g, 𝑃𝑖 - result
of an argmax operation over prediction probability vector for specific face i.
   Emotional features are produced as normalized probability distribution of the sum of probability
vectors for each face presented on the image (formula 2).
                                     𝑁
                                                          𝐸
                               𝐸 = ∑ 𝑝𝑟𝑒𝑑𝑖 , 𝐸 =                                                  (2)
                                                        ∑𝑀
                                                         𝑗 𝐸𝑗
                                     𝑖
where 𝐸 𝑅𝑥𝑀 - vector of averaged emotion probabilities, N - number of faces presented on the image,
predi - prediction probability vector for specific face i, M - number of unique genders.
   The data was splitted in the same way as for sentiment analysis. The approaches were validated
based on the test set performance using beam search technique with the beam size of 5. BLEU score
along with perplexity were used as the main metrics. For all the experiments RMSprop optimizer was
used, with the initial learning rate of 0.0001. In order not to overfit, the loss reduction technique was
used. If there was no improvement in validation perplexity for two epochs, the loss was reduced by
the factor 10. All the models were trained with a batch size of 64 for 30 epochs (Fig. 3).
Figure 3. Comparison of image-captioning models and approaches. For train and validation
perplexity, the values are shown for the last epoch of training

   Analyzing the results it’s obvious that the transfer-learning procedure gives better results than
training from scratch (ordinary) w.r.t BLEU on a test set. It’s also clear that ResNet representation
tends to give better results than autoencoder’s one, possibly because of a deeper architecture and
better learned features. Attention didn’t work well for all the approaches, probably because of the low
number of samples in the dataset and small amount of epochs. So far an approach that utilized logits
output of ResNet for encoder part of the network along with Word2Vec embeddings and additional
features of emotions (resnet_logits_w2v_emotions) gave the best results on test data w.r.t to averaged
BLEU score. The other appomodelrach, which is also worth paying attention to is the one, which
incorporates both emotions and gender features. Despite resnet_logits_w2v_emotions_gender didn’t
achieve the best performance on test BLEU, it reached the best balanced performance on all the data
splits, and thus was chosen as the best one. The architecture of the overall prediction pipeline is shown
in Fig. 4.


Figure 4. Architecture of the pipeline of resnet_logits_w2v_emotions_gender approach

   As it can be seen from Fig. 4, the overall pipeline is dependent on the face pre-processing step
along with the detection of emotions and gender. Obviously, if the performance of highlighted steps is
poor, the final output will be at least biased. The example of such bias is represented in Fig. 5.
Figure 5. Example of bias of additional features w.r.t image-captioning process. S - vector of gender
features, E - vector of emotions features. T - true caption, greedy - result of greedy decoding, beam
- result of beam search decoding. Changing additional features, changes the generation of captions
using greedy decoding strategy.

   During error analysis, it was found that the model suffers from slight overfitting on most frequent
words and phrases (like “man is flirting with a woman” presented in Fig. 5), which is a problem
caused by a small diversity of the dataset. Despite the fact that the collected data is a noisy one, as
each image was annotated by a different expert, which is not very suitable for the task of image-
captioning, the model succeeds to give adequate results on average (Fig. 6).


   a)


   b)
Figure 6 (a, b). Example of generated captions. T - true caption, greedy - result of greedy decoding,
beam - result of beam search decoding. Captions which are fully inappropriate are marked with bleu.
  c)


  d)


  e)
Figure 6 (c – e). Example of generated captions. T - true caption, greedy - result of greedy decoding,
beam - result of beam search decoding. Captions which are fully inappropriate are marked with bleu.

   It’s important to note that the longer training would probably give better results.

5. Conclusion and further work
   In this paper we analyzed the ability of deep learning models to understand the emotional attitude driven by
the situation. For this purpose, a new dataset with image-text pairs was presented. In result of pretrained SOTA
models analysis, it was concluded that some of them can be used in the process of transfer-learning. Through the
experiments it was shown that the dataset can be used to solve the problem of sentiment analysis. It was then
theorized that the problem of understanding the emotional attitude, can be transferred to the task of image-
captioning. Empirical results have shown that addition of emotional and gender features along with transfer-
learning based on ResNet network and Word2Vec embeddings improve the overall captioning performance. Our
approach gives pleasant results on average, confirming that deep learning models are able to understand
emotional attitude if they are trained to. It's important to note that such an approach has many downsides, as it’s
dependent on the performance of three additional models for face, emotions and gender detection. The other
problem that was faced is the noisy nature of the dataset and small variation of phrases in it. In future work it’s
planned to gather a bigger dataset, label each image with 5 captions and fix current problems.

6. Acknowledgements
  We would like to thank Oleksii Abdullaiev, Dmytro Tarasovskyi and Dmytro Maliovanyi for their
contribution in terms of the dataset creation.
7. References
[1]   Elizabeth D. Liddy. Natural Language Processing. In Encyclopaedia of Library and Information
     Science,            2nd          Ed.          NY.            Marcel          Decker,          Inc.
     https://surface.syr.edu/cgi/viewcontent.cgi?article=1043&context=istpub (accessed 12 December
     2021).
[2] IBM. What is Computer Vision? https://www.ibm.com/topics/computer-vision (accessed 12
     December 2021).
[3] Deng, Jia. et al., 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE
     conference on computer vision and pattern recognition. pp. 248–255.
[4] Yosinski, Jason, et al. How transferable are features in deep neural networks?. arXiv preprint
     arXiv:1411.1792, 2014.
[5] Kovenko, Volodymyr; Abdullaiev, Oleksii; Maliovanyi, Dmytro; Tarasovskyi, Dmytro; Bogach,
     Ilona; Bisikalo, Oleh (2021), “EmoAtCap : Emotional attitude captioning dataset”, Mendeley
     Data, V5, doi: 10.17632/dym6p2pvbt.
[6] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., ... & Zheng, X. (2016).
     Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint
     arXiv:1603.04467.
[7] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., ... & Chintala, S. (2019).
     Pytorch: An imperative style, high-performance deep learning library. Advances in neural
     information processing systems, 32, 8026-8037.
[8] Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... & Zitnick, C. L. (2014,
     September). Microsoft coco: Common objects in context. In European conference on computer
     vision (pp. 740-755). Springer, Cham.
[9] Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint
     arXiv:1804.02767.
[10] Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., ... & Bigham, J. P. (2018).
     Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the
     IEEE Conference on Computer Vision and Pattern Recognition (pp. 3608-3617).
[11] Mathews, A., Xie, L., & He, X. (2016, March). Senticap: Generating image descriptions with
     sentiments. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 30, No. 1).
[12] Rothe, R., Timofte, R., & Van Gool, L. (2015). Dex: Deep expectation of apparent age from a
     single image. In Proceedings of the IEEE international conference on computer vision
     workshops (pp. 10-15), doi: 10.1109/ICCVW.2015.41.
[13] Hutto, C., & Gilbert, E. (2014, May). Vader: A parsimonious rule-based model for sentiment
     analysis of social media text. In Proceedings of the International AAAI Conference on Web and
     Social Media (Vol. 8, No. 1).
[14] Goodfellow, I. J., Erhan, D., Carrier, P. L., Courville, A., Mirza, M., Hamner, B., ... & Bengio,
     Y. Challenges in representation learning: A report on three machine learning contests. Neural
     Networks, 64:59--63, 2015. doi: 10.1016/j.neunet.2014.09.005.
[15] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Tech Report.
     https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed 12 December 2021).
[16] Nilsback, M. E., & Zisserman, A. (2008, December). Automated flower classification over a
     large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics &
     Image Processing (pp. 722-729). IEEE.
[17] Barz, B., & Denzler, J. (2019, January). Hierarchy-based image embeddings for semantic image
     retrieval. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp.
     638-647). IEEE.
[18] Bryan      S.    Morse.      Image    Understanding.      http://www.sci.utah.edu/~gerig/CS6640-
     F2012/Materials/BMorse-BYU-iu-active-contours.pdf – Title from the screen (accessed 12
     December 2021).
[19] Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption
     generator. In Proceedings of the IEEE conference on computer vision and pattern
     recognition (pp. 3156-3164).
[20] [20] Hossain, M. Z., Sohel, F., Shiratuddin, M. F., & Laga, H. (2019). A comprehensive survey
     of deep learning for image captioning. ACM Computing Surveys (CsUR), 51(6), 1-36.
[21] Bisikalo, O., Bogach, I. & Sholota, V. (2020). The Method of Modelling the Mechanism of
     Random Access Memory of System for Natural Language Processing. In 2020 IEEE 15th
     International Conference on Advanced Trends in Radioelectronics, Telecommunications and
     Computer Engineering (TCSET) (pp. 472-477). doi: 10.1109/TCSET49122.2020.235477.
[22] Kovenko, V., & Bogach, I. (2020). A Comprehensive Study of Autoencoders' Applications
     Related to Images. In IT&I Workshops (pp. 43-54).
[23] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8),
     1735-1780.
[24] Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint
     arXiv:1607.06450.
[25] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., ... & Bengio, Y. (2015, June).
     Show, attend and tell: Neural image caption generation with visual attention. In International
     conference on machine learning (pp. 2048-2057). PMLR.
[26] Nezami, O. M., Dras, M., Anderson, P., & Hamey, L. (2018, September). Face-cap: Image
     captioning using facial expression analysis. In Joint European Conference on Machine Learning
     and Knowledge Discovery in Databases (pp. 226-240). Springer, Cham.