Visual Sentiment Analysis Multiplying Deep learning and Vision Transformers Tetsuya Asakawa1, Riku Tsuneda1, Masaki Aono1 1 Toyohashi University of Technology, Japan asakawa.tetsuya.um@tut.jp, tsuneda.riku.am@tut.jp, masaki.aono.ss@tut.jp ABSTRACT concerns adopting CNN features, (2) propose a combined feature method to combine the output of each feature, unlike previous Visual sentiment analysis investigates sentiment estimation from work which focuses on combining feature vectors. images and has been an interesting and challenging research problem. Most studies have focused on estimating a few specific 2 APPROACH sentiments and their intensities using several complex CNN models. In this paper, we propose multiplying CNN and Vision We propose single-label (Subtask1) and multi-label (Subtask2 Transformers method in MediaEval 2021 Visual Sentiment and 3) visual sentiment analysis system to predict multiple Analysis: A Natural Disaster Use-case. Specifically, we first emotions. And we will describe our deep neural network model introduce our proposed model used in subtask1. Then, we also that enables the single-label and multi-label outputs, given images introduce a median-based multi-label prediction algorithm used in that evoke emotions. Subtask 2 and 3, in which we assume that each emotion has a probability distribution. In other words, after training of our proposed model, we predict the existence of an evoked emotion for a given unknown image if the intensity of the emotion is larger than the median of the corresponding emotion. Experimental results demonstrate that our model outperforms several models in terms of subset Weighted F1-Score. 1 INTRODUCTION With the spread of SNS and the Internet, a vast number of images are widely available. As a result, there is an urgent Figure 1: Calculated spatial distribution of the in-plane requirement for image indexing and retrieval techniques. When dynamic magnetization. viewing an image, we can feel several emotions simultaneously. Different visual images have different emotional triggers. For 2.1 Subtask1 instance, an image with a snake or a spider may most likely This is a multi-class single label classification task, where the trigger a bad feeling like “disgust” or “fear,” whereas an image images are arranged in three different classes, namely positive, with a flower may most likely trigger a good feeling like negative, and neutral. There is a strong imbalance towards the “amusement” or “excitement”. negative class, given the nature of the topic. Visual sentiment prediction investigates sentiment estimation To solve our multi-class, single-label classification problem, from images and has been an interesting and challenging research we propose new combined neural network models which allow problem. In this paper, the purpose is to accurately estimate the inputs coming from both End-to-end (Vision Transformers: ViT sentiments as a single-label and multi-label multi-class problem and CNN) features. from given images that evoke multiple different emotions As illustrated in Figure 1, we adopt ViT-L/16 at ViT and simultaneously [1]. extracted features. On the other hand, CNN features extracted We also introduce a new combined neural network model from a pre-trained CNN-based neural network include which allows inputs coming from both ViT features and pre- EfficientNetB0[2]. trained CNN features. In addition, existing deep learning had - Vision Transformers (ViT) weak classifications, therefore we propose a new fully connected The Vision Transformer is a model for image classification 2 layers. The new contributions of this paper include (1) propose a that employs a Transformer-like architecture over patches of the novel feature considering both ViT and CNN features to predict image. This includes the use of Multi-Head Attention, Scaled Dot- sentiment of images, unlike most recent research which only Product Attention and other architectural features seen in the Transformer architecture traditionally used for NLP[3]. -CNN Copyright 2021 for this paper by its authors. Use permitted under Creative In addition to ViT features described above, our system Commons License Attribution 4.0 International (CC BY 4.0). incorporates CNN features, which can be extracted from pre- MediaEval’21, December 13-15 2021, Online MediaEval’21, December 13-15 2021, Online T. Asakawa et al. trained deep convolutional neural networks with EfficientNetB0. We used a fixed threshold for sentiment, and adopted “0.5” for Because of the lack of dataset in visual sentiment analysis, we each feature, which we employed as the threshold of the adopt transfer learning in our feature to prevent over fitting. corresponding emotion evocation. After obtaining all the We decrease the dimensions of fully-connected layers used in thresholds dynamically determined based on this threshold, the CNN models. Specifically, for EfficientNetB0, we extract a 1280- multi-hot vector of each image is generated such that if 𝑇#$ is dimensional vector from ‘Global Average Pooling 2D’layer (or equal to or greater than the thresholds, we set 𝑆#$ =1; otherwise the second to the last fully-connected layer), and reduce the vector 𝑆#$ =0, where 𝑆#$ is the element of K-th sentiment of i-th image. In to 512 dimension by applying a fully-connected layer. short, the vector 𝑆# represents the output multi-hot vector. We repeated this computation until all the test (unknown) images 2.2 Subtask2 and 3 were processed. In Subtask 2 and 3, this is a multi-class multi-label image classification task, where the participants are provided with multi- 3 EXPERIMENTAL RESULTS labeled images. Here we describe experiments and the evaluations. And, we To solve our multi-class, multi-label classification problem, have divided the training dataset into training and validation data we propose new combined neural network models which allow with an 8:2 ratio. We determined the following hyper-parameters; inputs coming from both End-to-end (ViT and CNN) features. We batch size as 256, optimization function as “SGD” with a learning adopt ViT-L/16 at ViT and extracted features. On the other hand, rate of 0.001 and momentum 0.9, and the number of epochs 200. CNN features extracted from a pre-trained CNN-based neural For the evaluations of single-label and multi-label classification, network, include EfficientNetB0. we employed Weighted F1-Score. To deal with the above combined features, we proposed a deep Here we compare in terms of Weighted F1-Score. Also, the neural network architecture where we allowed multiple inputs and table includes several base line methods including ViT, a multi-hot vector output. The combined feature is represented by EfficientNet B0, and our proposed combined model. The “Dim” the following formula: column of the table represents the feature dimension. For our 𝑐𝑜𝑚𝑏𝑖𝑛𝑒𝑑 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒(𝜔! (𝑉𝑖𝑇) + 𝜔" (𝐶𝑁𝑁)) (1) proposed combined model, we have tested with one variation, i.e., Based on this formula, after the training process, we allowed ViT+ EfficientNet B0. For our proposed combined model, it turns our neural network system to predict the visual sentiment of out that ViT+EfficientNet B0 has the best score. It is observed unknown images as a multi-label multi-class classification that the proposed method could correctly recognize the images problem. whose emotions are falsely classified by the base CNN. -multi-label prediction To detect a multi-hot vector, we employed a method based on Table 1: The results of doing experiment our research [4]. We proposed a method illustrated in Algorithm 1. The input is a collection of features extracted from each image with K kinds of sentiments, while the output is a K-dimensional Model Dim Weighted Weighted Weighted F1-Score in F1-Score in F1-Score multi-hot vector. Subtask 1 Subtask 2 in Subtask ViT-L/16 512 0.692 0.412 0.402 EfficientNet B0 512 0.715 0.534 0.392 Algorithm 1: Predicting multi hot vector for an image Proposed model 1024 0.753 0.585 0.415 4 CONCLUSIONS We proposed a model for Visual Sentiment Analysis: A Natural Disaster Use-case which accurately estimates single-label and multi-label multi-class problems from given images, evoking multiple different emotions simultaneously. Our proposed model is simple yet effective and achieves new state-of-the-art performance on multiple datasets. ACKNOWLEDGMENTS A part of this research was carried out with the support of the Grant for Toyohashi Heart Center Smart Hospital Joint Research In Algorithm 1, we assumed that the extracted features (here Course and the Grant for Education and Research in Toyohashi ViT and CNN) are represented by their probabilities. For each University of Technology. sentiment, we summed up the features, followed by averaging the result, which is denote by 𝑇#$ in Algorithm 1. Visual Sentiment Analysis: A Natural Disaster Use-case T. Asakawa et al. REFERENCES [1] H, Syed Zohaib and Ahmad, K and Riegler, M and Hicks, S and Conci, N, and Halvorsen, P and Al-Fuqaha, A Al- Fuqaha, 2021, December. Visual Sentiment Analysis: A Natural Disaster Use-case Task at MediaEval 2021. In Proceedings of the MediaEval 2021 Workshop, Online. [2] Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. ICML 2019 (05 2019), https://arxiv.org/pdf/1905.11946.pdf. [3] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., and Houlsby, N. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. [4] Asakawa, T., & Aono, M. 2019. Median based Multi-label Prediction by Inflating Emotions with Dyads for Visual Sentiment Analysis. In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 2008-2014). IEEE. 3