=Paper=
{{Paper
|id=Vol-3181/paper24
|storemode=property
|title=Visual Sentiment Analysis Multiplying Deep learning and Vision
Transformers
|pdfUrl=https://ceur-ws.org/Vol-3181/paper24.pdf
|volume=Vol-3181
|authors=Tetsuya Asakawa,Riku Tsuneda,Masaki
Aono
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AsakawaTA21
}}
==Visual Sentiment Analysis Multiplying Deep learning and Vision
Transformers==
Visual Sentiment Analysis Multiplying Deep learning and
Vision Transformers
Tetsuya Asakawa1, Riku Tsuneda1, Masaki Aono1
1
Toyohashi University of Technology, Japan
asakawa.tetsuya.um@tut.jp, tsuneda.riku.am@tut.jp, masaki.aono.ss@tut.jp
ABSTRACT concerns adopting CNN features, (2) propose a combined feature
method to combine the output of each feature, unlike previous
Visual sentiment analysis investigates sentiment estimation from work which focuses on combining feature vectors.
images and has been an interesting and challenging research
problem. Most studies have focused on estimating a few specific
2 APPROACH
sentiments and their intensities using several complex CNN
models. In this paper, we propose multiplying CNN and Vision We propose single-label (Subtask1) and multi-label (Subtask2
Transformers method in MediaEval 2021 Visual Sentiment and 3) visual sentiment analysis system to predict multiple
Analysis: A Natural Disaster Use-case. Specifically, we first emotions. And we will describe our deep neural network model
introduce our proposed model used in subtask1. Then, we also that enables the single-label and multi-label outputs, given images
introduce a median-based multi-label prediction algorithm used in that evoke emotions.
Subtask 2 and 3, in which we assume that each emotion has a
probability distribution. In other words, after training of our
proposed model, we predict the existence of an evoked emotion
for a given unknown image if the intensity of the emotion is larger
than the median of the corresponding emotion. Experimental
results demonstrate that our model outperforms several models in
terms of subset Weighted F1-Score.
1 INTRODUCTION
With the spread of SNS and the Internet, a vast number of
images are widely available. As a result, there is an urgent Figure 1: Calculated spatial distribution of the in-plane
requirement for image indexing and retrieval techniques. When dynamic magnetization.
viewing an image, we can feel several emotions simultaneously.
Different visual images have different emotional triggers. For 2.1 Subtask1
instance, an image with a snake or a spider may most likely This is a multi-class single label classification task, where the
trigger a bad feeling like “disgust” or “fear,” whereas an image images are arranged in three different classes, namely positive,
with a flower may most likely trigger a good feeling like negative, and neutral. There is a strong imbalance towards the
“amusement” or “excitement”. negative class, given the nature of the topic.
Visual sentiment prediction investigates sentiment estimation To solve our multi-class, single-label classification problem,
from images and has been an interesting and challenging research we propose new combined neural network models which allow
problem. In this paper, the purpose is to accurately estimate the inputs coming from both End-to-end (Vision Transformers: ViT
sentiments as a single-label and multi-label multi-class problem and CNN) features.
from given images that evoke multiple different emotions As illustrated in Figure 1, we adopt ViT-L/16 at ViT and
simultaneously [1]. extracted features. On the other hand, CNN features extracted
We also introduce a new combined neural network model from a pre-trained CNN-based neural network include
which allows inputs coming from both ViT features and pre- EfficientNetB0[2].
trained CNN features. In addition, existing deep learning had - Vision Transformers (ViT)
weak classifications, therefore we propose a new fully connected The Vision Transformer is a model for image classification
2 layers. The new contributions of this paper include (1) propose a that employs a Transformer-like architecture over patches of the
novel feature considering both ViT and CNN features to predict image. This includes the use of Multi-Head Attention, Scaled Dot-
sentiment of images, unlike most recent research which only Product Attention and other architectural features seen in the
Transformer architecture traditionally used for NLP[3].
-CNN
Copyright 2021 for this paper by its authors. Use permitted under Creative In addition to ViT features described above, our system
Commons License Attribution 4.0 International (CC BY 4.0). incorporates CNN features, which can be extracted from pre-
MediaEval’21, December 13-15 2021, Online
MediaEval’21, December 13-15 2021, Online T. Asakawa et al.
trained deep convolutional neural networks with EfficientNetB0. We used a fixed threshold for sentiment, and adopted “0.5” for
Because of the lack of dataset in visual sentiment analysis, we each feature, which we employed as the threshold of the
adopt transfer learning in our feature to prevent over fitting. corresponding emotion evocation. After obtaining all the
We decrease the dimensions of fully-connected layers used in thresholds dynamically determined based on this threshold, the
CNN models. Specifically, for EfficientNetB0, we extract a 1280- multi-hot vector of each image is generated such that if 𝑇#$ is
dimensional vector from ‘Global Average Pooling 2D’layer (or equal to or greater than the thresholds, we set 𝑆#$ =1; otherwise
the second to the last fully-connected layer), and reduce the vector 𝑆#$ =0, where 𝑆#$ is the element of K-th sentiment of i-th image. In
to 512 dimension by applying a fully-connected layer. short, the vector 𝑆# represents the output multi-hot vector. We
repeated this computation until all the test (unknown) images
2.2 Subtask2 and 3 were processed.
In Subtask 2 and 3, this is a multi-class multi-label image
classification task, where the participants are provided with multi- 3 EXPERIMENTAL RESULTS
labeled images.
Here we describe experiments and the evaluations. And, we
To solve our multi-class, multi-label classification problem,
have divided the training dataset into training and validation data
we propose new combined neural network models which allow
with an 8:2 ratio. We determined the following hyper-parameters;
inputs coming from both End-to-end (ViT and CNN) features. We
batch size as 256, optimization function as “SGD” with a learning
adopt ViT-L/16 at ViT and extracted features. On the other hand,
rate of 0.001 and momentum 0.9, and the number of epochs 200.
CNN features extracted from a pre-trained CNN-based neural
For the evaluations of single-label and multi-label classification,
network, include EfficientNetB0.
we employed Weighted F1-Score.
To deal with the above combined features, we proposed a deep
Here we compare in terms of Weighted F1-Score. Also, the
neural network architecture where we allowed multiple inputs and
table includes several base line methods including ViT,
a multi-hot vector output. The combined feature is represented by
EfficientNet B0, and our proposed combined model. The “Dim”
the following formula:
column of the table represents the feature dimension. For our
𝑐𝑜𝑚𝑏𝑖𝑛𝑒𝑑 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒(𝜔! (𝑉𝑖𝑇) + 𝜔" (𝐶𝑁𝑁)) (1) proposed combined model, we have tested with one variation, i.e.,
Based on this formula, after the training process, we allowed ViT+ EfficientNet B0. For our proposed combined model, it turns
our neural network system to predict the visual sentiment of out that ViT+EfficientNet B0 has the best score. It is observed
unknown images as a multi-label multi-class classification that the proposed method could correctly recognize the images
problem. whose emotions are falsely classified by the base CNN.
-multi-label prediction
To detect a multi-hot vector, we employed a method based on Table 1: The results of doing experiment
our research [4]. We proposed a method illustrated in Algorithm 1.
The input is a collection of features extracted from each image
with K kinds of sentiments, while the output is a K-dimensional Model Dim Weighted Weighted Weighted
F1-Score in F1-Score in F1-Score
multi-hot vector.
Subtask 1 Subtask 2 in Subtask
ViT-L/16 512 0.692 0.412 0.402
EfficientNet B0 512 0.715 0.534 0.392
Algorithm 1: Predicting multi hot vector for an image Proposed model 1024 0.753 0.585 0.415
4 CONCLUSIONS
We proposed a model for Visual Sentiment Analysis: A
Natural Disaster Use-case which accurately estimates single-label
and multi-label multi-class problems from given images, evoking
multiple different emotions simultaneously.
Our proposed model is simple yet effective and achieves new
state-of-the-art performance on multiple datasets.
ACKNOWLEDGMENTS
A part of this research was carried out with the support of the
Grant for Toyohashi Heart Center Smart Hospital Joint Research
In Algorithm 1, we assumed that the extracted features (here Course and the Grant for Education and Research in Toyohashi
ViT and CNN) are represented by their probabilities. For each University of Technology.
sentiment, we summed up the features, followed by averaging the
result, which is denote by 𝑇#$ in Algorithm 1.
Visual Sentiment Analysis: A Natural Disaster Use-case T. Asakawa et al.
REFERENCES
[1] H, Syed Zohaib and Ahmad, K and Riegler, M and Hicks, S
and Conci, N, and Halvorsen, P and Al-Fuqaha, A Al-
Fuqaha, 2021, December. Visual Sentiment Analysis: A
Natural Disaster Use-case Task at MediaEval 2021. In
Proceedings of the MediaEval 2021 Workshop, Online.
[2] Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for
convolutional neural networks. ICML 2019 (05 2019),
https://arxiv.org/pdf/1905.11946.pdf.
[3] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D.,
Zhai, X., Unterthiner, T., and Houlsby, N. 2020. An image is
worth 16x16 words: Transformers for image recognition at
scale. arXiv preprint arXiv:2010.11929.
[4] Asakawa, T., & Aono, M. 2019. Median based Multi-label
Prediction by Inflating Emotions with Dyads for Visual
Sentiment Analysis. In 2019 Asia-Pacific Signal and
Information Processing Association Annual Summit and
Conference (APSIPA ASC) (pp. 2008-2014). IEEE.
3