=Paper= {{Paper |id=Vol-3181/paper24 |storemode=property |title=Visual Sentiment Analysis Multiplying Deep learning and Vision Transformers |pdfUrl=https://ceur-ws.org/Vol-3181/paper24.pdf |volume=Vol-3181 |authors=Tetsuya Asakawa,Riku Tsuneda,Masaki Aono |dblpUrl=https://dblp.org/rec/conf/mediaeval/AsakawaTA21 }} ==Visual Sentiment Analysis Multiplying Deep learning and Vision Transformers== https://ceur-ws.org/Vol-3181/paper24.pdf
           Visual Sentiment Analysis Multiplying Deep learning and
                            Vision Transformers
                                       Tetsuya Asakawa1, Riku Tsuneda1, Masaki Aono1
                                                 1
                                              Toyohashi University of Technology, Japan
                              asakawa.tetsuya.um@tut.jp, tsuneda.riku.am@tut.jp, masaki.aono.ss@tut.jp

ABSTRACT                                                                     concerns adopting CNN features, (2) propose a combined feature
                                                                             method to combine the output of each feature, unlike previous
Visual sentiment analysis investigates sentiment estimation from             work which focuses on combining feature vectors.
images and has been an interesting and challenging research
problem. Most studies have focused on estimating a few specific
                                                                             2 APPROACH
sentiments and their intensities using several complex CNN
models. In this paper, we propose multiplying CNN and Vision                    We propose single-label (Subtask1) and multi-label (Subtask2
Transformers method in MediaEval 2021 Visual Sentiment                       and 3) visual sentiment analysis system to predict multiple
Analysis: A Natural Disaster Use-case. Specifically, we first                emotions. And we will describe our deep neural network model
introduce our proposed model used in subtask1. Then, we also                 that enables the single-label and multi-label outputs, given images
introduce a median-based multi-label prediction algorithm used in            that evoke emotions.
Subtask 2 and 3, in which we assume that each emotion has a
probability distribution. In other words, after training of our
proposed model, we predict the existence of an evoked emotion
for a given unknown image if the intensity of the emotion is larger
than the median of the corresponding emotion. Experimental
results demonstrate that our model outperforms several models in
terms of subset Weighted F1-Score.


1 INTRODUCTION
    With the spread of SNS and the Internet, a vast number of
images are widely available. As a result, there is an urgent                    Figure 1: Calculated spatial distribution of the in-plane
requirement for image indexing and retrieval techniques. When                dynamic magnetization.
viewing an image, we can feel several emotions simultaneously.
Different visual images have different emotional triggers. For               2.1 Subtask1
instance, an image with a snake or a spider may most likely                     This is a multi-class single label classification task, where the
trigger a bad feeling like “disgust” or “fear,” whereas an image             images are arranged in three different classes, namely positive,
with a flower may most likely trigger a good feeling like                    negative, and neutral. There is a strong imbalance towards the
“amusement” or “excitement”.                                                 negative class, given the nature of the topic.
    Visual sentiment prediction investigates sentiment estimation               To solve our multi-class, single-label classification problem,
from images and has been an interesting and challenging research             we propose new combined neural network models which allow
problem. In this paper, the purpose is to accurately estimate the            inputs coming from both End-to-end (Vision Transformers: ViT
sentiments as a single-label and multi-label multi-class problem             and CNN) features.
from given images that evoke multiple different emotions                        As illustrated in Figure 1, we adopt ViT-L/16 at ViT and
simultaneously [1].                                                          extracted features. On the other hand, CNN features extracted
    We also introduce a new combined neural network model                    from a pre-trained CNN-based neural network include
which allows inputs coming from both ViT features and pre-                   EfficientNetB0[2].
trained CNN features. In addition, existing deep learning had                   - Vision Transformers (ViT)
weak classifications, therefore we propose a new fully connected                The Vision Transformer is a model for image classification
2 layers. The new contributions of this paper include (1) propose a          that employs a Transformer-like architecture over patches of the
novel feature considering both ViT and CNN features to predict               image. This includes the use of Multi-Head Attention, Scaled Dot-
sentiment of images, unlike most recent research which only                  Product Attention and other architectural features seen in the
                                                                             Transformer architecture traditionally used for NLP[3].
                                                                                -CNN
Copyright 2021 for this paper by its authors. Use permitted under Creative      In addition to ViT features described above, our system
Commons License Attribution 4.0 International (CC BY 4.0).                   incorporates CNN features, which can be extracted from pre-
MediaEval’21, December 13-15 2021, Online
MediaEval’21, December 13-15 2021, Online                                                                                 T. Asakawa et al.

trained deep convolutional neural networks with EfficientNetB0.             We used a fixed threshold for sentiment, and adopted “0.5” for
Because of the lack of dataset in visual sentiment analysis, we         each feature, which we employed as the threshold of the
adopt transfer learning in our feature to prevent over fitting.         corresponding emotion evocation. After obtaining all the
    We decrease the dimensions of fully-connected layers used in        thresholds dynamically determined based on this threshold, the
CNN models. Specifically, for EfficientNetB0, we extract a 1280-        multi-hot vector of each image is generated such that if 𝑇#$ is
dimensional vector from ‘Global Average Pooling 2D’layer (or            equal to or greater than the thresholds, we set 𝑆#$ =1; otherwise
the second to the last fully-connected layer), and reduce the vector    𝑆#$ =0, where 𝑆#$ is the element of K-th sentiment of i-th image. In
to 512 dimension by applying a fully-connected layer.                   short, the vector 𝑆# represents the output multi-hot vector. We
                                                                        repeated this computation until all the test (unknown) images
2.2 Subtask2 and 3                                                      were processed.
   In Subtask 2 and 3, this is a multi-class multi-label image
classification task, where the participants are provided with multi-    3 EXPERIMENTAL RESULTS
labeled images.
                                                                            Here we describe experiments and the evaluations. And, we
   To solve our multi-class, multi-label classification problem,
                                                                        have divided the training dataset into training and validation data
we propose new combined neural network models which allow
                                                                        with an 8:2 ratio. We determined the following hyper-parameters;
inputs coming from both End-to-end (ViT and CNN) features. We
                                                                        batch size as 256, optimization function as “SGD” with a learning
adopt ViT-L/16 at ViT and extracted features. On the other hand,
                                                                        rate of 0.001 and momentum 0.9, and the number of epochs 200.
CNN features extracted from a pre-trained CNN-based neural
                                                                        For the evaluations of single-label and multi-label classification,
network, include EfficientNetB0.
                                                                        we employed Weighted F1-Score.
   To deal with the above combined features, we proposed a deep
                                                                            Here we compare in terms of Weighted F1-Score. Also, the
neural network architecture where we allowed multiple inputs and
                                                                        table includes several base line methods including ViT,
a multi-hot vector output. The combined feature is represented by
                                                                        EfficientNet B0, and our proposed combined model. The “Dim”
the following formula:
                                                                        column of the table represents the feature dimension. For our
 𝑐𝑜𝑚𝑏𝑖𝑛𝑒𝑑 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒(𝜔! (𝑉𝑖𝑇) + 𝜔" (𝐶𝑁𝑁))                (1)     proposed combined model, we have tested with one variation, i.e.,
   Based on this formula, after the training process, we allowed        ViT+ EfficientNet B0. For our proposed combined model, it turns
our neural network system to predict the visual sentiment of            out that ViT+EfficientNet B0 has the best score. It is observed
unknown images as a multi-label multi-class classification              that the proposed method could correctly recognize the images
problem.                                                                whose emotions are falsely classified by the base CNN.
   -multi-label prediction
    To detect a multi-hot vector, we employed a method based on                     Table 1: The results of doing experiment
our research [4]. We proposed a method illustrated in Algorithm 1.
The input is a collection of features extracted from each image
with K kinds of sentiments, while the output is a K-dimensional             Model           Dim       Weighted       Weighted         Weighted
                                                                                                     F1-Score in    F1-Score in       F1-Score
multi-hot vector.
                                                                                                      Subtask 1      Subtask 2       in Subtask
                                                                       ViT-L/16              512        0.692          0.412            0.402
                                                                       EfficientNet B0       512        0.715          0.534            0.392
    Algorithm 1: Predicting multi hot vector for an image              Proposed model       1024        0.753          0.585            0.415

                                                                        4 CONCLUSIONS
                                                                            We proposed a model for Visual Sentiment Analysis: A
                                                                        Natural Disaster Use-case which accurately estimates single-label
                                                                        and multi-label multi-class problems from given images, evoking
                                                                        multiple different emotions simultaneously.
                                                                            Our proposed model is simple yet effective and achieves new
                                                                        state-of-the-art performance on multiple datasets.

                                                                        ACKNOWLEDGMENTS
                                                                        A part of this research was carried out with the support of the
                                                                        Grant for Toyohashi Heart Center Smart Hospital Joint Research
   In Algorithm 1, we assumed that the extracted features (here         Course and the Grant for Education and Research in Toyohashi
ViT and CNN) are represented by their probabilities. For each           University of Technology.
sentiment, we summed up the features, followed by averaging the
result, which is denote by 𝑇#$ in Algorithm 1.
Visual Sentiment Analysis: A Natural Disaster Use-case              T. Asakawa et al.

REFERENCES
[1] H, Syed Zohaib and Ahmad, K and Riegler, M and Hicks, S
    and Conci, N, and Halvorsen, P and Al-Fuqaha, A Al-
    Fuqaha, 2021, December. Visual Sentiment Analysis: A
    Natural Disaster Use-case Task at MediaEval 2021. In
    Proceedings of the MediaEval 2021 Workshop, Online.
[2] Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for
    convolutional neural networks. ICML 2019 (05 2019),
    https://arxiv.org/pdf/1905.11946.pdf.
[3] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D.,
    Zhai, X., Unterthiner, T., and Houlsby, N. 2020. An image is
    worth 16x16 words: Transformers for image recognition at
    scale. arXiv preprint arXiv:2010.11929.
[4] Asakawa, T., & Aono, M. 2019. Median based Multi-label
    Prediction by Inflating Emotions with Dyads for Visual
    Sentiment Analysis. In 2019 Asia-Pacific Signal and
    Information Processing Association Annual Summit and
    Conference (APSIPA ASC) (pp. 2008-2014). IEEE.




                                                                                    3