=Paper=
{{Paper
|id=Vol-3181/paper15
|storemode=property
|title=Combining Multiple Deep-learning-based Image Features for Visual Sentiment
Analysis
|pdfUrl=https://ceur-ws.org/Vol-3181/paper15.pdf
|volume=Vol-3181
|authors=Alexandros Pournaras,Nikolaos Gkalelis,Damianos Galanopoulos,Vasileios Mezaris
|dblpUrl=https://dblp.org/rec/conf/mediaeval/PournarasGGM21
}}
==Combining Multiple Deep-learning-based Image Features for Visual Sentiment
Analysis==
Combining Multiple Deep-learning-based Image Features for
Visual Sentiment Analysis
Alexandros Pournaras, Nikolaos Gkalelis, Damianos Galanopoulos, Vasileios Mezaris
CERTH-ITI, Greece
{apournaras,gkalelis,dgalanop,bmezaris}@iti.gr
ABSTRACT 2 APPROACH
This paper presents our teamβs (IDT-ITI-CERTH) proposed method Our proposed method is closely based on [11], a method that
for the Visual Sentiment Analysis task of the Mediaeval 2021 bench- achieves state-of-the-art performance in many benchmark image
marking activity. Visual sentiment analysis is a challenging task as sentiment analysis datasets. We are transferring knowledge from
it involves a high level of subjectivity. The most recent works are 5 trained neural networks. These networks have different archi-
based on deep convolutional neural networks, and exploit transfer tectures and are trained on different datasets, some for problems
learning from other image classification tasks. However, transfer- other than image classification. They were chosen for use in this
ring knowledge from tasks other than image classification has not task because they perform very well in their respective domains.
been investigated in the literature. Motivated by this, in our ap- A feature vector is extracted from each network. In the following
proach we examine the potential of transferring knowledge from subsections, we briefly describe each network and how each feature
several pre-trained networks, some of which are out-of-domain. vector is extracted. We classify each feature vector in one of the
We concatenate these diverse feature vectors and construct an im- two categories, either in-domain for those coming from networks
age representation that is used to train a classifier for each of the trained on image classification tasks, or out-of-domain for those
three subtasks of this Mediaeval task. Due to a bug in the original coming from networks trained on other tasks. A summary of the
submission file, the official scores we got are 0.595, 0.479 and 0.380 employed feature vectors can be seen in Table 1.
for subtasks 1,2 and 3 respectively.
2.1 In-Domain Feature Vectors
2.1.1 EfficientNet features. EfficientNet [14] is a recently pro-
posed deep convolutional neural network architecture that achieves
1 INTRODUCTION state-of-the-art performance on image classification tasks. We used
Visual sentiment analysis is the problem of identifying the senti- a "B2"-variation model pre-trained on the 1000-class ImageNet
ment conveyed by an image. The problem has recently attracted dataset. We remove the last fully connected layer, so the network
significant attention due to the large-scale use of images in so- outputs a 1408-element feature vector, E.
cial media. This Mediaeval task focuses on images from natural
disasters, thus content that can often induce strongly negative 2.1.2 Resnet features. Resnet [6] is a family of convolutional
sentiments. A human-labeled disaster-related dataset as well as a neural networks based on residual blocks, that have shown state-of-
deep-learning based approach to solve it was proposed in [17]. In the-art performance in image classification tasks. We use the 152-
[5] a detailed description of the task is presented. layer deep Resnet architecture trained on the 11k-class ImageNet
In general, visual sentiment analysis is challenging because it dataset [12] and extract the 2048-element "pool5" layer as the feature
involves a higher level of human subjectivity in the classification vector, R.
process, compared to other image classification tasks. Similarly to
such tasks, deep convolutional neural networks are widely used;
2.2 Out-of-Domain Feature Vectors
many literature works, e.g. [2] [7], rely on transfer learning by 2.2.1 YT8M features. YouTube-8M [1] is a large annotated video
performing fine-tuning on pre-trained networks that most com- dataset containing approximately 6 million videos of a total dura-
monly have been originally trained on ImageNet [12]. However, tion of more than 500.000 hours and labeled with 3862 classes. For
little emphasis has been given to investigating the potential of trans- training a classifier on this dataset, we extract features at a 1fps
ferring knowledge learned from neural networks trained on tasks sampling rate using an Inception neural network [13] pre-trained
other than image classification. For this reason, we employ several on Imagenet [12]. The ReLU activation of the last hidden layer of
pre-trained networks, some of which are trained on out-of-domain this network is given as input to a rather simple CNN classifier,
datasets and tasks. We extract their encodings and concatenate consisting of a 1D convolutional layer with 64 filters, a max-pooling
them to create a rich image representation. Using this, we train a layer, a dropout and a Sigmoid of 3862 outputs. This is the YouTube-
sentiments classifier that takes the form of either a dense 3-layer 8M-trained classifier that we ultimately use as feature extractor for
neural network or a Mixture of Experts (MoE) [8] classifier. sentiment classification: the classifierβs 3862-element output vector
for each image is our feature vector, Y.
2.2.2 "Signature" features. To obtain the "signature" features,
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). we utilize a cross-modal network designed for ad-hoc video search.
MediaEvalβ21, December 13-15 2021, Online More specifically, the attention-based dual encoding network pre-
sented in [3] is used. The network is trained to translate a media
MediaEvalβ21, December 13-15 2021, Online A. Pournaras, N. Gkalelis, D. Galanopoulos, V. Mezaris
Table 1: Summary of employed deep-learning-based image features.
Feature name Base network architecture Training datasets Original task
EfficientNet (E) EfficientNet-B2 Imagenet 1k concepts image classification
Resnet (R) Resnet-152 Imagenet 11k concepts image classification
YT8M (Y) Inception Youtube8M video classification
Signature (S) Dual encoding network MSR-VTT, TGIF, Vatex, ActivityNet ad-hoc video search
GCN (G) Resnet-152, Faster R-CNN, GCN ImageNet 1k, FCVID, YLI-MED video event recognition
item (i.e. an entire video) V or a textual item (i.e. a natural-language Table 2: The results of our method (F1 weighted score) for
video caption or search query) T into a new joint feature space π (Β·), the 3 subtasks.
resulting in representations π (V) or π (T), respectively; such repre-
sentations, despite being derived from different data modalities, are Subtask dev set test set test set
directly comparable. This network is trained using large datasets (with bug) (corrected)
of video-caption pairs: MSR-VTT [16], TGIF [10], Vatex [15] and
Subtask 1 0.757 0.595 0.740
ActivityNet [9]. For leveraging this pre-trained network as a feature
Subtask 2 0.612 0.479 0.604
generator in the image sentiment analysis task, we considered an
Subtask 3 0.510 0.380 0.510
image as a special type of video comprising only one keyframe. The
image is used as input to the visual encoding branch of the network,
fed forward through the multi-level encoding layers, and the global
image representation π (V), a 2048-element vector, is used as our
βοΈ
ππ (πΌ ) = π (πππ (πΌ )) β ππ π π‘πππ₯ (πππ (πΌ )) (1)
"signatureβ feature S. π=1,2
2.3.2 Subtask 2. For subtask 2 we employ a dense 3-layer neural
2.2.3 Graph Convolutional Network (GCN) features. To obtain
network classifier. This classifier comprises three fully-connected
this feature vector, we employ a neural network trained for the
layers with 1000, 200 and, finally, 7 neurons, as there are 7 target
task of video event recognition [4]. Following the application of an
classes in this subtask. Between consecutive layers there is a ReLU
object detector on the frames of the video, a neural network is used
and a dropout layer with 0.4 probability. Finally the output passes
to extract the objectsβ features and graphs are used to model the
through a sigmoid nonlinearity.
relations between objects. Then, a graph convolutional network
(GCN) is utilized to perform reasoning on the graphs. The resulting 2.3.3 Subtask 3. For subtask 3 we use the same classifier used
object-based frame-level features are then forwarded to a long in subtask 2, with the exception of the output layer, which in this
short-term memory (LSTM) network for video event recognition. case comprises 10 neurons, as there are 10 target classes.
To extract the feature vector that we use for the image sentiment
analysis in this work, we fetch the output of the GCN, which is a 3 RESULTS
2048-element vector G. For optimizing the parameters and choosing the classifiers for each
subtask, we randomly split the development set to a training and
a validation set with 80% and 20% of the images respectively. For
2.3 Sentiment Classifiers
subtask 1, we measured the cross-entropy loss. We employed the
We concatenate the 5 feature vectors described above, resulting in Adam optimizer and trained with 10β5 learning rate. We trained
a final 11414-element feature vector, that will be used to train our the model for 300 epochs. For subtasks 2 and 3 we additionally
classifiers for the 3 subtasks. performed augmentations to the images of the training set: random
crop, blurring, change in brightness and random rotations. We
2.3.1 Subtask 1. For subtask 1 we employ a Mixture of Experts measured the binary cross-entropy loss and optimized with Adam.
classifier. The first layer of this classifier is a fully connected layer For subtask 2 we trained for 300 epochs with 3π₯10β6 learning rate,
which transforms the input vector to a 200-element vector. After while for subtask 3 for 200 epochs with 5π₯10β6 learning rate. The
passing through a Dropout and a ReLU block, this 200-element vec- learning rate is scheduled to drop by half every 70 epochs in all
tor is the input πΌ forwarded to the π = 2 experts, π π1 (), π π2 (), which are the subtasks. The batch size for the training was set to 64 for all
defined for each class π, as well as to the associated gates, ππ1 (), ππ2 (). the subtasks. Following the selection of the above parameters, we
For each class, an extra "dummy" expert is also defined to represent used the entire development set provided by the task organizers
the rest-of-the-world class, and only participates in partitioning to train our final models. The experimental results we got on the
the feature space through the gate component of the Mixture of development set as well as the official (with bug) and unofficial
Expert classifier. The experts and the gate are implemented as fully (corrected) test set results are shown in Table 2.
connected layers with a sigmoid and a softmax nonlinearity, re-
spectively. A confidence score for the πth class is then computed by ACKNOWLEDGMENTS
merging expertsβ outputs into a single output ππ (πΌ ) according to the This work was supported by the EU Horizon 2020 programme under
gateβs decision (Eq. (1)). The whole network is trained end-to-end. grant agreement 832921 (MIRROR).
Visual Sentiment Analysis: A Natural Disaster Use-case MediaEvalβ21, December 13-15 2021, Online
REFERENCES
[1] S. Abu-El-Haija, N. Kothari, J. Lee, A. Natsev, G. Toderici, B. Varadara-
jan, and S. Vijayanarasimhan. 2016. YouTube-8M: A Large-Scale Video
Classification Benchmark. In arXiv:1609.08675. https://arxiv.org/pdf/
1609.08675v1.pdf
[2] V. Campos, B. Jou, and X. GirΓ³-i-Nieto. 2017. From pixels to sentiment:
Fine-tuning CNNs for visual sentiment prediction. Image and Vision
Computing 65 (2017), 15β22.
[3] D. Galanopoulos and V. Mezaris. 2020. Attention Mechanisms, Signal
Encodings and Fusion Strategies for Improved Ad-hoc Video Search
with Dual Encoding Networks. In Proc. of the ACM Int. Conf. on Multi-
media Retrieval ((ICMR β20)). ACM.
[4] N. Gkalelis, A. Goulas, D. Galanopoulos, and V. Mezaris. 2021. Ob-
jectGraphs: Using Objects and a Graph Convolutional Network for
the Bottom-Up Recognition and Explanation of Events in Video. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR) Workshops. 3375β3383.
[5] S. Z. Hassan, K. Ahmad, M. Riegler, S. Hicks, N. Conci, P. Halvorsen,
and A. Al-Fuqaha. 2021. Visual Sentiment Analysis: A Natural Disaster
Use-case Task at MediaEval 2021. In Proceedings of the MediaEval 2021
Workshop, Online.
[6] K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning
for Image Recognition. 2016 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (2016), 770β778.
[7] J. Islam and Y. Zhang. 2016. Visual sentiment analysis for social
images using transfer learning approach. In IEEE Int. Conf. on Big Data
and Cloud Computing (BDCloud), Social Computing and Networking
(SocialCom), Sustainable Computing and Communications (SustainCom).
IEEE, 124β130.
[8] R. Jacobs, M. Jordan, S. Nowlan, and G. Hinton. 1991. Adaptive Mixture
of Local Expert. Neural Computation 3 (02 1991), 78β88. https://doi.
org/10.1162/neco.1991.3.1.79
[9] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles. 2017. Dense-
Captioning Events in Videos. In Int. Conf. on Computer Vision (ICCV).
[10] Y. Li, Y. Song, and others. 2016. TGIF: A new dataset and benchmark
on animated GIF description. In Proc. of IEEE CVPR. 4641β4650.
[11] A. Pournaras, N. Gkalelis, D. Galanopoulos, and V. Mezaris. 2021.
Exploiting Out-of-Domain Datasets and Visual Representations for
Image Sentiment Classification. In 2021 16th International Workshop
on Semantic and Social Media Adaptation Personalization (SMAP). 1β6.
https://doi.org/10.1109/SMAP53521.2021.9610801
[12] O. Russakovsky, J. Deng, H. Su, J. Krause, and others. 2015. Imagenet
large scale visual recognition challenge. Int. journal of computer vision
115, 3 (2015), 211β252.
[13] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V.
Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions.
In 2015 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR). 1β9. https://doi.org/10.1109/CVPR.2015.7298594
[14] M. Tan and Q. Le. 2019. EfficientNet: Rethinking Model Scaling for
Convolutional Neural Networks. In Int. Conf. on Machine Learning.
6105β6114.
[15] X. Wang, J. Wu, and others. 2019. Vatex: A large-scale, high-quality
multilingual dataset for video-and-language research. In Proc. of the
IEEE Int. Conf. on Computer Vision. 4581β4591.
[16] J. Xu, T. Mei, T. Yao, and Y. Rui. 2016. MSR-VTT: A Large Video
Description Dataset for Bridging Video and Language. In Proc. of IEEE
CVPR. 5288β5296.
[17] S. Zohaib, K. Ahmad, N. Conci, and A. Al-Fuqaha. 2019. Sen-
timent Analysis from Images of Natural Disasters. (2019).
arXiv:cs.CV/1910.04416