Combining Multiple Deep-learning-based Image Features for Visual Sentiment Analysis Alexandros Pournaras, Nikolaos Gkalelis, Damianos Galanopoulos, Vasileios Mezaris CERTH-ITI, Greece {apournaras,gkalelis,dgalanop,bmezaris}@iti.gr ABSTRACT 2 APPROACH This paper presents our team’s (IDT-ITI-CERTH) proposed method Our proposed method is closely based on [11], a method that for the Visual Sentiment Analysis task of the Mediaeval 2021 bench- achieves state-of-the-art performance in many benchmark image marking activity. Visual sentiment analysis is a challenging task as sentiment analysis datasets. We are transferring knowledge from it involves a high level of subjectivity. The most recent works are 5 trained neural networks. These networks have different archi- based on deep convolutional neural networks, and exploit transfer tectures and are trained on different datasets, some for problems learning from other image classification tasks. However, transfer- other than image classification. They were chosen for use in this ring knowledge from tasks other than image classification has not task because they perform very well in their respective domains. been investigated in the literature. Motivated by this, in our ap- A feature vector is extracted from each network. In the following proach we examine the potential of transferring knowledge from subsections, we briefly describe each network and how each feature several pre-trained networks, some of which are out-of-domain. vector is extracted. We classify each feature vector in one of the We concatenate these diverse feature vectors and construct an im- two categories, either in-domain for those coming from networks age representation that is used to train a classifier for each of the trained on image classification tasks, or out-of-domain for those three subtasks of this Mediaeval task. Due to a bug in the original coming from networks trained on other tasks. A summary of the submission file, the official scores we got are 0.595, 0.479 and 0.380 employed feature vectors can be seen in Table 1. for subtasks 1,2 and 3 respectively. 2.1 In-Domain Feature Vectors 2.1.1 EfficientNet features. EfficientNet [14] is a recently pro- posed deep convolutional neural network architecture that achieves 1 INTRODUCTION state-of-the-art performance on image classification tasks. We used Visual sentiment analysis is the problem of identifying the senti- a "B2"-variation model pre-trained on the 1000-class ImageNet ment conveyed by an image. The problem has recently attracted dataset. We remove the last fully connected layer, so the network significant attention due to the large-scale use of images in so- outputs a 1408-element feature vector, E. cial media. This Mediaeval task focuses on images from natural disasters, thus content that can often induce strongly negative 2.1.2 Resnet features. Resnet [6] is a family of convolutional sentiments. A human-labeled disaster-related dataset as well as a neural networks based on residual blocks, that have shown state-of- deep-learning based approach to solve it was proposed in [17]. In the-art performance in image classification tasks. We use the 152- [5] a detailed description of the task is presented. layer deep Resnet architecture trained on the 11k-class ImageNet In general, visual sentiment analysis is challenging because it dataset [12] and extract the 2048-element "pool5" layer as the feature involves a higher level of human subjectivity in the classification vector, R. process, compared to other image classification tasks. Similarly to such tasks, deep convolutional neural networks are widely used; 2.2 Out-of-Domain Feature Vectors many literature works, e.g. [2] [7], rely on transfer learning by 2.2.1 YT8M features. YouTube-8M [1] is a large annotated video performing fine-tuning on pre-trained networks that most com- dataset containing approximately 6 million videos of a total dura- monly have been originally trained on ImageNet [12]. However, tion of more than 500.000 hours and labeled with 3862 classes. For little emphasis has been given to investigating the potential of trans- training a classifier on this dataset, we extract features at a 1fps ferring knowledge learned from neural networks trained on tasks sampling rate using an Inception neural network [13] pre-trained other than image classification. For this reason, we employ several on Imagenet [12]. The ReLU activation of the last hidden layer of pre-trained networks, some of which are trained on out-of-domain this network is given as input to a rather simple CNN classifier, datasets and tasks. We extract their encodings and concatenate consisting of a 1D convolutional layer with 64 filters, a max-pooling them to create a rich image representation. Using this, we train a layer, a dropout and a Sigmoid of 3862 outputs. This is the YouTube- sentiments classifier that takes the form of either a dense 3-layer 8M-trained classifier that we ultimately use as feature extractor for neural network or a Mixture of Experts (MoE) [8] classifier. sentiment classification: the classifier’s 3862-element output vector for each image is our feature vector, Y. 2.2.2 "Signature" features. To obtain the "signature" features, Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). we utilize a cross-modal network designed for ad-hoc video search. MediaEval’21, December 13-15 2021, Online More specifically, the attention-based dual encoding network pre- sented in [3] is used. The network is trained to translate a media MediaEval’21, December 13-15 2021, Online A. Pournaras, N. Gkalelis, D. Galanopoulos, V. Mezaris Table 1: Summary of employed deep-learning-based image features. Feature name Base network architecture Training datasets Original task EfficientNet (E) EfficientNet-B2 Imagenet 1k concepts image classification Resnet (R) Resnet-152 Imagenet 11k concepts image classification YT8M (Y) Inception Youtube8M video classification Signature (S) Dual encoding network MSR-VTT, TGIF, Vatex, ActivityNet ad-hoc video search GCN (G) Resnet-152, Faster R-CNN, GCN ImageNet 1k, FCVID, YLI-MED video event recognition item (i.e. an entire video) V or a textual item (i.e. a natural-language Table 2: The results of our method (F1 weighted score) for video caption or search query) T into a new joint feature space 𝑓 (Β·), the 3 subtasks. resulting in representations 𝑓 (V) or 𝑓 (T), respectively; such repre- sentations, despite being derived from different data modalities, are Subtask dev set test set test set directly comparable. This network is trained using large datasets (with bug) (corrected) of video-caption pairs: MSR-VTT [16], TGIF [10], Vatex [15] and Subtask 1 0.757 0.595 0.740 ActivityNet [9]. For leveraging this pre-trained network as a feature Subtask 2 0.612 0.479 0.604 generator in the image sentiment analysis task, we considered an Subtask 3 0.510 0.380 0.510 image as a special type of video comprising only one keyframe. The image is used as input to the visual encoding branch of the network, fed forward through the multi-level encoding layers, and the global image representation 𝑓 (V), a 2048-element vector, is used as our βˆ‘οΈ π‘œπ‘ (𝐼 ) = 𝜎 (𝑒𝑖𝑐 (𝐼 )) βˆ— π‘†π‘œ 𝑓 π‘‘π‘šπ‘Žπ‘₯ (𝑔𝑖𝑐 (𝐼 )) (1) "signature” feature S. 𝑖=1,2 2.3.2 Subtask 2. For subtask 2 we employ a dense 3-layer neural 2.2.3 Graph Convolutional Network (GCN) features. To obtain network classifier. This classifier comprises three fully-connected this feature vector, we employ a neural network trained for the layers with 1000, 200 and, finally, 7 neurons, as there are 7 target task of video event recognition [4]. Following the application of an classes in this subtask. Between consecutive layers there is a ReLU object detector on the frames of the video, a neural network is used and a dropout layer with 0.4 probability. Finally the output passes to extract the objects’ features and graphs are used to model the through a sigmoid nonlinearity. relations between objects. Then, a graph convolutional network (GCN) is utilized to perform reasoning on the graphs. The resulting 2.3.3 Subtask 3. For subtask 3 we use the same classifier used object-based frame-level features are then forwarded to a long in subtask 2, with the exception of the output layer, which in this short-term memory (LSTM) network for video event recognition. case comprises 10 neurons, as there are 10 target classes. To extract the feature vector that we use for the image sentiment analysis in this work, we fetch the output of the GCN, which is a 3 RESULTS 2048-element vector G. For optimizing the parameters and choosing the classifiers for each subtask, we randomly split the development set to a training and a validation set with 80% and 20% of the images respectively. For 2.3 Sentiment Classifiers subtask 1, we measured the cross-entropy loss. We employed the We concatenate the 5 feature vectors described above, resulting in Adam optimizer and trained with 10βˆ’5 learning rate. We trained a final 11414-element feature vector, that will be used to train our the model for 300 epochs. For subtasks 2 and 3 we additionally classifiers for the 3 subtasks. performed augmentations to the images of the training set: random crop, blurring, change in brightness and random rotations. We 2.3.1 Subtask 1. For subtask 1 we employ a Mixture of Experts measured the binary cross-entropy loss and optimized with Adam. classifier. The first layer of this classifier is a fully connected layer For subtask 2 we trained for 300 epochs with 3π‘₯10βˆ’6 learning rate, which transforms the input vector to a 200-element vector. After while for subtask 3 for 200 epochs with 5π‘₯10βˆ’6 learning rate. The passing through a Dropout and a ReLU block, this 200-element vec- learning rate is scheduled to drop by half every 70 epochs in all tor is the input 𝐼 forwarded to the 𝑖 = 2 experts, 𝑒 𝑐1 (), 𝑒 𝑐2 (), which are the subtasks. The batch size for the training was set to 64 for all defined for each class 𝑐, as well as to the associated gates, 𝑔𝑐1 (), 𝑔𝑐2 (). the subtasks. Following the selection of the above parameters, we For each class, an extra "dummy" expert is also defined to represent used the entire development set provided by the task organizers the rest-of-the-world class, and only participates in partitioning to train our final models. The experimental results we got on the the feature space through the gate component of the Mixture of development set as well as the official (with bug) and unofficial Expert classifier. The experts and the gate are implemented as fully (corrected) test set results are shown in Table 2. connected layers with a sigmoid and a softmax nonlinearity, re- spectively. A confidence score for the 𝑐th class is then computed by ACKNOWLEDGMENTS merging experts’ outputs into a single output π‘œπ‘ (𝐼 ) according to the This work was supported by the EU Horizon 2020 programme under gate’s decision (Eq. (1)). The whole network is trained end-to-end. grant agreement 832921 (MIRROR). Visual Sentiment Analysis: A Natural Disaster Use-case MediaEval’21, December 13-15 2021, Online REFERENCES [1] S. Abu-El-Haija, N. Kothari, J. Lee, A. Natsev, G. Toderici, B. Varadara- jan, and S. Vijayanarasimhan. 2016. YouTube-8M: A Large-Scale Video Classification Benchmark. In arXiv:1609.08675. https://arxiv.org/pdf/ 1609.08675v1.pdf [2] V. Campos, B. Jou, and X. GirΓ³-i-Nieto. 2017. From pixels to sentiment: Fine-tuning CNNs for visual sentiment prediction. Image and Vision Computing 65 (2017), 15–22. [3] D. Galanopoulos and V. Mezaris. 2020. Attention Mechanisms, Signal Encodings and Fusion Strategies for Improved Ad-hoc Video Search with Dual Encoding Networks. In Proc. of the ACM Int. Conf. on Multi- media Retrieval ((ICMR ’20)). ACM. [4] N. Gkalelis, A. Goulas, D. Galanopoulos, and V. Mezaris. 2021. Ob- jectGraphs: Using Objects and a Graph Convolutional Network for the Bottom-Up Recognition and Explanation of Events in Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. 3375–3383. [5] S. Z. Hassan, K. Ahmad, M. Riegler, S. Hicks, N. Conci, P. Halvorsen, and A. Al-Fuqaha. 2021. Visual Sentiment Analysis: A Natural Disaster Use-case Task at MediaEval 2021. In Proceedings of the MediaEval 2021 Workshop, Online. [6] K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 770–778. [7] J. Islam and Y. Zhang. 2016. Visual sentiment analysis for social images using transfer learning approach. In IEEE Int. Conf. on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom). IEEE, 124–130. [8] R. Jacobs, M. Jordan, S. Nowlan, and G. Hinton. 1991. Adaptive Mixture of Local Expert. Neural Computation 3 (02 1991), 78–88. https://doi. org/10.1162/neco.1991.3.1.79 [9] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles. 2017. Dense- Captioning Events in Videos. In Int. Conf. on Computer Vision (ICCV). [10] Y. Li, Y. Song, and others. 2016. TGIF: A new dataset and benchmark on animated GIF description. In Proc. of IEEE CVPR. 4641–4650. [11] A. Pournaras, N. Gkalelis, D. Galanopoulos, and V. Mezaris. 2021. Exploiting Out-of-Domain Datasets and Visual Representations for Image Sentiment Classification. In 2021 16th International Workshop on Semantic and Social Media Adaptation Personalization (SMAP). 1–6. https://doi.org/10.1109/SMAP53521.2021.9610801 [12] O. Russakovsky, J. Deng, H. Su, J. Krause, and others. 2015. Imagenet large scale visual recognition challenge. Int. journal of computer vision 115, 3 (2015), 211–252. [13] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1–9. https://doi.org/10.1109/CVPR.2015.7298594 [14] M. Tan and Q. Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Int. Conf. on Machine Learning. 6105–6114. [15] X. Wang, J. Wu, and others. 2019. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proc. of the IEEE Int. Conf. on Computer Vision. 4581–4591. [16] J. Xu, T. Mei, T. Yao, and Y. Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In Proc. of IEEE CVPR. 5288–5296. [17] S. Zohaib, K. Ahmad, N. Conci, and A. Al-Fuqaha. 2019. Sen- timent Analysis from Images of Natural Disasters. (2019). arXiv:cs.CV/1910.04416