Leveraging CLIP for Image Emotion Recognition Alessandro Bondielli1 and Lucia C. Passaro2 1 Department of Information Engineering, University of Pisa, Pisa, Italy alessandro.bondielli@ing.unipi.it 2 Department of Computer Science, University of Pisa, Pisa, Italy lucia.passaro@unipi.it Abstract. Multi-modal neural models that are able to encode and pro- cess both visual and textual data are becoming more and more common in the last few years. Such models enable new ways to learn the inter- action between vision and text, and thus can be successfully applied to tasks of varying complexity in the domain of image and text classifica- tion. However, such models are traditionally oriented to learn grounded properties of images and of the objects they depict and less suited to solve tasks involving subjective characteristics, such as the emotions they can convey in viewers. In this paper, we provide some insights in the per- formances of the recently released OpenAI CLIP model for an emotion classification task. We evaluate the model both under zero-shot settings and via fine tuning on an image-emotion dataset. We compare the per- formances of CLIP both in a zero-shot and fine-tuning setting on (i) a standard benchmark dataset for object recognition (ii) an image-emotion dataset. Moreover, we evaluate to which extent a CLIP model adapted to emotions is able to retain general knowledge and generalization capa- bilities. Keywords: Affect · Emotion Classification · Computer Vision · Natural Language Processing · CLIP 1 Introduction The ever-increasing production and spread of multi-modal content over the in- ternet requires new analytical tools to deal with them. Although many issues re- lated to the multi-modal analysis of text and images have already been addressed in the literature, it is still unclear whether and to what extent state-of-the-art multi-modal systems can be exploited to explore the affective characteristics of the visual contents. Several multi-modal resources, systems and architectures have been proposed in the literature to approach a wide range of natively multi-modal tasks, such as Image Captioning [5, 12], Visual Question Answering [20, 21] and Image Gen- eration [18]. However, traditional literature in the field of Computer Vision and Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 A. Bondielli and L. C. Passaro specifically of Image Classification typically focuses on the recognition of objects and concrete entities depicted in the images. In this context, several large-scale resources useful to train neural models have been released [8, 10, 13]. On these benchmarks, the literature is plentiful of systems that have been proven effective in solving tasks of various levels of complexity [2, 7, 19, 20, 16]. On the contrary, the field of Natural Language processing has addressed problems related to the affective properties of texts for many years. The literature is filled with approaches dealing with sentiment, opinion and affect. For example, several studies have been proposed to analyze the sentiment and the emotions expressed and evoked by texts from several perspectives [4, 6, 14, 15]. The sentiment encoded in images has attracted a lot of interest due to its various applications [3, 9], ranging from human-robot interaction to social media analysis, but the results are not on par neither with systems working only on text nor with computer vision systems focused on concrete aspects of visual contents. This may be due to the fact that images convey rich semantic properties and can induce, as textual inputs can and possibly even more, emotional reactions to users who are exposed to them. Thus, it is important to develop new benchmarks to assess the ability of systems to classify images from an affective point of view. This aspect is also very relevant in the field of Industry 4.0. Companies are in fact expected to constantly communicate with their customers using new and effective forms of communication, such as the visual ones. On the one hand, it is important to study the emotional content conveyed by an image. On the other hand, especially for web marketing purposes, it is crucial to analyse the emotions “elicited” by images in viewers. To the best of our knowledge, a fully multi-modal dataset that includes real- world image samples addressing this issue is still lacking. However, a large scale visual dataset labelled for the emotions evoked by images has been proposed in [23]. This dataset is suitable to challenge state-of-the-art multi-modal neural models in predicting subjective, abstract labels for a given image like emotions. Thus, the analysis of the performances on this dataset may be seen as an early attempt to exploit pre-trained multi-modal systems to bridge the gap between computer vision and affect. To study aspects related to the emotions evoked by images, we decided to base our experiments on the recently released and well-known OpenAI CLIP model [17], a multi-modal Neural Network learned on text-image pairs. CLIP adopts an architecture that includes an image encoder and a text encoder. The peculiarity of CLIP resides in its contrastive training strategy. CLIP is trained on a dataset of 32,000 image-caption pairs. Its training objective is to predict, given an image, which of the captions was actually paired with it in the training dataset. The goal of this pre-training is to provide the network with a wide array of visual concepts found in images and enable it to learn how to identify proper associations between these visual content and their textual descriptions or presentations [17]. In this context, the motivation of our choice to adopt CLIP is twofold. On the one hand, the model has been trained to efficiently learn visual concepts Leveraging CLIP for Image Emotion Recognition 3 by exploiting natural language supervision. We can argue that it may directly encode latent emotive concepts. On the other hand, CLIP authors claim that it can be used to nearly arbitrary visual classification tasks [17] under zero- shot setting. Moreover, from an implementation perspective CLIP and CLIP-like models have a very interesting property that stems from their training approach: representations of images and texts (e.g. captions) can be easily compared in terms of cosine similarity between their vectors. For example, classification on a 10-class dataset can be faced with CLIP by simply encoding labels in the form of captions, and then by identifying the closest caption (i.e. label) in terms of cosine similarity for each image. Representations of images could be stored in memory and queried at inference time for their similarity with either another image or a piece of text, thus drastically reducing the computational cost at inference time. We are conscious that analyzing affect elicited by images is a very challeng- ing task because people with diverse social and cultural backgrounds may have different emotional reactions to the same image [22]. Moreover, we know that labelled datasets addressing this issue are scarce. However, for our preliminary studies we consider the Image-Emotion dataset [23] as suitable to draw the first insights to approach emotion classification of images. In this work, we propose to exploit and analyze the performances of CLIP for the task of image emotion recognition. CLIP can be leveraged either as a pre-trained model for zero-shot classification, as intended by the authors [17], or by further fine-tuning it on specific downstream tasks. Our goal is to explore how CLIP models perform on highly subjective tasks out-of-the-box and how they can be adapted to them via fine-tuning. Moreover, as the task of image emotion recognition is rather challenging, we aim to compare it also with a more standard classification task on a computer vision benchmark, namely the popular CIFAR100 benchmark dataset [11]. The contributions of this paper are the following: – We evaluate the zero-shot performances of CLIP on two different bench- mark datasets, namely (i) a dataset for image emotion recognition and (ii) a dataset for a more standard image classification problem; – We evaluate CLIP in a fine-tuning setting on two different tasks, namely (i) image emotion recognition and (ii) image classification, and compare the obtained results; – We evaluate to which extent CLIP is able to retain general knowledge and generalization capabilities to other tasks after being fine-tuned. The rest of this paper is organized as follows. Section 2 thoroughly describes the performed experiments. In Section 3 the results of the experiments are pre- sented and discussed in order to shed some light into the capabilities of CLIP for image emotion recognition. Finally, Section 4 draws some conclusions and discusses future work. 4 A. Bondielli and L. C. Passaro 2 Experiments In order to provide some insights into the capabilities of CLIP for emotion recog- nition we perform several different experiments. The experiments are devised in order to fulfil two goals: first, we want to assess the performances of CLIP un- der zero shot settings; second, we want to evaluate the impact of fine-tuning in the performances of the model, both for the specific task and in its generaliza- tion capabilities. In addition to this, we try to address the differences between the performances of the CLIP model on more abstract and more concrete tasks across all of the performed experiments. For the image emotion recognition task we employ the dataset described in [23], that we refer to as the Image-Emotion dataset. The dataset includes 23,308 images labelled with an emotion among Amusement, Anger, Awe, Content- ment, Disgust, Excitement, Fear, and Sadness. The images are collected and weakly labelled by searching for the emotion keywords on Instagram and Flickr. The weak labelling is then verified with a crowdsourcing experiments. Concerning the more concrete image classification task, we employ the popular CIFAR100 dataset [11]. It includes 60,000 images labelled with one of 100 classes of objects such as for example Dolphin, Road, and Boy. Our experiments are organized as follows. First we perform zero-shot classifi- cation on the two dataset using the pre-trained ViT-B/32 CLIP model. Second, we fine-tune the CLIP model on the two datasets, and evaluate its performances in a cross-validation experiment. Third, we again perform zero-shot classifica- tion on each of the two datasets using the model fine-tuned on the other one. This means that the model fine-tuned on the Image-Emotion dataset is applied to CIFAR100 and vice versa. This aims to understand how fine-tuning affects zero-shot performances on other tasks. All the experiments are performed by exploiting the CLIP python library3 and the official pre-trained available models. In the following, we thoroughly describe the experiments and show the obtained results. 2.1 Zero-shot classification In the first set of experiments, we simply employ a pre-trained CLIP model to classify images under zero shot settings. Following the original CLIP paper [17], we perform classification by means of cosine similarity between image representa- tions and captions. Notably, since we have labels and not captions for both of the employed datasets, we first generate a caption for each label in the dataset. For CIFAR100, the employed caption is “a photo of a < label >”, where < label > is one of the 100 labels of the dataset. For the Image-Emotion dataset, the caption is “an image that evokes the emotion of < emotion >”, where < emotion > is one of the eight emotion labels. We use a different wording for the two datasets (i.e., image and photo) due to the fact that all the data in CIFAR100 consists of photos, while the Image-Emotion dataset includes also more abstract images. 3 https://github.com/openai/CLIP Leveraging CLIP for Image Emotion Recognition 5 For both experiments, we encode all the images and all the captions with the CLIP model. Specifically, we use the ViT-B/32 pre-trained model. Then, we compute cosine similarity between the representations of each image and each caption. To obtain the final label, we simply assign to each image the caption (label) with the highest cosine similarity to it. 2.2 Fine-Tuning CLIP For the second set of experiments, our goal is to evaluate how much improve- ment could be obtained on downstream tasks by fine-tuning a base CLIP model. More specifically, we focus on the two downstream tasks of (i) image emotion classification on the Image-Emotion dataset and (ii) image classification on the CIFAR100 benchmark. In order to obtain reliable and comparable estimations of the performances, we use 10-fold cross validation during training. However, since the two datasets are different in terms of size, number of classes, and distribution of classes, we also perform some hyperparameter tuning to obtain the best possible results on both datasets. For the sake of brevity we leave out the details of parameter tuning. However, in this regard it is very interesting to notice how the process of fine- tuning CLIP is extremely sensitive to different hyperparameters. For example, a slight change in learning rate or number of training epochs may lead to a decrease in performances of up to 0.20 in weighted and macro average F1-Score. First, we experiment with the Image-Emotion dataset. We refer to the re- sulting model as Emotion-CLIP. As previously mentioned, we perform 10-fold cross validation on the whole dataset. Each fold is composed of 20,000 training examples and 3,500 test examples. The model is evaluated by predicting the most likely label for each image by means of cosine similarity with respect to the generated captions, as under zero-shot setting described in Section 2.1. As for the hyperparameters, we train each fold for 3 epochs with a batch size of 256. We use an Adam optimizer with a learning rate of 2e-5 and a 0.2 weight decay. Training each epoch took roughly 3 minutes on a Nvidia Titan RTX GPU. To obtain the final results for the classification, we average performances on each fold. In order to further evaluate how the process of fine-tuning can be helpful also for zero-shot capable models, we propose to exploit a simpler and more grounded task of image classification on the CIFAR100 dataset. We refer to the trained model as CIFAR100-CLIP. As for the previous experiment, we perform 10-fold cross validation on the entire dataset (i.e. the concatenation of train and test set), with the predictions obtained by means of cosine similarity between images and captions. Each fold is composed by 54,000 training samples and 6,000 test samples. Note that the distribution of classes on the whole dataset is perfectly balanced (i.e. each label is associated with exactly 6,000 images). After tuning the parameters, we chose to train the model on each fold for 1 epoch with a batch size of 256. The same learning rate and optimizer used for Emotion-CLIP are employed also in this case. 6 A. Bondielli and L. C. Passaro 2.3 Evaluation of fine-tuning on generalization capabilities of CLIP While fine-tuning is a viable strategy for applying CLIP to downstream classi- fication task, the original goal of CLIP is to take advantage of the interaction between natural language and images to perform image classification tasks with- out the need of direct optimization for the dataset at hand [17]. With the last set of experiments, our goal is twofold. On the one hand, we want to straightfor- ward understand how and how much fine tuning on a benchmark task actually affect the zero-shot capabilities of CLIP. On the other hand, the experiments also serve to assess the extent to which a specific kind of benchmark data may affect zero-shot performances. In the original paper, authors clearly state that while zero-shot performances on simpler image classification tasks are very promising, the model encounters more difficulties when the task becomes more complex (e.g. counting specific objects in the image) or more abstract. In this context, we want to shed some light into how fine-tuning on a more challenging task such as emotion recognition would affect performances on simpler tasks, and vice versa. In order to pursue this goal, we propose the following experiments. We first fine-tune the Emotion-CLIP model on the whole Image-Emotion dataset, and test it under zero-shot settings on the CIFAR100 dataset for image classification. Then, we do the opposite, i.e. we train CIFAR100-CLIP on the CIFAR100 dataset and test it for image emotion recognition on the Image-Emotion Dataset. Both Emotion-CLIP and CIFAR100-CLIP are trained on their respective datasets with the same parameters employed for the cross-validation experiments described in Section 2.2. The only difference is that, in this case, the model is trained on the whole dataset. As for testing, the models are deployed in zero- shot setting and labels for both the CIFAR100 and Image-Emotion Dataset are obtained by means of cosine similarity between images and generated captions. 3 Results and Discussion In this Section, we provide the results obtained for each of the performed exper- iments and discuss them to shed some light on the performances of CLIP with the different settings and datasets. 3.1 Zero-shot classification First, we evaluate the performances of the CLIP model under zero-shot settings both for the Image-Emotion dataset and for the CIFAR100 benchmark. As de- scribed in Section 2.1, for both the experiments the original CLIP ViT-B/32 pre-trained model was asked to compare the cosine similarity between the gen- erated captions and the images. As for the Image-Emotion dataset, we used the following captions: “an image that evokes the emotion of < emotion >”, where < emotion > stands for one of the eight emotion classes in the dataset. As for the CIFAR100 benchmark, the captions were of the form “a photo of a < label >”, where < label > is one of the 100 labels in CIFAR100. Leveraging CLIP for Image Emotion Recognition 7 Table 1: Experiments under zero shot settings CIFAR100 Image-Emotion Accuracy 0.62 0.49 Macro Avg. 0.69 0.46 Precision Weighted Avg. 0.69 0.52 Macro Avg. 0.62 0.44 Recall Weighted Avg. 0.62 0.49 Macro Avg. 0.61 0.42 F1-Score Weighted Avg. 0.61 0.48 Results for the two datasets are shown in Table 1. We report accuracy, weighted-average and macro-average precision, recall, and F1-score for each dataset. We can see that, as expected, despite the much higher number of classes in the CIFAR100 dataset, the CLIP model under zero-shot settings is better able to predict its labels with respect to the emotion elicited by the image in the Image-Emotion dataset. We can argue that this is due to the fact that the training data for CLIP is much more akin to the CIFAR100 classification task. However, it is interesting to notice how the baseline model is nevertheless fairly able to face also a more complex and more abstract task such as emotion recog- nition out-of-the-box. For the sake of completeness, we also report on class-level performances for the Image-Emotion dataset in Table 2. We notice that there is a high variance in performances among classes, that is however not directly related to the sample size on each class. In fact, it seems that some emotions such as Disgust and Sadness are harder to model for the CLIP base model. 3.2 Fine-Tuning CLIP In the second set of experiments, we evaluated the performances of fine-tuning the CLIP model for specific downstream tasks on the Image-Emotion dataset and on the CIFAR100 benchmark. The implementation details for the experiments are described in Section 2.2. Table 3 reports on the results of the Emotion-CLIP model. For completeness, we also report the performances for each class. It is interesting to notice how performances drastically improve by means of leveraging a fine-tuned model trained on images and small captions that describe and mention the emotion that is likely to be elicited when watching that image. Interestingly, the model and fine-tuning process is also rather sensitive to the 8 A. Bondielli and L. C. Passaro Table 2: Zero-shot classification results for Emotion-CLIP on the Image-Emotion Dataset. Precision Recall F1-Score Amusement 0.80 0.45 0.58 Anger 0.46 0.37 0.41 Awe 0.38 0.75 0.50 Contentment 0.61 0.72 0.66 Disgust 0.36 0.10 0.16 Excitement 0.45 0.43 0.44 Fear 0.30 0.49 0.37 Sadness 0.30 0.20 0.24 Macro Avg. 0.46 0.44 0.42 Weighted Avg. 0.52 0.49 0.48 Accuracy 0.49 Table 3: 10-fold cross validation results for Emotion-CLIP on the Image-Emotion Dataset. Precision Recall F1-Score Amusement 0.83 0.79 0.80 Anger 0.49 0.53 0.50 Awe 0.66 0.73 0.69 Contentment 0.80 0.64 0.70 Disgust 0.70 0.71 0.70 Excitement 0.68 0.62 0.65 Fear 0.37 0.55 0.44 Sadness 0.37 0.55 0.44 Macro Avg. 0.64 0.65 0.64 Weighted Avg. 0.70 0.67 0.68 Accuracy 0.67 Leveraging CLIP for Image Emotion Recognition 9 input captions that describe the labels. During the experiments we noticed in fact that captions that use more complex words, such as for example “an image that elicit < emotion >”, or that are more direct in describing the image (e.g. “this image is about < emotion >”) are consistently outperformed by models trained on a more simple yet specific and clear caption. While differences in per- formances are in the order of a few percentage points, i.e. 3-5%, it is nonetheless an interesting issue that could be explored further and more in-depth. Another interesting aspect that can be taken into account is the fact that performance vary rather widely across the different emotions. This may be due to the fact that describing (and thus recognizing) images eliciting certain emotions, such as Fear and Sadness, may be harder than with emotions such as Amusement and Disgust that may have more prominent visual features in the images. In addition to this, the size of the dataset and distribution of the labels must be taken into account as well. Interestingly, Disgust was the class for which per- formances were the worst in the zero-shot setting. Thus, in this case, it appears that the fine-tuning was rather helpful in pinpointing visual features of the emo- tion. Figure 1 shows some examples that highlight the differences between the zero-shot and the fine-tuned model. Specifically, we considered each caption (i.e. emotion) and show the top-8 images associated with that caption in the dataset extracted using zero-shot CLIP (top) and Emotion-CLIP (bottom). From the images, it is first and foremost clear that fine-tuning is very effective in learning better representations for the captions, and thus it is closer to images that actu- ally represent the emotional content. Second, it is also interesting to notice that while the performances for classes such as Fear and Sadness are sub-par with respect to other emotions, the top-8 images actually represent them quite well. This may serve as an indication that fine-tuned CLIP models may be extremely helpful also for retrieval purposes. Table 4 reports instead on the results of the CIFAR100-CLIP model. In this case, due to space concerns we report only the overall average performances of the model. Table 4: 10-fold cross validation results for CIFAR100-CLIP on the CIFAR100 dataset. Precision Recall F1-score Macro Avg. 0.82 0.81 0.81 Weighted Avg. 0.82 0.82 0.81 Accuracy 0.81 It is clear from the results that, even after only 1 epoch of fine-tuning, the model is closer to solve the CIFAR100 dataset with respect to the baseline CLIP model, with performances above 0.80 on all the considered metrics. 10 A. Bondielli and L. C. Passaro (a) Amusement (b) Anger (c) Awe (d) Contentment (e) Disgust (f) Excitement Fig. 1: Top-8 Images for each emotion (cosine similarity with the caption) with zero-shot CILP (top) and Emotion-CLIP (bottom). Leveraging CLIP for Image Emotion Recognition 11 (g) Fear (h) Sadness Fig. 1: Top-8 Images for each emotion (cosine similarity with the caption) with zero-shot CILP (top) and Emotion-CLIP (bottom) (continued). It is also very interesting to notice that if we compare the results of Emotion-CLIP with those of CIFAR100-CLIP, we see that the differences in per- formances before and after the fine-tuning are similar for both experiments, with an improvement of around 0.20 across all metrics. This is interesting consider- ing that the original model is much better suited to perform image classification tasks similar to the one of CIFAR100. We could speculate that, given a zero-shot capable model such as CLIP, the improvements in performances on downstream tasks and benchmark data may be limited by the architecture of the model itself. 3.3 Evaluation of fine-tuning on generalization capabilities of CLIP In the final experiments, we evaluated the zero-shot capabilities of CLIP after fine-tuning on a different dataset, i.e. the extent to which fine-tuning on specific data may affect the zero-shot performances on different dataset. Recall that in order to do so, we first trained Emotion-CLIP and CIFAR100-CLIP on their respective dataset, with the same settings described in Section 2.2. Then, we exploited the fine-tuned models to perform classification on the other considered dataset. The details of the experiments are described in Section 2.3. Results of the experiments are shown in Table 5. If we analyze the results of leveraging fine-tuned CLIP for different tasks, we can identify an interesting trend. We saw in Section 2.2 that fine-tuning for a specific task is effective in improving performances. In this case, both fine-tuned models perform worse than the ViT-B/32 CLIP pre-trained model on a task they are not fine-tuned on. This is clearly expected as the models’ weights are shifted towards the end goal of the downstream tasks. However, it is interesting 12 A. Bondielli and L. C. Passaro Table 5: Results of applying fine-tuned models to a different dataset under zero- shot settings. Precision Recall F1-score Model Test Data Accuracy Macro Weighted Macro Weighted Macro Weighted Emotion-CLIP CIFAR100 0.57 0.57 0.41 0.41 0.40 0.40 0.41 CIFAR100-CLIP Image-Emotion 0.43 0.50 0.30 0.39 0.30 0.34 0.39 to notice that both the experiments show rather similar degradation of the per- formances. In fact, both the models lose between 15 and 20% of F1-Score when tested on a different benchmark. This is interesting if we consider the nature of the training set of CLIP and its performances on simpler tasks with respect to more complex and/or abstract ones. The CIFAR100 dataset is definitely more akin to the original training set with respect to the Image-Emotion dataset, thus the resulting model should be more similar to the original one in terms of weights, i.e. it has to learn less about the classes. On the other hand, addressing image emotion classification starting from a pre-trained model requires a deeper adaptation of the model. This is also proven by the fact that CIFAR100-CLIP needed only a training epoch to learn the dataset, while Emotion-CLIP needed three. However, the relative closeness between CIFAR100-CLIP and the original CLIP model does not avoid the performance degradation in zero-shot settings on the image emotion classification. Notably, such a degradation is similar to the one detected by performing zero-shot classification on the CIFAR100 dataset starting from a model specialized on detecting emotions. 4 Conclusions and Future Works In this paper, we have provided an evaluation of CLIP for the detection of emo- tions elicited by images. We experimented with the model both under zero-shot settings and by leveraging a fine-tuning strategy, and evaluate the advantages and drawbacks of both also in comparison with a more straightforward com- puter vision task. Exploiting CLIP as a zero-shot classifier provides good and rather inexpensive out-of-the-box performances on image classification, while for image emotion recognition the obtained results still show a wide margin of im- provement. By leveraging fine tuning, we saw a significant improvement, similar in both considered tasks, but at the cost of generalization. A fine-tuned model on a specific downstream task performs worse than the base CLIP model on a benchmark it is not trained on. The obtained results provide an early insight into exploiting state-of-the-art multi-modal models to characterize the emotions elicited by images, and thus on more abstract and subjective tasks. In the future, we plan to extend this line of research by leveraging diverse models and datasets. To this extent, we plan to create a new dataset in which the emotive labels associated with images are provided with textual information describing the choice of the labelling, Leveraging CLIP for Image Emotion Recognition 13 according to the annotation schema adopted in the ArtEmis [1] dataset, which is focused on art. Moreover, we plan to face the emotion recognition task as multi- label problem, in order to better learn how emotional texts can be associated to images and vice versa. Finally, we plan to perform a more in-depth and systematic study on the impact of the generated captions on the final model quality. References 1. Achlioptas, P., Ovsjanikov, M., Haydarov, K., Elhoseiny, M., Guibas, L.: Artemis: Affective language for visual art. CoRR abs/2101.07396 (2021) 2. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6077–6086 (2018) 3. Borth, D., Ji, R., Chen, T., Breuel, T., Chang, S.F.: Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM international conference on Multimedia. pp. 223–232 (2013) 4. Chatterjee, A., Narahari, K.N., Joshi, M., Agrawal, P.: SemEval-2019 task 3: Emo- Context contextual emotion detection in text. In: Proceedings of the 13th Inter- national Workshop on Semantic Evaluation. pp. 39–48. Association for Computa- tional Linguistics, Minneapolis, Minnesota, USA (Jun 2019) 5. Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10578–10587 (2020) 6. Cortis, K., Freitas, A., Daudert, T., Huerlimann, M., Zarrouk, M., Handschuh, S., Davis, B.: SemEval-2017 task 5: Fine-grained sentiment analysis on financial microblogs and news. In: Proceedings of the 11th International Workshop on Se- mantic Evaluation (SemEval-2017). pp. 519–535. Association for Computational Linguistics, Vancouver, Canada (Aug 2017) 7. Hossain, M.Z., Sohel, F., Shiratuddin, M.F., Laga, H.: A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR) 51(6), 1–36 (2019) 8. Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 6700–6709 (2019) 9. Jia, J., Wu, S., Wang, X., Hu, P., Cai, L., Tang, J.: Can we understand van gogh’s mood? learning to infer affects from images in social networks. In: Proceedings of the 20th ACM international conference on Multimedia. pp. 857–860 (2012) 10. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123(1), 32–73 (2017) 11. Krizhevsky, A.: Learning multiple layers of features from tiny images pp. 32–33 (2009) 12. Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: European Conference on Computer Vision. pp. 121–137. Springer (2020) 14 A. Bondielli and L. C. Passaro 13. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014) 14. Mohammad, S., Bravo-Marquez, F., Salameh, M., Kiritchenko, S.: SemEval-2018 task 1: Affect in tweets. In: Proceedings of The 12th International Workshop on Semantic Evaluation. pp. 1–17. Association for Computational Linguistics, New Orleans, Louisiana (Jun 2018) 15. Passaro, L.C., Lenci, A.: Evaluating context selection strategies to build emotive vector space models. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož (Slovenia) (May 2016) 16. Passaro, L.C., Lenci, A.: Less is more: a multimodal system for tag refinement. In: Proceedings of the 4th Workshop on Natural Language for Artificial Intelligence (NL4AI 2020). pp. 44–58 (2020) 17. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021) 18. Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceed- ings of Machine Learning Research, vol. 139, pp. 8821–8831. PMLR (18–24 Jul 2021) 19. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 20. Tan, H., Bansal, M.: LXMERT: Learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 5100–5111. Association for Computational Linguistics, Hong Kong, China (Nov 2019) 21. Teney, D., Anderson, P., He, X., Van Den Hengel, A.: Tips and tricks for visual question answering: Learnings from the 2017 challenge. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4223–4232 (2018) 22. Yang, J., She, D., Sun, M.: Joint image emotion classification and distribution learning via deep convolutional neural network. In: IJCAI. pp. 3266–3272 (2017) 23. You, Q., Luo, J., Jin, H., Yang, J.: Building a large scale dataset for image emotion recognition: The fine print and the benchmark. Proceedings of the AAAI Confer- ence on Artificial Intelligence 30(1) (Feb 2016)