=Paper=
{{Paper
|id=Vol-3199/paper12
|storemode=property
|title=HCILab at Memotion 2.0 2022: Analysis of Sentiment, Emotion and Intensity of Emotion Classes from Meme Images using Single and Multi Modalities (short paper)
|pdfUrl=https://ceur-ws.org/Vol-3199/paper12.pdf
|volume=Vol-3199
|authors=Thanh Tin Nguyen,Nhat Truong Pham,Ngoc Duy Nguyen,Hai Nguyen,Long H. Nguyen,Yong-Guk Kim
|dblpUrl=https://dblp.org/rec/conf/aaai/NguyenPNNNK22
}}
==HCILab at Memotion 2.0 2022: Analysis of Sentiment, Emotion and Intensity of Emotion Classes from Meme Images using Single and Multi Modalities (short paper)==
HCILab at Memotion 2.0 2022: Analysis of Sentiment, Emotion and Intensity of Emotion Classes from Meme Images using Single and Multi Modalities Thanh Tin Nguyen1 , Nhat Truong Pham2,3 , Ngoc Duy Nguyen4 , Hai Nguyen5 , Long H. Nguyen6 and Yong-Guk Kim1 (Corresponding author) 1 Human Computer Interaction Lab, Department of Computer Engineering, Sejong University, Seoul, Korea 2 Division of Computational Mechatronics, Institute for Computational Science, Ton Duc Thang University, Ho Chi Minh City, Vietnam 3 Faculty of Electrical and Electronics Engineering, Ton Duc Thang University, Ho Chi Minh City, Vietnam 4 Institute for Intelligent Systems Research and Innovation, Deakin University, Victoria, Australian 5 Khoury College of Computer Sciences, Northeastern University, Boston, USA 6 Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam Abstract Nowadays, memes found on internet are overwhelming. Although they are innocuous and sometimes entertaining, there exist memes that contain sarcasm, offensive, or motivational feelings. In this study, several approaches are proposed to solve the multiple modality problem in analysing the given meme dataset. The imbalance issue has been addressed by using a new Auto Augmentation method and the uncorrelation issue has been mitigated by adopting deep Canonical Correlation Analysis to find the most correlated projections of visual and textual feature embedding. In addition, both stacked attention and multi-hop attention network are employed to efficiently generate aggregated features. As a result, our team, i.e. HCILab, achieved a weighted F1 score of 0.4995 for sentiment analysis, 0.7414 for emotion classification, and 0.5301 for scale/intensity of emotion classes on the leaderboard. This results are obtained by using concatenation between image and text model and our code can be found at https: //git.io/JMRa8. Keywords Meme analysis, attention models, correlation analysis, emotion classes, multimodality, vision and lan- guage 1. Introduction The task of analyzing sentiments, emotions and their intensity has attracted a great deal of attention in research community, especially when it can help to subdue unnecessary damages. As the internet has spread worldwide, false information, hatred, or offensive language are also De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, co-located with AAAI 2022. 2022, Vancouver, Canada. " nttin@sju.ac.kr (T. T. Nguyen); phamnhattruong.st@tdtu.edu.vn (N. T. Pham); n.nguyen@deakin.edu.au (N. D. Nguyen); hainguyen@ccs.neu.edu (H. Nguyen); hoanglong.fruitai@gmail.com (L. H. Nguyen); ykim@sejong.ac.kr (Y. Kim) 0000-0002-6798-9808 (T. T. Nguyen); 0000-0002-8086-6722 (N. T. Pham); 0000-0002-4052-5819 (N. D. Nguyen) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) increasing tremendously. A common way to disseminate these threads is by using texts in meme images that vicious people can mitigate as a mean to agitate arguments, disputes, and social wars. To mitigate harmful effects of toxic memes, machine learning [1] and deep learning [2, 3] are normally employed to tackle the problem. These techniques can detect and classify memes effectively, although it requires humans to label the data. Nevertheless, the results are promising and the algorithm can be integrated into social media platforms such as Facebook or Twitter to automatically detect and remove these memes completely. Following the success of the Semeval 2020 challenge [4], in the Memotion2 challenge, the organizer provides a new dataset [5], [6] including memes and corresponding texts. The task includes three subtasks: (Subtask A) sentiment analysis which is to classify negative, neutral, and positive contents; (Subtask B) emotion classification which is to classify emotions of memes, there are four main categories including funny, sarcastic, offensive and motivational; (Subtask C) the last task is to seek for detail information of each emotion, for example, funny, sarcastic, and offensive emotions have four levels while the last emotion only has two levels. The weighted F1 score is used in this competition to evaluate each subtask, the final score is the average of three subscores. In addition, the task has two important issues. Firstly, the data is imbalanced among different classes. Secondly, images and their corresponding texts are not well correlated to each other, because the texts and the images often point to different meanings in meme images, so there is a need to build an effective fusion technique to reduce a semantic gap between two modalities. In order to address the problem, our team proposed several models and achieved good results on the private leaderboard. The remaining of this study is organized as follows. Section 2 introduces brief literature of this problem. Section 3 describes our methodology including unimodal, bimodal as well as auxiliary techniques. Results are summarized in section 5. Finally, we conclude our study and outline future research in Section 6. 2. Background In the previous competition, different approaches were developed to tackle the problem. For instance, in [1, 2, 3], the authors introduce machine learning and deep learning models including Naive Bayes [7], BERT [8], Multimodal Transformer [9], and ResNet [10] to tackle the problem. Nevertheless, these are mainly divided into two types of models including unimodal and bimodal. The unimodal uses only one modality as an input which can be texts or images. The bimodal adopts fusion techniques to aggregate features from different modalities to obtain related information and achieve a better classification rate. Previous studies have employed state-of- the-art models for text and vision, but they did not consider the correlation of two modalities and how to preprocess the data to create a clean one. In this study, EfficientNet-v2 [11] is employed as a visual extractor, while LSTM [12] and RoBERTa [13] are used in the bimodal to extract textual features. In addition, RoBERTa is also used for text model. With respect to fusion techniques, there are three methods to obtain aggregated features including traditional aggregated method, multi-hop attention [2], and stacked attention [14]. These techniques are used to combine visual features and textual features from LSTM and RoBERTa. In general, we evaluated six different models during the competition. Besides, we also adopt several techniques to improve the classification rate such as Auto Augmentation [15] and Deep Canonical Correlation Analysis [16]. Finally, to enhance the visual extraction, we remove texts on memes by using EAST [17] to detect texts within images and then remove them. 3. Methodology Figure 1: The proposed framework for multiple modalities that has two inputs: (1) an image after re- moving the text in it; and (2) an extracted text from the image. The image is processed by Efficientnet-v2, while a text is processed by either a Roberta or an LSTM. Then, these extracted features are aggregated by a fusion network which includes a combination of a multi-hop attention network and/or a stacked attention network. A fused vectors is used to predict the classes of corresponding inputs depending on each subtask. We have evaluated many network architectures along with fusion techniques. In addition, we employ auxiliary methods such as Auto Augmentation and Canonical Correlation Analysis to enhance the efficiency. The proposed models are divided into Unimodal and Multimodal based on a vision backbone, i.e., EfficientNet-v2, [11] and LSTM [12] and RoBERTa [13] for text processing. Figure 1 depicts the proposed framework for multiple modalities. It includes one branch for extracting features from the image and one for extracting features from the text. These features are concatenated by an attention-based fusion module before passing through a fully connected layer for final classification. The number of output nodes of this layer depends on each task. For the sentiment task, it will be 3 nodes denoting for Negative, Neutral and Positive classes. In terms of the emotion task, there will be 4 final linear classifiers, each one will two output nodes, because in this task, there are 4 types of emotions, each emotion has two classes 0 and 1. Lastly, for the intensity task, there are also 4 final linear classifiers, but each one will have different output nodes, for example, in the intensity of humour class, there will be 4 nodes denoting 4 levels of intensity. Meanwhile, that of motivation class will have only 2 output nodes. 3.1. Unimodal for Text BERT [8] and its variant, e.g., RoBERTa [13], are widely used in Natural Language Processing (NLP) tasks and have demonstrated as efficient methods. In this competition, we employed them for three subtasks. In subtasks A and B, a RoBERTa [13] is used while in subtask C, four RoBERTa models are adopted so that every backbone corresponds classifying the intensity of each emotion. 3.2. Unimodal for Image As a vision-based approach, EfficientNet-v2 [11] is a well-known backbone with respect to speedy inference and a low number of parameters. In three subtasks, EfficientNet-v2 is used as an extractor to create embeddings. Subtask A has one classification branch to deal with three sentiment types while subtasks B and C have four branches which each is responsible for classifying four types of emotions as well as their intensity. 3.3. Multimodal for Image and Text Multi-modality is to aggregate vision and text to obtain correlated information. In this challenge, we build three different fusion models including concatenation, multi-hop attention [2], and stacked attention network [14]. 3.3.1. Concatenation Traditionally, concatenation of two feature vectors, a.k.a two modalities, has been a typical solution to obtain aggregated features. However, the method does not take into account the importance of each word that is within corresponding regions of the image. 3.3.2. Multi-hop Attention Multi-hop attention is initially proposed by [2]. It focuses parts of a given image together with texts within it. The technique aims to emphasize dissimilar features between image regions and textual utterances by defining a relevant matrix 𝑅, which is the cosine distance between textual and visual features. 3.3.3. Stacked Attention While a multi-hop attention network is used to learn attention maps between an image and texts within it, a stacked attention network introduced in [14] has a capability of learning an attention map in multiple times. Through such attention layers, interested regions are promoted through a referred concept within a given sentence. 3.4. Useful Techniques 3.4.1. Auto Augmentation Augmentation is a simple but important technique to increase the size of a given dataset, leading to a better generalization of a training model. However, current data augmentation is based on a set of manually designed algorithms such as Crop, Rotation, and Resize. In our experiment, we adopt the Auto Augment technique [15] which uses reinforcement learning to automatically search for a better data augmentation strategy. 3.4.2. Canonical Correlation Analysis The Canonical correlation analysis (CCA) was proposed by [18]. It is based on a well-established statistical technique that searches for a linear combination of input vectors by maximizing their correlations. Deep CCA [16] tries to utilize the power of both deep neural networks and CCA to overcome projection constraints of CCA . In this study, correlation scores obtained from Deep CCA is included to our loss function to maximize the correlation between two features, leading to a higher classification rate. 4. Experiment 4.1. Dataset In this shared task of the First Workshop on Multimodal Fact-Checking and Hate Speech Detection, MEMOTION 2.0 [5] was used which was a hate speech detection dataset. It included 7,000 samples for the training set and 1,500 samples for the validation set. This dataset was used for three subtasks in the MEMOTION 2.0 Challenge and labeled as follows: • Sentiment analysis: – Negative and Very Negative are labeled 0; – Neutral is labeled 1; – Positive is labeled 2. • Emotion classification: – Not Humorous is labeled 0, while Humorous is labeled 1 that includes funny, very funny, and hilarious; – Not Sarcastic is labeled 0, while Sarcastic is labeled 1 including little sarcastic, very sarcastic, and extremely sarcastic; – Not Offensive is labeled 0, while Offensive is labeled 1 that are slight, very offensive, and hateful offensive; – Not Motivational is labeled 0 and Motivational is labeled 1. • Scale/intensity of emotion classes: – Humour: Not funny, funny, very funny, and hilarious are labeled 0, 1, 2, 3, respec- tively; – Sarcasm: Not sarcastic, little sarcastic, very sarcastic, and extremely sarcastic are labeled 0, 1, 2, 3, respectively; – Offense: Not offensive, slight, very offensive, and hateful offensive are labeled 0, 1, 2, 3, respectively; – Motivation: Not motivational is labeled 0 and motivational is labeled 1. 4.2. Preprocessing Although both textual and visual features are important for meme emotion analysis, there is little correlation between them in the MEMOTION 2.0 dataset. Besides, the caption is also provided as a part of the dataset. Therefore, in this study, the text is removed from the image before extracting and training the proposed model. Based on the previous work [19] that summarized both traditional and deep learning ap- proaches for text detection and recognition, we design a preprocessing scheme to remove texts from images as follows. First, we employ the EAST [17] module to detect all text regions in an image. Then these regions are removed from the image, and we use the output image as the input for EfficientNet-v2 in the proposed framework. Figure 2 visualizes the steps of the preprocessing scheme. Figure 2: Preprocessing scheme: Given an image as input, we use the EAST [17] detector to detect the region of texts on the image and then remove them. 4.3. Experimental setup All experiment was carried out using a Titan Xp GPUs station. The batch size is 10, the input image size is 256×256, the learning rate is 2e-5, the Adam [20] optimizer is used in this model with a weight decay of 1e-5. Moreover, the Cosine Annealing Warm Restarts [21] scheduler is used for scheduling the learning rate. We also use common augmentation techniques such as Resize, CenterCrop, RandomFlip with probability of 0.5, especially adopt the Auto Augmentation mentioned above, then take Normalize with mean and std are (0.485, 0.456, 0.406), (0.229, 0.224, 0.225), respectively. Finally, our models use a cross-entropy as the loss function except for single models for the texts that use binary cross-entropy instead. 5. Results The evaluation metric of this competition is the Weighted F1 score, and the final score will be the average of three Weighted F1 scores of all subtasks. Table 1 summarizes our results in the public phase with different models. The results of the private phase are presented in Table 2. Among the best Weighted F1 scores of three subtasks, we achieved a score of 0.5124 for sentiment analysis, 0.7423 for emotion classification, and 0.5296 for scale/intensity of emotion classes, respectively. Table 1 The Weighted F1 scores of three subtasks, namely, Sentiment, Emotion, and Intensity of Emotion in public phase. Note that these results are obtained with the validation data during the public phase, and SAN denotes for Stacked Attention Network. Model Sentiment Emotion Intensity Average Only Text 0.5145 0.7140 0.5781 0.6025 Only Image 0.5176 0.7033 0.5628 0.5946 Multihop Image + Text 0.5316 0.7107 0.5590 0.6004 SAN Image + Text 0.5138 0.7140 0.5745 0.6008 Concat Image + Text 0.5253 0.7141 0.5823 0.6072 SAN Image + Text 0.5200 0.7083 0.584 0.6041 Table 2 The Weighted F1 scores of three subtasks Sentiment, Emotion, and Intensity of Emotion in private phase. Task Sentiment Emotion Intensity Average Weighted F1 score 0.4995 0.7414 0.5301 0.5903 6. Conclusion and Future Work For this study, we have integrated several attention models and the correlation analysis technique for the meme dataset analysis. To handle the imbalanced dataset, Auto Augmentation [15] is proposed and it is found that it provides a richer dataset for further processes. The visual and textual features extracted by attention models are projected into the most correlated directions by using DCCA [16] for the stable and generalized training. The best result of each subtask varies depending on combination of the used models. For the sentiment task, the multihop attention-based LSTM performs the best, whereas concatenation of CNN and BERT gives the highest result for the emotion task. The stacked attention network with CNN and BERT achieves the best for the intensity task. In the future, an in-depth analysis shall be done by collecting or synthesizing more dataset as well as mitigating the semantic gap between the text and the image. The imbalance between classes has been a vitally important problem that can be tackled by data augmentation or formulating a new loss function that can put more weight on classes with fewer data. In addition, since feature fusion is not always compulsory in the vision-language task, designing a noble network that can choose whether to use the fusion or not is to be investigated. Acknowledgments This research was supported by the Ministry of Science and ICT (MSIT), Korea, under the Infor- mation Technology Research Center (ITRC) support program (IITP-2021-2016-0-00312) as well as a grant (IITP-2019-0-00231) supervised by the Institute for Information & Communications Technology Planning & Evaluation (IITP). In addition, the authors would like to thank the FruitLab team for useful ideas and discussion. References [1] V. Keswani, S. Singh, S. Agarwal, A. Modi, Iitk at semeval-2020 task 8: Unimodal and bimodal sentiment analysis of internet memes, in: Proceedings of the Fourteenth Workshop on Semantic Evaluation, 2020, pp. 1135–1140. [2] S. Pramanick, M. S. Akhtar, T. Chakraborty, Exercise? i thought you said’extra fries’: Leveraging sentence demarcations and multi-hop attention for meme affect analysis, arXiv preprint arXiv:2103.12377 (2021). [3] Z. Li, Y. Zhang, B. Xu, T. Zhao, Cn-hit-mi. t at semeval-2020 task 8: Memotion analysis based on bert, in: Proceedings of the Fourteenth Workshop on Semantic Evaluation, 2020, pp. 1100–1105. [4] C. Sharma, D. Bhageria, W. Scott, S. PYKL, A. Das, T. Chakraborty, V. Pulabaigari, B. Gam- back, Semeval-2020 task 8: Memotion analysis–the visuo-lingual metaphor!, arXiv preprint arXiv:2008.03781 (2020). [5] S. Ramamoorthy, N. Gunti, S. Mishra, S. Suryavardan, A. Reganti, P. Patwa, A. Das, T. Chakraborty, A. Sheth, A. Ekbal, C. Ahuja, Memotion 2: Dataset on sentiment and emotion analysis of memes, in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, CEUR, 2022. [6] P. Patwa, S. Ramamoorthy, N. Gunti, S. Mishra, S. Suryavardan, A. Reganti, A. Das, T. Chakraborty, A. Sheth, A. Ekbal, C. Ahuja, Findings of memotion 2: Sentiment and emotion analysis of memes, in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, CEUR, 2022. [7] I. Rish, et al., An empirical study of the naive bayes classifier, in: IJCAI 2001 workshop on empirical methods in artificial intelligence, volume 3, 2001, pp. 41–46. [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [9] D. Kiela, S. Bhooshan, H. Firooz, E. Perez, D. Testuggine, Supervised multimodal bitrans- formers for classifying images and text, arXiv preprint arXiv:1909.02950 (2019). [10] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Pro- ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [11] M. Tan, Q. V. Le, Efficientnetv2: Smaller models and faster training, arXiv preprint arXiv:2104.00298 (2021). [12] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997) 1735–1780. [13] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach. corr abs/1907.11692 (2019), URL: http://arxiv. org/abs/1907.11692 (1907). [14] Z. Yang, X. He, J. Gao, L. Deng, A. Smola, Stacked attention networks for image question answering, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 21–29. [15] E. Cubuk, B. Zoph, D. Mane, V. Vasudevan, Q. Le, Autoaugment: Learning augmentation policies from data. arxiv 2018, arXiv preprint arXiv:1805.09501 (2019). [16] G. Andrew, R. Arora, J. Bilmes, K. Livescu, Deep canonical correlation analysis, in: International conference on machine learning, PMLR, 2013, pp. 1247–1255. [17] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, J. Liang, East: an efficient and accurate scene text detector, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 5551–5560. [18] H. Hotelling, Relations between two sets of variates, in: Breakthroughs in statistics, Springer, 1992, pp. 162–190. [19] S. Long, X. He, C. Yao, Scene text detection and recognition: The deep learning era, International Journal of Computer Vision 129 (2021) 161–184. [20] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014). [21] I. Loshchilov, F. Hutter, Sgdr: Stochastic gradient descent with warm restarts, arXiv preprint arXiv:1608.03983 (2016).