1. Introduction

wentaorub at Memotion 3: Ensemble learning for Multi-modal MEME classification

Wentao Yu

Dorothea Kolossa

0 0 Electronic Systems of Medical Engineering , TU Berlin , Germany 1 Institute of Communication Acoustics, Ruhr University Bochum , Germany

Memes, as a new means of creative expression on social networks, provide an appealing multi-modal form of communication. However, some memes are being used to express hatred, which can take a toll on people's mental health and on societal cohesion. This year's Memotion 3.0 challenge provides an English and a mixed Hindi-English meme dataset for three classification tasks: Task A is sentiment analysis to classify a given meme as positive, negative, or neutral. In Task B, emotion classification, a meme should be identified as humorous, sarcastic, ofensive, or motivational. Finally, Task C asks to predict the intensity of the emotion classes in Task B. Both text and image data play a role in the identification and classification of hateful memes. While such multi-modality can be helpful in many contexts, here, it also increases the challenge of the classification tasks due to the nature of memes, which often achieve their humorous efects through juxtaposition and irony. To address this dificulty, we adopt a multi-headed self-attention mechanism to integrate the text and image information in a learned, task-adapted manner. The gradient blending algorithm prevents overfitting issues in the multi-modal model. Our uni-modal models, which feed into the attention mechanism, are based on the CLIP model due to its outstanding performance on zero-shot classification tasks. Ultimately, with an ensemble strategy of our two best-performing models, our submission only reaches a 0.3289 weighted F1 score on sub-task A, but it ranks 1st on the two final Tasks B and C, with respective scores of 0.7977 and 0.5982. 1 1Our code will be made available at: https://github.com/wentaoxandry/Memotion3.0_challenge.git De-Factify 2: 2nd Workshop on Multimodal Fact Checking and Hate Speech Detection, co-located with AAAI 2023. 2023 Washington, DC, USA $ wentao.yu@rub.de (W. Yu); dorothea.kolossa@tu-berlin.de (D. Kolossa) https://cognitive-signal-processing.de/index.php/team/ (W. Yu); https://www.tu.berlin/en/mtec (D. Kolossa) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CPWrEooUrckResehdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CEUR Workshop Proceedings (CEUR-WS.org)

eol>Ensemble CLIP OSCAR multimodal memes classification

1. Introduction

It is well-known that multi-modal machine learning can vastly outperform uni-modal learning, at least when the system is set up appropriately. For example, in audio-visual speech recognition, visual information can complement speech signals to significantly improve recognition rates [ 1, 2, 3 ]. However, memes often express opinions in an implied manner. The text and image may even have opposite meanings in isolation and can be combined ironically. This characteristic of memes leads to a new type of challenge in automatic classification tasks. In order to study this problem, the Memotion 3.0 challenge provides a Hinglish meme dataset for three meme classification tasks [ 4, 5, 6, 7, 8, 9, 10 ].

In this work, we consider transfer learning to customize two multi-modal models based on the Transformer model: the CLIP model [ 11 ] and the OSCAR model [ 12 ]. The text and image encoders from the CLIP model are optimized as two uni-modal (text and image) models. Ultimately, the ensemble strategy is applied for better performance.

The paper is organized as follows: Section 2 introduces the related solutions to the task. Our system framework is described in Section 3, followed by the experimental setup in Section 4. Finally, our results are shown and conclusions are drawn in Sections 5 and 6.

2. Related Work

The transformer model [ 13 ] is widely used in natural language processing tasks due to its outstanding performance. In recent years, a number of works have expanded the capability of the transformer model towards multi-modal tasks. For example, the OSCAR model [ 12 ] adopts the Faster R-CNN [ 14 ] to extract visual embeddings of the detected object regions. In addition, the Faster R-CNN model outputs the detected object tags, which are considered as additional anchor points to improve the learning performance of alignments. Subsequently, an attention mechanism walks through the combined text-image sequence embeddings.

Recently, contrastive learning has drawn much attention due to its outstanding performance on zero-shot prediction [ 15, 16 ]. In this work, we consider the CLIP model [ 11 ] to extract contrastive embeddings, since memes contain various image types, which causes dificulties in classification. The remarkable zero-shot prediction accuracy of the CLIP model can help us to alleviate this problem. The CLIP model uses a pre-trained BERT model to extract text context classification features and a Vision transformer [ 17 ] for obtaining image classification features. Contrastive learning is adopted to learn multi-modal embeddings without manual labels by teaching the CLIP model about the similarity of diferent data points. Assuming a mini training batch with meme OCR texts T = {t1, t2, · · · t, } and images I = {i1, i2, · · · i, }, where is the batch size, the CLIP model learns to match the OCR text and image as follows: the extracted text classification features F, ∈ R× and image classification features F, ∈ R× are computed by

F, = encoder(T),

F, = encoder(I), where and are the attention dimension of text and image encoder, respectively. The classiifcation features are mapped to the same dimension and normalized with an l2 regularization. The contrastive logits x are derived as their scaled, pairwise cosine similarities: x = (‖W · F,‖2 · ‖ W · F,‖2 ) × , where is a learnable parameter. As in [ 11 ], the labels are the one hot encoded labels of the set y = [1, 2, · · · ]. The loss function is defined as: = 0.5 · =0(̂y︀, y) + 0.5 · =1(̂y︀, y),

CE CE and image classification features and multi-modal models. where CE is the cross-entropy, ̂y︀ = softmax(x) and ̂y︀ = softmax(x). Finally, the learned text =0 =1 F,, F, are used in our proposed CLIP-based text, image, (1) (2) (3)

3. System Overview

Classi er1 Fcls;t xt + Classi er2

h RNN

Et CLIPtext

T (a) Text model are utilized in the multi-modal model, where is the number of patches of the image.

Figure 2 depicts the proposed multi-modal model. The sequence embeddings from the text and image model in Figure 1 are concatenated along the sequence dimension as multi-modal embeddings

E = [E; E], where E ∈ R(+)× . x, where

The complete embedding sequence E is fed into six multi-head attention (MHA) blocks. We removed the MHA block's residual connection and dropout layer based on our experimental results. Finally, the classifier, which has the same structure as the text and image classifiers, uses the multi-modal classification embedding F ∈ R2 to obtain the multi-modal logits F = [Ẽ︀[0, :], Ẽ︀[0, :]]. (4) (5) (6) Ẽ︀[0, :] and Ẽ︀[0, :] are the first classification embeddings after 6 MHA blocks.

xmulti

4. Experimental Setup

The Memotion 3.0 challenge has three sub-tasks. Task A is to classify a meme as positive, negative, or neutral. In Task B, a given meme should be identified as humorous, sarcastic, ofensive, or motivational. It is a multi-label classification task, so that a meme can have more than one category. Finally, Task C asks to predict the intensity of the emotion classes in Task B. In our work, we only optimized the models for Task A and Task C. Task B results are obtained 1https://github.com/schesa/ImgFlip575K_Dataset from Task C. For example, we consider a meme as Humorous if it is classified as F, VF, or H (detailed in Table 2) and vice versa. Since the Memotion3 dataset contains English as well as mixed Hindi-English memes, we perform back-translation to Hindi and then to English with the Python translators package. All models are trained using the PyTorch library [ 21 ]. The AdamW optimizer [22] is used for backpropagation. The CLIP model (detailed in Section 2) is first optimized using the pretrainCLIP dataset, where the contrastive loss is the objective function. The pre-trained CLIP model thus learns to match the meme image and text and is denoted as CLIPpre. For uni-modal training, the pretrainuni dataset is used to fine-tune the the CLIP component models CLIPpre,t and CLIPpre,i within the two overall model structures as shown in Figure 1. The model parameters of the CLIP component models CLIP-text and CLIP-image are initialized by those of the respective CLIPpre model and optimized on the Memotion3 dataset. The multi-modal model CLIP-multi parameters are then initialized by the uni-modal models. We train one model for Task A with an output dimension of 3 and four models for the four aspects of Task C with output dimensions 4, 4, 4, and 2. The dropout rate in all classifiers is 0.1. The attention dimensions of the text encoder and image encoder are 512 and 768, respectively. The dataset for Task A is balanced. Therefore, we simply use the cross-entropy (CE) as the loss function for Task A.In contrast, the dataset for Task C is quite imbalanced. The focal loss (F) function [23] is therefore selected as the loss function for training the respective classifiers.

In this work, we adopt Gradient-Blending [24] (GB) to reduce the efect of overfitting. The multi-modal model (Figure 2) is based on the text and image model (Figure 1). Therefore, the text and image model logits x and x are also available in the multi-modal model. Taking the gradient of the blended loss = ∑︁ CE, (7)

where ∈[text, image, multi-modal], produces the blended gradient. It should be emphasized that the multi-modal predictions are only obtained from the multi-modal logits x. Finally, Table 3 gives an overview of the use of the loss functions in training all models.

We use the Python RAY 2 package to find the best-performing hyperparameters. The training process is carried out on NVIDIA's Volta-based DGX-1 multi-GPU system, using 3 TeslaV100 GPUs with 32 GB memory each.

2https://github.com/ray-project/ray

5. Results

This work considers the CLIP-text, CLIP-image (in Figure 1), CLIP-multi (in Figure 2), and OSCAR models. For better performance, majority voting is adopted to ensemble diferent models' decisions. Ensemble-1 fuses the prediction decisions of the candidate models CLIP-text0, CLIP-image0, CLIP-multi0, and OSCAR0, while Ensemble-2 also takes CLIP-text1, CLIP-image1, CLIP-multi1, and OSCAR1 into consideration. We iterate over all possible model combinations and adopt majority voting on the validation set to find the best performance model combinations. Then, these combinations are used to fuse the test set predictions.

Table 4 lists the weighted F1 score on the validation set. For Task A ("Overall" column in Table 4), the CLIP-text model performs better than the CLIP-image model. The score of the CLIP-multi setup lies between those of the former two models. Ensemble-2 improves the weighted F1 score to 0.4453. The model for motivation classification has scores above 0.9, because the binary classification dataset is imbalanced. Comparing the best-performing text and image models (CLIP-text0 and CLIP-image0), the image model shows a slightly better performance in Task C. The CLIP-multi0 model without GB training performs far worse than its gradient-blending counterpart. Overall, Ensemble-2 shows the best performance in Task A and Task C. Ultimately, the strategy of ensembling the top two models yields a 0.3289 (5th) weighted F1 score on Task A, 0.7977 (1st rank) on Task B and 0.5982 (also 1st rank) on Task C.

6. Conclusion

This work proposes a multi-modal CLIP-based meme classification system, which owes its capabilities on this rather small dataset to the outstanding zero-shot performance of the CLIP model. The text model combines the CLIP model text encoder with 2 BiLSTM layers; the image model is fine-tuned on the Memotion 3.0 dataset. The proposed multi-modal model integrates the text and image embeddings from the text and image encoders in 6 multi-head self-attention blocks. Gradient blending prevents the fusion model from overfitting. The OSCAR model is used both as a baseline model and as a participant model in our ensemble strategy, which further serves to improve the system performance. Our ensembe of the top two models yields a clearly better accuracy than one single model, winning Task B and C in the Memotion 3.0 challenge. The experimental results of the challenge do indicate, however, that sentiment analysis in memes is dificult for machine learning. The next goal of our work is therefore to develop mechanisms for understanding multi-modal, contrasting information, e.g. conveying irony, to improve sentiment classification performance for memes and social media posts.

Acknowledgments

The work was supported by the PhD School ”SecHuman - Security for Humans in Cyberspace” by the federal state of NRW, and partially funded by the Deutsche Forschungsgemeinschaft (DFG – German Research Foundation) [Project-ID 429873205] and by the German Federal Ministry of Education and Research [”noFake”, Grant No: 16KIS1519]. The authors are responsible for the content of this publication. N. Gimelshein, L. Antiga, et al., Pytorch: An imperative style, high-performance deep learning library, Advances in Neural Information Processing Systems (2019). [22] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101 (2017). [23] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in:

Proc. ICCV, 2017, pp. 2980–2988. [24] W. Wang, D. Tran, M. Feiszli, What makes training multi-modal classification networks hard?, in: Proc. CVPR, 2020, pp. 12695–12705.

[1]

Yu ,

Zeiler ,

Kolossa , Multimodal integration for large-vocabulary audio-visual speech recognition , in: Proc. 28th European Signal Processing Conf. (EUSIPCO) , IEEE, 2021 , pp. 341 - 345 .

[2]

Yu ,

Zeiler ,

Kolossa , Fusing information streams in end-to-end audio-visual speech recognition , in: Proc. ICASSP , IEEE, 2021 , pp. 3430 - 3434 .

[3]

Yu ,

Boenninghof ,

Roehrig ,

Kolossa , Rubcsg at SemEval -2022 Task 5: Ensemble learning for identifying misogynous MEMEs , arXiv preprint arXiv:2204.03953 ( 2022 ).

[4]

Patwa ,

Ramamoorthy ,

Gunti ,

Mishra ,

Suryavardan ,

Reganti , A. Das , T.

Chakraborty , A.

Sheth , A.

Ekbal , C.

Ahuja , Findings of Memotion 2: Sentiment and emotion analysis of memes , in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, ceur, 2022 .

[5]

Mishra ,

Suryavardan ,

Patwa ,

Chakraborty ,

Rani ,

Reganti ,

Chadha , A. Das , A.

Sheth , M.

Chinnakotla , A.

Ekbal , S.

Kumar , Memotion 3: Dataset on sentiment and emotion analysis of codemixed Hinglish memes , in: Proc. Defactify 2 : 2nd Workshop on Multimodal Fact-Checking and Hate Speech Detection , CEUR , 2023 .

[6]

Mishra ,

Suryavardan ,

Chakraborty ,

Patwa ,

Rani ,

Chadha ,

Reganti , A. Das , A.

Sheth , M.

Chinnakotla , A.

Ekbal , S.

Kumar , Overview of memotion 3: Sentiment and emotion analysis of codemixed hinglish memes , in: Proc. Defactify 2 : 2nd Workshop on Multimodal Fact-Checking and Hate Speech Detection , CEUR , 2023 .

[7]

Suryavardan ,

Mishra ,

Patwa ,

Chakraborty ,

Rani ,

Reganti ,

Chadha , A. Das , A.

Sheth , M.

Chinnakotla , A.

Ekbal , S.

Kumar , Factify 2: A multimodal fake news and satire news dataset , in: Proc. Defactify 2 : 2nd Workshop on Multimodal Fact-Checking and Hate Speech Detection , CEUR , 2023 .

[8]

Suryavardan ,

Mishra ,

Chakraborty ,

Patwa ,

Rani ,

Chadha ,

Reganti , A. Das , A.

Sheth , M.

Chinnakotla , A.

Ekbal , S.

Kumar , Findings of Factify 2: Multimodal fake news detection , in: Proc. Defactify 2 : 2nd Workshop on Multimodal Fact-Checking and Hate Speech Detection , CEUR , 2023 .

[9]

Sharma ,

Bhageria ,

Scott ,

Pykl , A. Das , T.

Chakraborty , V.

Pulabaigari , B.

Gamback , SemEval -2020 Task 8: Memotion analysis-the visuo-lingual metaphor! , arXiv preprint arXiv: 2008 . 03781 ( 2020 ).

[10]

Ramamoorthy ,

Gunti ,

Mishra ,

Suryavardan ,

Reganti ,

Patwa , A. DaS , T.

Chakraborty , A.

Sheth , A.

Ekbal , et al., Memotion 2: Dataset on sentiment and emotion analysis of memes , in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection , CEUR , 2022 .

[11]

Radford ,

J. W.

Kim ,

Hallacy ,

Ramesh , G. Goh,

Agarwal ,

Sastry ,

Askell ,

Mishkin ,

Clark , et al., Learning transferable visual models from natural language supervision , in: Proc. ICML, PMLR , 2021 , pp. 8748 - 8763 .

[12]

Li ,

Yin ,

Li ,

Zhang ,

Hu ,

Zhang ,

Wang ,

Hu ,

Dong ,

Wei , et al., Oscar: Object-semantics aligned pre-training for vision-language tasks , in: Proc. ECCV , Springer, 2020 , pp. 121 - 137 .

[13]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , Attention is all you need , Advances in Neural Information Processing Systems 30 ( 2017 ).

[14]

Girshick , Fast r-cnn, in: Proc. ICCV , 2015 , pp. 1440 - 1448 .

[15]

Han ,

Fu ,

Chen ,

Yang , Contrastive embedding for generalized zero-shot learning , in: Proc. CVPR , 2021 , pp. 2371 - 2381 .

[16]

Jiang ,

Wang ,

Shan ,

Chen , Transferable contrastive network for generalized zero-shot learning , in: Proc. CVPR , 2019 , pp. 9765 - 9774 .

[17]

Dosovitskiy ,

Beyer ,

Kolesnikov ,

Weissenborn ,

Zhai ,

Unterthiner ,

Dehghani ,

Minderer , G. Heigold,

Gelly , et al., An image is worth 16x16 words: Transformers for image recognition at scale , arXiv preprint arXiv: 2010 . 11929 ( 2020 ).

[18]

Gomez ,

Gibert ,

Gomez ,

Karatzas , Exploring hate speech detection in multimodal publications , in: Proc. IEEE/CVF Winter Conference on Applications of Computer Vision , 2020 , pp. 1470 - 1478 .

[19]

Kiela ,

Firooz ,

Mohan ,

Goswami ,

Singh ,

Ringshia ,

Testuggine , The hateful memes challenge: Detecting hate speech in multimodal memes , Advances in Neural Information Processing Systems 33 ( 2020 ) 2611 - 2624 .

[20]

Fersini ,

Gasparini ,

Rizzi ,

Saibene ,

Chulvi ,

Rosso ,

Lees ,

Sorensen , Semeval -2022 task 5: Multimedia automatic misogyny identification , in: Proc. SemEval2022 , 2022 , pp. 533 - 549 .

[21]

Paszke ,

Gross ,

Massa ,

Lerer ,

Bradbury , G. Chanan,

Killeen ,

Lin ,