<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>De-Factify</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Memotion 3: Good Foundation, Good Teacher, then you have Good Meme Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yu-Chien Tang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kuang-Da Wang</string-name>
          <email>gdwang.cs10@nycu.edu.tw</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ting-Yun Ou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wen-Chih Peng</string-name>
          <email>wcpeng@nctu.edu.tw</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Yang Ming Chiao Tung University</institution>
          ,
          <addr-line>Hsinchu</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Washington</institution>
          ,
          <addr-line>DC</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>2</volume>
      <issue>2</issue>
      <abstract>
        <p>This paper presents a robust solution to the Memotion 3.0 Shared Task. The goal of this task is to classify the emotion and the corresponding intensity expressed by memes, which are usually in the form of images with short captions on social media. Understanding the multi-modal features of the given memes will be the key to solving the task. In this work, we use CLIP[1] to extract aligned image-text features and propose a novel meme sentiment analysis framework, consisting of a Cooperative Teaching Model (CTM) for Task A and a Cascaded Emotion Classifier (CEC) for Tasks B&amp;C. CTM is based on the idea of knowledge distillation, and can better predict the sentiment of a given meme in Task A; CEC can leverage the emotion intensity suggestion from the prediction of Task C to classify the emotion more precisely in Task B. Experiments show that we achieved the 2nd place ranking for both Task A and Task B and the 4th place ranking for Task C, with weighted F1-scores of 0.342, 0.784, and 0.535 respectively. The results show the robustness and efectiveness of our framework. Our code is released at github 1.</p>
      </abstract>
      <kwd-group>
        <kwd>Emotion classification</kwd>
        <kwd>meme</kwd>
        <kwd>multi-modal network</kwd>
        <kwd>multi-task learning</kwd>
        <kwd>foundation model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        There are two common definitions[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] of a meme: (1) an amusing or interesting item (such as a
captioned picture or video) or genre of items which spread widely online, especially through
social media; (2) an idea, behavior, style, or usage that spreads from person to person within a
culture. With careful analysis of the underlying sentiment of a widespread meme, people can get
a better understanding of the post content from social media. However, due to the multi-modal
nature of the meme, it is no easy task to understand its emotion and intensity only with the
image content or its caption, hindering the potential application, such as detecting hateful
or harmful memes. Considering the strong correlation between the images and captions[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
downstream emotion classification tasks and sentiment analysis can benefit from high-quality
multi-modal representation. We take advantage of the CLIP[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] model, which is pre-trained with
contrastive loss and is able to align the multi-modal features in high-dimensional embedding
space, as a foundation to retrieve the rich information inside images and text.
      </p>
      <p>Besides, we observe that the sentiment label and its scales are hierarchical (e.g., the emotion
humorous contains funny, very funny, hilarious in Task C), and thus introduce two diferent
models, CTM and CEC, for the diferent downstream tasks. In Task A, we observe that diferent
types of sentiment are composed of diferent proportions of positive and negative emotions.
Therefore, we propose CTM, which introduces the concept of knowledge distillation and uses the
framework of the teacher-student model. The good teacher and the bad teacher will cooperate
with each other and teach their own students to achieve better performance on Task A. CEC
considers the hierarchical characteristics of emotions in the model architecture, predicts the
emotion intensity for Task C, and leverages the prediction as a suggestion to classify the emotion
for Task B so that both Task B can achieve better performance, compared to using a single
model.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Meme Understanding. People express themselves with memes in various templates on
social media as a way of communication. Modern memes are images with an embedded short
text. While sentiment analysis in memes needs to extract features from both modalities, some
researchers adopt multi-modal deep neural networks to analyze the sentiment of memes. In
previous competitions, many diferent deep learning approaches have been developed, such as
multi-task classification networks and multi-modal models [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. Previous studies usually adopt
fusion techniques to aggregate features from text and images to obtain multi-modal information
for better sentiment classification performance[
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ], but none of them has shed light on the
hierarchical features of sentiment labels.
      </p>
      <p>
        Vision-Language Pre-training. Recently there have been plenty of multi-modal models
combining modules from diferent fields in various design ways. They have had surprising
results, especially in the image-text field. ConVIRT[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] uses paired descriptive text to learn
medical visual representations successfully, while CLIP[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has impressive performance on the
zero-shot transfer model to downstream tasks by pre-training huge amounts of image-text
pairs data and modifying the ConvIRT[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] architecture. The Google research team proposed
CoCa[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], an image-text encoder-decoder foundation model pre-trained with contrastive loss
and captioning loss. It has the ability of contrastive approaches like CLIP[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and generative
methods like SimVLM[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In this challenge, we use CLIP as a multi-modal feature encoder to
extract rich vision-language information from the meme.
      </p>
      <p>
        Knowledge Distillation. Knowledge distillation is a technique used in model
compression[
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ]. The main concept is to extract the knowledge from a complex model for another
simple model so that this small simple model can also achieve the same efect as the complex
model. In the vanilla setting, it is usually implemented in the framework of the teacher-student
concept: a large deep neural network is regarded as a teacher training a smaller student neural
network from its logits. Even when the teacher model and student model are the same, it can
still improve the generalization and robustness of semi-supervised models. The framework with
the same architecture as the teacher model and student model is called self-distillation[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The
Cooperative Teaching Model (section 4.2) is based on self-distillation and provides the teacher
with more additional information to make it easier to learn.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Task Description</title>
      <p>
        The Memotion 3.0[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] shared task is the third iteration of the Memotion task which was first
conducted at Semeval 2020. The Memotion 3.0 [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] dataset is made up of training dataset,
validation dataset, and testing dataset at the ratio of 5:1:1. Each sample includes an image and
the corresponding captions extracted by the OCR system. In Table 1-3, we show the details and
the label distributions for each of the diferent tasks:
• Task A: Sentiment analysis. Given a meme image and its caption, the goal is to classify
the sentiment into three labels, namely positive, neutral, and negative.
• Task B: Emotion classification. Given a meme image and its caption, the task aims
to identify the types of emotion the meme belongs to, including humorous, sarcastic,
ofensive , and motivational. Each meme can express more than one emotion.
• Task C: Scales/Intensity of Emotion Classes. The goal of this task is to quantify the
intensity of each emotion. The scales of each emotion class are from 0 to 3 for humorous,
sarcastic, and ofensive , but only 0 and 1 for motivational.
dataset
overall 25%(17:83)
      </p>
      <p>Neg
train
Neut
42%</p>
      <p>Pos</p>
      <p>Neg
33%(16:84) 39%(18:82)</p>
      <p>Pos</p>
      <p>Neg
23%(11:89) 39%
valid
Neut
38%
test
Neut
36%</p>
      <p>Pos
25%</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodologies</title>
      <sec id="sec-4-1">
        <title>4.1. Meme Encoder</title>
        <p>
          Several powerful methods[
          <xref ref-type="bibr" rid="ref16 ref17 ref18">16, 17, 18</xref>
          ] have been proposed for feature extraction in the vision
and language domains. We decided to use two types of encoders to obtain better semantic
features for the multi-modal problems: (1) direct features from a Swin Transformer[
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] which
is pre-trained on the ImageNet-21k dataset, and will then be fine-tuned on the Memotion task
dataset, and (2) a CLIP[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] model. CLIP is composed of an image encoder and a text encoder,
both jointly pre-trained to project the image and the caption onto the same embedding space in
a contrastive manner. In this way, the extracted image embeddings and the caption embeddings
are aligned, and the images will be near the captions with similar semantic features. We adopt
ViT[
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] as the image encoder and DistilBERT[
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] as the text encoder in our CLIP model.
        </p>
        <p>Feature Extraction Pipeline. For each of the following downstream tasks, the first step
of computation is to extract the features of the meme images and their captions. The Swin
Transformer and the CLIP image encoder will encode the meme images into two vectors
respectively, and the CLIP text encoder will also be used to generate the caption embeddings.
The output multi-modal embedding tuple is made up of the above three embeddings.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Task A: Cooperative Teaching Model (CTM)</title>
        <p>We present our proposed model for Task A, called the Cooperative Teaching Model (CTM). An
overview of the CTM is illustrated in Figure 2. Task A aims to classify the meme into three
categories based on the expressed sentiment. However, we believe that the three categories
should be regarded as diferent extents between positive and negative sentiment. That is, the
neutral actually belongs to either the positive or the negative, but implicitly. Based on this
idea, we introduce the concept of knowledge distillation to design the framework that has two
teacher models to teach their student models how to classify sentiment respectively. The two
teachers are a good teacher and a bad teacher. In the training period, the good teacher teaches
students how to judge the positive sentiment of memes, and vice versa. In the inference period,
we classify the meme into three classes according to the judgment of the student model.</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. Teacher Model</title>
          <p>The diference between the teacher model and the student model is that in addition to the
features of the meme images and their captions, the input of the teacher model also includes
additional information to help meme sentiment classification. The reason is to make the teacher
model worthy of being learned by the student model and to let the teacher model learn faster
than the student model.</p>
          <p>Since the neutral class actually has slight positive or negative sentiment, we regard it as
representing both positive and negative sentiment and merge the three categories into two (the
pre-label in figure 2). This pre-label will be provided as additional information of input to the
teacher model for training, helping the teacher model classify memes more easily.</p>
          <p>The goal of the teacher model is to learn how to classify whether the sentiment of the meme is
positive or negative, and the results are provided for students to learn. We added a regularization
term for the teacher model about the degree of positive and negative sentiment that should
conform to a Gaussian distribution. Table 1 shows that the probability of extreme sentiment
should be small. Therefore, the output probability distribution of the two teachers should also
approach the Gaussian distribution, which will be more realistic.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Student Model</title>
          <p>The goal of the student model is to approximate the output of the teacher model as much as
possible. During the training process of the student model, we record their confidence in the
sentiment classification. Just like a real student in the learning process, as long as there is a
slight change in a dificult or unread question, it will increase the uncertainty of the student’s
answer. We bring the learning process of students into the student model and add Gaussian
noise to the same meme embedding for disturbance. If the standard deviation of the distribution
is small, it can be considered that the student has great confidence in the judgment. In the
same way, if the standard deviation of the distribution is large, it can be considered that the
students have no confidence in the judgment. Therefore, we train the student models to predict
with great confidence by minimizing the standard deviation. We also record the mean of the
student models’ prediction of the disturbed meme during the training phase as the threshold for
determining whether the meme is negative or positive during the inference phase. Compared
with the general default threshold of 0.5, such a threshold can make the student model have
stricter standards for classification and ensure a certain amount of neutral predictions.</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>4.2.3. Loss function</title>
          <p>We let  be the number of samples. The ground truth is represented by a pre-label during
training, so there are only two categories of sentiment, namely positive and negative. We train
the Cooperative Teaching Model with the loss function:
•   is the binary cross-entropy loss of predictions from the teacher model and its
corresponding pre-labels.</p>
          <p>= −</p>
          <p>∑(  log(  ) + (1 −   ) log(1 −   ))
•   is the Kullback–Leibler divergence between the probability distribution of the teacher
model (denoted by  ) and a Gaussian distribution with learnable mean and variance
 (,</p>
          <p>2). It is used to regularize the teacher models to output a more realistic distribution.
•   is the mean square error (MSE) between each prediction of the student model and the
prediction of the corresponding teacher model.
•</p>
          <p>is the standard deviation of the probability distribution from the student model for
the same meme with diferent Gaussian noises; the smaller the standard deviation, the
greater the confidence. For each meme, we generate  diferent meme embeddings with
Gaussian noise, where  = 1000 by default.</p>
          <p>=   +   +   +</p>
          <p>1
 
=   ( || (,</p>
          <p>2))
  = 1

∑(</p>
          <p />
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Tasks B&amp;C: Cascaded Emotion Classifier (CEC)</title>
        <p>Tasks B and C are essentially related since we can get the prediction of Task B by a simple
transformation based on the prediction of Task C. For instance, if the classifier predicts very
ofensive in Task C, the prediction of class ofensive in Task B can be 1. In the light of this, we
propose a framework combining the two classification tasks by leveraging the prediction of
Task C as a suggestion for Task B. Specifically, given a meme image and its caption in Task C, a
fusion layer will first combine the multi-modal information extracted by the Meme Encoder
and generate a fusing embedding. Then the fusing embedding is fed to four MLPs with the
multi-modal embedding to predict the corresponding scales for each emotion class. Task B, as
an extension of Task C here, will dynamically assess whether the scale prediction of Task C
is trustworthy. More precisely, the prediction output of Task C will be concatenated with the
multi-modal embedding and be fed to an MLP classifier to predict the emotion expressed by the
meme.</p>
        <sec id="sec-4-3-1">
          <title>4.3.1. Loss function</title>
          <p>We optimize Tasks B and C with binary cross-entropy loss   and softmax cross-entropy loss
  respectively, and the total loss is the sum of them. It is worth noting that we simplify the
notation with a single loss term for each emotion class.</p>
          <p>=   +  
  = −
∑(  log(  ) + (1 −   ) log(1 −   ))

1
For (8),  denotes the number of scales of each emotion class, and  , denotes the predicting
probability of j-th scale for a sample  .</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiment and Discussion</title>
      <p>
        For the CLIP model, we pre-train it on three datasets, namely MET-Meme[
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], Memotion 1.0[
        <xref ref-type="bibr" rid="ref23">23</xref>
        ],
and Memotion 3.0[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The Memotion 2.0 dataset[24, 25] was not available on the Internet
so we didn’t refer to it. The pre-trained CLIP model is frozen and is not fine-tuned in the
downstream tasks. In contrast, the Swin Transformer is fine-tuned in the downstream tasks, as
we believe that it can capture diferent perspectives of features from the CLIP model. All of
our experiments were conducted on a machine with an Nvidia GTX 3060 12GB GPU. For Task
A, since neutral is implicit positive sentiment or negative sentiment, neutral will appear only
when the predictions of a good student and a bad student are both smaller than each other’s
threshold. However, during the inference phase, most of the bad student predictions cannot
reach the threshold, resulting in many negative sentiment memes being recognized as neutral.
To correctly classify the negative hidden in the neutral, we add a judgmental statement in the
inference phase: when the prediction of the bad student is greater than the prediction of the
good student, the meme is classified as negative.
      </p>
      <sec id="sec-5-1">
        <title>5.1. Competition Results</title>
        <p>Task A</p>
        <p>Task B</p>
        <p>Task C
0.342
• The text in the Memotion 3.0 is in Hinglish, which afected the performance of the
foundation model pre-trained on English data. If we could pre-train CLIP with other
Hinglish meme datasets, or if the task was in English, the performance may be improved.
• The CLIP model can make the images near the captions with similar semantic features
by aligning the extracted image embedding and the caption embedding. However, the
text in a meme does not simply describe the things in the meme image but has implicit
meanings. This means that to correctly classify the sentiment and emotion of a meme,
besides recognizing the object or event in the meme image, we need to have enough
understanding of culture and society to understand the implicit meaning of the meme
with the help of the caption.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Ablation studies</title>
        <p>An extensive ablation study was conducted to verify the design of the Cooperative Teaching
Model (CTM) and the Cascaded Emotion Classifier (CEC). The ablation study for the Meme
Encoder was not conducted as it provided the multi-modal embeddings for each downstream
task. For CTM, we developed four variants to investigate the relative contributions of diferent
components: 1) w/o TR, which is CTM without the teacher model, and only uses the student
model with pre-label to training; 2) w/o TD, which is the student model of the CTM using the
default threshold of 0.5 for judging positive or negative during evaluation. We also implement a
simple classifier, instead of using a pre-label, connecting the features extracted from the Meme
Encoder to a linear layer to classify 3 categories (denoted by a simple classifier). For CEC,
we remove the cascaded architecture to analyze the contributions (denoted by w/o C). The
performance of all variant models is reported in Table 5. We summarize the observations as
follows.</p>
        <p>Task
Task A
Task B
Task C</p>
        <p>Model
w/o TR
w/o TD
w/o TR &amp; w/o TD
simple classifier</p>
        <p>CTM
w/o C
CEC
w/o C
CEC</p>
        <p>Weighted F1</p>
        <p>
          We observe that all the designs in the CTM and CEC contribute to the corresponding tasks.
For CTM, the teacher model and the student model with learned thresholds need to cooperate
with each other to further improve the performance. In addition, without both of them will cause
a performance decline of 26.6%, which is 13.37% lower than the simple classifier. This indicates
that the design of merging the three categories into a binary pre-label needs to cooperate with
the teacher model and the student model with learned thresholds, and can greatly improve
the performance by about 13.23% more than the performance of the simple classifier. Finally,
as mentioned earlier, the text in the Memotion 3.0[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] dataset is Hinglish. If we use the same
language for pre-training, we may be able to improve the performance. However we were not
able to find another Hinglish dataset for more appropriate pre-training, and so decided to use
the Memotion 1.0[
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] dataset for verification. The experimental results show that our method
indeed improved performance, reaching a weighted F1-score of 0.4774.
        </p>
        <p>For the CEC, the results in Table 5 illustrate that task-specific networks still outperform our
model cascading Task B and Task C. However, we believe that the CEC architecture can be a
reference for similar emotion classification tasks.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions &amp; Future Work</title>
      <p>This work presents Team NYCU_TWO’s approach to classifying the emotion and the
corresponding intensity of memes from social media. Besides a powerful multi-modal feature extraction
pipeline with the integration of CLIP, our framework incorporates two models, namely the
Cooperative Teaching Model and the Cascaded Emotion Classifier, for Task A and Tasks B&amp;C.
We achieved competitive performance at the end of the challenge, showing the efectiveness of
the framework.</p>
      <p>For our future work, we plan to improve the model from two diferent directions. The first one
could be the low-resource Hinglish problem, since the pre-trained language model is not trained
on Hinglish data as much as it is on English data, and the extracted caption embeddings cannot
fully reflect the rich semantic information, including sentiment. Aggregating state-of-the-art
methods[26, 27] for low-resource language may be able to address the issue. The second one is
the aligning problem of the CLIP model in memes. We find that unlike the common image-text
dataset for the VQA problem, in which the text can describe the image well, the captions are
not supplementary to meme images. The CLIP model can pull the image and text with similar
semantic meaning closer, but this is not the case in the meme image-text pairs here. It will
be an interesting survey topic to design a better contrastive learning objective toward meme
image-text pre-training.
Association for Computational Linguistics, Barcelona, Spain, 2020.
[24] S. Ramamoorthy, N. Gunti, S. Mishra, S. Suryavardan, A. Reganti, P. Patwa, A. DaS,
T. Chakraborty, A. Sheth, A. Ekbal, et al., Memotion 2: Dataset on sentiment and emotion
analysis of memes, in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking
and Hate Speech Detection, CEUR, 2022.
[25] P. Patwa, S. Ramamoorthy, N. Gunti, S. Mishra, S. Suryavardan, A. Reganti, A. Das,
T. Chakraborty, A. Sheth, A. Ekbal, C. Ahuja, Findings of memotion 2: Sentiment and
emotion analysis of memes, in: Proceedings of De-Factify: Workshop on Multimodal Fact
Checking and Hate Speech Detection, ceur, 2022.
[26] Z. Wang, S. Mayhew, D. Roth, et al., Extending multilingual bert to low-resource languages,
arXiv preprint arXiv:2004.13640 (2020).
[27] K. Ogueji, Y. Zhu, J. J. Lin, Small data? no problem! exploring the viability of
pretrained multilingual language models for low-resourced languages, Proceedings of the 1st
Workshop on Multilingual Representation Learning (2021).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>[2] s. Merriam-Webster.com Dictionary, meme, Accessed 7 Dec</source>
          .
          <year>2022</year>
          . URL: https://www. merriam-webster.com/dictionary/meme.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Flaxman</surname>
          </string-name>
          ,
          <article-title>Multimodal sentiment analysis to explore the structure of emotions</article-title>
          ,
          <source>in: proceedings of the 24th ACM SIGKDD international conference on Knowledge Discovery &amp; Data Mining</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>350</fpage>
          -
          <lpage>358</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G. G.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <source>Amazon pars at memotion 2</source>
          .
          <article-title>0 2022: Multi-modal multi-task learning for memotion 2.0 challenge</article-title>
          , Proceedings http://ceur-ws.
          <source>org ISSN 1613</source>
          (
          <year>2020</year>
          )
          <fpage>0073</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>A.-M. Bucur</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Cosma</surname>
            ,
            <given-names>I.-B.</given-names>
          </string-name>
          <string-name>
            <surname>Iordache</surname>
          </string-name>
          ,
          <source>Blue at memotion 2</source>
          .
          <article-title>0 2022: You have my image, my text and my transformer</article-title>
          ,
          <source>arXiv preprint arXiv:2202.07543</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Poria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Cambria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-P.</given-names>
            <surname>Morency</surname>
          </string-name>
          ,
          <article-title>Tensor fusion network for multimodal sentiment analysis</article-title>
          ,
          <source>in: Empirical Methods in Natural Language Processing</source>
          , EMNLP,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.-H. H.</given-names>
            <surname>Tsai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Z.</given-names>
            <surname>Kolter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-P.</given-names>
            <surname>Morency</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <article-title>Multimodal transformer for unaligned multimodal language sequences, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Jiang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Miura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. P.</given-names>
            <surname>Langlotz</surname>
          </string-name>
          ,
          <article-title>Contrastive learning of medical visual representations from paired images and text</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>00747</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vasudevan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yeung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seyedhosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          , Coca:
          <article-title>Contrastive captioners are image-text foundation models</article-title>
          ,
          <source>arXiv preprint arXiv:2205</source>
          .
          <year>01917</year>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tsvetkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          , Simvlm:
          <article-title>Simple visual language model pretraining with weak supervision</article-title>
          ,
          <source>arXiv preprint arXiv:2108.10904</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bucila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Caruana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Niculescu-Mizil</surname>
          </string-name>
          ,
          <article-title>Model compression</article-title>
          ,
          <source>in: Knowledge Discovery and Data Mining</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          , et al.,
          <article-title>Distilling the knowledge in a neural network</article-title>
          ,
          <source>arXiv preprint arXiv:1503.02531 2</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bao</surname>
          </string-name>
          , K. Ma,
          <article-title>Be your own teacher: Improve the performance of convolutional neural networks via self distillation</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>3713</fpage>
          -
          <lpage>3722</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shreyash</surname>
          </string-name>
          , S. S,
          <string-name>
            <given-names>C.</given-names>
            <surname>Megha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Parth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Anku</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Aman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aishwarya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amitava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Amit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Manoj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Asif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Overview of memotion 3: Sentiment and emotion analysis of codemixed hinglish memes</article-title>
          ,
          <source>in: proceedings of defactify 2: second workshop on Multimodal Fact-Checking and Hate Speech Detection, CEUR</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shreyash</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. S</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Parth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Megha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Anku</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aishwarya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Aman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amitava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Amit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Manoj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Asif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Srijan</surname>
          </string-name>
          ,
          <article-title>Memotion 3: Dataset on sentiment and emotion analysis of codemixed hinglish memes</article-title>
          ,
          <source>in: proceedings of defactify 2: second workshop on Multimodal Fact-Checking and Hate Speech Detection, CEUR</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          , Eficientnet:
          <article-title>Rethinking model scaling for convolutional neural networks</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>6105</fpage>
          -
          <lpage>6114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding, in: NAACL-HLT (1), Association for Computational Linguistics</article-title>
          ,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goodman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gimpel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , R. Soricut,
          <string-name>
            <surname>ALBERT:</surname>
          </string-name>
          <article-title>A lite BERT for self-supervised learning of language representations, in: ICLR, OpenReview</article-title>
          .net,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Swin transformer: Hierarchical vision transformer using shifted windows</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10012</fpage>
          -
          <lpage>10022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , in: International Conference on Learning Representations,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , T. Wolf,
          <article-title>Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter</article-title>
          , ArXiv abs/
          <year>1910</year>
          .01108 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Naseriparsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <article-title>Met-meme: A multimodal meme dataset rich in metaphors</article-title>
          ,
          <source>in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '22</source>
          ,
          <year>2022</year>
          , p.
          <fpage>2887</fpage>
          -
          <lpage>2899</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Paka</surname>
          </string-name>
          , Scott,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bhageria</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Poria</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Gambäck</surname>
          </string-name>
          ,
          <source>Task Report: Memotion Analysis 1.0 @SemEval</source>
          <year>2020</year>
          :
          <article-title>The Visuo-Lingual Metaphor!</article-title>
          ,
          <source>in: Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval-2020),</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>