<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>wentaorub at Memotion 3: Ensemble learning for Multi-modal MEME classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Wentao Yu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dorothea Kolossa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Electronic Systems of Medical Engineering</institution>
          ,
          <addr-line>TU Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Communication Acoustics, Ruhr University Bochum</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Memes, as a new means of creative expression on social networks, provide an appealing multi-modal form of communication. However, some memes are being used to express hatred, which can take a toll on people's mental health and on societal cohesion. This year's Memotion 3.0 challenge provides an English and a mixed Hindi-English meme dataset for three classification tasks: Task A is sentiment analysis to classify a given meme as positive, negative, or neutral. In Task B, emotion classification, a meme should be identified as humorous, sarcastic, ofensive, or motivational. Finally, Task C asks to predict the intensity of the emotion classes in Task B. Both text and image data play a role in the identification and classification of hateful memes. While such multi-modality can be helpful in many contexts, here, it also increases the challenge of the classification tasks due to the nature of memes, which often achieve their humorous efects through juxtaposition and irony. To address this dificulty, we adopt a multi-headed self-attention mechanism to integrate the text and image information in a learned, task-adapted manner. The gradient blending algorithm prevents overfitting issues in the multi-modal model. Our uni-modal models, which feed into the attention mechanism, are based on the CLIP model due to its outstanding performance on zero-shot classification tasks. Ultimately, with an ensemble strategy of our two best-performing models, our submission only reaches a 0.3289 weighted F1 score on sub-task A, but it ranks 1st on the two final Tasks B and C, with respective scores of 0.7977 and 0.5982. 1 1Our code will be made available at: https://github.com/wentaoxandry/Memotion3.0_challenge.git De-Factify 2: 2nd Workshop on Multimodal Fact Checking and Hate Speech Detection, co-located with AAAI 2023. 2023 Washington, DC, USA $ wentao.yu@rub.de (W. Yu); dorothea.kolossa@tu-berlin.de (D. Kolossa)  https://cognitive-signal-processing.de/index.php/team/ (W. Yu); https://www.tu.berlin/en/mtec (D. Kolossa) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CPWrEooUrckResehdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CEUR Workshop Proceedings (CEUR-WS.org)</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Ensemble</kwd>
        <kwd>CLIP</kwd>
        <kwd>OSCAR</kwd>
        <kwd>multimodal</kwd>
        <kwd>memes classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        It is well-known that multi-modal machine learning can vastly outperform uni-modal learning,
at least when the system is set up appropriately. For example, in audio-visual speech recognition,
visual information can complement speech signals to significantly improve recognition rates [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1,
2, 3</xref>
        ]. However, memes often express opinions in an implied manner. The text and image may
even have opposite meanings in isolation and can be combined ironically. This characteristic
of memes leads to a new type of challenge in automatic classification tasks. In order to study
this problem, the Memotion 3.0 challenge provides a Hinglish meme dataset for three meme
classification tasks [
        <xref ref-type="bibr" rid="ref10 ref4 ref5 ref6 ref7 ref8 ref9">4, 5, 6, 7, 8, 9, 10</xref>
        ].
      </p>
      <p>
        In this work, we consider transfer learning to customize two multi-modal models based
on the Transformer model: the CLIP model [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and the OSCAR model [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The text and
image encoders from the CLIP model are optimized as two uni-modal (text and image) models.
Ultimately, the ensemble strategy is applied for better performance.
      </p>
      <p>The paper is organized as follows: Section 2 introduces the related solutions to the task. Our
system framework is described in Section 3, followed by the experimental setup in Section 4.
Finally, our results are shown and conclusions are drawn in Sections 5 and 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The transformer model [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] is widely used in natural language processing tasks due to its
outstanding performance. In recent years, a number of works have expanded the capability of
the transformer model towards multi-modal tasks. For example, the OSCAR model [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] adopts
the Faster R-CNN [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] to extract visual embeddings of the detected object regions. In addition,
the Faster R-CNN model outputs the detected object tags, which are considered as additional
anchor points to improve the learning performance of alignments. Subsequently, an attention
mechanism walks through the combined text-image sequence embeddings.
      </p>
      <p>
        Recently, contrastive learning has drawn much attention due to its outstanding performance
on zero-shot prediction [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ]. In this work, we consider the CLIP model [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] to extract
contrastive embeddings, since memes contain various image types, which causes dificulties in
classification. The remarkable zero-shot prediction accuracy of the CLIP model can help us to
alleviate this problem. The CLIP model uses a pre-trained BERT model to extract text context
classification features and a Vision transformer [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] for obtaining image classification features.
Contrastive learning is adopted to learn multi-modal embeddings without manual labels by
teaching the CLIP model about the similarity of diferent data points. Assuming a mini training
batch with meme OCR texts T = {t1, t2, · · · t, } and images I = {i1, i2, · · · i, }, where  is
the batch size, the CLIP model learns to match the OCR text and image as follows: the extracted
text classification features F, ∈ R×  and image classification features F, ∈ R×  are
computed by
      </p>
      <p>F, = encoder(T),</p>
      <p>
        F, = encoder(I),
where  and  are the attention dimension of text and image encoder, respectively. The
classiifcation features are mapped to the same dimension  and normalized with an l2 regularization.
The contrastive logits x are derived as their scaled, pairwise cosine similarities:
x = (‖W · F,‖2 · ‖ W · F,‖2 ) × ,
where  is a learnable parameter. As in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], the labels are the one hot encoded labels of the set
y = [1, 2, · · · ]. The loss function is defined as:
 = 0.5 · =0(̂y︀, y) + 0.5 · =1(̂y︀, y),
      </p>
      <p>CE CE
and image classification features
and multi-modal models.
where CE is the cross-entropy, ̂y︀ = softmax(x) and ̂y︀ = softmax(x). Finally, the learned text
=0 =1
F,, F, are used in our proposed CLIP-based text, image,
(1)
(2)
(3)</p>
    </sec>
    <sec id="sec-3">
      <title>3. System Overview</title>
      <p>Classi er1
Fcls;t
xt
+
Classi er2</p>
      <p>h
RNN</p>
      <p>Et
CLIPtext</p>
      <p>T
(a) Text model
are utilized in the multi-modal model, where  is the number of patches of the image.</p>
      <p>Figure 2 depicts the proposed multi-modal model. The sequence embeddings from the text
and image model in Figure 1 are concatenated along the sequence dimension as multi-modal
embeddings</p>
      <p>E = [E; E], where E ∈ R(+)×  .
x, where</p>
      <p>The complete embedding sequence E is fed into six multi-head attention (MHA) blocks. We
removed the MHA block's residual connection and dropout layer based on our experimental
results. Finally, the classifier, which has the same structure as the text and image classifiers,
uses the multi-modal classification embedding F ∈ R2 to obtain the multi-modal logits
F = [Ẽ︀[0, :], Ẽ︀[0, :]].
(4)
(5)
(6)
Ẽ︀[0, :] and Ẽ︀[0, :] are the first classification embeddings after 6 MHA blocks.</p>
      <p>xmulti</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>
        The Memotion 3.0 challenge has three sub-tasks. Task A is to classify a meme as positive,
negative, or neutral. In Task B, a given meme should be identified as humorous, sarcastic,
ofensive, or motivational. It is a multi-label classification task, so that a meme can have more
than one category. Finally, Task C asks to predict the intensity of the emotion classes in Task B.
In our work, we only optimized the models for Task A and Task C. Task B results are obtained
1https://github.com/schesa/ImgFlip575K_Dataset
from Task C. For example, we consider a meme as Humorous if it is classified as F, VF, or H
(detailed in Table 2) and vice versa. Since the Memotion3 dataset contains English as well as
mixed Hindi-English memes, we perform back-translation to Hindi and then to English with
the Python translators package.
All models are trained using the PyTorch library [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. The AdamW optimizer [22] is used
for backpropagation. The CLIP model (detailed in Section 2) is first optimized using the
pretrainCLIP dataset, where the contrastive loss is the objective function. The pre-trained
CLIP model thus learns to match the meme image and text and is denoted as CLIPpre. For
uni-modal training, the pretrainuni dataset is used to fine-tune the the CLIP component
models CLIPpre,t and CLIPpre,i within the two overall model structures as shown in Figure 1.
The model parameters of the CLIP component models CLIP-text and CLIP-image are
initialized by those of the respective CLIPpre model and optimized on the Memotion3 dataset. The
multi-modal model CLIP-multi parameters are then initialized by the uni-modal models. We
train one model for Task A with an output dimension of 3 and four models for the four aspects
of Task C with output dimensions 4, 4, 4, and 2. The dropout rate in all classifiers is 0.1. The
attention dimensions of the text encoder and image encoder are 512 and 768, respectively. The
dataset for Task A is balanced. Therefore, we simply use the cross-entropy (CE) as the loss
function for Task A.In contrast, the dataset for Task C is quite imbalanced. The focal loss (F)
function [23] is therefore selected as the loss function for training the respective classifiers.
      </p>
      <p>In this work, we adopt Gradient-Blending [24] (GB) to reduce the efect of overfitting. The
multi-modal model (Figure 2) is based on the text and image model (Figure 1). Therefore, the
text and image model logits x and x are also available in the multi-modal model. Taking the
gradient of the blended loss
 = ∑︁ CE, (7)</p>
      <p>where  ∈[text, image, multi-modal], produces the blended gradient. It should be emphasized
that the multi-modal predictions are only obtained from the multi-modal logits x. Finally,
Table 3 gives an overview of the use of the loss functions in training all models.</p>
      <p>We use the Python RAY 2 package to find the best-performing hyperparameters. The training
process is carried out on NVIDIA's Volta-based DGX-1 multi-GPU system, using 3 TeslaV100
GPUs with 32 GB memory each.</p>
      <p>2https://github.com/ray-project/ray</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>This work considers the CLIP-text, CLIP-image (in Figure 1), CLIP-multi (in Figure 2),
and OSCAR models. For better performance, majority voting is adopted to ensemble
diferent models' decisions. Ensemble-1 fuses the prediction decisions of the candidate
models CLIP-text0, CLIP-image0, CLIP-multi0, and OSCAR0, while Ensemble-2 also takes
CLIP-text1, CLIP-image1, CLIP-multi1, and OSCAR1 into consideration. We iterate over
all possible model combinations and adopt majority voting on the validation set to find the
best performance model combinations. Then, these combinations are used to fuse the test set
predictions.</p>
      <p>Table 4 lists the weighted F1 score on the validation set. For Task A ("Overall" column in
Table 4), the CLIP-text model performs better than the CLIP-image model. The score of
the CLIP-multi setup lies between those of the former two models. Ensemble-2 improves
the weighted F1 score to 0.4453. The model for motivation classification has scores above 0.9,
because the binary classification dataset is imbalanced. Comparing the best-performing text
and image models (CLIP-text0 and CLIP-image0), the image model shows a slightly better
performance in Task C. The CLIP-multi0 model without GB training performs far worse than
its gradient-blending counterpart. Overall, Ensemble-2 shows the best performance in Task
A and Task C. Ultimately, the strategy of ensembling the top two models yields a 0.3289 (5th)
weighted F1 score on Task A, 0.7977 (1st rank) on Task B and 0.5982 (also 1st rank) on Task C.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This work proposes a multi-modal CLIP-based meme classification system, which owes its
capabilities on this rather small dataset to the outstanding zero-shot performance of the CLIP
model. The text model combines the CLIP model text encoder with 2 BiLSTM layers; the image
model is fine-tuned on the Memotion 3.0 dataset. The proposed multi-modal model integrates
the text and image embeddings from the text and image encoders in 6 multi-head self-attention
blocks. Gradient blending prevents the fusion model from overfitting. The OSCAR model is
used both as a baseline model and as a participant model in our ensemble strategy, which further
serves to improve the system performance. Our ensembe of the top two models yields a clearly
better accuracy than one single model, winning Task B and C in the Memotion 3.0 challenge.
The experimental results of the challenge do indicate, however, that sentiment analysis in
memes is dificult for machine learning. The next goal of our work is therefore to develop
mechanisms for understanding multi-modal, contrasting information, e.g. conveying irony, to
improve sentiment classification performance for memes and social media posts.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The work was supported by the PhD School ”SecHuman - Security for Humans in Cyberspace”
by the federal state of NRW, and partially funded by the Deutsche Forschungsgemeinschaft (DFG
– German Research Foundation) [Project-ID 429873205] and by the German Federal Ministry of
Education and Research [”noFake”, Grant No: 16KIS1519]. The authors are responsible for the
content of this publication.
N. Gimelshein, L. Antiga, et al., Pytorch: An imperative style, high-performance deep
learning library, Advances in Neural Information Processing Systems (2019).
[22] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, arXiv preprint
arXiv:1711.05101 (2017).
[23] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in:</p>
      <p>Proc. ICCV, 2017, pp. 2980–2988.
[24] W. Wang, D. Tran, M. Feiszli, What makes training multi-modal classification networks
hard?, in: Proc. CVPR, 2020, pp. 12695–12705.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zeiler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kolossa</surname>
          </string-name>
          ,
          <article-title>Multimodal integration for large-vocabulary audio-visual speech recognition</article-title>
          ,
          <source>in: Proc. 28th European Signal Processing Conf. (EUSIPCO)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>341</fpage>
          -
          <lpage>345</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zeiler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kolossa</surname>
          </string-name>
          ,
          <article-title>Fusing information streams in end-to-end audio-visual speech recognition</article-title>
          ,
          <source>in: Proc. ICASSP</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>3430</fpage>
          -
          <lpage>3434</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Boenninghof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Roehrig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kolossa</surname>
          </string-name>
          , Rubcsg at SemEval
          <article-title>-2022 Task 5: Ensemble learning for identifying misogynous MEMEs</article-title>
          ,
          <source>arXiv preprint arXiv:2204.03953</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ramamoorthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gunti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryavardan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reganti</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ekbal</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Ahuja</surname>
          </string-name>
          ,
          <article-title>Findings of Memotion 2: Sentiment and emotion analysis of memes</article-title>
          , in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, ceur,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryavardan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reganti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chadha</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chinnakotla</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ekbal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Memotion 3: Dataset on sentiment and emotion analysis of codemixed Hinglish memes</article-title>
          ,
          <source>in: Proc. Defactify</source>
          <volume>2</volume>
          : 2nd Workshop on Multimodal Fact-Checking and
          <article-title>Hate Speech Detection</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryavardan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chadha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reganti</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chinnakotla</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ekbal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Overview of memotion 3: Sentiment and emotion analysis of codemixed hinglish memes</article-title>
          ,
          <source>in: Proc. Defactify</source>
          <volume>2</volume>
          : 2nd Workshop on Multimodal Fact-Checking and
          <article-title>Hate Speech Detection</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryavardan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reganti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chadha</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chinnakotla</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ekbal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Factify 2: A multimodal fake news and satire news dataset</article-title>
          ,
          <source>in: Proc. Defactify</source>
          <volume>2</volume>
          : 2nd Workshop on Multimodal Fact-Checking and
          <article-title>Hate Speech Detection</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryavardan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chadha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reganti</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chinnakotla</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ekbal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Findings of Factify 2: Multimodal fake news detection</article-title>
          ,
          <source>in: Proc. Defactify</source>
          <volume>2</volume>
          : 2nd Workshop on Multimodal Fact-Checking and
          <article-title>Hate Speech Detection</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bhageria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Scott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pykl</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Pulabaigari</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Gamback</surname>
          </string-name>
          , SemEval
          <article-title>-2020 Task 8: Memotion analysis-the visuo-lingual metaphor!</article-title>
          , arXiv preprint arXiv:
          <year>2008</year>
          .
          <volume>03781</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ramamoorthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gunti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryavardan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reganti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. DaS</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ekbal</surname>
          </string-name>
          , et al.,
          <article-title>Memotion 2: Dataset on sentiment and emotion analysis of memes</article-title>
          , in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and
          <article-title>Hate Speech Detection</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: Proc. ICML, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          , et al.,
          <article-title>Oscar: Object-semantics aligned pre-training for vision-language tasks</article-title>
          ,
          <source>in: Proc. ECCV</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>121</fpage>
          -
          <lpage>137</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <surname>Fast</surname>
          </string-name>
          r-cnn,
          <source>in: Proc. ICCV</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1440</fpage>
          -
          <lpage>1448</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Contrastive embedding for generalized zero-shot learning</article-title>
          ,
          <source>in: Proc. CVPR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>2371</fpage>
          -
          <lpage>2381</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Transferable contrastive network for generalized zero-shot learning</article-title>
          ,
          <source>in: Proc. CVPR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>9765</fpage>
          -
          <lpage>9774</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          , et al.,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gibert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Karatzas</surname>
          </string-name>
          ,
          <article-title>Exploring hate speech detection in multimodal publications</article-title>
          ,
          <source>in: Proc. IEEE/CVF Winter Conference on Applications of Computer Vision</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1470</fpage>
          -
          <lpage>1478</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Firooz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Goswami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ringshia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Testuggine</surname>
          </string-name>
          ,
          <article-title>The hateful memes challenge: Detecting hate speech in multimodal memes</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>2611</fpage>
          -
          <lpage>2624</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>E.</given-names>
            <surname>Fersini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gasparini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Saibene</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lees</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sorensen</surname>
          </string-name>
          , Semeval
          <article-title>-2022 task 5: Multimedia automatic misogyny identification</article-title>
          ,
          <source>in: Proc. SemEval2022</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>533</fpage>
          -
          <lpage>549</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paszke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          , G. Chanan,
          <string-name>
            <given-names>T.</given-names>
            <surname>Killeen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>