<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, co-located with AAAI</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Amazon PARS at Memotion 2.0 2022: Multi-modal Multi-task Learning for Memotion 2.0 Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gwang Gook Lee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mingwei Shen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Amazon.com</institution>
          ,
          <addr-line>410 Terry Ave N., Seattle, 98109</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>2022</volume>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Over the years, memes became very popular as social media services growing rapidly. Understanding meme images as humans do is very complicated because of its multi-modal nature (texts on images). In this paper, we describe our approach for classifying sentiment and emotion of memes for Memotion 2.0 challenge. Assuming correlation between three sub-tasks, we implemented and compared four different multi-task network heads having different level of interactions. Experiments showed that multi-task classification network could perform better than individual networks for single tasks. We won 6th, 4th and 1st place for task A, B and C respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Emotion classification</kwd>
        <kwd>multi-modal</kwd>
        <kwd>multi-task</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Challenge Dataset</title>
      <p>•
•</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Method</title>
      <p>We chose VisusalBERT as our baseline because it showed the best performance among Hateful
Memes Challenge [4] baselines which is in a similar domain with Memotion 2.0 Challenge. ViLBERT
[5] also showed comparable performance to VisualBERT at the same challenge. However, compared
to ViLBERT, VisualBERT employs only single transformer for both text and image modalities hence
requires smaller memory and training time.</p>
      <p>For text features, bert-base-uncased tokenizer is used and maximum sequence length is fixed to 128.
For VisualBERT, images features can be extracted from grid maps or regions from object detection
results. It has been shown that grid-based features can perform on par with region-based features [6][7]
while keeping the entire pipeline simpler (enabling end-to-end pretraining). However, we chose
regionbased image features for two reasons. First, most meme images have large empty background regions
as Fig. 1. For such images, region-based features, which are prepared only for detected objects, would
help the model to focus more on image contents compared to grid-based features where all image
regions are treated equally. For the next, the challenge dataset has small amount of data for training
(7,000 samples). Object detectors are trained on large datasets and frozen while extracting features
which makes it more reliable to small size data compared to grid configurations.</p>
      <p>We chose a multi-task model rather than dedicated models for different tasks hypothesizing
subtasks are related. For example, memes with funny emotion would have higher probability of having
positive sentiment compared to offensive memes. Oftentimes, funny memes are sarcastic at the same
time. Fig. 2 shows correlation among labels in task A and task B. We also could easily expect Task B
and C are highly correlated as Task C is actually a fine-grained version of Task B.</p>
      <p>We designed four multi-task classification heads as illustrated in Fig 3. PRD and FC stand for BERT
prediction head and fully connected layer respectively. Numbers in parenthesis describes number of
channels in each output. For example, heads for sentiment classification have three out-puts: negative,
neutral and positive.</p>
      <p>Sentiment (3) Humour (4)</p>
      <p>Sarcastic (4)</p>
      <p>Offensive (4) Motivational (2)</p>
      <p>Sentiment (3) Humour (4)</p>
      <p>Sarcastic (4)</p>
      <p>Offensive (4) Motivational (2)
FC
PRD
FC
PRD</p>
      <p>FC</p>
      <p>PRD</p>
      <p>Transformer
Text Feature</p>
      <p>Image Feature</p>
      <p>FC
PRD</p>
      <p>FC
PRD</p>
      <p>FC</p>
      <p>PRD
Sentiment (3) Humour (4)</p>
      <p>Sarcastic (4)</p>
      <p>Offensive (4)
(a) Multi-task classification head 1</p>
      <p>FC</p>
      <p>FC</p>
      <p>FC</p>
      <p>Motivational (2)
Is_humour (2) Is_sarcastic (2) Is_offensive (2)
FC FC FC FC</p>
      <p>FC
PRD</p>
      <p>FC
PRD</p>
      <p>FC</p>
      <p>FC</p>
      <p>FC</p>
      <p>FC</p>
      <p>Transformer
Text Feature</p>
      <p>Image Feature
(b) Multi-task classification head 2
Sentiment (3) Humour (4) Sarcastic (4) Offensive (4) Motivational (2)</p>
      <p>FC</p>
      <p>FC</p>
      <p>FC</p>
      <p>FC
Emotion (4)
FC</p>
      <p>Transformer
Text Feature</p>
      <p>Image Feature</p>
      <p>Transformer
Text Feature</p>
      <p>Image Feature
(c) Multi-task classification head 3
(d) Multi-task classification head 4</p>
      <p>Classification head 1 has separate PRD for each output to learn task specific predictions. In contrast,
head 2 shares PRD for all tasks to strengthen benefits of multi-task learning. Head 3 classifies emotions
from Task B as binary classes first (is_humour, is_sarcastic and is_offensive). The outputs of emotion
predictions (Task B) are then con-catenated with multimodal feature embeddings (from PRD) to make
predictions on emotion intensities (Task C) expecting existence of emotions would help to classify how
strong the emotions are. This is analogous to the two-stage architectures in object detection [8]. The
first stage (region proposal networks, RPN) only produces output for the existence of an object and the
following lay-er classifies in which category the object belongs. Head 4 is similar with head 3 but a
multi-label classifier is utilized rather than four binary classifiers in head 3.</p>
      <p>Multi-task loss is defined as weighted sum of losses for each task. Cross entropy loss is used for
multi-class task (Task A and C) and binary cross entropy with logits is utilized for multi-label task
(Task B). Weights for Task A, B and C are chosen as 0.4, 0.3 and 0.3. Predictions for Task B is
generated from emotion intensity output when they are not explicitly predicted. For example,
slightly_funny, funny or hilarious predictions are all considered as humourous.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments 4.1</title>
    </sec>
    <sec id="sec-5">
      <title>Pre-trained Weights</title>
      <p>We used MMF [10] a vision and language multimodal research from Facebook AI Research for
implementation. MMF provides pretrained weights for various models for different types of tasks. It is
well known that the similarity between source domain and target do-main for fine-tuning affects to the
model performance. Also, it has been shown that when target domain is very different from source
domain, it gives better performance to train the model directly on the target domain rather than
finetuning from a pre-trained model [8].</p>
      <p>We tested several different pretrained models as summarized in Table 2. COCO, VQA2 and
Conceptual Captions (CC) are the most common datasets for pre-training multi-modal models. We also
trained the model directly on the train data without pre-training. For the last, two models finetuned on
hateful memes dataset with pretraining (on COCO) and without pretraining are tested because of the
resemblance of dataset. To compare pre-trained weights, two classification networks for Task A and
Task C are trained and averages of their F1 are measured on the validation set. Surprisingly, direct
training gave a good performance, even better than some pre-trained weights. However, fine-tuning on
the weights directly trained on hateful memes dataset gives much lower performance than all other
models. Hence, we could expect that directly trained model would be lacking in generalization. Among
the models finetuned on pre-trained weights, the weights pre-trained on COCO and then finetuned on
hateful memes gave the best performance and used for the following experiments.</p>
      <p>Name
Direct
Pretrained.coco
Pretrained.vqa2
Pretrained.cc
Finetuned.hateful_memes.from_coco
Finetuned.hateful_memes.from_coco</p>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusions</title>
      <p>In this paper, we presented our approach to the Memotion 2.0 Challenge. We tackled the problem
as a multi-task classification considering correlation among individual tasks. Experiments showed that
multi-task models could outperform single-task models.</p>
      <p>Though we achieved competitive performance, there is much room for improvement. One direction
could be end-to-end training of the multi-modal model. In this work, images features are extracted from
an object detector. As the object detector is not updated during the training, image features might not
be fully aligned with the downstream task. Exploring end-to-end training on grid image features (Huang
el. al. 2020) would be interesting which we leave as a future work.</p>
    </sec>
    <sec id="sec-7">
      <title>6. References</title>
      <p>[1] S. Ramamoorthy, N. Gunti, S. Mishra, S. Suryavardan, A. Reganti, P. Patwa, A. Das, T.</p>
      <p>Chakraborty, A. Sheth, A. Ekbal and C. Ahuja, Memotion 2: Dataset on Sentiment and Emotion
Analysis of Memes, in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and
Hate Speech Detection , CEUR, 2022
[2] P. Patwa, S. Ramamoorthy, N. Gunti, S. Mishra, S. Suryavardan, A. Reganti, A. Das, T.</p>
      <p>Chakraborty, A. Sheth, A. Ekbal and C. Ahuja, Findings of Memotion 2: Sentiment and Emotion
Analysis of Memes, in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and
Hate Speech Detection, CEUR, 2022
[3] L. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang. VisualBERT: A simple and performant
baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
[4] D. Kiela, H. Firooz, A. Mohan, V. Goswami, A. Singh, P. Ringshia and D. Testuggine, The Hateful
Memes Challenge: Detecting Hate Speech in Multimodal Memes. In: Proceedings of Neural
Information Processing Systems, 2020.
[5] J. Lu, D. Batra, D. Parikh, and S. Lee, ViLBERT: Pretraining Task-Agnostic Visiolinguistic
Representations for Vision-and-Language Tasks, in: Proceedings of the 33rd International
Conference on Neural Information Processing Systems, 2019, No. 2, pp. 13–23.
[6] H. Jiang, I. Misra, M. Rohrbach, E. Learned-Miller and X. Chen, In Defense of Grid Features
for Visual Question Answering, in: Proceedings of Computer Vision and Pattern Recognition
(CVPR 2020), 2020
[7] Z. Huang, Z. Zhaoyang, L. Bei, F. Dongmei, and F. Jianlong, Pixel-BERT: Aligning Image Pixels
with Text by Deep Multi-Modal Transformers. 2020. arXive preprint
arXiv:2004.00849
[8] S. Ren. K. He, G. Ross and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with
Region Proposal Networks, in Proceedings of Advances in Neural Information Processing Systems
28, 2015.
[9] A. Singh, V. Goswami, V. Natarajan, Y. Jiang, X. Chen, M. Shah, M. Rohrbach, D. Batra, and D.</p>
      <p>Parikh, MMF: A multimodal framework for vision and language research, 2020. URL:
https://github.com/facebookresearch/mmf.
[10] A. Singh, V. Goswami, and D. Parikh, Are we pretraining it right? Digging deeper into
visiolinguistic pretraining. 2020, arXiv preprint arXiv:2004.08744.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>