<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>De-Factify</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>NUAA-QMUL-AIIT at Memotion 3: Multi-modal Fusion with Squeeze-and-Excitation for Internet Meme Emotion Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xiaoyu Guo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jing Ma</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arkaitz Zubiaga</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Advanced Institute of Information Technology (AIIT), Peking University</institution>
          ,
          <addr-line>Hangzhou, Zhejiang</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>College of Economics and Management, Nanjing University of Aeronautics and Astronautics (NUAA)</institution>
          ,
          <addr-line>Nanjing, Jiangsu</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Electronic Engineering and Computer Science, Queen Mary University of London (QMUL)</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>2</volume>
      <issue>2</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper describes the participation of our NUAA-QMUL-AIIT team in the Memotion 3 shared task on meme emotion analysis. We propose a novel multi-modal fusion method, Squeeze-and-Excitation Fusion (SEFusion), and embed it into our system for emotion classification in memes. SEFusion is a simple fusion method that employs fully connected layers, reshaping, and matrix multiplication. SEFusion learns a weight for each modality and then applies it to its own modality feature. We evaluate the performance of our system on the three Memotion 3 sub-tasks. Among all participating systems in this Memotion 3 shared task, our system ranked first on task A, fifth on task B, and second on task C. Our proposed SEFusion provides the flexibility to fuse any features from diferent modalities. The source code for our method is published on https://github.com/xxxxxxxxy/memotion3-SEFusion.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;multi-modal fusion</kwd>
        <kwd>internet meme</kwd>
        <kwd>emotion analysis</kwd>
        <kwd>squeeze-and-excitation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        With the rapid increase in the amount of online information, automated processing of the
content can help alleviate the otherwise burdensome task of sifting through all the information.
One of the prevalent forms of online information is the one spread as internet memes. An
internet meme is a concise and often humorous means of sharing information online, generally
communicated as an image with text embedded [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In recent years, internet memes have
become prevalent as a means to share opinions through diferent Internet platforms such as
social media [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Generally, internet memes combine two modalities: image and text. While the content of
memes can be useful and important to be processed through automated means, much of the
existing research has limited to text, with less attention paid to the analysis of memes, as is
the case in our work focused on meme emotion analysis. The key challenge of meme emotion
analysis is achieving an efective combination of the text and image features extracted by
pre-trained models. Existing fusion methods mainly use an attention mechanism to map the
features of the diferent modalities (e.g. [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ]). An important aspect to be considered when
combining the modalities is determining the weight in each case, as it varies from case to
case where either the text or the image plays a more significant role. One can then combine
the modalities by multiplying and subsequently aggregating the inferred weights with their
associated embeddings. With our work, the main objective is to optimise the learning of the
weights of each modality through the use of neural network models.
      </p>
      <p>
        Our work builds on an approach introduced by Hu et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], who proposed a
squeeze-andexcitation block to learn the channel dependencies of an image, which can be applied to a variety
of deep neural networks leading to improved classification performance. Inspired by this work,
we consider utilizing squeeze-and-excitation to learn the modal dependencies of multi-modal
data. The squeeze-and-excitation block cannot be applied directly to fuse features of diferent
modalities and hence we adopt the framework in order to adapt it to our multi-modal fusion
task.
      </p>
      <p>In this article, we propose Squeeze-and-Excitation Fusion (SEFusion), a novel multi-modal
fusion method, and apply it to fuse text features and image features extracted from internet
memes. Through testing it on the Memotion 3 shared task, our SEFusion system achieved the
top rank in task A of the competition, with an F1 score of 0.3441. SEFusion system also ranked
second on task C.</p>
      <p>The rest of this paper is organized as follows. In the next section, we describe the Memotion
3 task and prior work on emotion classification in memes. Then in Section 3, we propose
SEFusion, a novel multi-modal fusion method, and embed it into our system to classify memes.
We then employ the method to analyze memes in task A, task B, and task C in Section 4. In
Section 5, we discuss the results of our experiments. Finally, we conclude with the findings of
this research and suggest directions for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>2.1. The Memotion 3 Shared Task</title>
        <p>
          Memotion 3 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] is the 3rd edition of the series of Memotion shared tasks focused on meme
emotion analysis. The previous editions-Memotion 1 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and Memotion 2 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] provided annotated
datasets [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and have brought attention to the analysis of memes. This edition consists of
three subtasks: (i) Task A in classifying an internet meme according to its expressed emotion as
positive, negative, or neutral, (ii) Task B in identifying whether an internet meme is sarcastic,
humorous, ofensive, or motivational as a multi-label classification task, and (iii) Task C in
quantifying the scales of each type in task B.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Related Work on Meme Emotion Analysis</title>
        <p>
          Previous research on meme emotion analysis mainly focuses on emotion classification,
identifying the type of emotion expressed, and detecting hateful memes, which are all part of the
Memotion 3 task. Wu et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] focus on text memes and add slang and sentiment lexica as
extra information to improve the performance of meme emotion classification. Amalia et al.
[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] firstly use OCR Tesseract to extract text from image memes and then classify the extracted
text into positive or negative employing the Naive Bayes classifier, which achieves a competitive
accuracy of 75%. As for identifying the type of emotion expressed, Costa et al. [13] propose to
use a Maximum Entropy classifier to recognize humorous text memes. Their model achieved
high performance for the negative class, with substantially lower performance for the positive
class. Sabat et al. [14] use BERT and VGG-16 to process texts and images for hateful meme
detection. They apply both early fusion and late fusion methods to combine text and image.
        </p>
        <p>Most recently, Nayak and Agrawal [15] employ various machine learning models to
automatically detect hate in internet memes. Ouaari et al. [16] use neural networks to extract features of
internet memes and train a classifier to identify the sentiment expressed in memes. Fersini et al.
[17] posit that hateful content is expressed through memes and, to support with their detection,
they utilize unimodal and multimodal approaches to identify misogynous memes.</p>
        <p>In summary, previous research on meme emotion analysis has considered a wide range of
traditional machine learning, more contemporary deep learning models, and well-established
feature fusion methods. To our knowledge, existing research does not combine features of
diferent modalities by learning the weight of diferent modalities automatically. By studying
this combination in the context of memes, our study introduces a novel fusion method that
attempts to learn the weights of diferent modalities.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. System Overview</title>
      <sec id="sec-3-1">
        <title>3.1. Breaking Down the Task into Subtasks</title>
        <p>The shared task consists of three subtasks which, in our case, we envisaged as nine classification
sub-tasks (one for task A, four for task B, and four for task C). This is because tasks B and C
require making four predictions each, which we considered to tackle separately. Hence, we
assign specific names to each of these classification sub-tasks (A, B1-4, C1-4), as shown in Table
1. Throughout the experimentation period, we observed that there was no diference between
B4 and C4, so we regard these two as the same sub-task, hence reducing it to eight sub-tasks.</p>
        <p>We therefore approach the task as eight classification sub-tasks (A, B1-4, C1-3). All sub-tasks
use the same system framework, which is depicted in Figure 1. First, we extract text and image
features. Subsequently, we fuse these features together with our proposed SEFusion. Finally, the
fused features are sent to dense layers with a proper activation to produce the category label.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Feature Extraction</title>
        <p>Internet memes are in a concise form[20] and thus it is uneasy to extract enough features from
themselves. The universal and semantic features could be learned from the large corpus during
pre-training. Therefore, we choose pre-trained models to extract features from internet memes.</p>
        <sec id="sec-3-2-1">
          <title>Task task A task B task C</title>
          <p>Sub-task
task A
task B1
task B2
task B3
task B4
task C1
task C2
task C3
task C4</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Content</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>Classify a meme as positive, negative, or neutral.</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>Classify a meme as humorous or not.</title>
        </sec>
        <sec id="sec-3-2-5">
          <title>Classify a meme as sarcastic or not.</title>
        </sec>
        <sec id="sec-3-2-6">
          <title>Classify a meme as ofensive or not.</title>
        </sec>
        <sec id="sec-3-2-7">
          <title>Classify a meme as motivational or not.</title>
        </sec>
        <sec id="sec-3-2-8">
          <title>Quantify a meme as not funny, funny, very funny, or hilarious.</title>
        </sec>
        <sec id="sec-3-2-9">
          <title>Quantify a meme as not sarcastic, general, twisted meaning, or very twisted.</title>
        </sec>
        <sec id="sec-3-2-10">
          <title>Quantify a meme as not ofensive, slight, very ofensive, or hateful ofensive.</title>
        </sec>
        <sec id="sec-3-2-11">
          <title>Quantify a meme as motivational or not.</title>
          <p>3.2.1. Data Pre-processing
Texts extracted from memes contain many user names, as strings that start with “@” followed
by other characters representing the user name. Given that this set of characters will likely not
be meaningful for the meme emotion analysis, we replace them with a generic token “@user”. In
addition, several memes have watermarks, showing a link for their creator or origin. We replace
these links with the generic token “http”. For the image, we perform the default pre-processing
of the pre-trained model1.
3.2.2. Text Feature Extraction
We use TweetEval [18] to extract the text features2. The pre-trained model we choose is
cardifnlp/twitter-roberta-base. We take the average of the extracted features as the
representation of each item of meme text. The text features are denoted as Xt, Xt ∈ R1× 768.
3.2.3. Image Feature Extraction
We use CLIP-ViT [19] to extract the image features. The pre-trained model we use is
laion/CLIPViT-B-32-laion2B-s34B-b79K. We then perform L2 normalization to get the final image features
Xi, Xi ∈ R1× 512.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Squeeze-and-excitation Fusion</title>
        <p>SEFusion is a computational unit that can be built upon multi-modal features. Since the output of
the multi-modal model is produced by a summation through all modalities, modal dependencies
are significant for multi-modal feature fusion. Our proposed fusion method can learn the
relationships between modalities and explicitly model modal interdependencies. The procedure
of SEFusion is shown in the middle of Figure 1.</p>
        <p>We next describe the two components of SEFusion, squeeze and excitation.</p>
        <p>
          1https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K/blob/main/preprocessor_config.json
2https://github.com/cardifnlp/tweeteval/blob/main/TweetEval_Tutorial.ipynb
3.3.1. Squeeze
In order to tackle the issue of exploiting modal dependencies, we consider learning the weight
of each modality. We first perform dimension reduction on the text and image features. We
use the dense layer and set the unit as 1 to linearly squeeze text features into a vector ,
 ∈ R1× 1, which is diferent from the squeeze procedure in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] since the feature dimension
in our case is diferent from the dimension produced by the convolutional operator in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. We
also get  with the same operation on image features. Next, we concatenate  with 
and get ,  ∈ R1× 2. The procedure is shown as:
 = F (X, W1) = W1X,
 = F (X, W2) = W2X,
z = Concat (, ) .
(1)
(2)
(3)
3.3.2. Excitation
To make use of the information aggregated in the squeeze operation, we follow it with a second
operation that aims to fully capture modal-wise dependencies. Following Hu et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], we opt
to employ a simple gating mechanism with a sigmoid activation:
        </p>
        <p>s = F(z, W) =  ((z, W)) =  (W4 (W3z)) ,
where  refers to the ReLU [21] function, W3 ∈ R2× 1, W4 ∈ R1× 2, and s is the learned vector
of weights for the modalities.</p>
        <p>Next, we apply the weight vector to the multi-modal features for computing the fused features.
Considering that the dimensionality of the text and image features are diferent, we concatenate
these features and directly reshape the concatenated features into X′ (︀ X′ ∈ R2× 640)︀ in order
to apply the operation matrix multiplication on s and X′ easily. The final fused feature is
calculated by:
(4)
(5)
(6)
(7)
X = Concat (Xt, Xi) ,</p>
        <p>X′ = Reshape(X),</p>
        <p>Xfusion = sX′,
where X ∈ R1× 1280, X′ ∈ R2× 640, and Xfusion ∈ R1× 640.</p>
        <p>Discussion. After reshaping, we got X′, whose first row contains partial features of the
text while the second row contains the combination of image features and the remaining text
features. When applying the operation matrix multiplication on s and X′, we put the image
weight on partial features of the text, which may bring some dispute. It is also acceptable to
unify the feature dimension of each modality by following a dense layer.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Classification</title>
        <p>The fused layer is used as the input to  fully connected layers, where  is a hyper-parameter
and needs to be adjusted for diferent sub-tasks. The fully connected layers are followed by the
activation of sigmoid (or softmax) for generating the probability of the image pertaining to a
class.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>The dataset used for our experiments was released by the organizers of the Memotion 3 task
[22]. Each entry in the dataset contains the following fields: image, text, and label. The field
of label varies for the diferent tasks. The dataset contains a total of 10,000 samples, including
7,000 for training, 1,500 for validation, and 1,500 for test. For the experimentation, we rely on
the training, validation, and test data as split by the organizers. Tables 2-4 show the distribution
of labels of diferent tasks across training, validation, and test sets.</p>
        <sec id="sec-4-1-1">
          <title>Train</title>
          <p>2,275(33%)
2,970(42%)
1,755(25%)
7,000</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Validation</title>
          <p>341(23%)
579(39%)
580(39%)
1,500</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Parameter Setting</title>
        <p>Since the dataset is imbalanced, we employ the strategy of Logit Adjustment [23] to overcome
this problem. This strategy is implemented by changing the loss function and can be directly
used in Keras.3 Therefore, we use sparse_categorical_crossentropy_with_prior as the loss
function in our experiments. In addition, it is necessary to add the prior distribution of labels
to the loss function. As the label distributions for the validation and test sets are not known
during training, we use the label distribution of training sets. From Table 2, we see that the
labels of task A are distributed into 2,275 positive (33%), 2,970 neutral (42%), and 1,454 negative
(25%) instances. The distribution of other sub-tasks can be drawn from Tables 3 and 4.</p>
        <p>The batch size is set to 256, the learning rate is set to 1− 4, and Adam is used as the optimizer.
For task A and task B, we use 2 dense layers; while for task C, we use 5 dense layers. We monitor
the sparse categorical accuracy on the validation set to save the best model.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Implementation</title>
        <p>We utilize Keras4, the python deep learning library, to build the whole model structure.
TweetEval [18] and CLIP-ViT [19] are used to acquire the text and image representations through API
provided by huggingface5.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Evaluation Metrics</title>
        <p>We use weighted-F1 as the evaluation metric, which is the oficial metric proposed by the
organizers. The weighted-F1 score is calculated by taking the mean of all per-class F1 scores
while considering each class’s support, which is shown as:</p>
        <p>F1 =
2 × P × R ,</p>
        <p>P + R</p>
        <p>weighted-F1 = ∑︁ Fl,
=1
(8)
(9)
where P and R stand for precision and recall, respectively.  denotes the support proportion. 
is the total number of classes.</p>
        <p>For task A, weighted-F1 can be used to evaluate directly. For task B and task C, we calculate
the weighted-F1 score for each of the sub-tasks and then take an average of those scores to
obtain an average-weighted-F1 score.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>Among all participating systems in this Memotion 3 task, our model achieved the 1st score on
the evaluation of task A and the 2nd score on the evaluation of task C. The weighted-F1 scores
and average-weighted-F1 scores for our proposed SEFusion are shown in Table 5.</p>
      <p>In Table 5, we can conclude that our models are under-fitting except task C1 and task C3
since the evaluation scores on the training set are lower than on the validation set. To our
best knowledge, under-fitting hardly happens on all the memotion datasets even using simple
machine learning models. Under-fitting indicates that the performance of our model could be
improved by training longer or adding extra layers to the network. The results of task C3 show
that the model is over-fitting and we should cut the layers. For task C1, although there is a little
overfitting, the extent of overfitting is acceptable.</p>
      <p>4https://keras.io/zh/
5https://huggingface.co/</p>
      <sec id="sec-5-1">
        <title>Task</title>
      </sec>
      <sec id="sec-5-2">
        <title>Task A</title>
      </sec>
      <sec id="sec-5-3">
        <title>Task B</title>
      </sec>
      <sec id="sec-5-4">
        <title>Task C</title>
      </sec>
      <sec id="sec-5-5">
        <title>Sub-task</title>
      </sec>
      <sec id="sec-5-6">
        <title>Task A</title>
      </sec>
      <sec id="sec-5-7">
        <title>Task B1</title>
      </sec>
      <sec id="sec-5-8">
        <title>Task B2</title>
      </sec>
      <sec id="sec-5-9">
        <title>Task B3</title>
      </sec>
      <sec id="sec-5-10">
        <title>Task B4</title>
      </sec>
      <sec id="sec-5-11">
        <title>Task C1</title>
      </sec>
      <sec id="sec-5-12">
        <title>Task C2</title>
      </sec>
      <sec id="sec-5-13">
        <title>Task C3</title>
      </sec>
      <sec id="sec-5-14">
        <title>Task C4</title>
        <p>In Table 5, we also see that the performance of task B3 is lower than other sub-tasks in task B.
All sub-tasks in task B are binary classification and they should be easier than task A and task C.
However, the weighted-F1 score of task B3 is near to 0.5, which is the general baseline of binary
classification. Given that the class proportion of task B3 varies less, we conclude that identifying
the ofensive memes is very hard using our existing features, and likewise classifying memes as
positive, neutral, or negative.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we propose SEFusion, a novel multi-modal fusion method to combine text and
image features jointly for emotion classification in internet memes. Our method ranks first on
task A and second on task C in Memotion 3 task.</p>
      <p>
        Given the features extracted from memes, our proposed SEFusion applies squeeze and
excitation, which are simple operations merely using fully connected layers with proper activations,
reshaping, and matrix multiplication, to fuse text and image features. Like the Squeeze-and
Excitation Block [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], our proposed SEFusion is flexible and can be used to fuse other sets of
features extracted through other models. In addition, SEFusion can fuse more than two types of
features as long as the dimension is reshaped correctly.
      </p>
      <p>Our work has some limitations and opens up avenues for future research. First, our model
learned the weight vector for each modality, but the weight did not apply to the corresponding
modality since we mixed the text and image features when reshaping the concatenated feature
vector. We will consider unifying the feature dimension of diferent modalities before performing
SEFusion. Second, internet meme emotion analysis is still in its infancy. Although our model
ranks first in task A, its performance only slightly above the baseline model has room for
improvement, which calls for more research, ideally jointly working with the adjacent tasks of
detecting sentiment and hateful content from memes.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgments</title>
      <p>This study was supported by the National Natural Science Foundation of China under grant
number 72174086. Xiaoyu Guo conducted this work while doing an internship at Advanced
Institute of Information Technology, Peking University.
by using optical character recognition (ocr) and naïve bayes algorithm, in: 2018 Third
International Conference on Informatics and Computing (ICIC), IEEE, 2018, pp. 1–5.
[13] D. Costa, H. G. Oliveira, A. M. Pinto, In reality there are as many religions as there are
papers-first steps towards the generation of internet memes., in: ICCC, 2015, pp. 300–307.
[14] B. O. Sabat, C. C. Ferrer, X. Giro-i Nieto, Hate speech in pixels: Detection of ofensive
memes towards automatic moderation, arXiv preprint arXiv:1910.02334 (2019).
[15] A. Nayak, A. Agrawal, Detection of hate speech in social media memes: A comparative
analysis, in: 2022 Third International Conference on Intelligent Computing Instrumentation
and Control Technologies (ICICICT), IEEE, 2022, pp. 1179–1185.
[16] S. Ouaari, T. M. Tashu, T. Horváth, Multimodal feature extraction for memes sentiment
classification, in: 2022 IEEE 2nd Conference on Information Technology and Data Science
(CITDS), IEEE, 2022, pp. 285–290.
[17] E. Fersini, G. Rizzi, A. Saibene, F. Gasparini, Misogynous meme recognition: A preliminary
study, in: International Conference of the Italian Association for Artificial Intelligence,
Springer, 2022, pp. 279–293.
[18] F. Barbieri, J. Camacho-Collados, L. Espinosa-Anke, L. Neves, TweetEval:Unified
Benchmark and Comparative Evaluation for Tweet Classification, in: Proceedings of Findings of
EMNLP, 2020.
[19] X. Zhai, J. Puigcerver, A. Kolesnikov, P. Ruyssen, C. Riquelme, M. Lucic, J. Djolonga, A. S.</p>
      <p>Pinto, M. Neumann, A. Dosovitskiy, et al., A large-scale study of representation learning
with the visual task adaptation benchmark, arXiv preprint arXiv:1910.04867 (2019).
[20] E. Hakoköngäs, O. Halmesvaara, I. Sakki, Persuasion through bitter humor: Multimodal
discourse analysis of rhetoric in internet memes of two far-right groups in finland, Social
Media+ Society 6 (2020) 2056305120921575.
[21] V. Nair, G. E. Hinton, Rectified linear units improve restricted boltzmann machines, in:</p>
      <p>Icml, 2010.
[22] S. Mishra, S. Suryavardan, M. Chakraborty, P. Patwa, A. Rani, A. Reganti, A. Chadha,
A. Das, A. Sheth, M. Chinnakotla, A. Ekbal, S. Kumar, Memotion 3: Dataset on sentiment
and emotion analysis of codemixed hinglish memes, in: proceedings of defactify 2: second
workshop on Multimodal Fact-Checking and Hate Speech Detection, CEUR, 2023.
[23] A. K. Menon, S. Jayasumana, A. S. Rawat, H. Jain, A. Veit, S. Kumar, Long-tail learning via
logit adjustment, arXiv preprint arXiv:2007.07314 (2020).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <article-title>A review of internet meme studies: State of the art and outlook</article-title>
          ,
          <source>Information Studies: Theory &amp; Application</source>
          <volume>44</volume>
          (
          <year>2021</year>
          )
          <fpage>199</fpage>
          -
          <lpage>207</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          , A. Zubiaga,
          <article-title>NUAA-QMUL at SemEval-2020 task 8: Utilizing BERT and DenseNet for Internet meme emotion analysis</article-title>
          ,
          <source>in: Proceedings of the Fourteenth Workshop on Semantic Evaluation</source>
          , International Committee for Computational Linguistics,
          <source>Barcelona (online)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>901</fpage>
          -
          <lpage>907</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .semeval-
          <volume>1</volume>
          .
          <fpage>114</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Attention-based multi-modal fusion network for semantic scene completion</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>34</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>11402</fpage>
          -
          <lpage>11409</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Deep rgb-d saliency detection with depth-sensitive attention and automatic multi-modal fusion</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1407</fpage>
          -
          <lpage>1417</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          <article-title>a. Guo, Research on multi-modal emotion recognition based on dr-transformer model</article-title>
          ,
          <source>Information Science</source>
          <volume>40</volume>
          (
          <year>2022</year>
          )
          <fpage>117</fpage>
          -
          <lpage>125</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shen</surname>
          </string-name>
          , G. Sun,
          <string-name>
            <surname>Squeeze-</surname>
          </string-name>
          and
          <article-title>-excitation networks</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>7132</fpage>
          -
          <lpage>7141</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryavardan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chadha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reganti</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chinnakotla</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ekbal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Overview of memotion 3: Sentiment and emotion analysis of codemixed hinglish memes</article-title>
          ,
          <source>in: proceedings of defactify 2: second workshop on Multimodal Fact-Checking and Hate Speech Detection, CEUR</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bhageria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Scott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pykl</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Pulabaigari</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Gambäck</surname>
          </string-name>
          , Semeval
          <article-title>-2020 task 8: Memotion analysis-the visuo-lingual metaphor!</article-title>
          ,
          <source>in: Proceedings of the Fourteenth Workshop on Semantic Evaluation</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>759</fpage>
          -
          <lpage>773</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ramamoorthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gunti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryavardan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reganti</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ekbal</surname>
          </string-name>
          , et al.,
          <article-title>Findings of memotion 2: Sentiment and emotion analysis of memes</article-title>
          , in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate Speech Detection, ceur,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ramamoorthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gunti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryavardan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reganti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. DaS</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ekbal</surname>
          </string-name>
          , et al.,
          <article-title>Memotion 2: Dataset on sentiment and emotion analysis of memes</article-title>
          , in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and
          <article-title>Hate Speech Detection</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Morstatter</surname>
          </string-name>
          , H. Liu,
          <article-title>Slangsd: building, expanding and using a sentiment dictionary of slang words for short-text sentiment classification</article-title>
          ,
          <source>Language Resources and Evaluation</source>
          <volume>52</volume>
          (
          <year>2018</year>
          )
          <fpage>839</fpage>
          -
          <lpage>852</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Amalia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sharif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Haisar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gunawan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. B.</given-names>
            <surname>Nasution</surname>
          </string-name>
          , Meme opinion categorization
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>