<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HCILab at Memotion 2.0 2022: Analysis of Sentiment, Emotion and Intensity of Emotion Classes from Meme Images using Single and Multi Modalities</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thanh Tin Nguyen</string-name>
          <email>nttin@sju.ac.kr</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nhat Truong Pham</string-name>
          <email>phamnhattruong.st@tdtu.edu.vn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ngoc Duy Nguyen</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hai Nguyen</string-name>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Long H. Nguyen</string-name>
          <email>hoanglong.fruitai@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yong-Guk Kim</string-name>
          <email>ykim@sejong.ac.kr</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>(Corresponding author)</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Division of Computational Mechatronics, Institute for Computational Science, Ton Duc Thang University</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Electrical and Electronics Engineering, Ton Duc Thang University</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Faculty of Information Technology, Ton Duc Thang University</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Human Computer Interaction Lab, Department of Computer Engineering, Sejong University</institution>
          ,
          <addr-line>Seoul</addr-line>
          ,
          <country country="KR">Korea</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Institute for Intelligent Systems Research and Innovation, Deakin University</institution>
          ,
          <addr-line>Victoria, Australian</addr-line>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Khoury College of Computer Sciences, Northeastern University</institution>
          ,
          <addr-line>Boston</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Nowadays, memes found on internet are overwhelming. Although they are innocuous and sometimes entertaining, there exist memes that contain sarcasm, ofensive, or motivational feelings. In this study, several approaches are proposed to solve the multiple modality problem in analysing the given meme dataset. The imbalance issue has been addressed by using a new Auto Augmentation method and the uncorrelation issue has been mitigated by adopting deep Canonical Correlation Analysis to find the most correlated projections of visual and textual feature embedding. In addition, both stacked attention and multi-hop attention network are employed to eficiently generate aggregated features. As a result, our team, i.e. HCILab, achieved a weighted F1 score of 0.4995 for sentiment analysis, 0.7414 for emotion classification, and 0.5301 for scale/intensity of emotion classes on the leaderboard. This results are obtained by using concatenation between image and text model and our code can be found at https: //git.io/JMRa8.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Meme analysis</kwd>
        <kwd>attention models</kwd>
        <kwd>correlation analysis</kwd>
        <kwd>emotion classes</kwd>
        <kwd>multimodality</kwd>
        <kwd>vision and language</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The task of analyzing sentiments, emotions and their intensity has attracted a great deal of
attention in research community, especially when it can help to subdue unnecessary damages.
As the internet has spread worldwide, false information, hatred, or ofensive language are also
increasing tremendously. A common way to disseminate these threads is by using texts in
meme images that vicious people can mitigate as a mean to agitate arguments, disputes, and
social wars.</p>
      <p>
        To mitigate harmful efects of toxic memes, machine learning [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and deep learning [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]
are normally employed to tackle the problem. These techniques can detect and classify memes
efectively, although it requires humans to label the data. Nevertheless, the results are promising
and the algorithm can be integrated into social media platforms such as Facebook or Twitter to
automatically detect and remove these memes completely.
      </p>
      <p>
        Following the success of the Semeval 2020 challenge [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], in the Memotion2 challenge, the
organizer provides a new dataset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] including memes and corresponding texts. The task
includes three subtasks: (Subtask A) sentiment analysis which is to classify negative, neutral,
and positive contents; (Subtask B) emotion classification which is to classify emotions of memes,
there are four main categories including funny, sarcastic, ofensive and motivational; (Subtask C)
the last task is to seek for detail information of each emotion, for example, funny, sarcastic, and
ofensive emotions have four levels while the last emotion only has two levels. The weighted F1
score is used in this competition to evaluate each subtask, the final score is the average of three
subscores.
      </p>
      <p>In addition, the task has two important issues. Firstly, the data is imbalanced among diferent
classes. Secondly, images and their corresponding texts are not well correlated to each other,
because the texts and the images often point to diferent meanings in meme images, so there is
a need to build an efective fusion technique to reduce a semantic gap between two modalities.
In order to address the problem, our team proposed several models and achieved good results
on the private leaderboard.</p>
      <p>The remaining of this study is organized as follows. Section 2 introduces brief literature of
this problem. Section 3 describes our methodology including unimodal, bimodal as well as
auxiliary techniques. Results are summarized in section 5. Finally, we conclude our study and
outline future research in Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>
        In the previous competition, diferent approaches were developed to tackle the problem. For
instance, in [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ], the authors introduce machine learning and deep learning models including
Naive Bayes [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], BERT [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Multimodal Transformer [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and ResNet [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] to tackle the problem.
Nevertheless, these are mainly divided into two types of models including unimodal and bimodal.
The unimodal uses only one modality as an input which can be texts or images. The bimodal
adopts fusion techniques to aggregate features from diferent modalities to obtain related
information and achieve a better classification rate. Previous studies have employed
state-ofthe-art models for text and vision, but they did not consider the correlation of two modalities
and how to preprocess the data to create a clean one.
      </p>
      <p>
        In this study, EficientNet-v2 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is employed as a visual extractor, while LSTM [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and
RoBERTa [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] are used in the bimodal to extract textual features. In addition, RoBERTa is
also used for text model. With respect to fusion techniques, there are three methods to obtain
aggregated features including traditional aggregated method, multi-hop attention [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and
stacked attention [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. These techniques are used to combine visual features and textual
features from LSTM and RoBERTa. In general, we evaluated six diferent models during the
competition.
      </p>
      <p>
        Besides, we also adopt several techniques to improve the classification rate such as Auto
Augmentation [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and Deep Canonical Correlation Analysis [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Finally, to enhance the visual
extraction, we remove texts on memes by using EAST [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] to detect texts within images and
then remove them.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>
        We have evaluated many network architectures along with fusion techniques. In addition,
we employ auxiliary methods such as Auto Augmentation and Canonical Correlation Analysis
to enhance the eficiency. The proposed models are divided into Unimodal and Multimodal
based on a vision backbone, i.e., EficientNet-v2, [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and LSTM [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and RoBERTa [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] for text
processing.
      </p>
      <p>Figure 1 depicts the proposed framework for multiple modalities. It includes one branch for
extracting features from the image and one for extracting features from the text. These features
are concatenated by an attention-based fusion module before passing through a fully connected
layer for final classification. The number of output nodes of this layer depends on each task.
For the sentiment task, it will be 3 nodes denoting for Negative, Neutral and Positive classes. In
terms of the emotion task, there will be 4 final linear classifiers, each one will two output nodes,
because in this task, there are 4 types of emotions, each emotion has two classes 0 and 1. Lastly,
for the intensity task, there are also 4 final linear classifiers, but each one will have diferent
output nodes, for example, in the intensity of humour class, there will be 4 nodes denoting 4
levels of intensity. Meanwhile, that of motivation class will have only 2 output nodes.</p>
      <sec id="sec-3-1">
        <title>3.1. Unimodal for Text</title>
        <p>
          BERT [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and its variant, e.g., RoBERTa [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], are widely used in Natural Language Processing
(NLP) tasks and have demonstrated as eficient methods. In this competition, we employed
them for three subtasks. In subtasks A and B, a RoBERTa [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] is used while in subtask C, four
RoBERTa models are adopted so that every backbone corresponds classifying the intensity of
each emotion.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Unimodal for Image</title>
        <p>
          As a vision-based approach, EficientNet-v2 [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] is a well-known backbone with respect to
speedy inference and a low number of parameters. In three subtasks, EficientNet-v2 is used
as an extractor to create embeddings. Subtask A has one classification branch to deal with
three sentiment types while subtasks B and C have four branches which each is responsible for
classifying four types of emotions as well as their intensity.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Multimodal for Image and Text</title>
        <p>
          Multi-modality is to aggregate vision and text to obtain correlated information. In this challenge,
we build three diferent fusion models including concatenation, multi-hop attention [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], and
stacked attention network [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Concatenation</title>
          <p>Traditionally, concatenation of two feature vectors, a.k.a two modalities, has been a typical
solution to obtain aggregated features. However, the method does not take into account the
importance of each word that is within corresponding regions of the image.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Multi-hop Attention</title>
          <p>
            Multi-hop attention is initially proposed by [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]. It focuses parts of a given image together with
texts within it. The technique aims to emphasize dissimilar features between image regions and
textual utterances by defining a relevant matrix , which is the cosine distance between textual
and visual features.
          </p>
        </sec>
        <sec id="sec-3-3-3">
          <title>3.3.3. Stacked Attention</title>
          <p>
            While a multi-hop attention network is used to learn attention maps between an image and
texts within it, a stacked attention network introduced in [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] has a capability of learning an
attention map in multiple times. Through such attention layers, interested regions are promoted
through a referred concept within a given sentence.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Useful Techniques</title>
        <sec id="sec-3-4-1">
          <title>3.4.1. Auto Augmentation</title>
          <p>
            Augmentation is a simple but important technique to increase the size of a given dataset, leading
to a better generalization of a training model. However, current data augmentation is based on
a set of manually designed algorithms such as Crop, Rotation, and Resize. In our experiment,
we adopt the Auto Augment technique [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] which uses reinforcement learning to automatically
search for a better data augmentation strategy.
          </p>
        </sec>
        <sec id="sec-3-4-2">
          <title>3.4.2. Canonical Correlation Analysis</title>
          <p>
            The Canonical correlation analysis (CCA) was proposed by [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ]. It is based on a well-established
statistical technique that searches for a linear combination of input vectors by maximizing their
correlations. Deep CCA [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ] tries to utilize the power of both deep neural networks and CCA to
overcome projection constraints of CCA . In this study, correlation scores obtained from Deep
CCA is included to our loss function to maximize the correlation between two features, leading
to a higher classification rate.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>
          In this shared task of the First Workshop on Multimodal Fact-Checking and Hate Speech
Detection, MEMOTION 2.0 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] was used which was a hate speech detection dataset. It included
7,000 samples for the training set and 1,500 samples for the validation set. This dataset was used
for three subtasks in the MEMOTION 2.0 Challenge and labeled as follows:
• Sentiment analysis:
• Emotion classification:
– Negative and Very Negative are labeled 0;
– Neutral is labeled 1;
– Positive is labeled 2.
– Not Humorous is labeled 0, while Humorous is labeled 1 that includes funny, very
funny, and hilarious;
– Not Sarcastic is labeled 0, while Sarcastic is labeled 1 including little sarcastic, very
sarcastic, and extremely sarcastic;
– Not Ofensive is labeled 0, while Ofensive is labeled 1 that are slight, very ofensive,
and hateful ofensive;
– Not Motivational is labeled 0 and Motivational is labeled 1.
• Scale/intensity of emotion classes:
– Humour: Not funny, funny, very funny, and hilarious are labeled 0, 1, 2, 3,
respectively;
– Sarcasm: Not sarcastic, little sarcastic, very sarcastic, and extremely sarcastic are
labeled 0, 1, 2, 3, respectively;
– Ofense : Not ofensive, slight, very ofensive, and hateful ofensive are labeled 0, 1,
2, 3, respectively;
– Motivation: Not motivational is labeled 0 and motivational is labeled 1.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Preprocessing</title>
        <p>Although both textual and visual features are important for meme emotion analysis, there is
little correlation between them in the MEMOTION 2.0 dataset. Besides, the caption is also
provided as a part of the dataset. Therefore, in this study, the text is removed from the image
before extracting and training the proposed model.</p>
        <p>
          Based on the previous work [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] that summarized both traditional and deep learning
approaches for text detection and recognition, we design a preprocessing scheme to remove texts
from images as follows. First, we employ the EAST [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] module to detect all text regions in
an image. Then these regions are removed from the image, and we use the output image as
the input for EficientNet-v2 in the proposed framework. Figure 2 visualizes the steps of the
preprocessing scheme.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Experimental setup</title>
        <p>
          All experiment was carried out using a Titan Xp GPUs station. The batch size is 10, the input
image size is 256× 256, the learning rate is 2e-5, the Adam [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] optimizer is used in this model
with a weight decay of 1e-5. Moreover, the Cosine Annealing Warm Restarts [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] scheduler is
used for scheduling the learning rate. We also use common augmentation techniques such as
Resize, CenterCrop, RandomFlip with probability of 0.5, especially adopt the Auto Augmentation
mentioned above, then take Normalize with mean and std are (0.485, 0.456, 0.406), (0.229, 0.224,
0.225), respectively. Finally, our models use a cross-entropy as the loss function except for single
models for the texts that use binary cross-entropy instead.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>The evaluation metric of this competition is the Weighted F1 score, and the final score will
be the average of three Weighted F1 scores of all subtasks. Table 1 summarizes our results
in the public phase with diferent models. The results of the private phase are presented in
Table 2. Among the best Weighted F1 scores of three subtasks, we achieved a score of 0.5124 for
sentiment analysis, 0.7423 for emotion classification, and 0.5296 for scale/intensity of emotion
classes, respectively.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>
        For this study, we have integrated several attention models and the correlation analysis technique
for the meme dataset analysis. To handle the imbalanced dataset, Auto Augmentation [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] is
proposed and it is found that it provides a richer dataset for further processes. The visual and
textual features extracted by attention models are projected into the most correlated directions
by using DCCA [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] for the stable and generalized training. The best result of each subtask
varies depending on combination of the used models. For the sentiment task, the multihop
attention-based LSTM performs the best, whereas concatenation of CNN and BERT gives the
highest result for the emotion task. The stacked attention network with CNN and BERT achieves
the best for the intensity task.
      </p>
      <p>In the future, an in-depth analysis shall be done by collecting or synthesizing more dataset as
well as mitigating the semantic gap between the text and the image. The imbalance between
classes has been a vitally important problem that can be tackled by data augmentation or
formulating a new loss function that can put more weight on classes with fewer data. In
addition, since feature fusion is not always compulsory in the vision-language task, designing a
noble network that can choose whether to use the fusion or not is to be investigated.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This research was supported by the Ministry of Science and ICT (MSIT), Korea, under the
Information Technology Research Center (ITRC) support program (IITP-2021-2016-0-00312) as well
as a grant (IITP-2019-0-00231) supervised by the Institute for Information &amp; Communications
Technology Planning &amp; Evaluation (IITP). In addition, the authors would like to thank the
FruitLab team for useful ideas and discussion.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V.</given-names>
            <surname>Keswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Modi</surname>
          </string-name>
          , Iitk at semeval
          <article-title>-2020 task 8: Unimodal and bimodal sentiment analysis of internet memes</article-title>
          ,
          <source>in: Proceedings of the Fourteenth Workshop on Semantic Evaluation</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1135</fpage>
          -
          <lpage>1140</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pramanick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Akhtar</surname>
          </string-name>
          , T. Chakraborty,
          <article-title>Exercise? i thought you said'extra fries': Leveraging sentence demarcations and multi-hop attention for meme afect analysis</article-title>
          ,
          <source>arXiv preprint arXiv:2103.12377</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Cn-</surname>
          </string-name>
          hit-mi. t at semeval
          <article-title>-2020 task 8: Memotion analysis based on bert</article-title>
          ,
          <source>in: Proceedings of the Fourteenth Workshop on Semantic Evaluation</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1100</fpage>
          -
          <lpage>1105</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bhageria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Scott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. PYKL</given-names>
            ,
            <surname>A. Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Pulabaigari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gamback</surname>
          </string-name>
          , Semeval
          <article-title>-2020 task 8: Memotion analysis-the visuo-lingual metaphor!</article-title>
          , arXiv preprint arXiv:
          <year>2008</year>
          .
          <volume>03781</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ramamoorthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gunti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryavardan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reganti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ekbal</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Ahuja</surname>
          </string-name>
          ,
          <article-title>Memotion 2: Dataset on sentiment and emotion analysis of memes</article-title>
          , in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and
          <article-title>Hate Speech Detection</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ramamoorthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gunti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryavardan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reganti</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ekbal</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Ahuja</surname>
          </string-name>
          ,
          <article-title>Findings of memotion 2: Sentiment and emotion analysis of memes</article-title>
          , in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and
          <article-title>Hate Speech Detection</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>I.</given-names>
            <surname>Rish</surname>
          </string-name>
          , et al.,
          <article-title>An empirical study of the naive bayes classifier</article-title>
          ,
          <source>in: IJCAI 2001 workshop on empirical methods in artificial intelligence,</source>
          volume
          <volume>3</volume>
          ,
          <year>2001</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>46</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhooshan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Firooz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Testuggine</surname>
          </string-name>
          ,
          <article-title>Supervised multimodal bitransformers for classifying images and text</article-title>
          , arXiv preprint arXiv:
          <year>1909</year>
          .
          <volume>02950</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Eficientnetv2: Smaller models and faster training</article-title>
          ,
          <source>arXiv preprint arXiv:2104.00298</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <article-title>Long short-term memory</article-title>
          ,
          <source>Neural computation 9</source>
          (
          <year>1997</year>
          )
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          . corr abs/
          <year>1907</year>
          .11692 (
          <year>2019</year>
          ), URL: http://arxiv. org/abs/
          <year>1907</year>
          .11692 (
          <year>1907</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Smola</surname>
          </string-name>
          ,
          <article-title>Stacked attention networks for image question answering</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>E.</given-names>
            <surname>Cubuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vasudevan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Autoaugment: Learning augmentation policies from data</article-title>
          .
          <source>arxiv</source>
          <year>2018</year>
          , arXiv preprint arXiv:
          <year>1805</year>
          .
          <volume>09501</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G.</given-names>
            <surname>Andrew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Arora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bilmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Livescu</surname>
          </string-name>
          ,
          <article-title>Deep canonical correlation analysis</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>1247</fpage>
          -
          <lpage>1255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>East: an eficient and accurate scene text detector</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>5551</fpage>
          -
          <lpage>5560</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Hotelling</surname>
          </string-name>
          ,
          <article-title>Relations between two sets of variates</article-title>
          , in: Breakthroughs in statistics, Springer,
          <year>1992</year>
          , pp.
          <fpage>162</fpage>
          -
          <lpage>190</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <article-title>Scene text detection and recognition: The deep learning era</article-title>
          ,
          <source>International Journal of Computer Vision</source>
          <volume>129</volume>
          (
          <year>2021</year>
          )
          <fpage>161</fpage>
          -
          <lpage>184</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <article-title>Adam: A method for stochastic optimization</article-title>
          ,
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>I.</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          , Sgdr:
          <article-title>Stochastic gradient descent with warm restarts</article-title>
          ,
          <source>arXiv preprint arXiv:1608.03983</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>