<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UPB @ DANKMEMES: Italian Memes Analysis - Employing Visual Models and Graph Convolutional Networks for Meme Identification and Hate Speech Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>George-Alexandru Vlad</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>George-Eduard Zaharia</string-name>
          <email>george.zaharia0806g@stud.acs.upb.ro</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dumitru-Clementin Cercel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mihai Dascalu</string-name>
          <email>mihai.dascalug@upb.ro</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University Politehnica of Bucharest, Faculty of Automatic Control and Computers</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Certain events or political situations determine users from the online environment to express themselves by using different modalities. One of them is represented by Internet memes, which combine text with a representative image to entail a wide range of emotions, from humor to sarcasm and even hate. In this paper, we describe our approach for the DANKMEMES competition from EVALITA 2020 consisting of a multimodal multi-task learning architecture based on two main components. The first one is a Graph Convolutional Network combined with an Italian BERT for text encoding, while the second is varied between different image-based architectures (i.e., ResNet50, ResNet152, and VGG-16) for image representation. Our solution achieves good performance on the first two tasks of the current competition, ranking 3rd for both Task 1 (.8437 macroF1 score) and Task 2 (.8169 macro-F1 score), while exceeding by high margins the official baselines.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>During the past two decades, the Internet evolved
massively and the social web became a hub where
people share their opinions, cooperate to solve
issues, or simply discuss on various topics. There
are many ways in which users can express
themselves: plain text, videos, or images. The
latter option became widely used due to its
convenience; however, images are frequently
accompanied by a short text description to better convey
These authors contributed equally.</p>
      <p>Copyright © 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
information. As the Internet and the online
social interactions evolved, certain image templates
emerged and gained global popularity,
contributing to a de facto standardization of joint
textimage usage, and thus leading to the creation of
memes. Memes can be humorous, satirical,
offensive, or hateful, therefore encapsulating a wide
range of emotions and beliefs. Properly
identifying memes from non-memes, and then analyzing
them to detect the users’ intentions is becoming
a stringent task in online marketing campaigns by
targeting the automated identification of opinions
pertaining to certain groups of users.</p>
      <p>
        The DANKMEMES competition [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] from
EVALITA 2020 [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] challenged participants to
approach the previously mentioned issues by
creating systems that identify and analyze Internet
memes in Italian. The competition consists of
three tasks, out of which we tackled two. Task
1 - Meme Detection considers the identification
of memes from a collection of images, such that
a clear distinction can be made between memes
and ordinary images. Afterwards, Task 2 - Hate
Speech Identification targets the classification of
images in terms of their purpose, by analyzing
content and identifying whether images are
hateful or not.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work 2</title>
      <p>2.1</p>
      <sec id="sec-2-1">
        <title>Multimodal Fake News Detection</title>
        <p>
          Singhal et al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] employed the usage of
multimodal techniques for fake news detection. The
authors introduced SpotFake, an architecture divided
into three sub-parts: one for identifying textual
features using Bidirectional Encoder
Representations from Transformers (BERT) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], a second
for visual analysis based on VGG-19 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], while the
third combines the previously mentioned elements
into a single feature vector.
        </p>
        <p>
          Similarly, Shah and Priyanshi [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] performed
multimodal fake news detection by using two
separate channels, visual and textual, both of them
aiming to extract relevant features. Moreover, they
included a Cultural Algorithm that introduces
another dimension by employing situational
knowledge, i.e. information about the depicted event as
seen by a specific individual. Another approach
regarding fake news detection was introduced by
Khattar et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] who created MVAE, a
multimodal autoencoder including encoders (both
visual and textual), decoders, and a detection
module for classifying the inputs.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Multimodal Hate Speech Identification</title>
        <p>
          Kiela et al. [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] created a new dataset specifically
designed for identifying hateful speech in memes.
At the same time, the authors also introduced a
series of baselines for further comparison, including
ResNet-152 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and VilBERT [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] for the visual
channel, and BERT for the textual counterpart.
        </p>
        <p>
          Furthermore, Sabat et al. [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] tackled the
problem of hate speech identification in memes by also
employing a multimodal system. However, they
used an Optical Character Recognition system for
extracting the textual component from the inputs,
alongside visual features from a VGG-16
component and the text encoded with BERT.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <p>
        Our approach for both tasks consists of a
multitask learning technique [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and our architecture
consists of two main neural network components,
one for the text input, while the other for the
image input. Thus, we combined the outputs of these
two components and used the learned features for
determining the required class, either for Task 1 or
Task 2.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Corpus</title>
        <p>The dataset for the meme detection task is split
into two parts, train and test. The training dataset
contains 1,600 image entries, together with a CSV
file containing other useful metadata, such as: the
engagement (i.e., number of comments and likes),
date, and manipulation (i.e., binary coding
denoting the low/high level of image modifications),
alongside a transcript of the text present in the
image. We kept 85% of the entries for training, while
15% are used for validation; the same class
distribution is kept in both partition. The test dataset
for the first task contains 400 entries with a
corresponding CSV file of a similar structure. The
second task offers a dataset containing 800 entries
which was partitioned in a similar manner.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2 Image Component</title>
        <p>
          Several image-based neural networks were
considered for the first component of our final
architecture. First, we used VGG-16 which
consists of five stacks of Convolutional Neural
Networks [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] accompanied by max-pooling layers.
Pretrained weights on the ImageNet dataset [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
were afterwards fine-tuned. Second, we also
experimented with ResNet in two variants, ResNet50
and ResNet152. ResNet introduced the concept
of skip connections as a solution to the vanishing
gradient problem; as such, the networks could be
further scaled in terms of depth, enabling more
abstract high-level features to be extracted from
the input images. Similar VGG-16 architecture,
pretrained weights on ImageNet were fine-tuned
for ResNet152, whereas pretrained weights on
VGGFace2 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] were used for ResNet50.
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Text Component</title>
        <p>
          A Graph Convolutional Network (GCN) [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]
for representing long-term dependencies between
tokens was selected, alongside a pretrained
version of BERT for Italian (ItalianBERT)1 to
model the contextual information at sample level.
The underlying implementation of the textual
feature extractor follows the architectural design
of Vocabulary Graph Convolutional Network with
BERT (VGCN-BERT) [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
        </p>
        <p>
          The proposed architecture
(VGCNItalianBERT) uses a tight coupling between the
graph convolutional layers and the ItalianBERT
embeddings, enabling the model to better adjust
the GCN extracted features through ItalianBERT’s
attention mechanism. The input to the VGCN
layer is represented by a vector Xd;v, where d is
the dimension of the ItalianBERT embedding and
v is the number of tokens in the dataset vocabulary.
A symmetric adjacency matrix Av;v is built to
preserve the prior global relationship between
tokens, where v is the vocabulary dimension. The
edge weight between two nodes i, j, denoted as
Ai;j , is initialized with the normalized point-wise
mutual information (NPMI) value [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] between the
two vocabulary tokens i, j. The mechanism of
1https://github.com/dbmdz/berts#
italian-bert
the VGCN layer is formally summarized by the
following equations:
        </p>
        <p>Hv;h = Dropout(Aev;vWv;h)</p>
        <p>Hd;h = ReLU (Xd;vHv;h)</p>
        <p>Hd;g = Hd;hWh;g
(1)
(2)
(3)
where terms Wv;h and Wh;g represent the weights
of the two GCN internal layers, with v the
vocabulary dimension, h and g the output feature
dimensions. In Equation 1, we add the global
context by multiplying the normalized adjacency
matrix Ae with the weight matrix of the first
GCN layer. We use the normalized adjacency
matrix Ae = D 1=2AD 1=2 to ensure numerical
stability. A convolution between the input vector
Xd;v and the result from the previous operation
(Equation 2) is performed to combine the global
information with the ItalianBERT embeddings.
Lastly, Equation 3 projects the features to the
dimensions required to fill in the reserved
VGCNItalianBERT embedding slots.</p>
        <p>Visual text features describing the actors of
a meme are added as the pair sentence to
ItalianBERT’s input. We cap the second sentence
containing the visual text features to K tokens,
overflowing tokens being dropped. Considering
L the maximum number of input tokens, the
remainder of L K tokens are being split
between the text tokens associated with a meme
and G VGCN reserved slots. Those slots are
kept empty to be internally filled with VGCN
embeddings during training. Alongside ordinary
inputs required by ItalianBERT (i.e. input ids,
input masks and segment ids ), we build a gcn
ids vector similarly to input ids, by mapping each
unique input token to the corresponding index
in the task vocabulary Vtask; Vtask represents the
set of tokens available in the task text corpus
and in the ItalianBERT’s vocabulary. The second
additional input is represented by a binary mask
vector having the value of 1 for the VGCN
reserved tokens, and 0 otherwise. During training,
all ItalianBERT layers with the exception of the
last 4 encoder blocks were frozen.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Multimodal Architecture</title>
        <p>The final solution consists of a multimodal
architecture with two main components, each
specialized on processing one informational
channel, namely text or image-based. The
dates are segmented and encoded by using
complementary sine and cosine functions to
preserve the cyclic characteristics of days (in a
month) and months. Equation 4 describes the time
cyclical encoding procedure, where n represents
the day value subtracted by 1 and divided
by the number of days in the corresponding
month. The same operations are applied for
the months encoding over the month index, but
the denominator is 12 in this case. Additional
metadata (i.e., manipulation and engagement) was
also encoded and used in the final prediction.
Values representing the year and engagement were
normalized to ensure the model’s stability during
training.</p>
        <p>timesin = sin( ); timecos = cos( )</p>
        <p>The two feature vectors from the image and text
components were fused together by concatenation
into a single vector and passed through two fully
connected layers, followed by a dropout layer of
0.5. The output of the dropout layer is then
concatenated together with the other extracted
features like time, engagement, manipulation,
and fed to the output layer. Softmax activation
function is used over the last fully connected layer
to compute the distribution probability over the
task classes. L2 regularization kernel is used on
the two hidden layers before fusion to account
for large activations and to keep our output layer
sensible to the metadata encoded features.</p>
        <p>
          In addition, an ensemble-based architecture
using our ResNet50 + VGCN-ItalianBERT model
was also considered. First, the training dataset
was split into 5 sets, while preserving the class
distribution of each fold. The aforementioned
model was trained 5 times using 4/5 sets for
training, and the remainder set for validation.
A weighted voting procedure is performed at
prediction time, in which the weights are
represented by the average confidence score of the
voters in the class receiving the highest probability
after softmax. Thus, we advocate for higher
confidence scores over the number of voters in
choosing the predicted class.
Preprocessing steps were performed to feed the
datasets to our architecture. The texts were
tokenized using the ItalianBERT tokenizer, and
then the input ids, input masks, segment ids, gcn
ids and gcn masks were computed. Images were
resized to a uniform dimension (i.e., 448 x 448)
and were serialized alongside the text components
in a tfrecords file specific for Tensorflow [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. An
Adam Weight decay optimizer [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] with a learning
rate of 1e-5 and a weight decay rate of 0.01 were
used in all conducted experiments. Furthermore,
the warm up proportion was set to 0.1.
        </p>
        <p>
          The maximum input length was limited to L =
100 tokens and the Visual text features to K =
20 tokens as the textual channel of memes is
represented by short sentences. Following the
experimental setup described in [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], we reserve
G = 16 slots to be filled with the resulted
VGCNItalianBERT embeddings. Moreover, only NPMI
values larger than 0.3 are kept in the adjacency
matrix A, corresponding to a higher semantic
correlation between words; all the other values
below this threshold are set to 0.
        </p>
        <p>
          We empirically found 1e-5 to be a good learning
rate value, which is on par with the results of [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
Lastly, we choose to train all the models for 9
epochs with a batch size of 8 examples.
3.6
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>Results</title>
        <p>Table 1 contains the results obtained by
our models for the first two tasks of the
DANKMEMES competition. The components
that were frozen during the training process are
varied for the three main conducted experiments
(i.e. combining ItalianBERT with VGCN and
ResNet50, ResNet152 and VGG-16, respectively)
to identify proper adjustments for the weights of
the pretrained models. The best results among the
four evaluated sets (i.e. validation, test for Task
1 and validation, test for Task 2) are obtained
by either freezing only the VGCN-ItalianBERT
component or by freezing both textual and image
components. The necessity of freezing the text
branch of the architecture underlines the fact
that the pretrained weights for the ItalianBERT
model already properly capture specific traits
of Italian and prove to be a viable option, even
when analyzing short texts such as memes.
Furthermore, the last convolutional block of the
image component needs to be unfrozen because
training an architecture on potential meme
images is a more specific task when compared to
analyzing Italian text.</p>
        <p>The best results are obtained using variations of
the ResNet50 + VGCN-ItalianBERT model, with
an .9041 macro-F1 score for the custom validation
dataset used for Task 1, and .8745 and .8169
macro-F1 scores on the validation and test datasets
for Task 2. However, the best result for the Task
1 test set is yielded by the ResNet152 +
VGCNItalianBERT architecture, with an .8700 macro-F1
score.</p>
        <p>ItalianBERT, ResNet50, and ResNet50 +
ItalianBERT are used as baseline models to
explore the improvements made by adding VGCN
to the textual architecture while maintaining the
same experimental setup. As expected, the model
using only the textual channel (i.e. ItalianBERT
baseline model) is performing considerably
worse than the joint architecture ResNet50 +
ItalianBERT, thus arguing for the importance
of considering images in disambiguating
the textual input. The ResNet50 +
VGCNItalianBERT model performs consistently better
than its baseline counterpart (i.e., ResNet50
+ ItalianBERT), by obtaining improvements
of 2.92% and 3.35% macro-F1 score on the
validation sets for Task 1 and Task 2, respectively.
3.7</p>
      </sec>
      <sec id="sec-3-6">
        <title>Error Analysis</title>
        <p>Although the models performed arguably well
on both task, the identified misclassifications
represent a good starting point for further analysis
and improvement. Figure 1 depicts a series of
misclassified entries from both tasks.</p>
        <p>The short texts encountered in memes require
in several situations prior information on the
sociopolitical context, therefore making the
detection of memes an exceedingly difficult task.
In general, a few well known and highly popular
image templates are reused, by changing or
partially adjusting the text to expressively convey
an idea or a view on a certain subject. However,
the used templates in the current competition are
extensively customized and tailored specifically
to the political context of Italy. In addition, the
subjectivity of the annotators also plays a decisive
role, considering that the concept of the hateful
speech tag for the second task is not well defined
for all situations and can be interpreted differently.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Work</title>
      <p>This paper introduces our multimodal architecture
for the first two tasks of the DANKMEMES
competition from EVALITA 2020. Several
joint text - Vocabulary Graph Convolutional
Network alongside an Italian BERT model
- and image-based architectures - ResNet50,
ResNet152, VGG-16 - were experimented. The
consideration of meme meta-information, such
as cyclic temporal characteristics and post
engagement, boosted even further our F1-scores
when compared to the competition baseline.</p>
      <p>
        In terms of future work, we intend to
experiment with other visual architectures,
including VGG-19 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and EfficientNet [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], and
also with multilingual neural networks, such as
mBERT [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and XLM-RoBERTa [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], that will
empower transfer learning across meme datasets
in different languages.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Rich</given-names>
            <surname>Caruana</surname>
          </string-name>
          . “
          <article-title>Multitask learning”</article-title>
          .
          <source>In: Machine learning 28.1</source>
          (
          <issue>1997</issue>
          ), pp.
          <fpage>41</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Gerlof</given-names>
            <surname>Bouma</surname>
          </string-name>
          .
          <article-title>“Normalized (pointwise) mutual information in collocation extraction”</article-title>
          .
          <source>In: Proceedings of GSCL</source>
          (
          <year>2009</year>
          ), pp.
          <fpage>31</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jia</given-names>
            <surname>Deng</surname>
          </string-name>
          et al. “
          <article-title>Imagenet: A large-scale hierarchical image database”</article-title>
          .
          <source>In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. Ieee</source>
          .
          <year>2009</year>
          , pp.
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Yoon</given-names>
            <surname>Kim</surname>
          </string-name>
          . “
          <article-title>Convolutional neural networks for sentence classification”</article-title>
          .
          <source>In: arXiv preprint arXiv:1408.5882</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          . “
          <article-title>Very deep convolutional networks for large-scale image recognition”</article-title>
          .
          <source>In: arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Abadi</surname>
          </string-name>
          et al. “
          <article-title>Tensorflow: Largescale machine learning on heterogeneous distributed systems”</article-title>
          .
          <source>In: arXiv preprint arXiv:1603.04467</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          et al. “
          <article-title>Deep residual learning for image recognition”</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          .
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          and
          <string-name>
            <given-names>Frank</given-names>
            <surname>Hutter</surname>
          </string-name>
          . “
          <article-title>Decoupled weight decay regularization”</article-title>
          .
          <source>In: arXiv preprint arXiv:1711.05101</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Qiong</given-names>
            <surname>Cao</surname>
          </string-name>
          et al. “
          <article-title>Vggface2: A dataset for recognising faces across pose and age”</article-title>
          .
          <source>In: 2018 13th IEEE International Conference on Automatic Face &amp; Gesture Recognition (FG</source>
          <year>2018</year>
          ). IEEE.
          <year>2018</year>
          , pp.
          <fpage>67</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          et al. “
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding”</article-title>
          . In: arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Alexis</given-names>
            <surname>Conneau</surname>
          </string-name>
          et al. “
          <article-title>Unsupervised crosslingual representation learning at scale”</article-title>
          . In: arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>02116</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Dhruv</given-names>
            <surname>Khattar</surname>
          </string-name>
          et al. “
          <article-title>Mvae: Multimodal variational autoencoder for fake news detection”</article-title>
          .
          <source>In: The World Wide Web Conference</source>
          .
          <year>2019</year>
          , pp.
          <fpage>2915</fpage>
          -
          <lpage>2921</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Jiasen</given-names>
            <surname>Lu</surname>
          </string-name>
          et al. “
          <article-title>Vilbert: Pretraining taskagnostic visiolinguistic representations for vision-and-language tasks”</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          .
          <year>2019</year>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Telmo</surname>
            <given-names>Pires</given-names>
          </string-name>
          , Eva Schlinger, and Dan Garrette. “
          <article-title>How multilingual is Multilingual BERT?</article-title>
          ” In: arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>01502</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Benet</given-names>
            <surname>Oriol</surname>
          </string-name>
          <string-name>
            <surname>Sabat</surname>
          </string-name>
          ,
          <article-title>Cristian Canton Ferrer, and Xavier Giro-i-Nieto. “Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation”</article-title>
          . In: arXiv preprint arXiv:
          <year>1910</year>
          .
          <volume>02334</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Shivangi</given-names>
            <surname>Singhal</surname>
          </string-name>
          et al. “
          <article-title>SpotFake: A Multi-modal Framework for Fake News Detection”</article-title>
          .
          <source>In: 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM)</source>
          .
          <source>IEEE</source>
          .
          <year>2019</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Mingxing</given-names>
            <surname>Tan</surname>
          </string-name>
          and Quoc V Le. “Efficientnet:
          <article-title>Rethinking model scaling for convolutional neural networks”</article-title>
          . In: arXiv preprint arXiv:
          <year>1905</year>
          .
          <volume>11946</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Liang</surname>
            <given-names>Yao</given-names>
          </string-name>
          , Chengsheng Mao, and Yuan Luo. “
          <article-title>Graph convolutional networks for text classification”</article-title>
          .
          <source>In: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          . Vol.
          <volume>33</volume>
          .
          <year>2019</year>
          , pp.
          <fpage>7370</fpage>
          -
          <lpage>7377</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          et al. “
          <article-title>EVALITA 2020: Overview of the 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian”</article-title>
          .
          <source>In: Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          ). Ed. by Valerio Basile et al.
          <source>Online: CEUR.org</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Douwe</given-names>
            <surname>Kiela</surname>
          </string-name>
          et al. “
          <article-title>The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes”</article-title>
          . In: arXiv preprint arXiv:
          <year>2005</year>
          .
          <volume>04790</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Zhibin</surname>
            <given-names>Lu</given-names>
          </string-name>
          , Pan Du, and
          <string-name>
            <surname>Jian-Yun Nie</surname>
          </string-name>
          . “
          <article-title>VGCN-BERT: Augmenting BERT with Graph Embedding for Text Classification”</article-title>
          .
          <source>In: European Conference on Information Retrieval</source>
          . Springer.
          <year>2020</year>
          , pp.
          <fpage>369</fpage>
          -
          <lpage>382</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Martina</given-names>
            <surname>Miliani</surname>
          </string-name>
          et al. “
          <article-title>DANKMEMES @ EVALITA2020: The memeing of life: memes, multimodality and politics”</article-title>
          .
          <source>In: Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          ). Ed. by Valerio Basile et al.
          <source>Online: CEUR.org</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Priyanshi</given-names>
            <surname>Shah</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ziad</given-names>
            <surname>Kobti</surname>
          </string-name>
          . “
          <article-title>Multimodal fake news detection using a Cultural Algorithm with situational and normative knowledge”</article-title>
          .
          <source>In: 2020 IEEE Congress on Evolutionary Computation (CEC)</source>
          .
          <source>IEEE</source>
          .
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>