UPB @ DANKMEMES: Italian Memes Analysis - Employing Visual
 Models and Graph Convolutional Networks for Meme Identification and
                      Hate Speech Detection
                     George-Alexandru Vlad∗ , George-Eduard Zaharia∗ ,
                          Dumitru-Clementin Cercel, Mihai Dascalu
        University Politehnica of Bucharest, Faculty of Automatic Control and Computers
         {george.vlad0108, george.zaharia0806}@stud.acs.upb.ro
                    {dumitru.cercel, mihai.dascalu}@upb.ro

                        Abstract                                information. As the Internet and the online so-
                                                                cial interactions evolved, certain image templates
    Certain events or political situations deter-               emerged and gained global popularity, contribut-
    mine users from the online environment                      ing to a de facto standardization of joint text-
    to express themselves by using different                    image usage, and thus leading to the creation of
    modalities. One of them is represented by                   memes. Memes can be humorous, satirical, of-
    Internet memes, which combine text with                     fensive, or hateful, therefore encapsulating a wide
    a representative image to entail a wide                     range of emotions and beliefs. Properly identify-
    range of emotions, from humor to sarcasm                    ing memes from non-memes, and then analyzing
    and even hate. In this paper, we describe                   them to detect the users’ intentions is becoming
    our approach for the DANKMEMES com-                         a stringent task in online marketing campaigns by
    petition from EVALITA 2020 consisting                       targeting the automated identification of opinions
    of a multimodal multi-task learning archi-                  pertaining to certain groups of users.
    tecture based on two main components.                          The DANKMEMES competition [22] from
    The first one is a Graph Convolutional                      EVALITA 2020 [19] challenged participants to ap-
    Network combined with an Italian BERT                       proach the previously mentioned issues by cre-
    for text encoding, while the second is var-                 ating systems that identify and analyze Internet
    ied between different image-based archi-                    memes in Italian. The competition consists of
    tectures (i.e., ResNet50, ResNet152, and                    three tasks, out of which we tackled two. Task
    VGG-16) for image representation. Our                       1 - Meme Detection considers the identification
    solution achieves good performance on the                   of memes from a collection of images, such that
    first two tasks of the current competition,                 a clear distinction can be made between memes
    ranking 3rd for both Task 1 (.8437 macro-                   and ordinary images. Afterwards, Task 2 - Hate
    F1 score) and Task 2 (.8169 macro-F1                        Speech Identification targets the classification of
    score), while exceeding by high margins                     images in terms of their purpose, by analyzing
    the official baselines.                                     content and identifying whether images are hate-
                                                                ful or not.
1    Introduction
                                                                2     Related Work
During the past two decades, the Internet evolved
massively and the social web became a hub where                 2.1    Multimodal Fake News Detection
people share their opinions, cooperate to solve is-
                                                                Singhal et al. [16] employed the usage of multi-
sues, or simply discuss on various topics. There
                                                                modal techniques for fake news detection. The au-
are many ways in which users can express them-
                                                                thors introduced SpotFake, an architecture divided
selves: plain text, videos, or images. The lat-
                                                                into three sub-parts: one for identifying textual
ter option became widely used due to its conve-
                                                                features using Bidirectional Encoder Representa-
nience; however, images are frequently accompa-
                                                                tions from Transformers (BERT) [10], a second
nied by a short text description to better convey
                                                                for visual analysis based on VGG-19 [5], while the
    ∗
      These authors contributed equally.                        third combines the previously mentioned elements
     Copyright © 2020 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-
                                                                into a single feature vector.
ternational (CC BY 4.0).                                           Similarly, Shah and Priyanshi [23] performed
multimodal fake news detection by using two sep-        responding CSV file of a similar structure. The
arate channels, visual and textual, both of them        second task offers a dataset containing 800 entries
aiming to extract relevant features. Moreover, they     which was partitioned in a similar manner.
included a Cultural Algorithm that introduces an-
other dimension by employing situational knowl-         3.2   Image Component
edge, i.e. information about the depicted event as      Several image-based neural networks were con-
seen by a specific individual. Another approach         sidered for the first component of our final ar-
regarding fake news detection was introduced by         chitecture. First, we used VGG-16 which con-
Khattar et al. [12] who created MVAE, a multi-          sists of five stacks of Convolutional Neural Net-
modal autoencoder including encoders (both vi-          works [4] accompanied by max-pooling layers.
sual and textual), decoders, and a detection mod-       Pretrained weights on the ImageNet dataset [3]
ule for classifying the inputs.                         were afterwards fine-tuned. Second, we also ex-
                                                        perimented with ResNet in two variants, ResNet50
2.2   Multimodal Hate Speech Identification
                                                        and ResNet152. ResNet introduced the concept
Kiela et al. [20] created a new dataset specifically    of skip connections as a solution to the vanishing
designed for identifying hateful speech in memes.       gradient problem; as such, the networks could be
At the same time, the authors also introduced a se-     further scaled in terms of depth, enabling more
ries of baselines for further comparison, including     abstract high-level features to be extracted from
ResNet-152 [7] and VilBERT [13] for the visual          the input images. Similar VGG-16 architecture,
channel, and BERT for the textual counterpart.          pretrained weights on ImageNet were fine-tuned
   Furthermore, Sabat et al. [15] tackled the prob-     for ResNet152, whereas pretrained weights on
lem of hate speech identification in memes by also      VGGFace2 [9] were used for ResNet50.
employing a multimodal system. However, they
used an Optical Character Recognition system for        3.3   Text Component
extracting the textual component from the inputs,       A Graph Convolutional Network (GCN) [18]
alongside visual features from a VGG-16 compo-          for representing long-term dependencies between
nent and the text encoded with BERT.                    tokens was selected, alongside a pretrained
                                                        version of BERT for Italian (ItalianBERT)1 to
3     Method                                            model the contextual information at sample level.
Our approach for both tasks consists of a multi-        The underlying implementation of the textual
task learning technique [1] and our architecture        feature extractor follows the architectural design
consists of two main neural network components,         of Vocabulary Graph Convolutional Network with
one for the text input, while the other for the im-     BERT (VGCN-BERT) [21].
age input. Thus, we combined the outputs of these          The        proposed      architecture   (VGCN-
two components and used the learned features for        ItalianBERT) uses a tight coupling between the
determining the required class, either for Task 1 or    graph convolutional layers and the ItalianBERT
Task 2.                                                 embeddings, enabling the model to better adjust
                                                        the GCN extracted features through ItalianBERT’s
3.1   Corpus                                            attention mechanism. The input to the VGCN
The dataset for the meme detection task is split        layer is represented by a vector Xd,v , where d is
into two parts, train and test. The training dataset    the dimension of the ItalianBERT embedding and
contains 1,600 image entries, together with a CSV       v is the number of tokens in the dataset vocabulary.
file containing other useful metadata, such as: the     A symmetric adjacency matrix Av,v is built to
engagement (i.e., number of comments and likes),        preserve the prior global relationship between
date, and manipulation (i.e., binary coding denot-      tokens, where v is the vocabulary dimension. The
ing the low/high level of image modifications),         edge weight between two nodes i, j, denoted as
alongside a transcript of the text present in the im-   Ai,j , is initialized with the normalized point-wise
age. We kept 85% of the entries for training, while     mutual information (NPMI) value [2] between the
15% are used for validation; the same class distri-     two vocabulary tokens i, j. The mechanism of
bution is kept in both partition. The test dataset        1
                                                            https://github.com/dbmdz/berts#
for the first task contains 400 entries with a cor-     italian-bert
the VGCN layer is formally summarized by the          dates are segmented and encoded by using
following equations:                                  complementary sine and cosine functions to
                                                      preserve the cyclic characteristics of days (in a
          Hv,h = Dropout(A
                         ev,v Wv,h )            (1)   month) and months. Equation 4 describes the time
                                                      cyclical encoding procedure, where n represents
           Hd,h = ReLU (Xd,v Hv,h )             (2)   the day value subtracted by 1 and divided
                Hd,g = Hd,h Wh,g                (3)   by the number of days in the corresponding
                                                      month. The same operations are applied for
where terms Wv,h and Wh,g represent the weights       the months encoding over the month index, but
of the two GCN internal layers, with v the            the denominator is 12 in this case. Additional
vocabulary dimension, h and g the output feature      metadata (i.e., manipulation and engagement) was
dimensions. In Equation 1, we add the global          also encoded and used in the final prediction.
context by multiplying the normalized adjacency       Values representing the year and engagement were
matrix A  e with the weight matrix of the first       normalized to ensure the model’s stability during
GCN layer. We use the normalized adjacency            training.
matrix A e = D−1/2 AD−1/2 to ensure numerical
stability. A convolution between the input vector                                    θ =2∗π∗n
                                                                                                       (4)
Xd,v and the result from the previous operation             timesin = sin(θ); timecos = cos(θ)
(Equation 2) is performed to combine the global
information with the ItalianBERT embeddings.             The two feature vectors from the image and text
Lastly, Equation 3 projects the features to the       components were fused together by concatenation
dimensions required to fill in the reserved VGCN-     into a single vector and passed through two fully
ItalianBERT embedding slots.                          connected layers, followed by a dropout layer of
   Visual text features describing the actors of      0.5. The output of the dropout layer is then
a meme are added as the pair sentence to              concatenated together with the other extracted
ItalianBERT’s input. We cap the second sentence       features like time, engagement, manipulation,
containing the visual text features to K tokens,      and fed to the output layer. Softmax activation
overflowing tokens being dropped. Considering         function is used over the last fully connected layer
L the maximum number of input tokens, the             to compute the distribution probability over the
remainder of L − K tokens are being split             task classes. L2 regularization kernel is used on
between the text tokens associated with a meme        the two hidden layers before fusion to account
and G VGCN reserved slots. Those slots are            for large activations and to keep our output layer
kept empty to be internally filled with VGCN          sensible to the metadata encoded features.
embeddings during training. Alongside ordinary           In addition, an ensemble-based architecture
inputs required by ItalianBERT (i.e. input ids,       using our ResNet50 + VGCN-ItalianBERT model
input masks and segment ids ), we build a gcn         was also considered. First, the training dataset
ids vector similarly to input ids, by mapping each    was split into 5 sets, while preserving the class
unique input token to the corresponding index         distribution of each fold. The aforementioned
in the task vocabulary Vtask ; Vtask represents the   model was trained 5 times using 4/5 sets for
set of tokens available in the task text corpus       training, and the remainder set for validation.
and in the ItalianBERT’s vocabulary. The second       A weighted voting procedure is performed at
additional input is represented by a binary mask      prediction time, in which the weights are
vector having the value of 1 for the VGCN             represented by the average confidence score of the
reserved tokens, and 0 otherwise. During training,    voters in the class receiving the highest probability
all ItalianBERT layers with the exception of the      after softmax. Thus, we advocate for higher
last 4 encoder blocks were frozen.                    confidence scores over the number of voters in
                                                      choosing the predicted class.
3.4   Multimodal Architecture
The final solution consists of a multimodal           3.5   Experimental Setup
architecture with two main components, each           Preprocessing steps were performed to feed the
specialized on processing one informational           datasets to our architecture. The texts were
channel, namely text or image-based.    The           tokenized using the ItalianBERT tokenizer, and
then the input ids, input masks, segment ids, gcn       the ResNet50 + VGCN-ItalianBERT model, with
ids and gcn masks were computed. Images were            an .9041 macro-F1 score for the custom validation
resized to a uniform dimension (i.e., 448 x 448)        dataset used for Task 1, and .8745 and .8169
and were serialized alongside the text components       macro-F1 scores on the validation and test datasets
in a tfrecords file specific for Tensorflow [6]. An     for Task 2. However, the best result for the Task
Adam Weight decay optimizer [8] with a learning         1 test set is yielded by the ResNet152 + VGCN-
rate of 1e-5 and a weight decay rate of 0.01 were       ItalianBERT architecture, with an .8700 macro-F1
used in all conducted experiments. Furthermore,         score.
the warm up proportion was set to 0.1.                     ItalianBERT, ResNet50, and ResNet50 +
   The maximum input length was limited to L =          ItalianBERT are used as baseline models to
100 tokens and the Visual text features to K =          explore the improvements made by adding VGCN
20 tokens as the textual channel of memes is            to the textual architecture while maintaining the
represented by short sentences. Following the           same experimental setup. As expected, the model
experimental setup described in [21], we reserve        using only the textual channel (i.e. ItalianBERT
G = 16 slots to be filled with the resulted VGCN-       baseline model) is performing considerably
ItalianBERT embeddings. Moreover, only NPMI             worse than the joint architecture ResNet50 +
values larger than 0.3 are kept in the adjacency        ItalianBERT, thus arguing for the importance
matrix A, corresponding to a higher semantic            of considering images in disambiguating
correlation between words; all the other values         the textual input. The ResNet50 + VGCN-
below this threshold are set to 0.                      ItalianBERT model performs consistently better
   We empirically found 1e-5 to be a good learning      than its baseline counterpart (i.e., ResNet50
rate value, which is on par with the results of [21].   + ItalianBERT), by obtaining improvements
Lastly, we choose to train all the models for 9         of 2.92% and 3.35% macro-F1 score on the
epochs with a batch size of 8 examples.                 validation sets for Task 1 and Task 2, respectively.

3.6   Results                                           3.7    Error Analysis
                                                        Although the models performed arguably well
Table 1 contains the results obtained by
                                                        on both task, the identified misclassifications
our models for the first two tasks of the
                                                        represent a good starting point for further analysis
DANKMEMES competition. The components
                                                        and improvement. Figure 1 depicts a series of
that were frozen during the training process are
                                                        misclassified entries from both tasks.
varied for the three main conducted experiments
                                                           The short texts encountered in memes require
(i.e. combining ItalianBERT with VGCN and
                                                        in several situations prior information on the
ResNet50, ResNet152 and VGG-16, respectively)
                                                        sociopolitical context, therefore making the
to identify proper adjustments for the weights of
                                                        detection of memes an exceedingly difficult task.
the pretrained models. The best results among the
                                                        In general, a few well known and highly popular
four evaluated sets (i.e. validation, test for Task
                                                        image templates are reused, by changing or
1 and validation, test for Task 2) are obtained
                                                        partially adjusting the text to expressively convey
by either freezing only the VGCN-ItalianBERT
                                                        an idea or a view on a certain subject. However,
component or by freezing both textual and image
                                                        the used templates in the current competition are
components. The necessity of freezing the text
                                                        extensively customized and tailored specifically
branch of the architecture underlines the fact
                                                        to the political context of Italy. In addition, the
that the pretrained weights for the ItalianBERT
                                                        subjectivity of the annotators also plays a decisive
model already properly capture specific traits
                                                        role, considering that the concept of the hateful
of Italian and prove to be a viable option, even
                                                        speech tag for the second task is not well defined
when analyzing short texts such as memes.
                                                        for all situations and can be interpreted differently.
Furthermore, the last convolutional block of the
image component needs to be unfrozen because
                                                        4     Conclusion and Future Work
training an architecture on potential meme
images is a more specific task when compared to         This paper introduces our multimodal architecture
analyzing Italian text.                                 for the first two tasks of the DANKMEMES
   The best results are obtained using variations of    competition from EVALITA 2020.           Several
Table 1: Macro-F1 scores on the validation and test datasets, for both Task 1 and Task 2. Submitted
models are shown in italics.
                                    Frozen Component              Task 1              Task 2
       Neural Architecture
                                     Image      Text          Dev       Test       Dev      Test
            ItalianBERT                 -         -          0.7618    0.7546    0.8083    0.7996
              ResNet50                  -         -          0.8203    0.7899    0.5661    0.5598
      ResNet50 + ItalianBERT            -         X          0.8749    0.8499    0.8331    0.7949
  ResNet50 + VGCN-ItalianBERT           -         -          0.8666    0.8348    0.8413    0.8150
  ResNet50 + VGCN-ItalianBERT           -         X          0.9041    0.8235    0.8666    0.8169
  ResNet50 + VGCN-ItalianBERT          X          -          0.8874    0.8375    0.8493    0.7584
  ResNet50 + VGCN-ItalianBERT          X          X          0.8833    0.8499    0.8745    0.7992
 ResNet152 + VGCN-ItalianBERT           -         -          0.8458    0.8424    0.8331    0.7998
 ResNet152 + VGCN-ItalianBERT           -         X          0.8791    0.8700    0.8666    0.7994
 ResNet152 + VGCN-ItalianBERT          X          -          0.8246    0.8474    0.8310    0.8093
 ResNet152 + VGCN-ItalianBERT          X          X          0.8915    0.8273    0.8489    0.7490
  VGG-16 + VGCN-ItalianBERT             -         -          0.8124    0.7923    0.6906    0.5478
  VGG-16 + VGCN-ItalianBERT             -         X          0.8083    0.7620    0.5566    0.5469
  VGG-16 + VGCN-ItalianBERT            X          -          0.7485    0.7447    0.6414    0.5263
  VGG-16 + VGCN-ItalianBERT            X          X          0.7621    0.7248    0.6003    0.5388
       Ensemble Architecture            -         -          0.8916    0.8437    0.7874    0.7692
       Competition Baselines            -         -             -      0.5198       -      0.5621


                     Figure 1: Examples of misclassified samples for both tasks.


joint text - Vocabulary Graph Convolutional         in different languages.
Network alongside an Italian BERT model
- and image-based architectures - ResNet50,         References
ResNet152, VGG-16 - were experimented. The           [1] Rich Caruana. “Multitask learning”. In:
consideration of meme meta-information, such             Machine learning 28.1 (1997), pp. 41–75.
as cyclic temporal characteristics and post
                                                     [2] Gerlof Bouma. “Normalized (pointwise)
engagement, boosted even further our F1-scores
                                                         mutual     information   in    collocation
when compared to the competition baseline.
                                                         extraction”. In: Proceedings of GSCL
   In terms of future work, we intend to                 (2009), pp. 31–40.
experiment with other visual architectures,
                                                     [3] Jia Deng et al. “Imagenet: A large-scale
including VGG-19 [5] and EfficientNet [17], and
                                                         hierarchical image database”. In: 2009
also with multilingual neural networks, such as
                                                         IEEE Conference on Computer Vision and
mBERT [14] and XLM-RoBERTa [11], that will
empower transfer learning across meme datasets
       Pattern Recognition. Ieee. 2009, pp. 248–               towards Automatic Moderation”. In: arXiv
       255.                                                    preprint arXiv:1910.02334 (2019).
 [4]   Yoon Kim. “Convolutional neural networks         [16]   Shivangi Singhal et al. “SpotFake:
       for sentence classification”. In: arXiv                 A Multi-modal Framework for Fake
       preprint arXiv:1408.5882 (2014).                        News Detection”. In: 2019 IEEE Fifth
 [5]   Karen Simonyan and Andrew Zisserman.                    International Conference on Multimedia
       “Very deep convolutional networks for                   Big Data (BigMM). IEEE. 2019, pp. 39–47.
       large-scale image recognition”. In: arXiv        [17]   Mingxing Tan and Quoc V Le.
       preprint arXiv:1409.1556 (2014).                        “Efficientnet: Rethinking model scaling for
 [6]   Martin Abadi et al. “Tensorflow: Large-                 convolutional neural networks”. In: arXiv
       scale machine learning on heterogeneous                 preprint arXiv:1905.11946 (2019).
       distributed systems”. In: arXiv preprint         [18]   Liang Yao, Chengsheng Mao, and Yuan
       arXiv:1603.04467 (2016).                                Luo. “Graph convolutional networks for
 [7]   Kaiming He et al. “Deep residual learning               text classification”. In: Proceedings of the
       for image recognition”. In: Proceedings of              AAAI Conference on Artificial Intelligence.
       the IEEE Conference on Computer Vision                  Vol. 33. 2019, pp. 7370–7377.
       and Pattern Recognition. 2016, pp. 770–          [19]   Valerio Basile et al. “EVALITA 2020:
       778.                                                    Overview of the 7th Evaluation Campaign
 [8]   Ilya Loshchilov and Frank Hutter.                       of Natural Language Processing and Speech
       “Decoupled weight decay regularization”.                Tools for Italian”. In: Proceedings of
       In: arXiv preprint arXiv:1711.05101                     Seventh Evaluation Campaign of Natural
       (2017).                                                 Language Processing and Speech Tools
 [9]   Qiong Cao et al. “Vggface2: A dataset for               for Italian. Final Workshop (EVALITA
       recognising faces across pose and age”. In:             2020). Ed. by Valerio Basile et al. Online:
       2018 13th IEEE International Conference                 CEUR.org, 2020.
       on Automatic Face & Gesture Recognition          [20]   Douwe Kiela et al. “The Hateful Memes
       (FG 2018). IEEE. 2018, pp. 67–74.                       Challenge: Detecting Hate Speech in
[10]   Jacob Devlin et al. “Bert: Pre-training                 Multimodal Memes”. In: arXiv preprint
       of deep bidirectional transformers for                  arXiv:2005.04790 (2020).
       language understanding”. In: arXiv preprint      [21]   Zhibin Lu, Pan Du, and Jian-Yun Nie.
       arXiv:1810.04805 (2018).                                “VGCN-BERT: Augmenting BERT with
[11]   Alexis Conneau et al. “Unsupervised cross-              Graph Embedding for Text Classification”.
       lingual representation learning at scale”. In:          In: European Conference on Information
       arXiv preprint arXiv:1911.02116 (2019).                 Retrieval. Springer. 2020, pp. 369–382.
[12]   Dhruv Khattar et al. “Mvae: Multimodal           [22]   Martina Miliani et al. “DANKMEMES
       variational autoencoder for fake news                   @ EVALITA2020: The memeing of life:
       detection”. In: The World Wide Web                      memes, multimodality and politics”.
       Conference. 2019, pp. 2915–2921.                        In: Proceedings of Seventh Evaluation
                                                               Campaign of Natural Language Processing
[13]   Jiasen Lu et al. “Vilbert: Pretraining task-
                                                               and Speech Tools for Italian. Final
       agnostic visiolinguistic representations for
                                                               Workshop (EVALITA 2020). Ed. by Valerio
       vision-and-language tasks”. In: Advances
                                                               Basile et al. Online: CEUR.org, 2020.
       in Neural Information Processing Systems.
       2019, pp. 13–23.                                 [23]   Priyanshi Shah and Ziad Kobti.
                                                               “Multimodal fake news detection using a
[14]   Telmo Pires, Eva Schlinger, and
                                                               Cultural Algorithm with situational and
       Dan Garrette. “How multilingual is
                                                               normative knowledge”. In: 2020 IEEE
       Multilingual BERT?” In: arXiv preprint
                                                               Congress on Evolutionary Computation
       arXiv:1906.01502 (2019).
                                                               (CEC). IEEE. 2020, pp. 1–7.
[15]   Benet Oriol Sabat, Cristian Canton Ferrer,
       and Xavier Giro-i-Nieto. “Hate Speech
       in Pixels: Detection of Offensive Memes