UPB @ DANKMEMES: Italian Memes Analysis - Employing Visual Models and Graph Convolutional Networks for Meme Identification and Hate Speech Detection George-Alexandru Vlad∗ , George-Eduard Zaharia∗ , Dumitru-Clementin Cercel, Mihai Dascalu University Politehnica of Bucharest, Faculty of Automatic Control and Computers {george.vlad0108, george.zaharia0806}@stud.acs.upb.ro {dumitru.cercel, mihai.dascalu}@upb.ro Abstract information. As the Internet and the online so- cial interactions evolved, certain image templates Certain events or political situations deter- emerged and gained global popularity, contribut- mine users from the online environment ing to a de facto standardization of joint text- to express themselves by using different image usage, and thus leading to the creation of modalities. One of them is represented by memes. Memes can be humorous, satirical, of- Internet memes, which combine text with fensive, or hateful, therefore encapsulating a wide a representative image to entail a wide range of emotions and beliefs. Properly identify- range of emotions, from humor to sarcasm ing memes from non-memes, and then analyzing and even hate. In this paper, we describe them to detect the users’ intentions is becoming our approach for the DANKMEMES com- a stringent task in online marketing campaigns by petition from EVALITA 2020 consisting targeting the automated identification of opinions of a multimodal multi-task learning archi- pertaining to certain groups of users. tecture based on two main components. The DANKMEMES competition [22] from The first one is a Graph Convolutional EVALITA 2020 [19] challenged participants to ap- Network combined with an Italian BERT proach the previously mentioned issues by cre- for text encoding, while the second is var- ating systems that identify and analyze Internet ied between different image-based archi- memes in Italian. The competition consists of tectures (i.e., ResNet50, ResNet152, and three tasks, out of which we tackled two. Task VGG-16) for image representation. Our 1 - Meme Detection considers the identification solution achieves good performance on the of memes from a collection of images, such that first two tasks of the current competition, a clear distinction can be made between memes ranking 3rd for both Task 1 (.8437 macro- and ordinary images. Afterwards, Task 2 - Hate F1 score) and Task 2 (.8169 macro-F1 Speech Identification targets the classification of score), while exceeding by high margins images in terms of their purpose, by analyzing the official baselines. content and identifying whether images are hate- ful or not. 1 Introduction 2 Related Work During the past two decades, the Internet evolved massively and the social web became a hub where 2.1 Multimodal Fake News Detection people share their opinions, cooperate to solve is- Singhal et al. [16] employed the usage of multi- sues, or simply discuss on various topics. There modal techniques for fake news detection. The au- are many ways in which users can express them- thors introduced SpotFake, an architecture divided selves: plain text, videos, or images. The lat- into three sub-parts: one for identifying textual ter option became widely used due to its conve- features using Bidirectional Encoder Representa- nience; however, images are frequently accompa- tions from Transformers (BERT) [10], a second nied by a short text description to better convey for visual analysis based on VGG-19 [5], while the ∗ These authors contributed equally. third combines the previously mentioned elements Copyright © 2020 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- into a single feature vector. ternational (CC BY 4.0). Similarly, Shah and Priyanshi [23] performed multimodal fake news detection by using two sep- responding CSV file of a similar structure. The arate channels, visual and textual, both of them second task offers a dataset containing 800 entries aiming to extract relevant features. Moreover, they which was partitioned in a similar manner. included a Cultural Algorithm that introduces an- other dimension by employing situational knowl- 3.2 Image Component edge, i.e. information about the depicted event as Several image-based neural networks were con- seen by a specific individual. Another approach sidered for the first component of our final ar- regarding fake news detection was introduced by chitecture. First, we used VGG-16 which con- Khattar et al. [12] who created MVAE, a multi- sists of five stacks of Convolutional Neural Net- modal autoencoder including encoders (both vi- works [4] accompanied by max-pooling layers. sual and textual), decoders, and a detection mod- Pretrained weights on the ImageNet dataset [3] ule for classifying the inputs. were afterwards fine-tuned. Second, we also ex- perimented with ResNet in two variants, ResNet50 2.2 Multimodal Hate Speech Identification and ResNet152. ResNet introduced the concept Kiela et al. [20] created a new dataset specifically of skip connections as a solution to the vanishing designed for identifying hateful speech in memes. gradient problem; as such, the networks could be At the same time, the authors also introduced a se- further scaled in terms of depth, enabling more ries of baselines for further comparison, including abstract high-level features to be extracted from ResNet-152 [7] and VilBERT [13] for the visual the input images. Similar VGG-16 architecture, channel, and BERT for the textual counterpart. pretrained weights on ImageNet were fine-tuned Furthermore, Sabat et al. [15] tackled the prob- for ResNet152, whereas pretrained weights on lem of hate speech identification in memes by also VGGFace2 [9] were used for ResNet50. employing a multimodal system. However, they used an Optical Character Recognition system for 3.3 Text Component extracting the textual component from the inputs, A Graph Convolutional Network (GCN) [18] alongside visual features from a VGG-16 compo- for representing long-term dependencies between nent and the text encoded with BERT. tokens was selected, alongside a pretrained version of BERT for Italian (ItalianBERT)1 to 3 Method model the contextual information at sample level. Our approach for both tasks consists of a multi- The underlying implementation of the textual task learning technique [1] and our architecture feature extractor follows the architectural design consists of two main neural network components, of Vocabulary Graph Convolutional Network with one for the text input, while the other for the im- BERT (VGCN-BERT) [21]. age input. Thus, we combined the outputs of these The proposed architecture (VGCN- two components and used the learned features for ItalianBERT) uses a tight coupling between the determining the required class, either for Task 1 or graph convolutional layers and the ItalianBERT Task 2. embeddings, enabling the model to better adjust the GCN extracted features through ItalianBERT’s 3.1 Corpus attention mechanism. The input to the VGCN The dataset for the meme detection task is split layer is represented by a vector Xd,v , where d is into two parts, train and test. The training dataset the dimension of the ItalianBERT embedding and contains 1,600 image entries, together with a CSV v is the number of tokens in the dataset vocabulary. file containing other useful metadata, such as: the A symmetric adjacency matrix Av,v is built to engagement (i.e., number of comments and likes), preserve the prior global relationship between date, and manipulation (i.e., binary coding denot- tokens, where v is the vocabulary dimension. The ing the low/high level of image modifications), edge weight between two nodes i, j, denoted as alongside a transcript of the text present in the im- Ai,j , is initialized with the normalized point-wise age. We kept 85% of the entries for training, while mutual information (NPMI) value [2] between the 15% are used for validation; the same class distri- two vocabulary tokens i, j. The mechanism of bution is kept in both partition. The test dataset 1 https://github.com/dbmdz/berts# for the first task contains 400 entries with a cor- italian-bert the VGCN layer is formally summarized by the dates are segmented and encoded by using following equations: complementary sine and cosine functions to preserve the cyclic characteristics of days (in a Hv,h = Dropout(A ev,v Wv,h ) (1) month) and months. Equation 4 describes the time cyclical encoding procedure, where n represents Hd,h = ReLU (Xd,v Hv,h ) (2) the day value subtracted by 1 and divided Hd,g = Hd,h Wh,g (3) by the number of days in the corresponding month. The same operations are applied for where terms Wv,h and Wh,g represent the weights the months encoding over the month index, but of the two GCN internal layers, with v the the denominator is 12 in this case. Additional vocabulary dimension, h and g the output feature metadata (i.e., manipulation and engagement) was dimensions. In Equation 1, we add the global also encoded and used in the final prediction. context by multiplying the normalized adjacency Values representing the year and engagement were matrix A e with the weight matrix of the first normalized to ensure the model’s stability during GCN layer. We use the normalized adjacency training. matrix A e = D−1/2 AD−1/2 to ensure numerical stability. A convolution between the input vector θ =2∗π∗n (4) Xd,v and the result from the previous operation timesin = sin(θ); timecos = cos(θ) (Equation 2) is performed to combine the global information with the ItalianBERT embeddings. The two feature vectors from the image and text Lastly, Equation 3 projects the features to the components were fused together by concatenation dimensions required to fill in the reserved VGCN- into a single vector and passed through two fully ItalianBERT embedding slots. connected layers, followed by a dropout layer of Visual text features describing the actors of 0.5. The output of the dropout layer is then a meme are added as the pair sentence to concatenated together with the other extracted ItalianBERT’s input. We cap the second sentence features like time, engagement, manipulation, containing the visual text features to K tokens, and fed to the output layer. Softmax activation overflowing tokens being dropped. Considering function is used over the last fully connected layer L the maximum number of input tokens, the to compute the distribution probability over the remainder of L − K tokens are being split task classes. L2 regularization kernel is used on between the text tokens associated with a meme the two hidden layers before fusion to account and G VGCN reserved slots. Those slots are for large activations and to keep our output layer kept empty to be internally filled with VGCN sensible to the metadata encoded features. embeddings during training. Alongside ordinary In addition, an ensemble-based architecture inputs required by ItalianBERT (i.e. input ids, using our ResNet50 + VGCN-ItalianBERT model input masks and segment ids ), we build a gcn was also considered. First, the training dataset ids vector similarly to input ids, by mapping each was split into 5 sets, while preserving the class unique input token to the corresponding index distribution of each fold. The aforementioned in the task vocabulary Vtask ; Vtask represents the model was trained 5 times using 4/5 sets for set of tokens available in the task text corpus training, and the remainder set for validation. and in the ItalianBERT’s vocabulary. The second A weighted voting procedure is performed at additional input is represented by a binary mask prediction time, in which the weights are vector having the value of 1 for the VGCN represented by the average confidence score of the reserved tokens, and 0 otherwise. During training, voters in the class receiving the highest probability all ItalianBERT layers with the exception of the after softmax. Thus, we advocate for higher last 4 encoder blocks were frozen. confidence scores over the number of voters in choosing the predicted class. 3.4 Multimodal Architecture The final solution consists of a multimodal 3.5 Experimental Setup architecture with two main components, each Preprocessing steps were performed to feed the specialized on processing one informational datasets to our architecture. The texts were channel, namely text or image-based. The tokenized using the ItalianBERT tokenizer, and then the input ids, input masks, segment ids, gcn the ResNet50 + VGCN-ItalianBERT model, with ids and gcn masks were computed. Images were an .9041 macro-F1 score for the custom validation resized to a uniform dimension (i.e., 448 x 448) dataset used for Task 1, and .8745 and .8169 and were serialized alongside the text components macro-F1 scores on the validation and test datasets in a tfrecords file specific for Tensorflow [6]. An for Task 2. However, the best result for the Task Adam Weight decay optimizer [8] with a learning 1 test set is yielded by the ResNet152 + VGCN- rate of 1e-5 and a weight decay rate of 0.01 were ItalianBERT architecture, with an .8700 macro-F1 used in all conducted experiments. Furthermore, score. the warm up proportion was set to 0.1. ItalianBERT, ResNet50, and ResNet50 + The maximum input length was limited to L = ItalianBERT are used as baseline models to 100 tokens and the Visual text features to K = explore the improvements made by adding VGCN 20 tokens as the textual channel of memes is to the textual architecture while maintaining the represented by short sentences. Following the same experimental setup. As expected, the model experimental setup described in [21], we reserve using only the textual channel (i.e. ItalianBERT G = 16 slots to be filled with the resulted VGCN- baseline model) is performing considerably ItalianBERT embeddings. Moreover, only NPMI worse than the joint architecture ResNet50 + values larger than 0.3 are kept in the adjacency ItalianBERT, thus arguing for the importance matrix A, corresponding to a higher semantic of considering images in disambiguating correlation between words; all the other values the textual input. The ResNet50 + VGCN- below this threshold are set to 0. ItalianBERT model performs consistently better We empirically found 1e-5 to be a good learning than its baseline counterpart (i.e., ResNet50 rate value, which is on par with the results of [21]. + ItalianBERT), by obtaining improvements Lastly, we choose to train all the models for 9 of 2.92% and 3.35% macro-F1 score on the epochs with a batch size of 8 examples. validation sets for Task 1 and Task 2, respectively. 3.6 Results 3.7 Error Analysis Although the models performed arguably well Table 1 contains the results obtained by on both task, the identified misclassifications our models for the first two tasks of the represent a good starting point for further analysis DANKMEMES competition. The components and improvement. Figure 1 depicts a series of that were frozen during the training process are misclassified entries from both tasks. varied for the three main conducted experiments The short texts encountered in memes require (i.e. combining ItalianBERT with VGCN and in several situations prior information on the ResNet50, ResNet152 and VGG-16, respectively) sociopolitical context, therefore making the to identify proper adjustments for the weights of detection of memes an exceedingly difficult task. the pretrained models. The best results among the In general, a few well known and highly popular four evaluated sets (i.e. validation, test for Task image templates are reused, by changing or 1 and validation, test for Task 2) are obtained partially adjusting the text to expressively convey by either freezing only the VGCN-ItalianBERT an idea or a view on a certain subject. However, component or by freezing both textual and image the used templates in the current competition are components. The necessity of freezing the text extensively customized and tailored specifically branch of the architecture underlines the fact to the political context of Italy. In addition, the that the pretrained weights for the ItalianBERT subjectivity of the annotators also plays a decisive model already properly capture specific traits role, considering that the concept of the hateful of Italian and prove to be a viable option, even speech tag for the second task is not well defined when analyzing short texts such as memes. for all situations and can be interpreted differently. Furthermore, the last convolutional block of the image component needs to be unfrozen because 4 Conclusion and Future Work training an architecture on potential meme images is a more specific task when compared to This paper introduces our multimodal architecture analyzing Italian text. for the first two tasks of the DANKMEMES The best results are obtained using variations of competition from EVALITA 2020. Several Table 1: Macro-F1 scores on the validation and test datasets, for both Task 1 and Task 2. Submitted models are shown in italics. Frozen Component Task 1 Task 2 Neural Architecture Image Text Dev Test Dev Test ItalianBERT - - 0.7618 0.7546 0.8083 0.7996 ResNet50 - - 0.8203 0.7899 0.5661 0.5598 ResNet50 + ItalianBERT - X 0.8749 0.8499 0.8331 0.7949 ResNet50 + VGCN-ItalianBERT - - 0.8666 0.8348 0.8413 0.8150 ResNet50 + VGCN-ItalianBERT - X 0.9041 0.8235 0.8666 0.8169 ResNet50 + VGCN-ItalianBERT X - 0.8874 0.8375 0.8493 0.7584 ResNet50 + VGCN-ItalianBERT X X 0.8833 0.8499 0.8745 0.7992 ResNet152 + VGCN-ItalianBERT - - 0.8458 0.8424 0.8331 0.7998 ResNet152 + VGCN-ItalianBERT - X 0.8791 0.8700 0.8666 0.7994 ResNet152 + VGCN-ItalianBERT X - 0.8246 0.8474 0.8310 0.8093 ResNet152 + VGCN-ItalianBERT X X 0.8915 0.8273 0.8489 0.7490 VGG-16 + VGCN-ItalianBERT - - 0.8124 0.7923 0.6906 0.5478 VGG-16 + VGCN-ItalianBERT - X 0.8083 0.7620 0.5566 0.5469 VGG-16 + VGCN-ItalianBERT X - 0.7485 0.7447 0.6414 0.5263 VGG-16 + VGCN-ItalianBERT X X 0.7621 0.7248 0.6003 0.5388 Ensemble Architecture - - 0.8916 0.8437 0.7874 0.7692 Competition Baselines - - - 0.5198 - 0.5621 Figure 1: Examples of misclassified samples for both tasks. joint text - Vocabulary Graph Convolutional in different languages. Network alongside an Italian BERT model - and image-based architectures - ResNet50, References ResNet152, VGG-16 - were experimented. The [1] Rich Caruana. “Multitask learning”. In: consideration of meme meta-information, such Machine learning 28.1 (1997), pp. 41–75. as cyclic temporal characteristics and post [2] Gerlof Bouma. “Normalized (pointwise) engagement, boosted even further our F1-scores mutual information in collocation when compared to the competition baseline. extraction”. In: Proceedings of GSCL In terms of future work, we intend to (2009), pp. 31–40. experiment with other visual architectures, [3] Jia Deng et al. “Imagenet: A large-scale including VGG-19 [5] and EfficientNet [17], and hierarchical image database”. In: 2009 also with multilingual neural networks, such as IEEE Conference on Computer Vision and mBERT [14] and XLM-RoBERTa [11], that will empower transfer learning across meme datasets Pattern Recognition. Ieee. 2009, pp. 248– towards Automatic Moderation”. In: arXiv 255. preprint arXiv:1910.02334 (2019). [4] Yoon Kim. “Convolutional neural networks [16] Shivangi Singhal et al. “SpotFake: for sentence classification”. In: arXiv A Multi-modal Framework for Fake preprint arXiv:1408.5882 (2014). News Detection”. In: 2019 IEEE Fifth [5] Karen Simonyan and Andrew Zisserman. International Conference on Multimedia “Very deep convolutional networks for Big Data (BigMM). IEEE. 2019, pp. 39–47. large-scale image recognition”. In: arXiv [17] Mingxing Tan and Quoc V Le. preprint arXiv:1409.1556 (2014). “Efficientnet: Rethinking model scaling for [6] Martin Abadi et al. “Tensorflow: Large- convolutional neural networks”. In: arXiv scale machine learning on heterogeneous preprint arXiv:1905.11946 (2019). distributed systems”. In: arXiv preprint [18] Liang Yao, Chengsheng Mao, and Yuan arXiv:1603.04467 (2016). Luo. “Graph convolutional networks for [7] Kaiming He et al. “Deep residual learning text classification”. In: Proceedings of the for image recognition”. In: Proceedings of AAAI Conference on Artificial Intelligence. the IEEE Conference on Computer Vision Vol. 33. 2019, pp. 7370–7377. and Pattern Recognition. 2016, pp. 770– [19] Valerio Basile et al. “EVALITA 2020: 778. Overview of the 7th Evaluation Campaign [8] Ilya Loshchilov and Frank Hutter. of Natural Language Processing and Speech “Decoupled weight decay regularization”. Tools for Italian”. In: Proceedings of In: arXiv preprint arXiv:1711.05101 Seventh Evaluation Campaign of Natural (2017). Language Processing and Speech Tools [9] Qiong Cao et al. “Vggface2: A dataset for for Italian. Final Workshop (EVALITA recognising faces across pose and age”. In: 2020). Ed. by Valerio Basile et al. Online: 2018 13th IEEE International Conference CEUR.org, 2020. on Automatic Face & Gesture Recognition [20] Douwe Kiela et al. “The Hateful Memes (FG 2018). IEEE. 2018, pp. 67–74. Challenge: Detecting Hate Speech in [10] Jacob Devlin et al. “Bert: Pre-training Multimodal Memes”. In: arXiv preprint of deep bidirectional transformers for arXiv:2005.04790 (2020). language understanding”. In: arXiv preprint [21] Zhibin Lu, Pan Du, and Jian-Yun Nie. arXiv:1810.04805 (2018). “VGCN-BERT: Augmenting BERT with [11] Alexis Conneau et al. “Unsupervised cross- Graph Embedding for Text Classification”. lingual representation learning at scale”. In: In: European Conference on Information arXiv preprint arXiv:1911.02116 (2019). Retrieval. Springer. 2020, pp. 369–382. [12] Dhruv Khattar et al. “Mvae: Multimodal [22] Martina Miliani et al. “DANKMEMES variational autoencoder for fake news @ EVALITA2020: The memeing of life: detection”. In: The World Wide Web memes, multimodality and politics”. Conference. 2019, pp. 2915–2921. In: Proceedings of Seventh Evaluation Campaign of Natural Language Processing [13] Jiasen Lu et al. “Vilbert: Pretraining task- and Speech Tools for Italian. Final agnostic visiolinguistic representations for Workshop (EVALITA 2020). Ed. by Valerio vision-and-language tasks”. In: Advances Basile et al. Online: CEUR.org, 2020. in Neural Information Processing Systems. 2019, pp. 13–23. [23] Priyanshi Shah and Ziad Kobti. “Multimodal fake news detection using a [14] Telmo Pires, Eva Schlinger, and Cultural Algorithm with situational and Dan Garrette. “How multilingual is normative knowledge”. In: 2020 IEEE Multilingual BERT?” In: arXiv preprint Congress on Evolutionary Computation arXiv:1906.01502 (2019). (CEC). IEEE. 2020, pp. 1–7. [15] Benet Oriol Sabat, Cristian Canton Ferrer, and Xavier Giro-i-Nieto. “Hate Speech in Pixels: Detection of Offensive Memes