ArchiMeDe @ DANKMEMES: A New Model Architecture for Meme Detection Jinen Setpal Gabriele Sarti RN Podar School Department of Mathematics and Geosciences Mumbai, India University of Trieste & SISSA jinens8@gmail.com Trieste, Italy jinen.setpal@rnpodarschool.com gsarti@sissa.it 1 Introduction Abstract English. We introduce ArchiMeDe, a In recent years, the democratization of data collec- multimodal neural network-based archi- tion procedures through web scraping and crowd- tecture used to solve the DANKMEMES sourcing has led to the broad availability of pub- meme detections subtask at the 2020 lic datasets spanning modalities like language and EVALITA campaign. The system incor- vision. Contemporary state-of-the-art machine porates information from visual and tex- learning models can leverage those resources to tual sources through a multimodal neu- achieve highly accurate and often superhuman ral ensemble to predict if input images performances using millions or even billions of and their respective metadata are memes parameters (Brown et al., 2020), but are heavily re- or not. Each pre-trained neural network liant on an abundance of computational resources in the ensemble is first fine-tuned indi- to work properly. Consequently, such architec- vidually on the training dataset to per- tures’ training is often inaccessible to smaller re- form domain adaptation. Learned text and search centers – let alone individual users. To visual representations are then concate- counter this tendency, the availability of pre- nated to obtain a single multimodal em- trained open-source models has dramatically re- bedding, and the final prediction is per- duced the computational threshold required to ob- formed through majority voting by all net- tain state-of-the-art results in multiple languages works in the ensemble. and vision tasks (Devlin et al., 2019; He et al., 2016). Pre-trained systems are often leveraged in a Italiano. Presentiamo ArchiMeDe, two-step framework: first, they undergo an unsu- un’architettura multimodale basata su pervised or semi-supervised pre-training to learn reti neurali per la risoluzione del subtask general knowledge representations, then they are di “meme detection” per DANKMEMES fine-tuned in a supervised way to adapt their pa- a EVALITA 2020. Il sistema unisce rameters in the context of downstream tasks. This informazione visiva e testuale attraverso transfer learning approach stems from the com- un insieme multimodale di reti neurali puter vision literature (He et al., 2019) but has per prevedere se immagini e rispettivi been recently adopted for natural language pro- metadati corrispondano a meme o meno. cessing tasks with positive results (Howard and Ogni rete neurale pre-allenata all’interno Ruder, 2018; Devlin et al., 2019; Liu et al., 2019). dell’insieme è inizialmente adattata al In this paper, we present ArchiMeDe, a dominio specifico del dataset di training. multimodal system leveraging pre-trained lan- In seguito, le rappresentazioni di ogni rete guage and vision models to compete in the per immagini e testo vengono concatenate DANKMEMES (Miliani et al., 2020) shared task in un unico embedding multimodale, e la at the EVALITA 2020 campaign (Basile et al., previsione finale è effettuata tramite un 2020). Following recent transfer learning ap- voto di maggioranza effettuato da tutte le proaches, our system leverages pre-trained visual reti nell’insieme. and word embeddings in a multimodal setup, ob- Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 taining strong results on the meme detection sub- International (CC BY 4.0). task. Specifically, we participated in the first sub- UmBERTo = Concat Text Sentence embeddings + Image embeddings + Raw metadata Dense ResNet Layers Images Dense Is this a DankMemes AlexNet meme? Layers Dataset Majority Vote Dense DenseNet Layers Metadata (Visual actors, date, etc.) Figure 1: The ArchiMeDe system architecture. Sentence embeddings produced by the UmBERTo NLM are concatenated to metadata and image embeddings produced by three popular pre-trained vi- sion modals. The three resulting multimodal embeddings are fed separately to feedforward networks, and the final outcome is selected through majority voting. task of DANKMEMES, aimed at discriminating on the Italian language to produce sentence em- memes from standard images containing actors beddings. Then, we leverage three popular pre- from the Italian political scene. Task organizers trained vision architectures, namely ResNet (He extracted a total of 1600 training images from the et al., 2016), DenseNet (Huang et al., 2017a) and Instagram platform, and data available from each AlexNet (Krizhevsky et al., 2017), to produce dataset entry – text, actors and user engagement, three independent image embeddings for each in- among others – were leveraged to train an ensem- put image. These embeddings can be considered ble of multimodal models performing meme de- as different views over an image that may pro- tection through majority-vote. The following sec- vide us with complementary information about its tions present our approach in detail, first showing content. Then, each image embedding is concate- our preliminary evaluation of multiple modeling nated with the sentence embedding and the raw approaches and then focusing on the final system’s image metadata and fed as input to an 8-layer main modules and the features we leverage from feed-forward neural network to predict an image’s the dataset. Finally, results are presented, and meme status. The feed-forward network also in- we conclude by discussing the problems we faced cludes a single dropout layer to prevent overfitting with some inconsistencies in the data. Our code and improve generalization. Lastly, the three pre- is made available at https://github.com/ dictions are weighted through majority voting to jinensetpal/ArchiMeDe obtain the final prediction of the ensemble. Other simpler strategies using a single vision model to 2 System Description produce image embeddings were initially envis- aged as potential candidates for our submission ArchiMeDe is composed of a multimodal learn- but were finally dismissed in light of the promis- ing ensemble, with the final output being the re- ing performances of the ArchiMeDe ensembling sult of a majority vote. Figure 1 visualizes our approach. We discuss those perspectives in Sec- approach. First, the transcript associated with tion 4. each image is fed to an UmBERTo (Francia et al., 2020) neural language model (NLM) pre-trained The remaining part of this section contains an in-depth description of our ensemble’s compo- one-hot encoding of all the actors occurring in the nents, focusing on the input features that were training set: if a specific politician is present in an used and how those were preprocessed to best image, the corresponding entry is true; conversely, suit learning. Moreover, we also include trans- if no such actor is present, the binary field is set to fer learning specifications with some details about false. Actors that were not present in the training their impact on the overall system accuracy. set are disregarded during evaluation: while this step is required given the context, we assume that 2.1 Metadata this may significantly impact the outcome in im- Engagement User engagement per post is ex- ages for which new actors were introduced. pressed as a numeric integer value. We scale and standardize engagement values to obtain a distri- 2.2 Textual input bution centered in 0 with σ = 1. This procedure is The analysis of textual content in meme images is a standard practice to avoid passing extreme abso- critical to the success of the overall system. In- lute values as inputs for the neural network. deed, ironical or satyrical comments may deeply affect the users’ interpretation of an image that Date We decided to leverage temporal informa- would otherwise be classified as normal. We tion in our system, building upon the intuition note that this problem cannot be approached simi- that memes often rely on a small set of templates larly to standard textual analytic frameworks since that undergo a significant variation in popularity memes are elucidated in short, concise phrases and through time. Temporal information may thus pro- do not necessarily comply with standard gram- vide our system with additional cues about an im- matical rules. They also tend to contain slang age’s meme status in a specific time-frame. In the and vernacular expressions, which, albeit convey- training dataset, dates for each post has been pre- ing the intended meaning to the reader, greatly in- sented in the yyyy-mm-dd format. This date was crease the need for high model capacity and ad- compared with the predetermined date, 1st Jan- hoc training data. For this reason, we selected uary 2015, to derive a numeric value represent- UmBERTo (Francia et al., 2020), a RoBERTa- ing the number of days from the date of refer- based (Liu et al., 2019) neural language model ence. Min-max scaling is then applied to the nu- pre-trained on Italian texts extracted from the OS- meric values, further deriving float numeric values CAR corpus (Ortiz Suárez et al., 2020), for pro- between in the range [0,1], subsequently fed into ducing text representations.1 In a recent study by each training model. Miaschi et al. (2020), the model was highlighted Manipulation The manipulation field provides as one of the top Italian NLMs for encoding lin- boolean information about whether an image guistic information about social media excerpts has been manipulated before being added to the taken from the TWITTIRÒ and PoSTWITA Twit- dataset. We found this information noisy and a ter corpora (Cignarella et al., 2019; Sanguinetti et weak predictor of meme status; therefore, it was al., 2018). UmBERTo has a high model capability dropped as input. with 125M trainable parameters and was trained on online crawled data, making it suitable for pro- Visual Actors Each entry was additionally pro- cessing meme language. vided with a list of names of the visual actors present in the frame. In the specific case of SentenceTransformers We use the Sentence- the DANKMEMES shared task, visual actors can Transformers framework (Reimers and Gurevych, be especially useful to identify meme images. 2019) to produce sentence embeddings by av- For example, we can hypothesize that politicians eraging all word embeddings produced by the who maintain a strong public presence by making original UmBERTo model since Miaschi and claims that produce a high level of public engage- Dell’Orletta (2020) showed that those are usually ment are more likely to be the subject of meme im- much more informative than the default [CLS] ages. Moreover, some combinations of actors may sentence embedding. We fine-tune representations be particularly likely for memes e.g. politicians over the available meme textual data and use them belonging to parties at the political compass’s an- as components of our end-to-end system. tipodes. In order to produce a unified representa- 1 umberto-commoncrawl-cased-v1 in the Hug- tion of visual actors for our system, we perform a gingFace’s model hub (Wolf et al., 2019) 2.3 Visual input Run # Precision Recall F1 While we have so far discussed only using meta- Baseline 0.525 0.5147 0.5198 data to predict our results, it is essential to ad- UniTor 1 0.839 0.8431 0.8411 dress the core of a meme: the image itself. We 2 0.8522 0.848 0.8501 can internally distinguish a meme from a stan- 1 0.8515 0.8431 0.8473 SNK 2 0.8317 0.848 0.8398 dard image through the aforementioned broken sentence structure, meme templates, and quick and 1 0.861 0.7892 0.8235 UPB 2 0.8543 0.8333 0.8437 messy edits, among other aspects. As previously ArchiMeDe 1 0.8249 0.7157 0.7664 mentioned, memes can be very difficult to indi- viduate when they look like standard images but 1 0.8121 0.6569 0.7263 Keila 2 0.7389 0.652 0.6927 gain meme status through real-world knowledge grounding. Due to the inherently large variance in meme Table 1: System ranking for the DANKMEMES images’ styles and contents, it is impractical to meme detection subtask. Top scores are in bold, expect a single framework to effectively describe our system is underlined. each distinguishable feature and utilize it to clas- sify an entry. Hence, we split the representational to vanishing-gradient problems during training. burden across multiple pre-trained model architec- DenseNet (Huang et al., 2017b) introduces dense tures. Each of them uses a fundamentally differ- blocks where the feature-maps of all preceding ent approach to extract image embeddings, mak- layers are used as inputs to the layer, and its ing the resulting ensemble predictions more flex- feature-maps are used as inputs into all subsequent ible in general settings. The three networks we layers. This approach encourages feature reuse used for producing image embeddings are: and may lead to more generalizable image em- beddings. Each DenseNet image embedding has ResNet Residual Networks, or ResNets (He et a size of 1000 weights. al., 2016), learn residual functions in relation to The aim of using multiple vector embeddings layer inputs. If H(x) is the standard underlying was to cumulatively cover a significant portion of target mapping, ResNet layers are instead trained possible meme combinations and templates. As a to fit another mapping F(x) = H(x) − x. The result, is Section 4 we show how the ensemble of original mapping is thus recast into F(x)+x. This systems using different image embeddings leads approach makes the optimization process easier, to significant increases in validation accuracy. allowing for deeper architectures. The default vec- tor representation provided by task organizers is 3 Results produced by a ResNet-50, with fifty blocks of residual layers. We use those image embeddings Table 1 presents the system ranking for the meme of size 2048 without further adjustments. detection subtask. Our system placed 7th in terms of F1 score,2 impeded primarily by inconsistent AlexNet AlexNet (Krizhevsky et al., 2017) is a recall performances but significantly better than vision architecture built with 5 layers of convolu- the random baseline (+0.2466 F1). tion and 3 fully-connected layers. AlexNet spe- Results suggest that ArchiMeDe has developed cializes in identifying depth; the network archi- inductive biases for specific image features that tecture effectively classifies objects such as key- strongly influence the classification outcome. By boards and a large subset of animals. This fact inspecting validation folds over training data, we makes AlexNet embeddings good predictors for observe that most false negatives produced by the features such as depth that are generally problem- system involve distinct facial characteristics of atic in memes due to image subsections (e.g. text scene actors. Inversely, ArchiMeDe effectively boxes). We use an embedding size of 4096 in the classifies images containing text bubbles and ev- context of our experiments. ident manual edits. Another notable failure case DenseNet Pre-trained models such as ResNet we identified is due to face-swapping. This failure and AlexNet use a large number of hidden is especially relevant since face-swapping is com- layers. While the increase in depth allows 2 The F1 score is the harmonic mean between precision for better feature abstraction, it often leads and recall, commonly used to evaluate classification systems. Encoder Precision Recall F1 sates with a higher recall. We found that misclas- sified observations were different across models, AlexNet .83/.77 .75/.85 .79/.81 suggesting that each model could capture different DenseNet .87/.83 .82/.87 .84/.85 properties of the input. The only exception was the ResNet .83/.79 .87/.86 .85/.83 ResNeSt model, which produced errors very close ResNeSt .80/.84 .84/.76 .82/.79 to the ResNet ones and was henceforth dropped ArchiMeDe .87/.85 .84/.87 .86/.86 for further experiments. Multimodal Ensemble Following the comple- Table 2: Performances of ArchiMeDe variants mentary viewpoints of different encoders, we de- with single image encoders over a validation split cided to evaluate the performances of an ensem- of the DANKMEMES training set. Scores are pre- ble. Table 2 shows that our ArchiMeDe ensemble sented for non-meme/meme classes. outperforms single systems in terms of both pre- cision and recall when considering both classes, monly used to add an ironic component to meme compensating the weaknesses of individual sys- images, but it is hardly detectable due to missing tems. The resulting majority-vote ensemble was real-world context. optimized and used as the final system for our sub- mission. Multiple experimental iterations showed 4 Other Embedding Approaches that an increase in depth, followed by a reduc- As a complementary perspective on our experi- tion in layers’ width, led to increased accuracy ments’ nature, in this section, we present other scores. Each model was trained with a batch size approaches tested in the context of meme detec- of 64 sets, 100 epochs fitted with test accuracy tion and that were finally disregarded in favor of callbacks, and an early stopping strategy with a the ArchiMeDe approach presented in the previ- five epochs’ patience value. Each model utilized ous section. the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.001 and was trained using a CNN without Metadata Preliminary runs on binary cross-entropy loss over the two categories. the DANKMEMES dataset relied solely on the use of standard convolutional neural networks. The 4.1 Data Augmentation target architecture was fed the image itself without Given the relatively small size of the available associated metadata to ensure that the standalone training dataset and since popular classification impact of the architecture was shown. The system models are often trained using thousands if not performed poorly, performing only slightly better millions of images, we tested some data augmen- than the baseline scores. Additional measures to tation strategies to improve our system’s general- optimize this network were not taken since we as- ization performances. We applied random changes sumed that this naive approach would not lead to for each image to augment data, modifying it with substantial gains in performances over the base- random brightness, rotation, and zoom in a rea- line. sonable margin to keep it distinguishable. 9 aug- Single Pre-trained Image Encoder Before mented images were produced for every initial im- working with an ensemble, we estimated the per- age entry. As a result, the training dataset is in- formances of its components in performing meme creased from 1280 to 12800 images. detection. Besides the three models that we Every augmented image is associated with the finally included in ArchiMeDe, we also tested same metadata as the original, varying only in the ResNeSt (Zhang et al., 2020), which was finally visual embedding itself. The result we aimed for dropped due to the similarity of its predictions was an increase in generalization performances, as to those of ResNet-50. Table 2 presents the the model fits better to the general rule of recog- performances of the individual image encoders nizing memes. However, our results showed the and the final ensemble over a validation split opposite behavior: the system would easily over- containing 320 examples equally distributed over fit individual observation when data augmentation (meme, non-meme) classes. Results show how was used. We think this was partly due to augmen- the DenseNet model appears to be better in terms tations not pertinent to the general meme template of precision, while ResNet is worse but compen- and partly because of the significant increase in the number of entries having the same associated ods, we realize the importance of slang and how metadata. it relates directly to the core human principle of An extensive set of augmentation strategies was community belonging. A piece of our culture, tested over the dataset, modifying factors, ranges, memes are the best represented and documented and augmentation count. No iteration significantly cultural artifacts we have today, and to effectively and consistently improved the system’s perfor- interpret them would mean to cross a significant mance, and thus the augmentation process was de- milestone for the field NLP, with lasting impacts termined noisy, relatively inconclusive, and there- on our society as a whole. fore dropped from the training procedure. 5 Discussion and Conclusion References Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- In this paper, we presented ArchiMeDe, our cia C. Passaro. 2020. EVALITA 2020: Overview multimodal system used for participating in the of the 7th evaluation campaign of natural language processing and speech tools for italian. In Valerio DANKMEMES task at EVALITA 2020. The re- Basile, Danilo Croce, Maria Di Maro, and Lucia C. sults produced by the system are promising, even Passaro, editors, Proceedings of Seventh Evalua- if the systems do not encode inductive biases that tion Campaign of Natural Language Processing and are specific neither for multimodal artifact recog- Speech Tools for Italian. Final Workshop (EVALITA nition nor to meme detection in particular. The en- 2020), Online. CEUR.org. try is not far behind in terms of precision from the T. Brown, B. Mann, Nick Ryder, Melanie Subbiah, best-performing systems, and several paths dis- J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, play considerable potential for improving its per- Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-Voss, G. Krüger, Tom formances. The paper effectively highlights the Henighan, R. Child, Aditya Ramesh, D. Ziegler, Jef- crucial impact of transfer learning on the success frey Wu, Clemens Winter, Christopher Hesse, Mark of this system. Notably, ArchiMeDe can be easily Chen, E. Sigler, Mateusz Litwin, Scott Gray, Ben- trained with standard consumer-level GPUs. jamin Chess, J. Clark, Christopher Berner, Sam Mc- Candlish, A. Radford, Ilya Sutskever, and Dario A direction that can be explored to improve Amodei. 2020. Language models are few-shot the current system would be to modify the recall learners. ArXiv, abs/2005.14165. threshold, obtaining a better precision-recall bal- Alessandra Teresa Cignarella, Cristina Bosco, and ance for predictions. Another possibility involves Paolo Rosso. 2019. Presenting TWITTIRÒ-UD: introducing an aggregator network on top of the An italian twitter treebank in universal dependen- ensemble instead of using majority vote: in this cies. In Proceedings of the Fifth International Con- way, the network can learn whether the predictions ference on Dependency Linguistics (Depling, Syn- taxFest 2019). of a single subnetwork are reliable, regardless of it being part of the majority. The ensemble could Jacob Devlin, Ming-Wei Chang, Kenton Lee, and also include more varied models with differing ar- Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- chitecture to further accentuate differences in fea- standing. In Proceedings of the 2019 Conference of ture representations. Above all, we believe that the North American Chapter of the Association for leveraging additional data (not necessarily in Ital- Computational Linguistics: Human Language Tech- ian) could significantly improve the system’s per- nologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June. Associ- formance at the cost of increased time and compu- ation for Computational Linguistics. tational costs. Memes today are one of the most formidable Simone Francia, Loreto Parisi, and Magnani Paolo. 2020. UmBERTo: an italian language model trained modes of portraying one’s idea while building a with whole word maskings. strong interpersonal connection between creators and users. The informality of memes, combined Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. with their ease of making and distribution, has 2016 IEEE Conference on Computer Vision and Pat- greatly accentuated their growth in the last few tern Recognition (CVPR), pages 770–778. years. To be able to interpret memes effectively Kaiming He, Ross B. Girshick, and P. Dollár. is a task far deeper than what can be intuitively 2019. Rethinking ImageNet pre-training. 2019 thought. As humans continue to unravel their IEEE/CVF International Conference on Computer minds and derive ingenious computational meth- Vision (ICCV), pages 4917–4926. Jeremy Howard and Sebastian Ruder. 2018. Univer- Nils Reimers and Iryna Gurevych. 2019. Sentence- sal language model fine-tuning for text classifica- BERT: Sentence embeddings using Siamese BERT- tion. In Proceedings of the 56th Annual Meeting of networks. In Proceedings of the 2019 Conference on the Association for Computational Linguistics (Vol- Empirical Methods in Natural Language Processing ume 1: Long Papers), pages 328–339, Melbourne, and the 9th International Joint Conference on Natu- Australia, July. Association for Computational Lin- ral Language Processing (EMNLP-IJCNLP), pages guistics. 3982–3992, Hong Kong, China, November. Associ- ation for Computational Linguistics. Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. 2017a. Densely connected convolutional networks. Manuela Sanguinetti, Cristina Bosco, Alberto Lavelli, 2017 IEEE Conference on Computer Vision and Pat- Alessandro Mazzei, and Fabio Tamburini. 2018. tern Recognition (CVPR), pages 2261–2269. PoSTWITA-UD: an Italian Twitter Treebank in uni- versal dependencies. In Proceedings of the Eleventh Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Language Resources and Evaluation Conference 2017b. Densely connected convolutional networks. (LREC 2018). 2017 IEEE Conference on Computer Vision and Pat- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien tern Recognition (CVPR), pages 2261–2269. Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow- Diederik P. Kingma and Jimmy Ba. 2015. Adam: icz, and Jamie Brew. 2019. Huggingface’s trans- A method for stochastic optimization. CoRR, formers: State-of-the-art natural language process- abs/1412.6980. ing. ArXiv, abs/1910.03771. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin- Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, ton. 2017. ImageNet classification with deep con- Zhi-Li Zhang, Haibin Lin, Yu e Sun, Tong He, Jonas volutional neural networks. Communications of the Mueller, R. Manmatha, M. Li, and Alex Smola. ACM, 60(6):84–90, May. 2020. Resnest: Split-attention networks. ArXiv, abs/2004.08955. Y. Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, M. Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. ArXiv, abs/1907.11692. Alessio Miaschi and Felice Dell’Orletta. 2020. Con- textual and non-contextual word embeddings: an in- depth linguistic investigation. In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 110–119, Online, July. Association for Com- putational Linguistics. Alessio Miaschi, Gabriele Sarti, Dominique Brunato, Felice Dell’Orletta, and Giulia Venturi. 2020. Ital- ian transformers under the linguistic lens. In Pro- ceedings of the Seventh Italian Conference on Com- putational Linguistics (CLiC-it). Martina Miliani, Giulia Giorgi, Ilir Rama, Guido Anselmi, and Gianluca E. Lebani. 2020. DANKMEMES @ EVALITA2020: The memeing of life: memes, multimodality and politics. In Va- lerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors, Proceedings of Seventh Evaluation Campaign of Natural Language Pro- cessing and Speech Tools for Italian. Final Work- shop (EVALITA 2020), Online. CEUR.org. Pedro Javier Ortiz Suárez, Laurent Romary, and Benoı̂t Sagot. 2020. A monolingual approach to con- textualized word embeddings for mid-resource lan- guages. In Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguis- tics, pages 1703–1714, Online, July. Association for Computational Linguistics.