=Paper=
{{Paper
|id=Vol-2765/139
|storemode=property
|title=UNITOR @ DANKMEME: Combining Convolutional Models and Transformer-based architectures for accurate MEME management
|pdfUrl=https://ceur-ws.org/Vol-2765/paper139.pdf
|volume=Vol-2765
|authors=Claudia Breazzano,Edoardo Rubino,Danilo Croce,Roberto Basili
|dblpUrl=https://dblp.org/rec/conf/evalita/BreazzanoRC020
}}
==UNITOR @ DANKMEME: Combining Convolutional Models and Transformer-based architectures for accurate MEME management==
UNITOR @ DANKMEMES: Combining Convolutional Models and Transformer-based architectures for accurate MEME management Claudia Breazzano and Edoardo Rubino and Danilo Croce and Roberto Basili University of Roma, Tor Vergata Via del Politecnico 1, Rome, 00133, Italy claudiabreazzano@outlook.it, edoardo.ru94@libero.it {croce,basili}@info.uniroma2.it Abstract is a MEME, according to the definition of (Shif- man, 2013); in Hate Speech Identification the aim This paper describes the UNITOR sys- is to recognize if a MEME expresses an offensive tem that participated to the “multi- message; finally, in Event Clustering the aim is to moDal Artefacts recogNition Knowl- cluster MEMEs according to their referring topics. edge for MEMES” (DANKMEMES) task In this work, we present the UNITOR sys- within the context of EVALITA 2020. tem participating in all three subtasks. Since UNITOR implements a neural model MEMEs convey their content through the multi- which combines a Deep Convolutional modal combination of an image and a text, UN- Neural Network to encode visual informa- ITOR implements a neural network which com- tion of input images and a Transformer- bines state-of-the-art architectures for Computer based architecture to encode the meaning Vision and Natural Language Processing. In of the attached texts. UNITOR ranked first particular, Deep Convolutional Neural Networks, in all subtasks, clearly confirming the ro- such as (He et al., 2016; Tan and Le, 2019) bustness of the investigated neural archi- are used to encode visual information into dense tectures and suggesting the beneficial im- embeddings and Transformer-based architectures, pact of the proposed combination strategy. such as (Devlin et al., 2019; Liu et al., 2019) en- code the meaning of the added overlaid captions. 1 Introduction UNITOR then stacks a multi-layered network in In Social networks, the ways to express opinions order to effectively combine the evidences cap- evolved from simply writing a post to publish- tured by both encoders, in the final classification. ing more complex contents, e.g., the composi- The UNITOR system ranked first in each sub- tion of images and texts. These multi-modal ob- task, clearly confirming the robustness of the in- jects, if adhering to some specific social conven- vestigated neural architectures and suggesting the tions and visual specifications, are called MEMEs. beneficial contribution of the proposed combina- In particular, a MEME is a multi-modal arti- tion strategy. In the rest of the paper, in Section 2 fact, manipulated by users, who combines inter- the UNITOR system is described while Section 3 textual elements to convey a message. Charac- reports the experimental results. terized by a visual format that includes images, text, or a combination of them, MEMEs combine 2 UNITOR Description references to current events or related situations CNNs for Image classification. Recent years and pop-cultural references to music, comics and demonstrated that Convolutional Neural Networks films (Ross and Rivers, 2017). In this context, (CNNs) are able to achieve state-of-the-art results the multimoDal Artefacts recogNition Knowledge in image processing (Jiao and Zhao, 2019), by im- for MEMES (DANKMEMES) task is the first plementing deep and complex stackings of Convo- EVALITA (Basile et al., 2020) task for MEMEs lutional layers, which capture different aspects of recognition and hate speech/event identification in input images at different levels of the networks. MEMEs (Miliani et al., 2020). This task is di- Among the investigated architectures, we first vided into three subtasks: in MEME Detection, considered ResNET (He et al., 2016): this net- system is required to determine whether an image work is the first introducing Residual Learning to define very deep and effective CNNs. Sev- Copyright © 2020 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- eral ResNET architectures are defined by stack- ternational (CC BY 4.0). ing 50, 101, 152 up to 1001 layers of convolu- tion layers and skip connectors: as a result, deeper text made available via OCR to the participants networks achieved significant improvements of by the DANKMEME organizers. In particular, previous state-of-art in a wide plethora of im- we adopt the approach proposed in (Devlin et al., age processing tasks. Moreover, we investigated 2019), namely Bidirectional Encoder Representa- the recently proposed EfficientNet (Tan and Le, tions from Transformers (BERT). It provides an 2019): unlike ResNET, this is not a real archi- effective way to pre-train a neural network over tecture, but it provides an automatic methodol- large-scale collections of raw texts, and apply it ogy to improve the performance of an existing to a large variety of supervised NLP tasks, here CNN (such as ResNET) by tuning its depth, width text classification. The building block of BERT and resolution dimensions. The adoption of this is the Transformer element (Vaswani et al., 2017), methodology led to the definition of 8 CNNs an attention-based mechanism that learns contex- (namely EfficientNET-B0, EfficientNET-B1 up to tual relations between words in a text. The pre- EfficientNET-B7), each characterized by an in- training stage is based on two auxiliary tasks, creasing depth and width. They achieve impres- whose aim is the acquisition of an expressive and sive results by efficiently balancing the number of robust language and text model: the Masked Lan- the parameters of the network. The tuning pro- guage model acquires a meaningful and context- cess of (Tan and Le, 2019) demonstrated that a sensitive representation of words, while the Next network such as EfficientNet-B3 achieves higher Sentence Prediction task captures discourse level accuracy than ResNeXt101 (Xie et al., 2016) in information. In particular, this last task operates using 18x fewer neural operations. Regardless of on text-pairs to capture relational information be- the adopted networks, these are already trained tween them, e.g. between the consecutive sen- in a classification task involving the recognition tences in a text. The straightforward application of of thousands of object types in several millions BERT has shown better results than previous state- of images, i.e. in the ImageNet dataset (Deng et of-the-art models on a wide spectrum of natural al., 2009). This pre-training step enables the net- language processing tasks. In (Liu et al., 2019) work to recognize many “basic entities” (such as RoBERTa is proposed as a variant of BERT which people or animals) before being applied to a new modifies some key hyperparameters, including re- task, e.g., MEME Detection. The customization moving the next-sentence pre-training objective, to a new task is obtained just by replacing the last and training on more data, with much larger mini- classification layer with a new one (sized based batches and learning rates. This allows RoBERTa on the number of targeted classes) and by fine- to improve on the masked language modelling ob- tuning the entire architecture. It is worth notic- jective compared with BERT and leads to bet- ing that, once the architecture is fine-tuned on the ter down-stream task performances. We adopt new down-stream task, it can be also used as an here the fine-tuning process for sequence classi- Image Encoder: the embeddings generated on the fication, where sequences correspond to texts ex- layer previous the classification one can be used as tracted from images. The special token [CLS] low-dimensional representations of input images. is added as a first element of each input sen- Most importantly, these embeddings are correlated tence, so that BERT associates it a specific em- with the down-stream task, as they are expected to bedding. This dense vector represents the entire lay in linearly separable sub-spaces (Goodfellow sentence and is used in input to a linear classi- et al., 2016), where the final classifier is applied. fier customized for the target classification task: in In UNITOR these vectors are used to combine vi- MEME Detection and Hate Speech Identification, sual information with other evidences: in practice, two classes are considered, while in Event Clus- they will be used in combination with the embed- tering five classes reflect the target topics. Dur- dings produced from the Transformer-based archi- ing training, all the network parameters are fine- tectures (applied to texts) before being used in in- tuned. BERT and RoBERTa are pre-trained over put to the final classifier. text in English, and they are able to capture lan- guage models specific for this language. In order Transformer-based Architectures for text clas- to apply these architectures in Italian, we inves- sification. A MEME is a combination of visual in- tigate several alternative models, pre-trained us- formation and the overlaid caption. In this work, ing document collections in Italian or in multi- we thus also investigated classifiers based on the ple languages. Among these models, AlBERTo vectors will be used in UNITOR in combination (Polignano et al., 2019) is a BERT-based model with the embeddings derived from the CNN archi- pre-trained over the Twita corpus (Basile and Nis- tecture, as described hereafter. sim, 2013) (made of millions of Italian tweets) Combining visual and semantic evidences. UN- while GilBERTo1 and UmBERTo2 are RoBERTa- ITOR adopts an approach similar to the Fea- based models pre-trained over the OSCAR corpus ture Concatenation Model (FCM) already seen in and the Italian version of Wikipedia, respectively. (Oriol et al., 2019; Gomez et al., 2020) to combine Among the multi-lingual models, we investigate visual and textual information. For each subtask, multilingual BERT (mBERT) (Pires et al., 2019) the specific CNN achieving best results on the de- and XLM-RoBERTa (Conneau et al., 2020) which velopment set is selected, among the investigated extends the corresponding pre-training over texts ones. The same happens for the Transformer- in more than 100 languages. based architectures. When the “best” architec- Regardless of the adopted Transformer-based tures are selected and fine-tuned for visual and architecture, we also investigated the adoption textual analysis, these are used to encode the en- of additional annotated material to support the tire dataset. It allows training a new classifier training of complex networks over very short which accounts on the evidences from both as- texts extracted from MEMEs. In particular, in pects. In UNITOR these encodings are concate- Hate Speech Identification, we used an external nated, so that the final classifier is a Multi-layered dataset which addressed the same task, but within Perceptron4 . Only this final classifier is fine-tuned, a different source. We thus adopted a dataset as the remaining parameters are supposed to be made available within the Hate Speech Detection already optimized for the task. Future work will (HaSpeeDe) task (Bosco et al., 2018) which in- consider the fine-tuning of all the parameters of volves the automatic recognition of hateful con- this combined network, here ignored for the (too) tents in Twitter (HaSpeeDe-TW) and Facebook high computational cost required from this more posts (HaSpeeDe-FB). Each investigated architec- elegant approach. It must be said that other infor- ture is trained for few epochs only over on the mation is available in the competition: for exam- HaSpeeDe dataset before the real training is ap- ple, each MEME was supported with its publica- plied to the DANKMEMES material. In this tion date or the list of politicians appearing in the way, the neural model, which is not specifically picture. We investigated the manual definition of pre-trained to detect hate speech, is expected to feature vectors to be added in the concatenation improve its “expertise” in handling such a phe- described above. Unfortunately, these vectors did nomenon (even though using material derived not provide any significant impact during our ex- from a different source) before being specialized periments, so we only relied on visual and textual on the final DANKMEMES task3 . information. We suppose this additional informa- We trained UmBERTo both on HaSpeeDe-TW tion it is too sparse (given the dataset size) to pro- and on HaSpeeDe-FB and on the merging of these, vide any valuable evidence. too. Initial experiments suggested that a higher ac- Modelling Event Clustering as a Classification curacy can be achieved only considering the mate- task. While Event Clustering may suggest a rial from Facebook (HaSpeeDe-FB). We suppose straightforward application of unsupervised algo- this is mainly due to the fact that messages from rithms, we adopted a supervised setting, by im- HaSpeeDe-FB and DANKMEMES share simi- posing the hypothesis that train and test datasets lar political topics. As for a CNN, once the share the same topics. We modelled this subtask Transformer-based architecture is fine-tuned on as a classification problem, where each MEME is the new task, it can be used as text encoder, by re- to be assigned to one of the five classes reflecting moving the final linear classifier and selecting the the underlying topic. UNITOR implements two embedding associated to the [CLS] token. These different approaches. In a first model, the same 1 https://huggingface.co/idb-ita/ setting adopted in the other subtasks is used: a gilberto-uncased-from-camembert CNN and a Transformer-based are optimized on 2 https://huggingface.co/Musixmatch/ the Task 3 and used as encoder to train the final umberto-wikipedia-uncased-v1 3 4 An alternative approach consists in adding the messages We investigated also more complex combinations, such from HaSpeeDe to the training set: this approach led to lower as the weighed sum, or point-wise product of embeddings, results, not reported here due to lack of space. but lower results were obtained. MLP classifier. Unfortunately, most of the texts using pytorch6 . are really short to be valuable in the final classi- fication. We thus adopted a second model which System Precision Recall F1 Rank UNITOR-R2 0.8522 0.8480 0.8501 1 is inspired by the capability of BERT-based mod- SNK-R1 0.8515 0.8431 0.8473 2 els to effectively operate over text pairs, achiev- UNITOR-R1 0.8390 0.8431 0.8411 4 ing state-of-the-art results in tasks such as in Tex- Baseline 0.5250 0.5147 0.5198 - tual Entailment and in Natural Language Inference Table 1: UNITOR Results in Task 1. tasks (Devlin et al., 2019). In this second set- ting, each input MEME generates five pairs (one Task 1 - MEME Detection. For the subtask 1, the for each topic) which are in the form htopic def- training dataset counts 1,600 examples, equally inition, texti. Let us consider the example ”ma labelled as “MEME” and “NotMEME”. Results of come chi sono? presidé só io senza fotoscioppe!”, UNITOR is reported in Table 1, where results are associated to the topic #2, defined5 as “L’inizio evaluated in terms of Precision, Recall and F1- delle consultazioni con i partiti politici e il dis- measure, calculated over the binary classification corso al Senato di Conte”. It generates new in- task (this last used to rank systems). The last puts in the form “[CLS] ma come chi . . . foto- row reports a baseline model which randomly as- scioppe! [SEP] L’inizio delle . . . Senato di Conte. signs labels to images. MEMEs generally adhere [SEP]” which defines sentence pairs in BERT- to specific visual conventions, where the mean- like architectures. The same approach is applied ing of text is secondary: as a consequence, our with respect to each topic. In other words, the first model (UNITOR-R1) only relies on an image original classification problem over five classes is classifier. In particular, it corresponds to the fine- mapped to a binary classification one: each pair is tuning of EfficientNet-B3 over the official dataset. a positive example when the text is associated to In order to improve the robustness of such a CNN, the correct topic, negative otherwise. In this way, we adopted a simple data augmentation technique, we expected to detect a possible “semantic con- by duplicating the training material and horizon- nection” between the extracted text and the paired tally mirroring it. UNITOR-R1 ranked forth (over (correct topic) description. At classification time, 10 submissions) in the competition. This clearly for each MEME, five new examples are derived confirms the effectiveness of EfficientNet, com- (one per topic) and classified. The one generated bined with the adopted Ensemble technique. We by the topic receiving the highest softmax score is also investigated larger variants of EfficientNet but selected as output. they did not outperform the B3 variant: we sup- pose these larger architectures are more exposed 3 Experimental evaluation and results to over-fitting, also considering the dataset size. UNITOR participated to all subtasks within Moreover, we adopted a model that combines DANKMEMES. For parameter tuning, we the output of EfficientNet-B3 with a Transformer- adopted a 10-cross fold validation, so that the based architecture. Among all the investigated ar- training material is divided in 10 folds, each split chitecture, AlBERTo achieved the highest classifi- according to 90%-10% proportion. The model is cation accuracy. Once tuned (in the same 10-cross trained using a standard Cross-entropy Loss and fold evaluation schema) it is used to encode the en- an ADAM optimizer initialized with a learning tire dataset and the embeddings are concatenated rate set to 2 · 10−5 . We trained the model for 5 to the ones from EfficientNet-B3. This enables the epochs, using a batch size of 32 elements. When training of 10 MLPs (one per fold) whose Ensem- combining the networks, the number of hidden ble defines UNITOR-R2, which ranked first in the layers in the MLP classifier is tuned between 1 and task, with a F1 of 0.8501. The overall results thus 3. At test time, for each task, an Ensemble of such confirm also the beneficial (although limited) im- classifiers is used: each image is in fact classified pact of textual information in this subtask. using all 10 models trained in the different folds Task2 - Hate Speech Identification. The train- and the label suggested by the highest number ing dataset available for the subtask 2 contains of classifiers is selected. UNITOR is implement 800 training examples, labelled as “Hate” and “NotHate”, while the test dataset counts 200 ex- 5 In a simplified English: ”Are you seriously asking who I 6 am? Mr President, it’s me without Photoshop effects!” https://pytorch.org/ amples. In Table 2 the results obtained by UNI- BERTo, both achieving best accuracy in our initial TOR are reported, according to the same metrics tuning within this subtask. UNITOR-R1 ranked adopted in Task 1. Unlike the first subtask, Hate first (among three submissions) in this competi- Speech is more related to the textual information. tion with a F1 of 0.2657, which doubles the result Even the baseline is given by the performance of obtained from the baseline. It must be said that the a classifier labelling a MEME as offensive when- Transformer achieves significantly better results ever it includes at least a swear word (resulting in with respect to the CNN, suggesting that the vi- a system with a high Precision and a very low Re- sual information is negligible also in this subtask7 . call). We thus evaluated a model which considers only text, by fine-tuning an AlBERTo model adopting System Precision Recall F1 Rank UNITOR-R2 0.7845 0.8667 0.8235 1 the pair-based approach presented in Section 2, UNITOR-R1 0.7686 0.8857 0.8230 2 where each text is associated with the description UPB 0.8056 0.8286 0.8169 3 of the topic. Unfortunately, this model, namely Baseline 0.8958 0.4095 0.5621 - UNITOR-R2, under-performed the first submis- Table 2: UNITOR Results in Task 2. sion, with a F1 of 0.2183. System Precision Recall F1 Rank In this task, we adopted UmBERTo (pre-trained UNITOR-R1 0.2683 0.2851 0.2657 1 over Wikipedia), fine-tuned for 3 epochs over the UNITOR-R2 0.2096 0.2548 0.2183 2 HaSpeeDe dataset and then for 3 epochs over Baseline 0.0960 0.2000 0.1297 - the DANKMEMES dataset. Again, a 10-cross Table 3: UNITOR Results in Task 3. fold schema is adopted and the final ensemble of such UmBERTo models originated UNITOR- 200 R1, which ranked 2 over 5 submissions. The 150 improvements with respect to the first competi- UNITOR-R1 tive system confirms the robustness of the adopted 100 UNITOR-R2 Transformer-based architecture combined with the Gold Standard 50 adopted auxiliary training step. We thus combined this model with a CNN (here ResNET152) to ex- 0 ploit also visual information as for the previous 0 1 2 3 4 subtask. This combination originated UNITOR- Figure 1: Distribution of labels and classifications R2, which again provided the best results in the in Task 3. competition, even though a very little margin is obtained w.r.t. UNITOR-R1. For an error analysis, we compared the assign- Task3 - Event Clustering. The training dataset ments provided in the test set and the ones derived available for the subtask 3 contains 800 training from UNITOR, as shown in Figure 1. First, it is examples for the 5 targeted topics and a test dataset clear that the dataset is highly unbalanced, with made of 200 examples. In Table 3 the perfor- half of the examples assigned to the class with mances of UNITOR are reported, as for the pre- uncertain topics. Moreover, it can be seen that vious subtask. Since it is a multi-class classifica- the combination of textual and visual information tion task, each system is evaluated with respect to makes UNITOR-R1 more robust in detecting topic each of the 5 labels in a binary setting and then 2, and most importantly, topic 1, which is ignored the macro-average is applied to Precision, Recall from UNITOR-R2. Topics 3 and 4 are ignored and F1. Here, the baseline is given by a classifier by UNITOR but they are also under-represented labelling every MEME as belonging to the most in the training material. UNITOR-R2 seems more represented class (i.e. topic 0, containing miscel- conservative with respect to the largest class (topic laneous examples). Its results, i.e. a F1 of 0.1297, 0): it is clear that the repetition of the same topic suggest this is a very challenging task, where the over many examples introduced a bias. Future dataset is quite limited, especially considering the work will consider the adoption of more expres- overlap that exists among all political topics. In sive and varied topic descriptions to be paired with the first row, the run UNITOR-R1 is reported: it texts: for examples, we will select headline news corresponds to a model that combines the embed- that can be retrieved using Retrieval Engines (e.g., 7 dings from ResNET152 and those obtained by Al- These results are not reported for lack of space. by querying with the topic description) to have a R. Gomez, J. Gibert, L. Gomez, and D. Karatzas. 2020. more expressive representation of the topics. Exploring hate speech detection in multimodal pub- lications. In 2020 IEEE Winter Conference on Ap- 4 Conclusions plications of Computer Vision (WACV), pages 1459– 1467. This work presented the UNITOR system partici- pating to DANKMEMES task at EVALITA 2020. Ian J. Goodfellow, Yoshua Bengio, and Aaron UNITOR merges visual and textual evidences by Courville. 2016. Deep Learning. MIT Press, Cam- bridge, MA, USA. combining state-of-the-art deep neural architec- tures and ranked first in all subtasks defined in K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep resid- the competition. These results confirm the ben- ual learning for image recognition. In 2016 IEEE eficial impact of the adopted Convolutional and Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 770–778. Transformer-based architecture in the automatic recognition of MEMEs as well as in Hate Speech L. Jiao and J. Zhao. 2019. A survey on the new gen- Identification or Event Clustering. Future work eration of deep learning in image processing. IEEE will investigate multi-task learning approaches to Access, 7:172231–172263. combine the adopted architectures in a more prin- Y. Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- cipled way. dar Joshi, Danqi Chen, Omer Levy, M. Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining ap- References proach. ArXiv, abs/1907.11692. Valerio Basile and Malvina Nissim. 2013. Sentiment Martina Miliani, Giulia Giorgi, Ilir Rama, Guido analysis on italian tweets. In Proceedings of the 4th Anselmi, and Gianluca E. Lebani. 2020. Workshop on Computational Approaches to Subjec- Dankmemes @ evalita2020: The memeing of life: tivity, Sentiment and Social Media Analysis, pages memes, multimodality and politics). In Valerio 100–107, Atlanta. Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors, Proceedings of Seventh Evalua- Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- tion Campaign of Natural Language Processing and cia C. Passaro. 2020. Evalita 2020: Overview Speech Tools for Italian. Final Workshop (EVALITA of the 7th evaluation campaign of natural language 2020), Online. CEUR.org. processing and speech tools for italian. In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Benet Oriol, Cristian Canton-Ferrer, and Xavier Giró Passaro, editors, Proceedings of Seventh Evalua- i Nieto. 2019. Hate speech in pixels: Detection tion Campaign of Natural Language Processing and of offensive memes towards automatic moderation. Speech Tools for Italian. Final Workshop (EVALITA In NeurIPS 2019 Workshop on AI for Social Good, 2020), Online. CEUR.org. Vancouver, Canada, 09/2019. Cristina Bosco, Felice Dell’Orletta, Fabio Poletto, Manuela Sanguinetti, and Maurizio Tesconi. 2018. Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. Overview of the EVALITA 2018 hate speech detec- How multilingual is multilingual BERT? In Pro- tion task. In Proceedings of EVALITA 2018, Turin, ceedings of the 57th Annual Meeting of the Asso- Italy, December 12-13, 2018, volume 2263 of CEUR ciation for Computational Linguistics, pages 4996– Workshop Proceedings. 5001, Florence, Italy, July. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Marco Polignano, Pierpaolo Basile, Marco de Gem- Vishrav Chaudhary, Guillaume Wenzek, Francisco mis, Giovanni Semeraro, and Valerio Basile. 2019. Guzmán, Edouard Grave, Myle Ott, Luke Zettle- AlBERTo: Italian BERT Language Understanding moyer, and Veselin Stoyanov. 2020. Unsupervised Model for NLP Challenging Tasks Based on Tweets. cross-lingual representation learning at scale. In In Proceedings of the Sixth Italian Conference on Proceedings of ACL 2020, Online, July 5-10, 2020, Computational Linguistics (CLiC-it 2019), volume pages 8440–8451. 2481. CEUR. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Andrew Ross and Damian J. Rivers. 2017. Digital cul- Fei. 2009. ImageNet: A Large-Scale Hierarchical tures of political participation: Internet memes and Image Database. In CVPR09. the discursive delgitimization of the 2016 u.s. pres- idential candidates. Discourse, Context and Media, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 16:1–11, 01. Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- Limor Shifman. 2013. Memes in a digital world: Rec- standing. In Proceedings of NAACL 2019, pages onciling with a conceptual troublemaker. J. Comput. 4171–4186, Minneapolis, Minnesota, June. Mediat. Commun., 18:362–377. Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv e-prints, page arXiv:1905.11946, May. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems, volume 30, pages 5998–6008. Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2016. Aggregated Residual Transformations for Deep Neural Networks. arXiv e-prints, page arXiv:1611.05431, November.