=Paper=
{{Paper
|id=Vol-2765/139
|storemode=property
|title=UNITOR @ DANKMEME: Combining Convolutional Models and Transformer-based architectures for accurate MEME management
|pdfUrl=https://ceur-ws.org/Vol-2765/paper139.pdf
|volume=Vol-2765
|authors=Claudia Breazzano,Edoardo Rubino,Danilo Croce,Roberto Basili
|dblpUrl=https://dblp.org/rec/conf/evalita/BreazzanoRC020
}}
==UNITOR @ DANKMEME: Combining Convolutional Models and Transformer-based architectures for accurate MEME management==
<pdf width="1500px">https://ceur-ws.org/Vol-2765/paper139.pdf</pdf>
<pre>
     UNITOR @ DANKMEMES: Combining Convolutional Models and
     Transformer-based architectures for accurate MEME management
     Claudia Breazzano and Edoardo Rubino and Danilo Croce and Roberto Basili
                            University of Roma, Tor Vergata
                        Via del Politecnico 1, Rome, 00133, Italy
                         claudiabreazzano@outlook.it, edoardo.ru94@libero.it
                                        {croce,basili}@info.uniroma2.it

                        Abstract                                is a MEME, according to the definition of (Shif-
                                                                man, 2013); in Hate Speech Identification the aim
    This paper describes the UNITOR sys-                        is to recognize if a MEME expresses an offensive
    tem that participated to the “multi-                        message; finally, in Event Clustering the aim is to
    moDal Artefacts recogNition Knowl-                          cluster MEMEs according to their referring topics.
    edge for MEMES” (DANKMEMES) task                               In this work, we present the UNITOR sys-
    within the context of EVALITA 2020.                         tem participating in all three subtasks. Since
    UNITOR implements a neural model                            MEMEs convey their content through the multi-
    which combines a Deep Convolutional                         modal combination of an image and a text, UN-
    Neural Network to encode visual informa-                    ITOR implements a neural network which com-
    tion of input images and a Transformer-                     bines state-of-the-art architectures for Computer
    based architecture to encode the meaning                    Vision and Natural Language Processing. In
    of the attached texts. UNITOR ranked first                  particular, Deep Convolutional Neural Networks,
    in all subtasks, clearly confirming the ro-                 such as (He et al., 2016; Tan and Le, 2019)
    bustness of the investigated neural archi-                  are used to encode visual information into dense
    tectures and suggesting the beneficial im-                  embeddings and Transformer-based architectures,
    pact of the proposed combination strategy.                  such as (Devlin et al., 2019; Liu et al., 2019) en-
                                                                code the meaning of the added overlaid captions.
1    Introduction                                               UNITOR then stacks a multi-layered network in
In Social networks, the ways to express opinions                order to effectively combine the evidences cap-
evolved from simply writing a post to publish-                  tured by both encoders, in the final classification.
ing more complex contents, e.g., the composi-                      The UNITOR system ranked first in each sub-
tion of images and texts. These multi-modal ob-                 task, clearly confirming the robustness of the in-
jects, if adhering to some specific social conven-              vestigated neural architectures and suggesting the
tions and visual specifications, are called MEMEs.              beneficial contribution of the proposed combina-
In particular, a MEME is a multi-modal arti-                    tion strategy. In the rest of the paper, in Section 2
fact, manipulated by users, who combines inter-                 the UNITOR system is described while Section 3
textual elements to convey a message. Charac-                   reports the experimental results.
terized by a visual format that includes images,
text, or a combination of them, MEMEs combine                   2   UNITOR Description
references to current events or related situations              CNNs for Image classification. Recent years
and pop-cultural references to music, comics and                demonstrated that Convolutional Neural Networks
films (Ross and Rivers, 2017). In this context,                 (CNNs) are able to achieve state-of-the-art results
the multimoDal Artefacts recogNition Knowledge                  in image processing (Jiao and Zhao, 2019), by im-
for MEMES (DANKMEMES) task is the first                         plementing deep and complex stackings of Convo-
EVALITA (Basile et al., 2020) task for MEMEs                    lutional layers, which capture different aspects of
recognition and hate speech/event identification in             input images at different levels of the networks.
MEMEs (Miliani et al., 2020). This task is di-                     Among the investigated architectures, we first
vided into three subtasks: in MEME Detection,                   considered ResNET (He et al., 2016): this net-
system is required to determine whether an image                work is the first introducing Residual Learning
                                                                to define very deep and effective CNNs. Sev-
     Copyright © 2020 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-       eral ResNET architectures are defined by stack-
ternational (CC BY 4.0).                                        ing 50, 101, 152 up to 1001 layers of convolu-
tion layers and skip connectors: as a result, deeper    text made available via OCR to the participants
networks achieved significant improvements of           by the DANKMEME organizers. In particular,
previous state-of-art in a wide plethora of im-         we adopt the approach proposed in (Devlin et al.,
age processing tasks. Moreover, we investigated         2019), namely Bidirectional Encoder Representa-
the recently proposed EfficientNet (Tan and Le,         tions from Transformers (BERT). It provides an
2019): unlike ResNET, this is not a real archi-         effective way to pre-train a neural network over
tecture, but it provides an automatic methodol-         large-scale collections of raw texts, and apply it
ogy to improve the performance of an existing           to a large variety of supervised NLP tasks, here
CNN (such as ResNET) by tuning its depth, width         text classification. The building block of BERT
and resolution dimensions. The adoption of this         is the Transformer element (Vaswani et al., 2017),
methodology led to the definition of 8 CNNs             an attention-based mechanism that learns contex-
(namely EfficientNET-B0, EfficientNET-B1 up to          tual relations between words in a text. The pre-
EfficientNET-B7), each characterized by an in-          training stage is based on two auxiliary tasks,
creasing depth and width. They achieve impres-          whose aim is the acquisition of an expressive and
sive results by efficiently balancing the number of     robust language and text model: the Masked Lan-
the parameters of the network. The tuning pro-          guage model acquires a meaningful and context-
cess of (Tan and Le, 2019) demonstrated that a          sensitive representation of words, while the Next
network such as EfficientNet-B3 achieves higher         Sentence Prediction task captures discourse level
accuracy than ResNeXt101 (Xie et al., 2016) in          information. In particular, this last task operates
using 18x fewer neural operations. Regardless of        on text-pairs to capture relational information be-
the adopted networks, these are already trained         tween them, e.g. between the consecutive sen-
in a classification task involving the recognition      tences in a text. The straightforward application of
of thousands of object types in several millions        BERT has shown better results than previous state-
of images, i.e. in the ImageNet dataset (Deng et        of-the-art models on a wide spectrum of natural
al., 2009). This pre-training step enables the net-     language processing tasks. In (Liu et al., 2019)
work to recognize many “basic entities” (such as        RoBERTa is proposed as a variant of BERT which
people or animals) before being applied to a new        modifies some key hyperparameters, including re-
task, e.g., MEME Detection. The customization           moving the next-sentence pre-training objective,
to a new task is obtained just by replacing the last    and training on more data, with much larger mini-
classification layer with a new one (sized based        batches and learning rates. This allows RoBERTa
on the number of targeted classes) and by fine-         to improve on the masked language modelling ob-
tuning the entire architecture. It is worth notic-      jective compared with BERT and leads to bet-
ing that, once the architecture is fine-tuned on the    ter down-stream task performances. We adopt
new down-stream task, it can be also used as an         here the fine-tuning process for sequence classi-
Image Encoder: the embeddings generated on the          fication, where sequences correspond to texts ex-
layer previous the classification one can be used as    tracted from images. The special token [CLS]
low-dimensional representations of input images.        is added as a first element of each input sen-
Most importantly, these embeddings are correlated       tence, so that BERT associates it a specific em-
with the down-stream task, as they are expected to      bedding. This dense vector represents the entire
lay in linearly separable sub-spaces (Goodfellow        sentence and is used in input to a linear classi-
et al., 2016), where the final classifier is applied.   fier customized for the target classification task: in
In UNITOR these vectors are used to combine vi-         MEME Detection and Hate Speech Identification,
sual information with other evidences: in practice,     two classes are considered, while in Event Clus-
they will be used in combination with the embed-        tering five classes reflect the target topics. Dur-
dings produced from the Transformer-based archi-        ing training, all the network parameters are fine-
tectures (applied to texts) before being used in in-    tuned. BERT and RoBERTa are pre-trained over
put to the final classifier.                            text in English, and they are able to capture lan-
                                                        guage models specific for this language. In order
Transformer-based Architectures for text clas-          to apply these architectures in Italian, we inves-
sification. A MEME is a combination of visual in-       tigate several alternative models, pre-trained us-
formation and the overlaid caption. In this work,       ing document collections in Italian or in multi-
we thus also investigated classifiers based on the
ple languages. Among these models, AlBERTo                      vectors will be used in UNITOR in combination
(Polignano et al., 2019) is a BERT-based model                  with the embeddings derived from the CNN archi-
pre-trained over the Twita corpus (Basile and Nis-              tecture, as described hereafter.
sim, 2013) (made of millions of Italian tweets)                 Combining visual and semantic evidences. UN-
while GilBERTo1 and UmBERTo2 are RoBERTa-                       ITOR adopts an approach similar to the Fea-
based models pre-trained over the OSCAR corpus                  ture Concatenation Model (FCM) already seen in
and the Italian version of Wikipedia, respectively.             (Oriol et al., 2019; Gomez et al., 2020) to combine
Among the multi-lingual models, we investigate                  visual and textual information. For each subtask,
multilingual BERT (mBERT) (Pires et al., 2019)                  the specific CNN achieving best results on the de-
and XLM-RoBERTa (Conneau et al., 2020) which                    velopment set is selected, among the investigated
extends the corresponding pre-training over texts               ones. The same happens for the Transformer-
in more than 100 languages.                                     based architectures. When the “best” architec-
   Regardless of the adopted Transformer-based                  tures are selected and fine-tuned for visual and
architecture, we also investigated the adoption                 textual analysis, these are used to encode the en-
of additional annotated material to support the                 tire dataset. It allows training a new classifier
training of complex networks over very short                    which accounts on the evidences from both as-
texts extracted from MEMEs. In particular, in                   pects. In UNITOR these encodings are concate-
Hate Speech Identification, we used an external                 nated, so that the final classifier is a Multi-layered
dataset which addressed the same task, but within               Perceptron4 . Only this final classifier is fine-tuned,
a different source. We thus adopted a dataset                   as the remaining parameters are supposed to be
made available within the Hate Speech Detection                 already optimized for the task. Future work will
(HaSpeeDe) task (Bosco et al., 2018) which in-                  consider the fine-tuning of all the parameters of
volves the automatic recognition of hateful con-                this combined network, here ignored for the (too)
tents in Twitter (HaSpeeDe-TW) and Facebook                     high computational cost required from this more
posts (HaSpeeDe-FB). Each investigated architec-                elegant approach. It must be said that other infor-
ture is trained for few epochs only over on the                 mation is available in the competition: for exam-
HaSpeeDe dataset before the real training is ap-                ple, each MEME was supported with its publica-
plied to the DANKMEMES material. In this                        tion date or the list of politicians appearing in the
way, the neural model, which is not specifically                picture. We investigated the manual definition of
pre-trained to detect hate speech, is expected to               feature vectors to be added in the concatenation
improve its “expertise” in handling such a phe-                 described above. Unfortunately, these vectors did
nomenon (even though using material derived                     not provide any significant impact during our ex-
from a different source) before being specialized               periments, so we only relied on visual and textual
on the final DANKMEMES task3 .                                  information. We suppose this additional informa-
   We trained UmBERTo both on HaSpeeDe-TW                       tion it is too sparse (given the dataset size) to pro-
and on HaSpeeDe-FB and on the merging of these,                 vide any valuable evidence.
too. Initial experiments suggested that a higher ac-            Modelling Event Clustering as a Classification
curacy can be achieved only considering the mate-               task. While Event Clustering may suggest a
rial from Facebook (HaSpeeDe-FB). We suppose                    straightforward application of unsupervised algo-
this is mainly due to the fact that messages from               rithms, we adopted a supervised setting, by im-
HaSpeeDe-FB and DANKMEMES share simi-                           posing the hypothesis that train and test datasets
lar political topics. As for a CNN, once the                    share the same topics. We modelled this subtask
Transformer-based architecture is fine-tuned on                 as a classification problem, where each MEME is
the new task, it can be used as text encoder, by re-            to be assigned to one of the five classes reflecting
moving the final linear classifier and selecting the            the underlying topic. UNITOR implements two
embedding associated to the [CLS] token. These                  different approaches. In a first model, the same
   1
     https://huggingface.co/idb-ita/                            setting adopted in the other subtasks is used: a
gilberto-uncased-from-camembert                                 CNN and a Transformer-based are optimized on
   2
     https://huggingface.co/Musixmatch/                         the Task 3 and used as encoder to train the final
umberto-wikipedia-uncased-v1
   3                                                               4
     An alternative approach consists in adding the messages         We investigated also more complex combinations, such
from HaSpeeDe to the training set: this approach led to lower   as the weighed sum, or point-wise product of embeddings,
results, not reported here due to lack of space.                but lower results were obtained.
MLP classifier. Unfortunately, most of the texts               using pytorch6 .
are really short to be valuable in the final classi-
fication. We thus adopted a second model which                   System           Precision   Recall     F1     Rank
                                                                 UNITOR-R2         0.8522     0.8480   0.8501    1
is inspired by the capability of BERT-based mod-                 SNK-R1            0.8515     0.8431   0.8473    2
els to effectively operate over text pairs, achiev-              UNITOR-R1         0.8390     0.8431   0.8411    4
ing state-of-the-art results in tasks such as in Tex-            Baseline          0.5250     0.5147   0.5198     -
tual Entailment and in Natural Language Inference                       Table 1: UNITOR Results in Task 1.
tasks (Devlin et al., 2019). In this second set-
ting, each input MEME generates five pairs (one                Task 1 - MEME Detection. For the subtask 1, the
for each topic) which are in the form htopic def-              training dataset counts 1,600 examples, equally
inition, texti. Let us consider the example ”ma                labelled as “MEME” and “NotMEME”. Results of
come chi sono? presidé só io senza fotoscioppe!”,            UNITOR is reported in Table 1, where results are
associated to the topic #2, defined5 as “L’inizio              evaluated in terms of Precision, Recall and F1-
delle consultazioni con i partiti politici e il dis-           measure, calculated over the binary classification
corso al Senato di Conte”. It generates new in-                task (this last used to rank systems). The last
puts in the form “[CLS] ma come chi . . . foto-                row reports a baseline model which randomly as-
scioppe! [SEP] L’inizio delle . . . Senato di Conte.           signs labels to images. MEMEs generally adhere
[SEP]” which defines sentence pairs in BERT-                   to specific visual conventions, where the mean-
like architectures. The same approach is applied               ing of text is secondary: as a consequence, our
with respect to each topic. In other words, the                first model (UNITOR-R1) only relies on an image
original classification problem over five classes is           classifier. In particular, it corresponds to the fine-
mapped to a binary classification one: each pair is            tuning of EfficientNet-B3 over the official dataset.
a positive example when the text is associated to              In order to improve the robustness of such a CNN,
the correct topic, negative otherwise. In this way,            we adopted a simple data augmentation technique,
we expected to detect a possible “semantic con-                by duplicating the training material and horizon-
nection” between the extracted text and the paired             tally mirroring it. UNITOR-R1 ranked forth (over
(correct topic) description. At classification time,           10 submissions) in the competition. This clearly
for each MEME, five new examples are derived                   confirms the effectiveness of EfficientNet, com-
(one per topic) and classified. The one generated              bined with the adopted Ensemble technique. We
by the topic receiving the highest softmax score is            also investigated larger variants of EfficientNet but
selected as output.                                            they did not outperform the B3 variant: we sup-
                                                               pose these larger architectures are more exposed
3   Experimental evaluation and results                        to over-fitting, also considering the dataset size.
UNITOR participated to all subtasks within                        Moreover, we adopted a model that combines
DANKMEMES. For parameter tuning, we                            the output of EfficientNet-B3 with a Transformer-
adopted a 10-cross fold validation, so that the                based architecture. Among all the investigated ar-
training material is divided in 10 folds, each split           chitecture, AlBERTo achieved the highest classifi-
according to 90%-10% proportion. The model is                  cation accuracy. Once tuned (in the same 10-cross
trained using a standard Cross-entropy Loss and                fold evaluation schema) it is used to encode the en-
an ADAM optimizer initialized with a learning                  tire dataset and the embeddings are concatenated
rate set to 2 · 10−5 . We trained the model for 5              to the ones from EfficientNet-B3. This enables the
epochs, using a batch size of 32 elements. When                training of 10 MLPs (one per fold) whose Ensem-
combining the networks, the number of hidden                   ble defines UNITOR-R2, which ranked first in the
layers in the MLP classifier is tuned between 1 and            task, with a F1 of 0.8501. The overall results thus
3. At test time, for each task, an Ensemble of such            confirm also the beneficial (although limited) im-
classifiers is used: each image is in fact classified          pact of textual information in this subtask.
using all 10 models trained in the different folds             Task2 - Hate Speech Identification. The train-
and the label suggested by the highest number                  ing dataset available for the subtask 2 contains
of classifiers is selected. UNITOR is implement                800 training examples, labelled as “Hate” and
                                                               “NotHate”, while the test dataset counts 200 ex-
  5
    In a simplified English: ”Are you seriously asking who I
                                                                  6
am? Mr President, it’s me without Photoshop effects!”                 https://pytorch.org/
amples. In Table 2 the results obtained by UNI-         BERTo, both achieving best accuracy in our initial
TOR are reported, according to the same metrics         tuning within this subtask. UNITOR-R1 ranked
adopted in Task 1. Unlike the first subtask, Hate       first (among three submissions) in this competi-
Speech is more related to the textual information.      tion with a F1 of 0.2657, which doubles the result
Even the baseline is given by the performance of        obtained from the baseline. It must be said that the
a classifier labelling a MEME as offensive when-        Transformer achieves significantly better results
ever it includes at least a swear word (resulting in    with respect to the CNN, suggesting that the vi-
a system with a high Precision and a very low Re-       sual information is negligible also in this subtask7 .
call).                                                  We thus evaluated a model which considers only
                                                        text, by fine-tuning an AlBERTo model adopting
  System        Precision   Recall     F1     Rank
  UNITOR-R2      0.7845     0.8667   0.8235    1
                                                        the pair-based approach presented in Section 2,
  UNITOR-R1      0.7686     0.8857   0.8230    2        where each text is associated with the description
  UPB            0.8056     0.8286   0.8169    3        of the topic. Unfortunately, this model, namely
  Baseline       0.8958     0.4095   0.5621     -
                                                        UNITOR-R2, under-performed the first submis-
       Table 2: UNITOR Results in Task 2.               sion, with a F1 of 0.2183.
                                                          System                Precision   Recall         F1      Rank
   In this task, we adopted UmBERTo (pre-trained          UNITOR-R1              0.2683     0.2851       0.2657     1
over Wikipedia), fine-tuned for 3 epochs over the         UNITOR-R2              0.2096     0.2548       0.2183     2
HaSpeeDe dataset and then for 3 epochs over               Baseline               0.0960     0.2000       0.1297      -
the DANKMEMES dataset. Again, a 10-cross
                                                                  Table 3: UNITOR Results in Task 3.
fold schema is adopted and the final ensemble
of such UmBERTo models originated UNITOR-               200
R1, which ranked 2 over 5 submissions. The
                                                        150
improvements with respect to the first competi-
                                                                                                              UNITOR-R1
tive system confirms the robustness of the adopted      100
                                                                                                              UNITOR-R2
Transformer-based architecture combined with the                                                              Gold Standard
                                                         50
adopted auxiliary training step. We thus combined
this model with a CNN (here ResNET152) to ex-             0
ploit also visual information as for the previous                   0       1        2      3        4
subtask. This combination originated UNITOR-            Figure 1: Distribution of labels and classifications
R2, which again provided the best results in the        in Task 3.
competition, even though a very little margin is
obtained w.r.t. UNITOR-R1.                                 For an error analysis, we compared the assign-
Task3 - Event Clustering. The training dataset          ments provided in the test set and the ones derived
available for the subtask 3 contains 800 training       from UNITOR, as shown in Figure 1. First, it is
examples for the 5 targeted topics and a test dataset   clear that the dataset is highly unbalanced, with
made of 200 examples. In Table 3 the perfor-            half of the examples assigned to the class with
mances of UNITOR are reported, as for the pre-          uncertain topics. Moreover, it can be seen that
vious subtask. Since it is a multi-class classifica-    the combination of textual and visual information
tion task, each system is evaluated with respect to     makes UNITOR-R1 more robust in detecting topic
each of the 5 labels in a binary setting and then       2, and most importantly, topic 1, which is ignored
the macro-average is applied to Precision, Recall       from UNITOR-R2. Topics 3 and 4 are ignored
and F1. Here, the baseline is given by a classifier     by UNITOR but they are also under-represented
labelling every MEME as belonging to the most           in the training material. UNITOR-R2 seems more
represented class (i.e. topic 0, containing miscel-     conservative with respect to the largest class (topic
laneous examples). Its results, i.e. a F1 of 0.1297,    0): it is clear that the repetition of the same topic
suggest this is a very challenging task, where the      over many examples introduced a bias. Future
dataset is quite limited, especially considering the    work will consider the adoption of more expres-
overlap that exists among all political topics. In      sive and varied topic descriptions to be paired with
the first row, the run UNITOR-R1 is reported: it        texts: for examples, we will select headline news
corresponds to a model that combines the embed-         that can be retrieved using Retrieval Engines (e.g.,
                                                           7
dings from ResNET152 and those obtained by Al-                 These results are not reported for lack of space.
by querying with the topic description) to have a           R. Gomez, J. Gibert, L. Gomez, and D. Karatzas. 2020.
more expressive representation of the topics.                  Exploring hate speech detection in multimodal pub-
                                                               lications. In 2020 IEEE Winter Conference on Ap-
4   Conclusions                                                plications of Computer Vision (WACV), pages 1459–
                                                               1467.
This work presented the UNITOR system partici-
pating to DANKMEMES task at EVALITA 2020.                   Ian J. Goodfellow, Yoshua Bengio, and Aaron
UNITOR merges visual and textual evidences by                  Courville. 2016. Deep Learning. MIT Press, Cam-
                                                               bridge, MA, USA.
combining state-of-the-art deep neural architec-
tures and ranked first in all subtasks defined in           K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep resid-
the competition. These results confirm the ben-                ual learning for image recognition. In 2016 IEEE
eficial impact of the adopted Convolutional and                Conference on Computer Vision and Pattern Recog-
                                                               nition (CVPR), pages 770–778.
Transformer-based architecture in the automatic
recognition of MEMEs as well as in Hate Speech              L. Jiao and J. Zhao. 2019. A survey on the new gen-
Identification or Event Clustering. Future work                eration of deep learning in image processing. IEEE
will investigate multi-task learning approaches to             Access, 7:172231–172263.
combine the adopted architectures in a more prin-
                                                            Y. Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
cipled way.                                                    dar Joshi, Danqi Chen, Omer Levy, M. Lewis,
                                                               Luke Zettlemoyer, and Veselin Stoyanov. 2019.
                                                               Roberta: A robustly optimized bert pretraining ap-
References                                                     proach. ArXiv, abs/1907.11692.
Valerio Basile and Malvina Nissim. 2013. Sentiment          Martina Miliani, Giulia Giorgi, Ilir Rama, Guido
  analysis on italian tweets. In Proceedings of the 4th      Anselmi, and Gianluca E. Lebani.             2020.
  Workshop on Computational Approaches to Subjec-            Dankmemes @ evalita2020: The memeing of life:
  tivity, Sentiment and Social Media Analysis, pages         memes, multimodality and politics). In Valerio
  100–107, Atlanta.                                          Basile, Danilo Croce, Maria Di Maro, and Lucia C.
                                                             Passaro, editors, Proceedings of Seventh Evalua-
Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-         tion Campaign of Natural Language Processing and
  cia C. Passaro. 2020. Evalita 2020: Overview               Speech Tools for Italian. Final Workshop (EVALITA
  of the 7th evaluation campaign of natural language         2020), Online. CEUR.org.
  processing and speech tools for italian. In Valerio
  Basile, Danilo Croce, Maria Di Maro, and Lucia C.         Benet Oriol, Cristian Canton-Ferrer, and Xavier Giró
  Passaro, editors, Proceedings of Seventh Evalua-            i Nieto. 2019. Hate speech in pixels: Detection
  tion Campaign of Natural Language Processing and            of offensive memes towards automatic moderation.
  Speech Tools for Italian. Final Workshop (EVALITA           In NeurIPS 2019 Workshop on AI for Social Good,
  2020), Online. CEUR.org.                                    Vancouver, Canada, 09/2019.
Cristina Bosco, Felice Dell’Orletta, Fabio Poletto,
  Manuela Sanguinetti, and Maurizio Tesconi. 2018.          Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.
  Overview of the EVALITA 2018 hate speech detec-             How multilingual is multilingual BERT? In Pro-
  tion task. In Proceedings of EVALITA 2018, Turin,           ceedings of the 57th Annual Meeting of the Asso-
  Italy, December 12-13, 2018, volume 2263 of CEUR            ciation for Computational Linguistics, pages 4996–
  Workshop Proceedings.                                       5001, Florence, Italy, July.

Alexis Conneau, Kartikay Khandelwal, Naman Goyal,           Marco Polignano, Pierpaolo Basile, Marco de Gem-
  Vishrav Chaudhary, Guillaume Wenzek, Francisco             mis, Giovanni Semeraro, and Valerio Basile. 2019.
  Guzmán, Edouard Grave, Myle Ott, Luke Zettle-             AlBERTo: Italian BERT Language Understanding
  moyer, and Veselin Stoyanov. 2020. Unsupervised            Model for NLP Challenging Tasks Based on Tweets.
  cross-lingual representation learning at scale. In         In Proceedings of the Sixth Italian Conference on
  Proceedings of ACL 2020, Online, July 5-10, 2020,          Computational Linguistics (CLiC-it 2019), volume
  pages 8440–8451.                                           2481. CEUR.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-   Andrew Ross and Damian J. Rivers. 2017. Digital cul-
   Fei. 2009. ImageNet: A Large-Scale Hierarchical            tures of political participation: Internet memes and
   Image Database. In CVPR09.                                 the discursive delgitimization of the 2016 u.s. pres-
                                                              idential candidates. Discourse, Context and Media,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and                 16:1–11, 01.
   Kristina Toutanova. 2019. BERT: Pre-training of
   deep bidirectional transformers for language under-      Limor Shifman. 2013. Memes in a digital world: Rec-
   standing. In Proceedings of NAACL 2019, pages              onciling with a conceptual troublemaker. J. Comput.
   4171–4186, Minneapolis, Minnesota, June.                   Mediat. Commun., 18:362–377.
Mingxing Tan and Quoc V. Le. 2019. EfficientNet:
  Rethinking Model Scaling for Convolutional Neural
  Networks. arXiv e-prints, page arXiv:1905.11946,
  May.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
  Kaiser, and Illia Polosukhin. 2017. Attention is all
  you need. In Advances in Neural Information Pro-
  cessing Systems, volume 30, pages 5998–6008.
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen
  Tu, and Kaiming He. 2016. Aggregated Residual
  Transformations for Deep Neural Networks. arXiv
  e-prints, page arXiv:1611.05431, November.

</pre>