<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UNITOR @ DANKMEMES: Combining Convolutional Models and Transformer-based architectures for accurate MEME management</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Claudia Breazzano</string-name>
          <email>claudiabreazzano@outlook.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edoardo Rubino</string-name>
          <email>edoardo.ru94@libero.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Danilo Croce</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Basili</string-name>
          <email>basilig@info.uniroma2.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Roma</institution>
          ,
          <addr-line>Tor Vergata Via del Politecnico 1, Rome, 00133</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the UNITOR system that participated to the “multimoDal Artefacts recogNition Knowledge for MEMES” (DANKMEMES) task within the context of EVALITA 2020. UNITOR implements a neural model which combines a Deep Convolutional Neural Network to encode visual information of input images and a Transformerbased architecture to encode the meaning of the attached texts. UNITOR ranked first in all subtasks, clearly confirming the robustness of the investigated neural architectures and suggesting the beneficial impact of the proposed combination strategy.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        In Social networks, the ways to express opinions
evolved from simply writing a post to
publishing more complex contents, e.g., the
composition of images and texts. These multi-modal
objects, if adhering to some specific social
conventions and visual specifications, are called MEMEs.
In particular, a MEME is a multi-modal
artifact, manipulated by users, who combines
intertextual elements to convey a message.
Characterized by a visual format that includes images,
text, or a combination of them, MEMEs combine
references to current events or related situations
and pop-cultural references to music, comics and
films
        <xref ref-type="bibr" rid="ref16">(Ross and Rivers, 2017)</xref>
        . In this context,
the multimoDal Artefacts recogNition Knowledge
for MEMES (DANKMEMES) task is the first
EVALITA
        <xref ref-type="bibr" rid="ref2">(Basile et al., 2020)</xref>
        task for MEMEs
recognition and hate speech/event identification in
MEMEs
        <xref ref-type="bibr" rid="ref12">(Miliani et al., 2020)</xref>
        . This task is
divided into three subtasks: in MEME Detection,
system is required to determine whether an image
      </p>
      <p>
        Copyright © 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
is a MEME, according to the definition of
        <xref ref-type="bibr" rid="ref17">(Shifman, 2013)</xref>
        ; in Hate Speech Identification the aim
is to recognize if a MEME expresses an offensive
message; finally, in Event Clustering the aim is to
cluster MEMEs according to their referring topics.
      </p>
      <p>
        In this work, we present the UNITOR
system participating in all three subtasks. Since
MEMEs convey their content through the
multimodal combination of an image and a text,
UNITOR implements a neural network which
combines state-of-the-art architectures for Computer
Vision and Natural Language Processing. In
particular, Deep Convolutional Neural Networks,
such as
        <xref ref-type="bibr" rid="ref10 ref13 ref14 ref18 ref9">(He et al., 2016; Tan and Le, 2019)</xref>
        are used to encode visual information into dense
embeddings and Transformer-based architectures,
such as
        <xref ref-type="bibr" rid="ref11 ref6">(Devlin et al., 2019; Liu et al., 2019)</xref>
        encode the meaning of the added overlaid captions.
UNITOR then stacks a multi-layered network in
order to effectively combine the evidences
captured by both encoders, in the final classification.
      </p>
      <p>The UNITOR system ranked first in each
subtask, clearly confirming the robustness of the
investigated neural architectures and suggesting the
beneficial contribution of the proposed
combination strategy. In the rest of the paper, in Section 2
the UNITOR system is described while Section 3
reports the experimental results.
2</p>
    </sec>
    <sec id="sec-2">
      <title>UNITOR Description</title>
      <p>
        CNNs for Image classification. Recent years
demonstrated that Convolutional Neural Networks
(CNNs) are able to achieve state-of-the-art results
in image processing
        <xref ref-type="bibr" rid="ref10 ref13 ref14 ref18">(Jiao and Zhao, 2019)</xref>
        , by
implementing deep and complex stackings of
Convolutional layers, which capture different aspects of
input images at different levels of the networks.
      </p>
      <p>
        Among the investigated architectures, we first
considered ResNET
        <xref ref-type="bibr" rid="ref9">(He et al., 2016)</xref>
        : this
network is the first introducing Residual Learning
to define very deep and effective CNNs.
Several ResNET architectures are defined by
stacking 50, 101, 152 up to 1001 layers of
convolution layers and skip connectors: as a result, deeper
networks achieved significant improvements of
previous state-of-art in a wide plethora of
image processing tasks. Moreover, we investigated
the recently proposed EfficientNet
        <xref ref-type="bibr" rid="ref10 ref13 ref14 ref18">(Tan and Le,
2019)</xref>
        : unlike ResNET, this is not a real
architecture, but it provides an automatic
methodology to improve the performance of an existing
CNN (such as ResNET) by tuning its depth, width
and resolution dimensions. The adoption of this
methodology led to the definition of 8 CNNs
(namely EfficientNET-B0, EfficientNET-B1 up to
EfficientNET-B7), each characterized by an
increasing depth and width. They achieve
impressive results by efficiently balancing the number of
the parameters of the network. The tuning
process of
        <xref ref-type="bibr" rid="ref10 ref13 ref14 ref18">(Tan and Le, 2019)</xref>
        demonstrated that a
network such as EfficientNet-B3 achieves higher
accuracy than ResNeXt101
        <xref ref-type="bibr" rid="ref20">(Xie et al., 2016)</xref>
        in
using 18x fewer neural operations. Regardless of
the adopted networks, these are already trained
in a classification task involving the recognition
of thousands of object types in several millions
of images, i.e. in the ImageNet dataset
        <xref ref-type="bibr" rid="ref5">(Deng et
al., 2009)</xref>
        . This pre-training step enables the
network to recognize many “basic entities” (such as
people or animals) before being applied to a new
task, e.g., MEME Detection. The customization
to a new task is obtained just by replacing the last
classification layer with a new one (sized based
on the number of targeted classes) and by
finetuning the entire architecture. It is worth
noticing that, once the architecture is fine-tuned on the
new down-stream task, it can be also used as an
Image Encoder: the embeddings generated on the
layer previous the classification one can be used as
low-dimensional representations of input images.
Most importantly, these embeddings are correlated
with the down-stream task, as they are expected to
lay in linearly separable sub-spaces
        <xref ref-type="bibr" rid="ref8">(Goodfellow
et al., 2016)</xref>
        , where the final classifier is applied.
In UNITOR these vectors are used to combine
visual information with other evidences: in practice,
they will be used in combination with the
embeddings produced from the Transformer-based
architectures (applied to texts) before being used in
input to the final classifier.
      </p>
      <p>
        Transformer-based Architectures for text
classification. A MEME is a combination of visual
information and the overlaid caption. In this work,
we thus also investigated classifiers based on the
text made available via OCR to the participants
by the DANKMEME organizers. In particular,
we adopt the approach proposed in
        <xref ref-type="bibr" rid="ref6">(Devlin et al.,
2019)</xref>
        , namely Bidirectional Encoder
Representations from Transformers (BERT). It provides an
effective way to pre-train a neural network over
large-scale collections of raw texts, and apply it
to a large variety of supervised NLP tasks, here
text classification. The building block of BERT
is the Transformer element
        <xref ref-type="bibr" rid="ref19">(Vaswani et al., 2017)</xref>
        ,
an attention-based mechanism that learns
contextual relations between words in a text. The
pretraining stage is based on two auxiliary tasks,
whose aim is the acquisition of an expressive and
robust language and text model: the Masked
Language model acquires a meaningful and
contextsensitive representation of words, while the Next
Sentence Prediction task captures discourse level
information. In particular, this last task operates
on text-pairs to capture relational information
between them, e.g. between the consecutive
sentences in a text. The straightforward application of
BERT has shown better results than previous
stateof-the-art models on a wide spectrum of natural
language processing tasks. In
        <xref ref-type="bibr" rid="ref11">(Liu et al., 2019)</xref>
        RoBERTa is proposed as a variant of BERT which
modifies some key hyperparameters, including
removing the next-sentence pre-training objective,
and training on more data, with much larger
minibatches and learning rates. This allows RoBERTa
to improve on the masked language modelling
objective compared with BERT and leads to
better down-stream task performances. We adopt
here the fine-tuning process for sequence
classification, where sequences correspond to texts
extracted from images. The special token [CLS]
is added as a first element of each input
sentence, so that BERT associates it a specific
embedding. This dense vector represents the entire
sentence and is used in input to a linear
classifier customized for the target classification task: in
MEME Detection and Hate Speech Identification,
two classes are considered, while in Event
Clustering five classes reflect the target topics.
During training, all the network parameters are
finetuned. BERT and RoBERTa are pre-trained over
text in English, and they are able to capture
language models specific for this language. In order
to apply these architectures in Italian, we
investigate several alternative models, pre-trained
using document collections in Italian or in
multiple languages. Among these models, AlBERTo
        <xref ref-type="bibr" rid="ref15">(Polignano et al., 2019)</xref>
        is a BERT-based model
pre-trained over the Twita corpus
        <xref ref-type="bibr" rid="ref1">(Basile and
Nissim, 2013)</xref>
        (made of millions of Italian tweets)
while GilBERTo1 and UmBERTo2 are
RoBERTabased models pre-trained over the OSCAR corpus
and the Italian version of Wikipedia, respectively.
Among the multi-lingual models, we investigate
multilingual BERT (mBERT)
        <xref ref-type="bibr" rid="ref14">(Pires et al., 2019)</xref>
        and XLM-RoBERTa
        <xref ref-type="bibr" rid="ref4">(Conneau et al., 2020)</xref>
        which
extends the corresponding pre-training over texts
in more than 100 languages.
      </p>
      <p>
        Regardless of the adopted Transformer-based
architecture, we also investigated the adoption
of additional annotated material to support the
training of complex networks over very short
texts extracted from MEMEs. In particular, in
Hate Speech Identification, we used an external
dataset which addressed the same task, but within
a different source. We thus adopted a dataset
made available within the Hate Speech Detection
(HaSpeeDe) task
        <xref ref-type="bibr" rid="ref3">(Bosco et al., 2018)</xref>
        which
involves the automatic recognition of hateful
contents in Twitter (HaSpeeDe-TW) and Facebook
posts (HaSpeeDe-FB). Each investigated
architecture is trained for few epochs only over on the
HaSpeeDe dataset before the real training is
applied to the DANKMEMES material. In this
way, the neural model, which is not specifically
pre-trained to detect hate speech, is expected to
improve its “expertise” in handling such a
phenomenon (even though using material derived
from a different source) before being specialized
on the final DANKMEMES task3.
      </p>
      <p>We trained UmBERTo both on HaSpeeDe-TW
and on HaSpeeDe-FB and on the merging of these,
too. Initial experiments suggested that a higher
accuracy can be achieved only considering the
material from Facebook (HaSpeeDe-FB). We suppose
this is mainly due to the fact that messages from
HaSpeeDe-FB and DANKMEMES share
similar political topics. As for a CNN, once the
Transformer-based architecture is fine-tuned on
the new task, it can be used as text encoder, by
removing the final linear classifier and selecting the
embedding associated to the [CLS] token. These
1https://huggingface.co/idb-ita/
gilberto-uncased-from-camembert</p>
      <p>2https://huggingface.co/Musixmatch/
umberto-wikipedia-uncased-v1</p>
      <p>3An alternative approach consists in adding the messages
from HaSpeeDe to the training set: this approach led to lower
results, not reported here due to lack of space.
vectors will be used in UNITOR in combination
with the embeddings derived from the CNN
architecture, as described hereafter.</p>
    </sec>
    <sec id="sec-3">
      <title>Combining visual and semantic evidences. UN</title>
      <p>
        ITOR adopts an approach similar to the
Feature Concatenation Model (FCM) already seen in
        <xref ref-type="bibr" rid="ref13 ref7">(Oriol et al., 2019; Gomez et al., 2020)</xref>
        to combine
visual and textual information. For each subtask,
the specific CNN achieving best results on the
development set is selected, among the investigated
ones. The same happens for the
Transformerbased architectures. When the “best”
architectures are selected and fine-tuned for visual and
textual analysis, these are used to encode the
entire dataset. It allows training a new classifier
which accounts on the evidences from both
aspects. In UNITOR these encodings are
concatenated, so that the final classifier is a Multi-layered
Perceptron4. Only this final classifier is fine-tuned,
as the remaining parameters are supposed to be
already optimized for the task. Future work will
consider the fine-tuning of all the parameters of
this combined network, here ignored for the (too)
high computational cost required from this more
elegant approach. It must be said that other
information is available in the competition: for
example, each MEME was supported with its
publication date or the list of politicians appearing in the
picture. We investigated the manual definition of
feature vectors to be added in the concatenation
described above. Unfortunately, these vectors did
not provide any significant impact during our
experiments, so we only relied on visual and textual
information. We suppose this additional
information it is too sparse (given the dataset size) to
provide any valuable evidence.
      </p>
      <p>Modelling Event Clustering as a Classification
task. While Event Clustering may suggest a
straightforward application of unsupervised
algorithms, we adopted a supervised setting, by
imposing the hypothesis that train and test datasets
share the same topics. We modelled this subtask
as a classification problem, where each MEME is
to be assigned to one of the five classes reflecting
the underlying topic. UNITOR implements two
different approaches. In a first model, the same
setting adopted in the other subtasks is used: a
CNN and a Transformer-based are optimized on
the Task 3 and used as encoder to train the final
4We investigated also more complex combinations, such
as the weighed sum, or point-wise product of embeddings,
but lower results were obtained.</p>
      <p>
        MLP classifier. Unfortunately, most of the texts
are really short to be valuable in the final
classification. We thus adopted a second model which
is inspired by the capability of BERT-based
models to effectively operate over text pairs,
achieving state-of-the-art results in tasks such as in
Textual Entailment and in Natural Language Inference
tasks
        <xref ref-type="bibr" rid="ref6">(Devlin et al., 2019)</xref>
        . In this second
setting, each input MEME generates five pairs (one
for each topic) which are in the form htopic
definition, texti. Let us consider the example ”ma
come chi sono? preside´ so´ io senza fotoscioppe!”,
associated to the topic #2, defined5 as “L’inizio
delle consultazioni con i partiti politici e il
discorso al Senato di Conte”. It generates new
inputs in the form “[CLS] ma come chi : : :
fotoscioppe! [SEP] L’inizio delle . . . Senato di Conte.
[SEP]” which defines sentence pairs in
BERTlike architectures. The same approach is applied
with respect to each topic. In other words, the
original classification problem over five classes is
mapped to a binary classification one: each pair is
a positive example when the text is associated to
the correct topic, negative otherwise. In this way,
we expected to detect a possible “semantic
connection” between the extracted text and the paired
(correct topic) description. At classification time,
for each MEME, five new examples are derived
(one per topic) and classified. The one generated
by the topic receiving the highest softmax score is
selected as output.
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>Experimental evaluation and results</title>
      <p>UNITOR participated to all subtasks within
DANKMEMES. For parameter tuning, we
adopted a 10-cross fold validation, so that the
training material is divided in 10 folds, each split
according to 90%-10% proportion. The model is
trained using a standard Cross-entropy Loss and
an ADAM optimizer initialized with a learning
rate set to 2 10 5. We trained the model for 5
epochs, using a batch size of 32 elements. When
combining the networks, the number of hidden
layers in the MLP classifier is tuned between 1 and
3. At test time, for each task, an Ensemble of such
classifiers is used: each image is in fact classified
using all 10 models trained in the different folds
and the label suggested by the highest number
of classifiers is selected. UNITOR is implement
5In a simplified English: ”Are you seriously asking who I
am? Mr President, it’s me without Photoshop effects!”
using pytorch6.</p>
      <p>System
UNITOR-R2
SNK-R1
UNITOR-R1
Baseline</p>
    </sec>
    <sec id="sec-5">
      <title>Task 1 - MEME Detection. For the subtask 1, the</title>
      <p>training dataset counts 1,600 examples, equally
labelled as “MEME” and “NotMEME”. Results of
UNITOR is reported in Table 1, where results are
evaluated in terms of Precision, Recall and
F1measure, calculated over the binary classification
task (this last used to rank systems). The last
row reports a baseline model which randomly
assigns labels to images. MEMEs generally adhere
to specific visual conventions, where the
meaning of text is secondary: as a consequence, our
first model (UNITOR-R1) only relies on an image
classifier. In particular, it corresponds to the
finetuning of EfficientNet-B3 over the official dataset.
In order to improve the robustness of such a CNN,
we adopted a simple data augmentation technique,
by duplicating the training material and
horizontally mirroring it. UNITOR-R1 ranked forth (over
10 submissions) in the competition. This clearly
confirms the effectiveness of EfficientNet,
combined with the adopted Ensemble technique. We
also investigated larger variants of EfficientNet but
they did not outperform the B3 variant: we
suppose these larger architectures are more exposed
to over-fitting, also considering the dataset size.</p>
      <p>Moreover, we adopted a model that combines
the output of EfficientNet-B3 with a
Transformerbased architecture. Among all the investigated
architecture, AlBERTo achieved the highest
classification accuracy. Once tuned (in the same 10-cross
fold evaluation schema) it is used to encode the
entire dataset and the embeddings are concatenated
to the ones from EfficientNet-B3. This enables the
training of 10 MLPs (one per fold) whose
Ensemble defines UNITOR-R2, which ranked first in the
task, with a F1 of 0:8501. The overall results thus
confirm also the beneficial (although limited)
impact of textual information in this subtask.</p>
    </sec>
    <sec id="sec-6">
      <title>Task2 - Hate Speech Identification. The train</title>
      <p>ing dataset available for the subtask 2 contains
800 training examples, labelled as “Hate” and
“NotHate”, while the test dataset counts 200
ex6https://pytorch.org/
amples. In Table 2 the results obtained by
UNITOR are reported, according to the same metrics
adopted in Task 1. Unlike the first subtask, Hate
Speech is more related to the textual information.
Even the baseline is given by the performance of
a classifier labelling a MEME as offensive
whenever it includes at least a swear word (resulting in
a system with a high Precision and a very low
Recall).</p>
      <p>System
UNITOR-R2
UNITOR-R1
UPB
Baseline</p>
      <p>In this task, we adopted UmBERTo (pre-trained
over Wikipedia), fine-tuned for 3 epochs over the
HaSpeeDe dataset and then for 3 epochs over
the DANKMEMES dataset. Again, a 10-cross
fold schema is adopted and the final ensemble
of such UmBERTo models originated
UNITORR1, which ranked 2 over 5 submissions. The
improvements with respect to the first
competitive system confirms the robustness of the adopted
Transformer-based architecture combined with the
adopted auxiliary training step. We thus combined
this model with a CNN (here ResNET152) to
exploit also visual information as for the previous
subtask. This combination originated
UNITORR2, which again provided the best results in the
competition, even though a very little margin is
obtained w.r.t. UNITOR-R1.</p>
      <p>Task3 - Event Clustering. The training dataset
available for the subtask 3 contains 800 training
examples for the 5 targeted topics and a test dataset
made of 200 examples. In Table 3 the
performances of UNITOR are reported, as for the
previous subtask. Since it is a multi-class
classification task, each system is evaluated with respect to
each of the 5 labels in a binary setting and then
the macro-average is applied to Precision, Recall
and F1. Here, the baseline is given by a classifier
labelling every MEME as belonging to the most
represented class (i.e. topic 0, containing
miscellaneous examples). Its results, i.e. a F1 of 0:1297,
suggest this is a very challenging task, where the
dataset is quite limited, especially considering the
overlap that exists among all political topics. In
the first row, the run UNITOR-R1 is reported: it
corresponds to a model that combines the
embeddings from ResNET152 and those obtained by
Al200
150
100
50
0
UNITOR-R1
UNITOR-R2
Gold Standard
BERTo, both achieving best accuracy in our initial
tuning within this subtask. UNITOR-R1 ranked
first (among three submissions) in this
competition with a F1 of 0:2657, which doubles the result
obtained from the baseline. It must be said that the
Transformer achieves significantly better results
with respect to the CNN, suggesting that the
visual information is negligible also in this subtask7.
We thus evaluated a model which considers only
text, by fine-tuning an AlBERTo model adopting
the pair-based approach presented in Section 2,
where each text is associated with the description
of the topic. Unfortunately, this model, namely
UNITOR-R2, under-performed the first
submission, with a F1 of 0:2183.</p>
      <p>System
UNITOR-R1
UNITOR-R2
Baseline</p>
      <p>Precision
0.2683
0.2096
0.0960</p>
      <p>Recall
0.2851
0.2548
0.2000</p>
      <p>For an error analysis, we compared the
assignments provided in the test set and the ones derived
from UNITOR, as shown in Figure 1. First, it is
clear that the dataset is highly unbalanced, with
half of the examples assigned to the class with
uncertain topics. Moreover, it can be seen that
the combination of textual and visual information
makes UNITOR-R1 more robust in detecting topic
2, and most importantly, topic 1, which is ignored
from UNITOR-R2. Topics 3 and 4 are ignored
by UNITOR but they are also under-represented
in the training material. UNITOR-R2 seems more
conservative with respect to the largest class (topic
0): it is clear that the repetition of the same topic
over many examples introduced a bias. Future
work will consider the adoption of more
expressive and varied topic descriptions to be paired with
texts: for examples, we will select headline news
that can be retrieved using Retrieval Engines (e.g.,
7These results are not reported for lack of space.
by querying with the topic description) to have a
more expressive representation of the topics.
4</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions</title>
      <p>This work presented the UNITOR system
participating to DANKMEMES task at EVALITA 2020.
UNITOR merges visual and textual evidences by
combining state-of-the-art deep neural
architectures and ranked first in all subtasks defined in
the competition. These results confirm the
beneficial impact of the adopted Convolutional and
Transformer-based architecture in the automatic
recognition of MEMEs as well as in Hate Speech
Identification or Event Clustering. Future work
will investigate multi-task learning approaches to
combine the adopted architectures in a more
principled way.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          and
          <string-name>
            <given-names>Malvina</given-names>
            <surname>Nissim</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Sentiment analysis on italian tweets</article-title>
          .
          <source>In Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis</source>
          , pages
          <fpage>100</fpage>
          -
          <lpage>107</lpage>
          , Atlanta.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          , Danilo Croce, Maria Di Maro, and
          <string-name>
            <surname>Lucia</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Passaro</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Evalita 2020: Overview of the 7th evaluation campaign of natural language processing and speech tools for italian</article-title>
          .
          <source>In Valerio Basile</source>
          , Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors,
          <source>Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          ),
          <article-title>Online</article-title>
          . CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Cristina</given-names>
            <surname>Bosco</surname>
          </string-name>
          , Felice Dell'Orletta, Fabio Poletto, Manuela Sanguinetti, and
          <string-name>
            <given-names>Maurizio</given-names>
            <surname>Tesconi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of the EVALITA 2018 hate speech detection task</article-title>
          .
          <source>In Proceedings of EVALITA</source>
          <year>2018</year>
          , Turin, Italy,
          <source>December 12-13</source>
          ,
          <year>2018</year>
          , volume
          <volume>2263</volume>
          <source>of CEUR Workshop Proceedings.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Alexis</given-names>
            <surname>Conneau</surname>
          </string-name>
          , Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzma´n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and
          <string-name>
            <given-names>Veselin</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          .
          <source>In Proceedings of ACL 2020, Online, July</source>
          <volume>5</volume>
          -
          <issue>10</issue>
          ,
          <year>2020</year>
          , pages
          <fpage>8440</fpage>
          -
          <lpage>8451</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <surname>L.</surname>
          </string-name>
          <year>FeiFei</year>
          .
          <year>2009</year>
          .
          <article-title>ImageNet: A Large-Scale Hierarchical Image Database</article-title>
          .
          <source>In CVPR09.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In Proceedings of NAACL 2019</source>
          , pages
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          , Minneapolis, Minnesota, June.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gibert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Karatzas</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Exploring hate speech detection in multimodal publications</article-title>
          .
          <source>In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV)</source>
          , pages
          <fpage>1459</fpage>
          -
          <lpage>1467</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Ian J. Goodfellow</surname>
            , Yoshua Bengio, and
            <given-names>Aaron</given-names>
          </string-name>
          <string-name>
            <surname>Courville</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep Learning</article-title>
          . MIT Press, Cambridge, MA, USA.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , pages
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            <surname>Jiao</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>A survey on the new generation of deep learning in image processing</article-title>
          .
          <source>IEEE Access</source>
          ,
          <volume>7</volume>
          :
          <fpage>172231</fpage>
          -
          <lpage>172263</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          , Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Luke</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Veselin</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          . ArXiv, abs/
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Martina</given-names>
            <surname>Miliani</surname>
          </string-name>
          , Giulia Giorgi, Ilir Rama, Guido Anselmi, and
          <string-name>
            <given-names>Gianluca E.</given-names>
            <surname>Lebani</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Dankmemes @ evalita2020: The memeing of life: memes, multimodality and politics)</article-title>
          .
          <source>In Valerio Basile</source>
          , Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors,
          <source>Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          ),
          <article-title>Online</article-title>
          . CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Benet</given-names>
            <surname>Oriol</surname>
          </string-name>
          ,
          <source>Cristian Canton-Ferrer, and Xavier Giro´ i Nieto</source>
          .
          <year>2019</year>
          .
          <article-title>Hate speech in pixels: Detection of offensive memes towards automatic moderation</article-title>
          .
          <source>In NeurIPS 2019 Workshop on AI for Social Good</source>
          , Vancouver, Canada,
          <volume>09</volume>
          /
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Telmo</given-names>
            <surname>Pires</surname>
          </string-name>
          , Eva Schlinger, and
          <string-name>
            <given-names>Dan</given-names>
            <surname>Garrette</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</article-title>
          , pages
          <fpage>4996</fpage>
          -
          <lpage>5001</lpage>
          , Florence, Italy, July.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Marco</given-names>
            <surname>Polignano</surname>
          </string-name>
          , Pierpaolo Basile, Marco de Gemmis, Giovanni Semeraro, and
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>AlBERTo: Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets</article-title>
          .
          <source>In Proceedings of the Sixth Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2019</year>
          ), volume
          <volume>2481</volume>
          . CEUR.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Ross and Damian J. Rivers</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Digital cultures of political participation: Internet memes and the discursive delgitimization of the 2016 u.s. presidential candidates</article-title>
          .
          <source>Discourse, Context and Media</source>
          ,
          <volume>16</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          ,
          <fpage>01</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Limor</given-names>
            <surname>Shifman</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Memes in a digital world: Reconciling with a conceptual troublemaker</article-title>
          .
          <source>J. Comput. Mediat. Commun.</source>
          ,
          <volume>18</volume>
          :
          <fpage>362</fpage>
          -
          <lpage>377</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Mingxing</given-names>
            <surname>Tan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Quoc V.</given-names>
            <surname>Le</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks</article-title>
          . arXiv e-prints, page arXiv:
          <year>1905</year>
          .11946, May.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Ashish</given-names>
            <surname>Vaswani</surname>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
          <article-title>Ł ukasz Kaiser, and</article-title>
          <string-name>
            <given-names>Illia</given-names>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , volume
          <volume>30</volume>
          , pages
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Saining</given-names>
            <surname>Xie</surname>
          </string-name>
          , Ross Girshick, Piotr Dolla´r, Zhuowen Tu, and
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Aggregated Residual Transformations for Deep Neural Networks</article-title>
          . arXiv e-prints, page arXiv:
          <volume>1611</volume>
          .05431,
          <string-name>
            <surname>November</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>