<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ArchiMeDe @ DANKMEMES: A New Model Architecture for Meme Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jinen Setpal</string-name>
          <email>jinen.setpal@rnpodarschool.com</email>
          <email>jinens8@gmail.com</email>
          <email>jinens8@gmail.com jinen.setpal@rnpodarschool.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriele Sarti</string-name>
          <email>gsarti@sissa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Geosciences, University of Trieste &amp; SISSA</institution>
          ,
          <addr-line>Trieste</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>RN Podar School</institution>
          ,
          <addr-line>Mumbai</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>English. We introduce ArchiMeDe, a multimodal neural network-based architecture used to solve the DANKMEMES meme detections subtask at the 2020 EVALITA campaign. The system incorporates information from visual and textual sources through a multimodal neural ensemble to predict if input images and their respective metadata are memes or not. Each pre-trained neural network in the ensemble is first fine-tuned individually on the training dataset to perform domain adaptation. Learned text and visual representations are then concatenated to obtain a single multimodal embedding, and the final prediction is performed through majority voting by all networks in the ensemble.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Italiano. Presentiamo ArchiMeDe,
un’architettura multimodale basata su
reti neurali per la risoluzione del subtask
di “meme detection” per DANKMEMES
a EVALITA 2020. Il sistema unisce
informazione visiva e testuale attraverso
un insieme multimodale di reti neurali
per prevedere se immagini e rispettivi
metadati corrispondano a meme o meno.
Ogni rete neurale pre-allenata all’interno
dell’insieme e` inizialmente adattata al
dominio specifico del dataset di training.
In seguito, le rappresentazioni di ogni rete
per immagini e testo vengono concatenate
in un unico embedding multimodale, e la
previsione finale e` effettuata tramite un
voto di maggioranza effettuato da tutte le
reti nell’insieme.</p>
      <p>Copyright c 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).</p>
    </sec>
    <sec id="sec-2">
      <title>1 Introduction</title>
      <p>
        In recent years, the democratization of data
collection procedures through web scraping and
crowdsourcing has led to the broad availability of
public datasets spanning modalities like language and
vision. Contemporary state-of-the-art machine
learning models can leverage those resources to
achieve highly accurate and often superhuman
performances using millions or even billions of
parameters
        <xref ref-type="bibr" rid="ref2">(Brown et al., 2020)</xref>
        , but are heavily
reliant on an abundance of computational resources
to work properly. Consequently, such
architectures’ training is often inaccessible to smaller
research centers – let alone individual users. To
counter this tendency, the availability of
pretrained open-source models has dramatically
reduced the computational threshold required to
obtain state-of-the-art results in multiple languages
and vision tasks
        <xref ref-type="bibr" rid="ref4 ref6">(Devlin et al., 2019; He et al.,
2016)</xref>
        . Pre-trained systems are often leveraged in a
two-step framework: first, they undergo an
unsupervised or semi-supervised pre-training to learn
general knowledge representations, then they are
fine-tuned in a supervised way to adapt their
parameters in the context of downstream tasks. This
transfer learning approach stems from the
computer vision literature
        <xref ref-type="bibr" rid="ref7">(He et al., 2019)</xref>
        but has
been recently adopted for natural language
processing tasks with positive results
        <xref ref-type="bibr" rid="ref13 ref4 ref8">(Howard and
Ruder, 2018; Devlin et al., 2019; Liu et al., 2019)</xref>
        .
      </p>
      <p>
        In this paper, we present ArchiMeDe, a
multimodal system leveraging pre-trained
language and vision models to compete in the
DANKMEMES
        <xref ref-type="bibr" rid="ref16">(Miliani et al., 2020)</xref>
        shared task
at the EVALITA 2020 campaign
        <xref ref-type="bibr" rid="ref1">(Basile et al.,
2020)</xref>
        . Following recent transfer learning
approaches, our system leverages pre-trained visual
and word embeddings in a multimodal setup,
obtaining strong results on the meme detection
subtask. Specifically, we participated in the first
subText
      </p>
      <p>Images</p>
      <p>Sentence embeddings +
Image embeddings +
Raw metadata</p>
      <p>Dense
Layers
Dense
Layers
Dense
Layers</p>
      <p>Is this a
meme?
Majority Vote
task of DANKMEMES, aimed at discriminating
memes from standard images containing actors
from the Italian political scene. Task organizers
extracted a total of 1600 training images from the
Instagram platform, and data available from each
dataset entry – text, actors and user engagement,
among others – were leveraged to train an
ensemble of multimodal models performing meme
detection through majority-vote. The following
sections present our approach in detail, first showing
our preliminary evaluation of multiple modeling
approaches and then focusing on the final system’s
main modules and the features we leverage from
the dataset. Finally, results are presented, and
we conclude by discussing the problems we faced
with some inconsistencies in the data. Our code
is made available at https://github.com/
jinensetpal/ArchiMeDe
2</p>
    </sec>
    <sec id="sec-3">
      <title>System Description</title>
      <p>
        ArchiMeDe is composed of a multimodal
learning ensemble, with the final output being the
result of a majority vote. Figure 1 visualizes our
approach. First, the transcript associated with
each image is fed to an UmBERTo
        <xref ref-type="bibr" rid="ref5">(Francia et al.,
2020)</xref>
        neural language model (NLM) pre-trained
on the Italian language to produce sentence
embeddings. Then, we leverage three popular
pretrained vision architectures, namely ResNet
        <xref ref-type="bibr" rid="ref6">(He
et al., 2016)</xref>
        , DenseNet
        <xref ref-type="bibr" rid="ref10 ref9">(Huang et al., 2017a)</xref>
        and
AlexNet
        <xref ref-type="bibr" rid="ref12">(Krizhevsky et al., 2017)</xref>
        , to produce
three independent image embeddings for each
input image. These embeddings can be considered
as different views over an image that may
provide us with complementary information about its
content. Then, each image embedding is
concatenated with the sentence embedding and the raw
image metadata and fed as input to an 8-layer
feed-forward neural network to predict an image’s
meme status. The feed-forward network also
includes a single dropout layer to prevent overfitting
and improve generalization. Lastly, the three
predictions are weighted through majority voting to
obtain the final prediction of the ensemble. Other
simpler strategies using a single vision model to
produce image embeddings were initially
envisaged as potential candidates for our submission
but were finally dismissed in light of the
promising performances of the ArchiMeDe ensembling
approach. We discuss those perspectives in
Section 4.
      </p>
      <p>The remaining part of this section contains an
in-depth description of our ensemble’s
components, focusing on the input features that were
used and how those were preprocessed to best
suit learning. Moreover, we also include
transfer learning specifications with some details about
their impact on the overall system accuracy.
2.1</p>
      <sec id="sec-3-1">
        <title>Metadata</title>
        <p>Engagement User engagement per post is
expressed as a numeric integer value. We scale and
standardize engagement values to obtain a
distribution centered in 0 with = 1. This procedure is
a standard practice to avoid passing extreme
absolute values as inputs for the neural network.
Date We decided to leverage temporal
information in our system, building upon the intuition
that memes often rely on a small set of templates
that undergo a significant variation in popularity
through time. Temporal information may thus
provide our system with additional cues about an
image’s meme status in a specific time-frame. In the
training dataset, dates for each post has been
presented in the yyyy-mm-dd format. This date was
compared with the predetermined date, 1st
January 2015, to derive a numeric value
representing the number of days from the date of
reference. Min-max scaling is then applied to the
numeric values, further deriving float numeric values
between in the range [0,1], subsequently fed into
each training model.</p>
        <p>Manipulation The manipulation field provides
boolean information about whether an image
has been manipulated before being added to the
dataset. We found this information noisy and a
weak predictor of meme status; therefore, it was
dropped as input.</p>
        <p>Visual Actors Each entry was additionally
provided with a list of names of the visual actors
present in the frame. In the specific case of
the DANKMEMES shared task, visual actors can
be especially useful to identify meme images.
For example, we can hypothesize that politicians
who maintain a strong public presence by making
claims that produce a high level of public
engagement are more likely to be the subject of meme
images. Moreover, some combinations of actors may
be particularly likely for memes e.g. politicians
belonging to parties at the political compass’s
antipodes. In order to produce a unified
representation of visual actors for our system, we perform a
one-hot encoding of all the actors occurring in the
training set: if a specific politician is present in an
image, the corresponding entry is true; conversely,
if no such actor is present, the binary field is set to
false. Actors that were not present in the training
set are disregarded during evaluation: while this
step is required given the context, we assume that
this may significantly impact the outcome in
images for which new actors were introduced.
2.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Textual input</title>
        <p>
          The analysis of textual content in meme images is
critical to the success of the overall system.
Indeed, ironical or satyrical comments may deeply
affect the users’ interpretation of an image that
would otherwise be classified as normal. We
note that this problem cannot be approached
similarly to standard textual analytic frameworks since
memes are elucidated in short, concise phrases and
do not necessarily comply with standard
grammatical rules. They also tend to contain slang
and vernacular expressions, which, albeit
conveying the intended meaning to the reader, greatly
increase the need for high model capacity and
adhoc training data. For this reason, we selected
UmBERTo
          <xref ref-type="bibr" rid="ref5">(Francia et al., 2020)</xref>
          , a
RoBERTabased
          <xref ref-type="bibr" rid="ref13">(Liu et al., 2019)</xref>
          neural language model
pre-trained on Italian texts extracted from the
OSCAR corpus
          <xref ref-type="bibr" rid="ref17">(Ortiz Sua´rez et al., 2020)</xref>
          , for
producing text representations.1 In a recent study by
Miaschi et al. (2020), the model was highlighted
as one of the top Italian NLMs for encoding
linguistic information about social media excerpts
taken from the TWITTIRO` and PoSTWITA
Twitter corpora
          <xref ref-type="bibr" rid="ref19 ref3">(Cignarella et al., 2019; Sanguinetti et
al., 2018)</xref>
          . UmBERTo has a high model capability
with 125M trainable parameters and was trained
on online crawled data, making it suitable for
processing meme language.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>SentenceTransformers We use the Sentence</title>
        <p>
          Transformers framework
          <xref ref-type="bibr" rid="ref18 ref3 ref7">(Reimers and Gurevych,
2019)</xref>
          to produce sentence embeddings by
averaging all word embeddings produced by the
original UmBERTo model since Miaschi and
Dell’Orletta (2020) showed that those are usually
much more informative than the default [CLS]
sentence embedding. We fine-tune representations
over the available meme textual data and use them
as components of our end-to-end system.
        </p>
        <p>
          1umberto-commoncrawl-cased-v1 in the
HuggingFace’s model hub
          <xref ref-type="bibr" rid="ref20">(Wolf et al., 2019)</xref>
          While we have so far discussed only using
metadata to predict our results, it is essential to
address the core of a meme: the image itself. We
can internally distinguish a meme from a
standard image through the aforementioned broken
sentence structure, meme templates, and quick and
messy edits, among other aspects. As previously
mentioned, memes can be very difficult to
individuate when they look like standard images but
gain meme status through real-world knowledge
grounding.
        </p>
        <p>
          Due to the inherently large variance in meme
images’ styles and contents, it is impractical to
expect a single framework to effectively describe
each distinguishable feature and utilize it to
classify an entry. Hence, we split the representational
burden across multiple pre-trained model
architectures. Each of them uses a fundamentally
different approach to extract image embeddings,
making the resulting ensemble predictions more
flexible in general settings. The three networks we
used for producing image embeddings are:
ResNet Residual Networks, or ResNets
          <xref ref-type="bibr" rid="ref6">(He et
al., 2016)</xref>
          , learn residual functions in relation to
layer inputs. If H(x) is the standard underlying
target mapping, ResNet layers are instead trained
to fit another mapping F (x) = H(x) x. The
original mapping is thus recast into F (x)+x. This
approach makes the optimization process easier,
allowing for deeper architectures. The default
vector representation provided by task organizers is
produced by a ResNet-50, with fifty blocks of
residual layers. We use those image embeddings
of size 2048 without further adjustments.
AlexNet AlexNet
          <xref ref-type="bibr" rid="ref12">(Krizhevsky et al., 2017)</xref>
          is a
vision architecture built with 5 layers of
convolution and 3 fully-connected layers. AlexNet
specializes in identifying depth; the network
architecture effectively classifies objects such as
keyboards and a large subset of animals. This fact
makes AlexNet embeddings good predictors for
features such as depth that are generally
problematic in memes due to image subsections (e.g. text
boxes). We use an embedding size of 4096 in the
context of our experiments.
        </p>
        <p>
          DenseNet Pre-trained models such as ResNet
and AlexNet use a large number of hidden
layers. While the increase in depth allows
for better feature abstraction, it often leads
Run #
to vanishing-gradient problems during training.
DenseNet
          <xref ref-type="bibr" rid="ref10 ref9">(Huang et al., 2017b)</xref>
          introduces dense
blocks where the feature-maps of all preceding
layers are used as inputs to the layer, and its
feature-maps are used as inputs into all subsequent
layers. This approach encourages feature reuse
and may lead to more generalizable image
embeddings. Each DenseNet image embedding has
a size of 1000 weights.
        </p>
        <p>The aim of using multiple vector embeddings
was to cumulatively cover a significant portion of
possible meme combinations and templates. As a
result, is Section 4 we show how the ensemble of
systems using different image embeddings leads
to significant increases in validation accuracy.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>Table 1 presents the system ranking for the meme
detection subtask. Our system placed 7th in terms
of F1 score,2 impeded primarily by inconsistent
recall performances but significantly better than
the random baseline (+0:2466 F1).</p>
      <p>Results suggest that ArchiMeDe has developed
inductive biases for specific image features that
strongly influence the classification outcome. By
inspecting validation folds over training data, we
observe that most false negatives produced by the
system involve distinct facial characteristics of
scene actors. Inversely, ArchiMeDe effectively
classifies images containing text bubbles and
evident manual edits. Another notable failure case
we identified is due to face-swapping. This failure
is especially relevant since face-swapping is
com2The F1 score is the harmonic mean between precision
and recall, commonly used to evaluate classification systems.
.83/.77
.87/.83
.83/.79
.80/.84
.87/.85
.79/.81
.84/.85
.85/.83
.82/.79
.84/.87 .86/.86
monly used to add an ironic component to meme
images, but it is hardly detectable due to missing
real-world context.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Other Embedding Approaches</title>
      <p>As a complementary perspective on our
experiments’ nature, in this section, we present other
approaches tested in the context of meme
detection and that were finally disregarded in favor of
the ArchiMeDe approach presented in the
previous section.</p>
      <p>CNN without Metadata Preliminary runs on
the DANKMEMES dataset relied solely on the use
of standard convolutional neural networks. The
target architecture was fed the image itself without
associated metadata to ensure that the standalone
impact of the architecture was shown. The system
performed poorly, performing only slightly better
than the baseline scores. Additional measures to
optimize this network were not taken since we
assumed that this naive approach would not lead to
substantial gains in performances over the
baseline.</p>
      <sec id="sec-5-1">
        <title>Single Pre-trained Image Encoder Before</title>
        <p>
          working with an ensemble, we estimated the
performances of its components in performing meme
detection. Besides the three models that we
finally included in ArchiMeDe, we also tested
ResNeSt
          <xref ref-type="bibr" rid="ref21">(Zhang et al., 2020)</xref>
          , which was finally
dropped due to the similarity of its predictions
to those of ResNet-50. Table 2 presents the
performances of the individual image encoders
and the final ensemble over a validation split
containing 320 examples equally distributed over
(meme, non-meme) classes. Results show how
the DenseNet model appears to be better in terms
of precision, while ResNet is worse but
compensates with a higher recall. We found that
misclassified observations were different across models,
suggesting that each model could capture different
properties of the input. The only exception was the
ResNeSt model, which produced errors very close
to the ResNet ones and was henceforth dropped
for further experiments.
        </p>
        <p>
          Multimodal Ensemble Following the
complementary viewpoints of different encoders, we
decided to evaluate the performances of an
ensemble. Table 2 shows that our ArchiMeDe ensemble
outperforms single systems in terms of both
precision and recall when considering both classes,
compensating the weaknesses of individual
systems. The resulting majority-vote ensemble was
optimized and used as the final system for our
submission. Multiple experimental iterations showed
that an increase in depth, followed by a
reduction in layers’ width, led to increased accuracy
scores. Each model was trained with a batch size
of 64 sets, 100 epochs fitted with test accuracy
callbacks, and an early stopping strategy with a
five epochs’ patience value. Each model utilized
the Adam optimizer
          <xref ref-type="bibr" rid="ref11">(Kingma and Ba, 2015)</xref>
          with
a learning rate of 0.001 and was trained using a
binary cross-entropy loss over the two categories.
4.1
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Data Augmentation</title>
        <p>Given the relatively small size of the available
training dataset and since popular classification
models are often trained using thousands if not
millions of images, we tested some data
augmentation strategies to improve our system’s
generalization performances. We applied random changes
for each image to augment data, modifying it with
random brightness, rotation, and zoom in a
reasonable margin to keep it distinguishable. 9
augmented images were produced for every initial
image entry. As a result, the training dataset is
increased from 1280 to 12800 images.</p>
        <p>Every augmented image is associated with the
same metadata as the original, varying only in the
visual embedding itself. The result we aimed for
was an increase in generalization performances, as
the model fits better to the general rule of
recognizing memes. However, our results showed the
opposite behavior: the system would easily
overfit individual observation when data augmentation
was used. We think this was partly due to
augmentations not pertinent to the general meme template
and partly because of the significant increase in
the number of entries having the same associated
metadata.</p>
        <p>An extensive set of augmentation strategies was
tested over the dataset, modifying factors, ranges,
and augmentation count. No iteration significantly
and consistently improved the system’s
performance, and thus the augmentation process was
determined noisy, relatively inconclusive, and
therefore dropped from the training procedure.
5</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Discussion and Conclusion</title>
      <p>In this paper, we presented ArchiMeDe, our
multimodal system used for participating in the
DANKMEMES task at EVALITA 2020. The
results produced by the system are promising, even
if the systems do not encode inductive biases that
are specific neither for multimodal artifact
recognition nor to meme detection in particular. The
entry is not far behind in terms of precision from the
best-performing systems, and several paths
display considerable potential for improving its
performances. The paper effectively highlights the
crucial impact of transfer learning on the success
of this system. Notably, ArchiMeDe can be easily
trained with standard consumer-level GPUs.</p>
      <p>A direction that can be explored to improve
the current system would be to modify the recall
threshold, obtaining a better precision-recall
balance for predictions. Another possibility involves
introducing an aggregator network on top of the
ensemble instead of using majority vote: in this
way, the network can learn whether the predictions
of a single subnetwork are reliable, regardless of it
being part of the majority. The ensemble could
also include more varied models with differing
architecture to further accentuate differences in
feature representations. Above all, we believe that
leveraging additional data (not necessarily in
Italian) could significantly improve the system’s
performance at the cost of increased time and
computational costs.</p>
      <p>Memes today are one of the most formidable
modes of portraying one’s idea while building a
strong interpersonal connection between creators
and users. The informality of memes, combined
with their ease of making and distribution, has
greatly accentuated their growth in the last few
years. To be able to interpret memes effectively
is a task far deeper than what can be intuitively
thought. As humans continue to unravel their
minds and derive ingenious computational
methods, we realize the importance of slang and how
it relates directly to the core human principle of
community belonging. A piece of our culture,
memes are the best represented and documented
cultural artifacts we have today, and to effectively
interpret them would mean to cross a significant
milestone for the field NLP, with lasting impacts
on our society as a whole.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          , Danilo Croce, Maria Di Maro, and
          <string-name>
            <surname>Lucia</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Passaro</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>EVALITA 2020: Overview of the 7th evaluation campaign of natural language processing and speech tools for italian</article-title>
          .
          <source>In Valerio Basile</source>
          , Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors,
          <source>Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          ),
          <article-title>Online</article-title>
          . CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          , B. Mann, Nick Ryder, Melanie Subbiah,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          , Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, G. Kru¨ger, Tom Henighan,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Aditya</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ziegler</surname>
          </string-name>
          , Jeffrey Wu, Clemens Winter, Christopher Hesse,
          <string-name>
            <given-names>Mark</given-names>
            <surname>Chen</surname>
          </string-name>
          , E. Sigler,
          <string-name>
            <given-names>Mateusz</given-names>
            <surname>Litwin</surname>
          </string-name>
          , Scott Gray, Benjamin Chess,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , Christopher Berner,
          <string-name>
            <surname>Sam McCandlish</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            , Ilya Sutskever, and
            <given-names>Dario</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Language models are few-shot learners</article-title>
          .
          <source>ArXiv</source>
          , abs/
          <year>2005</year>
          .14165.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Alessandra</given-names>
            <surname>Teresa</surname>
          </string-name>
          <string-name>
            <surname>Cignarella</surname>
          </string-name>
          , Cristina Bosco, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <string-name>
            <surname>Presenting</surname>
            <given-names>TWITTIRO</given-names>
          </string-name>
          `
          <article-title>-UD: An italian twitter treebank in universal dependencies</article-title>
          .
          <source>In Proceedings of the Fifth International Conference on Dependency Linguistics (Depling</source>
          , SyntaxFest
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          , Minneapolis, Minnesota, June. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Simone</given-names>
            <surname>Francia</surname>
          </string-name>
          , Loreto Parisi, and
          <string-name>
            <given-names>Magnani</given-names>
            <surname>Paolo</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>UmBERTo: an italian language model trained with whole word maskings</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , pages
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ross B. Girshick</surname>
          </string-name>
          , and P. Dolla´r.
          <year>2019</year>
          .
          <article-title>Rethinking ImageNet pre-training</article-title>
          .
          <source>2019 IEEE/CVF International Conference on Computer Vision</source>
          (ICCV), pages
          <fpage>4917</fpage>
          -
          <lpage>4926</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Jeremy</given-names>
            <surname>Howard</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Ruder</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Universal language model fine-tuning for text classification</article-title>
          .
          <source>In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          , pages
          <fpage>328</fpage>
          -
          <lpage>339</lpage>
          , Melbourne, Australia, July. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Gao</given-names>
            <surname>Huang</surname>
          </string-name>
          , Zhuang Liu, and
          <string-name>
            <surname>Kilian</surname>
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Weinberger</surname>
          </string-name>
          . 2017a.
          <article-title>Densely connected convolutional networks</article-title>
          .
          <source>2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , pages
          <fpage>2261</fpage>
          -
          <lpage>2269</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Gao</given-names>
            <surname>Huang</surname>
          </string-name>
          , Zhuang Liu, and
          <string-name>
            <surname>Kilian</surname>
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Weinberger</surname>
          </string-name>
          . 2017b.
          <article-title>Densely connected convolutional networks</article-title>
          .
          <source>2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , pages
          <fpage>2261</fpage>
          -
          <lpage>2269</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Diederik P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jimmy</given-names>
            <surname>Ba</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>CoRR, abs/1412</source>
          .6980.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Alex</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , Ilya Sutskever, and
          <string-name>
            <given-names>Geoffrey E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>ImageNet classification with deep convolutional neural networks</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>60</volume>
          (
          <issue>6</issue>
          ):
          <fpage>84</fpage>
          -
          <lpage>90</lpage>
          , May.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          , Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Luke</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Veselin</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>RoBERTa: A robustly optimized BERT pretraining approach</article-title>
          . ArXiv, abs/
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Alessio</given-names>
            <surname>Miaschi and Felice Dell'Orletta</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Contextual and non-contextual word embeddings: an indepth linguistic investigation</article-title>
          .
          <source>In Proceedings of the 5th Workshop on Representation Learning for NLP</source>
          , pages
          <fpage>110</fpage>
          -
          <lpage>119</lpage>
          , Online, July. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Alessio</given-names>
            <surname>Miaschi</surname>
          </string-name>
          , Gabriele Sarti, Dominique Brunato, Felice Dell'Orletta,
          <string-name>
            <given-names>and Giulia</given-names>
            <surname>Venturi</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Italian transformers under the linguistic lens</article-title>
          .
          <source>In Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it).</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Martina</given-names>
            <surname>Miliani</surname>
          </string-name>
          , Giulia Giorgi, Ilir Rama, Guido Anselmi, and
          <string-name>
            <given-names>Gianluca E.</given-names>
            <surname>Lebani</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>DANKMEMES @ EVALITA2020: The memeing of life: memes, multimodality and politics</article-title>
          . In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors,
          <source>Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          ),
          <article-title>Online</article-title>
          . CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Pedro</given-names>
            <surname>Javier Ortiz</surname>
          </string-name>
          <article-title>Sua´rez, Laurent Romary</article-title>
          , and
          <string-name>
            <given-names>Benoˆıt</given-names>
            <surname>Sagot</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>A monolingual approach to contextualized word embeddings for mid-resource languages</article-title>
          .
          <source>In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          , pages
          <fpage>1703</fpage>
          -
          <lpage>1714</lpage>
          , Online, July. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Nils</given-names>
            <surname>Reimers</surname>
          </string-name>
          and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>SentenceBERT: Sentence embeddings using Siamese BERTnetworks</article-title>
          .
          <source>In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          , pages
          <fpage>3982</fpage>
          -
          <lpage>3992</lpage>
          ,
          <string-name>
            <surname>Hong</surname>
            <given-names>Kong</given-names>
          </string-name>
          , China, November. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Manuela</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          , Cristina Bosco, Alberto Lavelli, Alessandro Mazzei, and
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Tamburini</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>PoSTWITA-UD: an Italian Twitter Treebank in universal dependencies</article-title>
          .
          <source>In Proceedings of the Eleventh Language Resources and Evaluation Conference (LREC</source>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Wolf</surname>
          </string-name>
          , Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R'emi Louf, Morgan Funtowicz, and
          <string-name>
            <given-names>Jamie</given-names>
            <surname>Brew</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Huggingface's transformers: State-of-the-art natural language processing</article-title>
          . ArXiv, abs/
          <year>1910</year>
          .03771.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Hang</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Chongruo Wu, Zhongyue Zhang, Yi Zhu,
          <string-name>
            <surname>Zhi-Li</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Haibin Lin, Yu e Sun, Tong He,
          <string-name>
            <surname>Jonas Mueller</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Manmatha</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>and Alex</given-names>
          </string-name>
          <string-name>
            <surname>Smola</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Resnest: Split-attention networks</article-title>
          .
          <source>ArXiv</source>
          , abs/
          <year>2004</year>
          .08955.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>