ArchiMeDe @ DANKMEMES:
                    A New Model Architecture for Meme Detection
                Jinen Setpal                   Gabriele Sarti
              RN Podar School    Department of Mathematics and Geosciences
               Mumbai, India           University of Trieste & SISSA
           jinens8@gmail.com                    Trieste, Italy
     jinen.setpal@rnpodarschool.com       gsarti@sissa.it


                                                           1   Introduction
                      Abstract
    English. We introduce ArchiMeDe, a                     In recent years, the democratization of data collec-
    multimodal neural network-based archi-                 tion procedures through web scraping and crowd-
    tecture used to solve the DANKMEMES                    sourcing has led to the broad availability of pub-
    meme detections subtask at the 2020                    lic datasets spanning modalities like language and
    EVALITA campaign. The system incor-                    vision. Contemporary state-of-the-art machine
    porates information from visual and tex-               learning models can leverage those resources to
    tual sources through a multimodal neu-                 achieve highly accurate and often superhuman
    ral ensemble to predict if input images                performances using millions or even billions of
    and their respective metadata are memes                parameters (Brown et al., 2020), but are heavily re-
    or not. Each pre-trained neural network                liant on an abundance of computational resources
    in the ensemble is first fine-tuned indi-              to work properly. Consequently, such architec-
    vidually on the training dataset to per-               tures’ training is often inaccessible to smaller re-
    form domain adaptation. Learned text and               search centers – let alone individual users. To
    visual representations are then concate-               counter this tendency, the availability of pre-
    nated to obtain a single multimodal em-                trained open-source models has dramatically re-
    bedding, and the final prediction is per-              duced the computational threshold required to ob-
    formed through majority voting by all net-             tain state-of-the-art results in multiple languages
    works in the ensemble.                                 and vision tasks (Devlin et al., 2019; He et al.,
                                                           2016). Pre-trained systems are often leveraged in a
    Italiano.        Presentiamo ArchiMeDe,                two-step framework: first, they undergo an unsu-
    un’architettura multimodale basata su                  pervised or semi-supervised pre-training to learn
    reti neurali per la risoluzione del subtask            general knowledge representations, then they are
    di “meme detection” per DANKMEMES                      fine-tuned in a supervised way to adapt their pa-
    a EVALITA 2020.          Il sistema unisce             rameters in the context of downstream tasks. This
    informazione visiva e testuale attraverso              transfer learning approach stems from the com-
    un insieme multimodale di reti neurali                 puter vision literature (He et al., 2019) but has
    per prevedere se immagini e rispettivi                 been recently adopted for natural language pro-
    metadati corrispondano a meme o meno.                  cessing tasks with positive results (Howard and
    Ogni rete neurale pre-allenata all’interno             Ruder, 2018; Devlin et al., 2019; Liu et al., 2019).
    dell’insieme è inizialmente adattata al
                                                              In this paper, we present ArchiMeDe, a
    dominio specifico del dataset di training.
                                                           multimodal system leveraging pre-trained lan-
    In seguito, le rappresentazioni di ogni rete
                                                           guage and vision models to compete in the
    per immagini e testo vengono concatenate
                                                           DANKMEMES (Miliani et al., 2020) shared task
    in un unico embedding multimodale, e la
                                                           at the EVALITA 2020 campaign (Basile et al.,
    previsione finale è effettuata tramite un
                                                           2020). Following recent transfer learning ap-
    voto di maggioranza effettuato da tutte le
                                                           proaches, our system leverages pre-trained visual
    reti nell’insieme.
                                                           and word embeddings in a multimodal setup, ob-
     Copyright c 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0   taining strong results on the meme detection sub-
International (CC BY 4.0).                                 task. Specifically, we participated in the first sub-
                                       UmBERTo         = Concat

                    Text                                          Sentence embeddings +
                                                                  Image embeddings +
                                                                  Raw metadata


                                                                                      Dense
                                            ResNet                                    Layers


                     Images                                                               Dense                  Is this a
    DankMemes                               AlexNet                                                              meme?
                                                                                          Layers
      Dataset
                                                                                                      Majority Vote


                                                                                          Dense
                                            DenseNet                                      Layers


         Metadata
         (Visual actors, date, etc.)


Figure 1: The ArchiMeDe system architecture. Sentence embeddings produced by the UmBERTo
NLM are concatenated to metadata and image embeddings produced by three popular pre-trained vi-
sion modals. The three resulting multimodal embeddings are fed separately to feedforward networks,
and the final outcome is selected through majority voting.


task of DANKMEMES, aimed at discriminating                        on the Italian language to produce sentence em-
memes from standard images containing actors                      beddings. Then, we leverage three popular pre-
from the Italian political scene. Task organizers                 trained vision architectures, namely ResNet (He
extracted a total of 1600 training images from the                et al., 2016), DenseNet (Huang et al., 2017a) and
Instagram platform, and data available from each                  AlexNet (Krizhevsky et al., 2017), to produce
dataset entry – text, actors and user engagement,                 three independent image embeddings for each in-
among others – were leveraged to train an ensem-                  put image. These embeddings can be considered
ble of multimodal models performing meme de-                      as different views over an image that may pro-
tection through majority-vote. The following sec-                 vide us with complementary information about its
tions present our approach in detail, first showing               content. Then, each image embedding is concate-
our preliminary evaluation of multiple modeling                   nated with the sentence embedding and the raw
approaches and then focusing on the final system’s                image metadata and fed as input to an 8-layer
main modules and the features we leverage from                    feed-forward neural network to predict an image’s
the dataset. Finally, results are presented, and                  meme status. The feed-forward network also in-
we conclude by discussing the problems we faced                   cludes a single dropout layer to prevent overfitting
with some inconsistencies in the data. Our code                   and improve generalization. Lastly, the three pre-
is made available at https://github.com/                          dictions are weighted through majority voting to
jinensetpal/ArchiMeDe                                             obtain the final prediction of the ensemble. Other
                                                                  simpler strategies using a single vision model to
2    System Description                                           produce image embeddings were initially envis-
                                                                  aged as potential candidates for our submission
ArchiMeDe is composed of a multimodal learn-                      but were finally dismissed in light of the promis-
ing ensemble, with the final output being the re-                 ing performances of the ArchiMeDe ensembling
sult of a majority vote. Figure 1 visualizes our                  approach. We discuss those perspectives in Sec-
approach. First, the transcript associated with                   tion 4.
each image is fed to an UmBERTo (Francia et al.,
2020) neural language model (NLM) pre-trained                       The remaining part of this section contains an
in-depth description of our ensemble’s compo-         one-hot encoding of all the actors occurring in the
nents, focusing on the input features that were       training set: if a specific politician is present in an
used and how those were preprocessed to best          image, the corresponding entry is true; conversely,
suit learning. Moreover, we also include trans-       if no such actor is present, the binary field is set to
fer learning specifications with some details about   false. Actors that were not present in the training
their impact on the overall system accuracy.          set are disregarded during evaluation: while this
                                                      step is required given the context, we assume that
2.1   Metadata                                        this may significantly impact the outcome in im-
Engagement User engagement per post is ex-            ages for which new actors were introduced.
pressed as a numeric integer value. We scale and
standardize engagement values to obtain a distri-     2.2   Textual input
bution centered in 0 with σ = 1. This procedure is    The analysis of textual content in meme images is
a standard practice to avoid passing extreme abso-    critical to the success of the overall system. In-
lute values as inputs for the neural network.         deed, ironical or satyrical comments may deeply
                                                      affect the users’ interpretation of an image that
Date We decided to leverage temporal informa-         would otherwise be classified as normal. We
tion in our system, building upon the intuition       note that this problem cannot be approached simi-
that memes often rely on a small set of templates     larly to standard textual analytic frameworks since
that undergo a significant variation in popularity    memes are elucidated in short, concise phrases and
through time. Temporal information may thus pro-      do not necessarily comply with standard gram-
vide our system with additional cues about an im-     matical rules. They also tend to contain slang
age’s meme status in a specific time-frame. In the    and vernacular expressions, which, albeit convey-
training dataset, dates for each post has been pre-   ing the intended meaning to the reader, greatly in-
sented in the yyyy-mm-dd format. This date was        crease the need for high model capacity and ad-
compared with the predetermined date, 1st Jan-        hoc training data. For this reason, we selected
uary 2015, to derive a numeric value represent-       UmBERTo (Francia et al., 2020), a RoBERTa-
ing the number of days from the date of refer-        based (Liu et al., 2019) neural language model
ence. Min-max scaling is then applied to the nu-      pre-trained on Italian texts extracted from the OS-
meric values, further deriving float numeric values   CAR corpus (Ortiz Suárez et al., 2020), for pro-
between in the range [0,1], subsequently fed into     ducing text representations.1 In a recent study by
each training model.                                  Miaschi et al. (2020), the model was highlighted
Manipulation The manipulation field provides          as one of the top Italian NLMs for encoding lin-
boolean information about whether an image            guistic information about social media excerpts
has been manipulated before being added to the        taken from the TWITTIRÒ and PoSTWITA Twit-
dataset. We found this information noisy and a        ter corpora (Cignarella et al., 2019; Sanguinetti et
weak predictor of meme status; therefore, it was      al., 2018). UmBERTo has a high model capability
dropped as input.                                     with 125M trainable parameters and was trained
                                                      on online crawled data, making it suitable for pro-
Visual Actors Each entry was additionally pro-        cessing meme language.
vided with a list of names of the visual actors
present in the frame. In the specific case of         SentenceTransformers We use the Sentence-
the DANKMEMES shared task, visual actors can          Transformers framework (Reimers and Gurevych,
be especially useful to identify meme images.         2019) to produce sentence embeddings by av-
For example, we can hypothesize that politicians      eraging all word embeddings produced by the
who maintain a strong public presence by making       original UmBERTo model since Miaschi and
claims that produce a high level of public engage-    Dell’Orletta (2020) showed that those are usually
ment are more likely to be the subject of meme im-    much more informative than the default [CLS]
ages. Moreover, some combinations of actors may       sentence embedding. We fine-tune representations
be particularly likely for memes e.g. politicians     over the available meme textual data and use them
belonging to parties at the political compass’s an-   as components of our end-to-end system.
tipodes. In order to produce a unified representa-       1
                                                           umberto-commoncrawl-cased-v1 in the Hug-
tion of visual actors for our system, we perform a    gingFace’s model hub (Wolf et al., 2019)
2.3   Visual input                                                       Run #     Precision    Recall      F1
While we have so far discussed only using meta-            Baseline                  0.525      0.5147    0.5198
data to predict our results, it is essential to ad-        UniTor
                                                                           1         0.839      0.8431    0.8411
dress the core of a meme: the image itself. We                             2        0.8522      0.848     0.8501
can internally distinguish a meme from a stan-                             1        0.8515      0.8431    0.8473
                                                           SNK
                                                                           2        0.8317      0.848     0.8398
dard image through the aforementioned broken
sentence structure, meme templates, and quick and                          1         0.861      0.7892    0.8235
                                                           UPB
                                                                           2        0.8543      0.8333    0.8437
messy edits, among other aspects. As previously
                                                           ArchiMeDe       1        0.8249      0.7157    0.7664
mentioned, memes can be very difficult to indi-
viduate when they look like standard images but                            1        0.8121      0.6569    0.7263
                                                           Keila
                                                                           2        0.7389       0.652    0.6927
gain meme status through real-world knowledge
grounding.
   Due to the inherently large variance in meme        Table 1: System ranking for the DANKMEMES
images’ styles and contents, it is impractical to      meme detection subtask. Top scores are in bold,
expect a single framework to effectively describe      our system is underlined.
each distinguishable feature and utilize it to clas-
sify an entry. Hence, we split the representational    to vanishing-gradient problems during training.
burden across multiple pre-trained model architec-     DenseNet (Huang et al., 2017b) introduces dense
tures. Each of them uses a fundamentally differ-       blocks where the feature-maps of all preceding
ent approach to extract image embeddings, mak-         layers are used as inputs to the layer, and its
ing the resulting ensemble predictions more flex-      feature-maps are used as inputs into all subsequent
ible in general settings. The three networks we        layers. This approach encourages feature reuse
used for producing image embeddings are:               and may lead to more generalizable image em-
                                                       beddings. Each DenseNet image embedding has
ResNet Residual Networks, or ResNets (He et
                                                       a size of 1000 weights.
al., 2016), learn residual functions in relation to
                                                          The aim of using multiple vector embeddings
layer inputs. If H(x) is the standard underlying
                                                       was to cumulatively cover a significant portion of
target mapping, ResNet layers are instead trained
                                                       possible meme combinations and templates. As a
to fit another mapping F(x) = H(x) − x. The
                                                       result, is Section 4 we show how the ensemble of
original mapping is thus recast into F(x)+x. This
                                                       systems using different image embeddings leads
approach makes the optimization process easier,
                                                       to significant increases in validation accuracy.
allowing for deeper architectures. The default vec-
tor representation provided by task organizers is      3     Results
produced by a ResNet-50, with fifty blocks of
residual layers. We use those image embeddings         Table 1 presents the system ranking for the meme
of size 2048 without further adjustments.              detection subtask. Our system placed 7th in terms
                                                       of F1 score,2 impeded primarily by inconsistent
AlexNet AlexNet (Krizhevsky et al., 2017) is a         recall performances but significantly better than
vision architecture built with 5 layers of convolu-    the random baseline (+0.2466 F1).
tion and 3 fully-connected layers. AlexNet spe-           Results suggest that ArchiMeDe has developed
cializes in identifying depth; the network archi-      inductive biases for specific image features that
tecture effectively classifies objects such as key-    strongly influence the classification outcome. By
boards and a large subset of animals. This fact        inspecting validation folds over training data, we
makes AlexNet embeddings good predictors for           observe that most false negatives produced by the
features such as depth that are generally problem-     system involve distinct facial characteristics of
atic in memes due to image subsections (e.g. text      scene actors. Inversely, ArchiMeDe effectively
boxes). We use an embedding size of 4096 in the        classifies images containing text bubbles and ev-
context of our experiments.                            ident manual edits. Another notable failure case
DenseNet Pre-trained models such as ResNet             we identified is due to face-swapping. This failure
and AlexNet use a large number of hidden               is especially relevant since face-swapping is com-
layers.   While the increase in depth allows              2
                                                            The F1 score is the harmonic mean between precision
for better feature abstraction, it often leads         and recall, commonly used to evaluate classification systems.
    Encoder      Precision    Recall      F1           sates with a higher recall. We found that misclas-
                                                       sified observations were different across models,
    AlexNet       .83/.77     .75/.85   .79/.81
                                                       suggesting that each model could capture different
    DenseNet      .87/.83     .82/.87   .84/.85
                                                       properties of the input. The only exception was the
    ResNet        .83/.79     .87/.86   .85/.83
                                                       ResNeSt model, which produced errors very close
    ResNeSt       .80/.84     .84/.76   .82/.79
                                                       to the ResNet ones and was henceforth dropped
    ArchiMeDe     .87/.85     .84/.87   .86/.86        for further experiments.

                                                       Multimodal Ensemble Following the comple-
Table 2: Performances of ArchiMeDe variants
                                                       mentary viewpoints of different encoders, we de-
with single image encoders over a validation split
                                                       cided to evaluate the performances of an ensem-
of the DANKMEMES training set. Scores are pre-
                                                       ble. Table 2 shows that our ArchiMeDe ensemble
sented for non-meme/meme classes.
                                                       outperforms single systems in terms of both pre-
                                                       cision and recall when considering both classes,
monly used to add an ironic component to meme          compensating the weaknesses of individual sys-
images, but it is hardly detectable due to missing     tems. The resulting majority-vote ensemble was
real-world context.                                    optimized and used as the final system for our sub-
                                                       mission. Multiple experimental iterations showed
4     Other Embedding Approaches                       that an increase in depth, followed by a reduc-
As a complementary perspective on our experi-          tion in layers’ width, led to increased accuracy
ments’ nature, in this section, we present other       scores. Each model was trained with a batch size
approaches tested in the context of meme detec-        of 64 sets, 100 epochs fitted with test accuracy
tion and that were finally disregarded in favor of     callbacks, and an early stopping strategy with a
the ArchiMeDe approach presented in the previ-         five epochs’ patience value. Each model utilized
ous section.                                           the Adam optimizer (Kingma and Ba, 2015) with
                                                       a learning rate of 0.001 and was trained using a
CNN without Metadata Preliminary runs on               binary cross-entropy loss over the two categories.
the DANKMEMES dataset relied solely on the use
of standard convolutional neural networks. The         4.1   Data Augmentation
target architecture was fed the image itself without
                                                       Given the relatively small size of the available
associated metadata to ensure that the standalone
                                                       training dataset and since popular classification
impact of the architecture was shown. The system
                                                       models are often trained using thousands if not
performed poorly, performing only slightly better
                                                       millions of images, we tested some data augmen-
than the baseline scores. Additional measures to
                                                       tation strategies to improve our system’s general-
optimize this network were not taken since we as-
                                                       ization performances. We applied random changes
sumed that this naive approach would not lead to
                                                       for each image to augment data, modifying it with
substantial gains in performances over the base-
                                                       random brightness, rotation, and zoom in a rea-
line.
                                                       sonable margin to keep it distinguishable. 9 aug-
Single Pre-trained Image Encoder Before                mented images were produced for every initial im-
working with an ensemble, we estimated the per-        age entry. As a result, the training dataset is in-
formances of its components in performing meme         creased from 1280 to 12800 images.
detection. Besides the three models that we                Every augmented image is associated with the
finally included in ArchiMeDe, we also tested          same metadata as the original, varying only in the
ResNeSt (Zhang et al., 2020), which was finally        visual embedding itself. The result we aimed for
dropped due to the similarity of its predictions       was an increase in generalization performances, as
to those of ResNet-50. Table 2 presents the            the model fits better to the general rule of recog-
performances of the individual image encoders          nizing memes. However, our results showed the
and the final ensemble over a validation split         opposite behavior: the system would easily over-
containing 320 examples equally distributed over       fit individual observation when data augmentation
(meme, non-meme) classes. Results show how             was used. We think this was partly due to augmen-
the DenseNet model appears to be better in terms       tations not pertinent to the general meme template
of precision, while ResNet is worse but compen-        and partly because of the significant increase in
the number of entries having the same associated        ods, we realize the importance of slang and how
metadata.                                               it relates directly to the core human principle of
   An extensive set of augmentation strategies was      community belonging. A piece of our culture,
tested over the dataset, modifying factors, ranges,     memes are the best represented and documented
and augmentation count. No iteration significantly      cultural artifacts we have today, and to effectively
and consistently improved the system’s perfor-          interpret them would mean to cross a significant
mance, and thus the augmentation process was de-        milestone for the field NLP, with lasting impacts
termined noisy, relatively inconclusive, and there-     on our society as a whole.
fore dropped from the training procedure.

5   Discussion and Conclusion                           References
                                                        Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-
In this paper, we presented ArchiMeDe, our                cia C. Passaro. 2020. EVALITA 2020: Overview
multimodal system used for participating in the           of the 7th evaluation campaign of natural language
                                                          processing and speech tools for italian. In Valerio
DANKMEMES task at EVALITA 2020. The re-
                                                          Basile, Danilo Croce, Maria Di Maro, and Lucia C.
sults produced by the system are promising, even          Passaro, editors, Proceedings of Seventh Evalua-
if the systems do not encode inductive biases that        tion Campaign of Natural Language Processing and
are specific neither for multimodal artifact recog-       Speech Tools for Italian. Final Workshop (EVALITA
nition nor to meme detection in particular. The en-       2020), Online. CEUR.org.
try is not far behind in terms of precision from the    T. Brown, B. Mann, Nick Ryder, Melanie Subbiah,
best-performing systems, and several paths dis-            J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
play considerable potential for improving its per-         Pranav Shyam, Girish Sastry, Amanda Askell, Sand-
                                                           hini Agarwal, Ariel Herbert-Voss, G. Krüger, Tom
formances. The paper effectively highlights the            Henighan, R. Child, Aditya Ramesh, D. Ziegler, Jef-
crucial impact of transfer learning on the success         frey Wu, Clemens Winter, Christopher Hesse, Mark
of this system. Notably, ArchiMeDe can be easily           Chen, E. Sigler, Mateusz Litwin, Scott Gray, Ben-
trained with standard consumer-level GPUs.                 jamin Chess, J. Clark, Christopher Berner, Sam Mc-
                                                           Candlish, A. Radford, Ilya Sutskever, and Dario
   A direction that can be explored to improve             Amodei. 2020. Language models are few-shot
the current system would be to modify the recall           learners. ArXiv, abs/2005.14165.
threshold, obtaining a better precision-recall bal-
                                                        Alessandra Teresa Cignarella, Cristina Bosco, and
ance for predictions. Another possibility involves        Paolo Rosso. 2019. Presenting TWITTIRÒ-UD:
introducing an aggregator network on top of the           An italian twitter treebank in universal dependen-
ensemble instead of using majority vote: in this          cies. In Proceedings of the Fifth International Con-
way, the network can learn whether the predictions        ference on Dependency Linguistics (Depling, Syn-
                                                          taxFest 2019).
of a single subnetwork are reliable, regardless of it
being part of the majority. The ensemble could          Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
also include more varied models with differing ar-         Kristina Toutanova. 2019. BERT: Pre-training of
                                                           deep bidirectional transformers for language under-
chitecture to further accentuate differences in fea-       standing. In Proceedings of the 2019 Conference of
ture representations. Above all, we believe that           the North American Chapter of the Association for
leveraging additional data (not necessarily in Ital-       Computational Linguistics: Human Language Tech-
ian) could significantly improve the system’s per-         nologies, Volume 1 (Long and Short Papers), pages
                                                           4171–4186, Minneapolis, Minnesota, June. Associ-
formance at the cost of increased time and compu-
                                                           ation for Computational Linguistics.
tational costs.
   Memes today are one of the most formidable           Simone Francia, Loreto Parisi, and Magnani Paolo.
                                                          2020. UmBERTo: an italian language model trained
modes of portraying one’s idea while building a           with whole word maskings.
strong interpersonal connection between creators
and users. The informality of memes, combined           Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun.
                                                          2016. Deep residual learning for image recognition.
with their ease of making and distribution, has           2016 IEEE Conference on Computer Vision and Pat-
greatly accentuated their growth in the last few          tern Recognition (CVPR), pages 770–778.
years. To be able to interpret memes effectively
                                                        Kaiming He, Ross B. Girshick, and P. Dollár.
is a task far deeper than what can be intuitively         2019. Rethinking ImageNet pre-training. 2019
thought. As humans continue to unravel their              IEEE/CVF International Conference on Computer
minds and derive ingenious computational meth-            Vision (ICCV), pages 4917–4926.
Jeremy Howard and Sebastian Ruder. 2018. Univer-          Nils Reimers and Iryna Gurevych. 2019. Sentence-
   sal language model fine-tuning for text classifica-      BERT: Sentence embeddings using Siamese BERT-
   tion. In Proceedings of the 56th Annual Meeting of       networks. In Proceedings of the 2019 Conference on
   the Association for Computational Linguistics (Vol-      Empirical Methods in Natural Language Processing
   ume 1: Long Papers), pages 328–339, Melbourne,           and the 9th International Joint Conference on Natu-
   Australia, July. Association for Computational Lin-      ral Language Processing (EMNLP-IJCNLP), pages
   guistics.                                                3982–3992, Hong Kong, China, November. Associ-
                                                            ation for Computational Linguistics.
Gao Huang, Zhuang Liu, and Kilian Q. Weinberger.
  2017a. Densely connected convolutional networks.        Manuela Sanguinetti, Cristina Bosco, Alberto Lavelli,
  2017 IEEE Conference on Computer Vision and Pat-         Alessandro Mazzei, and Fabio Tamburini. 2018.
  tern Recognition (CVPR), pages 2261–2269.                PoSTWITA-UD: an Italian Twitter Treebank in uni-
                                                           versal dependencies. In Proceedings of the Eleventh
Gao Huang, Zhuang Liu, and Kilian Q. Weinberger.           Language Resources and Evaluation Conference
  2017b. Densely connected convolutional networks.         (LREC 2018).
  2017 IEEE Conference on Computer Vision and Pat-        Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
  tern Recognition (CVPR), pages 2261–2269.                 Chaumond, Clement Delangue, Anthony Moi, Pier-
                                                            ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-
Diederik P. Kingma and Jimmy Ba. 2015. Adam:                icz, and Jamie Brew. 2019. Huggingface’s trans-
  A method for stochastic optimization. CoRR,               formers: State-of-the-art natural language process-
  abs/1412.6980.                                            ing. ArXiv, abs/1910.03771.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin-     Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu,
  ton. 2017. ImageNet classification with deep con-         Zhi-Li Zhang, Haibin Lin, Yu e Sun, Tong He, Jonas
  volutional neural networks. Communications of the         Mueller, R. Manmatha, M. Li, and Alex Smola.
  ACM, 60(6):84–90, May.                                    2020. Resnest: Split-attention networks. ArXiv,
                                                            abs/2004.08955.
Y. Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
   dar Joshi, Danqi Chen, Omer Levy, M. Lewis,
   Luke Zettlemoyer, and Veselin Stoyanov. 2019.
   RoBERTa: A robustly optimized BERT pretraining
   approach. ArXiv, abs/1907.11692.

Alessio Miaschi and Felice Dell’Orletta. 2020. Con-
  textual and non-contextual word embeddings: an in-
  depth linguistic investigation. In Proceedings of the
  5th Workshop on Representation Learning for NLP,
  pages 110–119, Online, July. Association for Com-
  putational Linguistics.

Alessio Miaschi, Gabriele Sarti, Dominique Brunato,
  Felice Dell’Orletta, and Giulia Venturi. 2020. Ital-
  ian transformers under the linguistic lens. In Pro-
  ceedings of the Seventh Italian Conference on Com-
  putational Linguistics (CLiC-it).

Martina Miliani, Giulia Giorgi, Ilir Rama, Guido
 Anselmi, and Gianluca E. Lebani.             2020.
 DANKMEMES @ EVALITA2020: The memeing
 of life: memes, multimodality and politics. In Va-
 lerio Basile, Danilo Croce, Maria Di Maro, and
 Lucia C. Passaro, editors, Proceedings of Seventh
 Evaluation Campaign of Natural Language Pro-
 cessing and Speech Tools for Italian. Final Work-
 shop (EVALITA 2020), Online. CEUR.org.

Pedro Javier Ortiz Suárez, Laurent Romary, and Benoı̂t
  Sagot. 2020. A monolingual approach to con-
  textualized word embeddings for mid-resource lan-
  guages. In Proceedings of the 58th Annual Meet-
  ing of the Association for Computational Linguis-
  tics, pages 1703–1714, Online, July. Association for
  Computational Linguistics.