<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SNK @ DANKMEMES: Leveraging Pretrained Embeddings for Multimodal Meme Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefano Fiorucci Machine learning engineer ETI</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>stefano.fiorucci@virgilio.it</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>English. In this paper, we describe and present the results of meme detection system, specifically developed and submitted for our participation to the first subtask of DANKMEMES (EVALITA 2020). We built simple classifiers, consisting in feed forward neural networks. They leverage existing pretrained embeddings, both for text and image representation. Our best system (SNK1) achieves good results in meme detection (F1 = 0.8473), ranking 2nd in the competition, at a distance of 0.0028 from the first classified.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1.1</p>
    </sec>
    <sec id="sec-2">
      <title>System description</title>
      <sec id="sec-2-1">
        <title>General approach and tools</title>
        <p>
          DANKMEMES
          <xref ref-type="bibr" rid="ref8">(Miliani et al., 2020)</xref>
          is a task for
meme recognition and hate speech/event
identification in memes and is part of the EVALITA 2020
evaluation campaign
          <xref ref-type="bibr" rid="ref1">(Basile et al., 2020)</xref>
          .
        </p>
        <p>Copyright © 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0)</p>
        <p>For our participation to the first subtask of
DANKEMES, we built simple classification
models for meme detection.</p>
        <p>The main challenge is to effectively combine
textual and image inputs. We tried to exploit the
ability of pretrained embedding to represent the
information present in text and images, paying a
limited computational cost.</p>
        <p>
          To quickly build various prototypes of
neural networks, we used Uber Ludwig framework
          <xref ref-type="bibr" rid="ref9">(Molino et al., 2019)</xref>
          : a toolbox built on top of
TensorFlow, which facilitates and speeds up the
training and testing of various models.
        </p>
        <p>We trained our models using Google
Colaboratory, a hosted Jupyter notebook service, which
provides free access to GPUs, with some resource
and time limitations.
1.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Features</title>
      </sec>
      <sec id="sec-2-3">
        <title>1.2.1 DANKMEMES dataset</title>
        <p>The dataset provided for the first subtask has the
following features:
• File: the name of the .jpg image file.
• Date: when the image has first been posted
on Instagram.
• Picture manipulation: entails the degree
of visual modification of the images.
Nonmanipulated or low impact changes are
labeled 0. Heavily manipulated, impactful
changes are labeled 1.
• Visual actors: the political actors (i.e.
politicians, parties’ logos) portrayed visually,
regardless whether edited into the picture or
portrayed in the original image.
• Engagement: the number of comments and
likes of the image.
• Text: the textual content of the image.
• Meme: binary feature, where 0 represents
non meme images and 1 meme images. This
is the target label.</p>
        <p>The dataset also includes image embeddings.</p>
      </sec>
      <sec id="sec-2-4">
        <title>1.2.2 Feature selection and preprocessing</title>
        <p>We discarded Date feature, because it seems
irrelevant for meme detection.</p>
        <p>Picture manipulation and Meme are simple
binary features and do not require preprocessing.</p>
        <p>We chose to scale Engagement feature, using
min-max normalization.</p>
        <p>Visual actors feature was preprocessed using
Ludwig approach for sets. We report an extract
of the official framework documentation1:
“Set features are expected to be provided as a
string of elements separated by whitespace.</p>
        <p>The string values are transformed into a binary
valued matrix of size n x l (where n is the size of
the dataset and l is the minimum of the size of the
biggest set and a max_size parameter) [...]</p>
        <p>The way sets are mapped into integers consists
in first using a tokenizer to map from strings to
sequences of set items. Then a dictionary of all
the different set item strings present in the column
of the dataset is collected, then they are ranked by
frequency and an increasing integer ID is assigned
to them from the most frequent to the most rare
(with 0 being assigned to PAD used for padding
and 1 assigned to UNK item).”</p>
      </sec>
      <sec id="sec-2-5">
        <title>1.2.3 Text representation</title>
        <p>For text representation, we chose to use pretrained
word embeddings for the Italian language.</p>
        <p>
          Our first model used fastText word
representations
          <xref ref-type="bibr" rid="ref3">(Bojanowski et al., 2016)</xref>
          : non-contextual
word embeddings. fastText word embeddings
rely on subword information (bag of character
ngrams) and thus provide valid representations for
rare, misspelled or out-of-vocabulary words.
Particularly, we used word vectors for the Italian
language officially distributed in 2018
          <xref ref-type="bibr" rid="ref6">(Grave et al.,
2018)</xref>
          . Word embeddings are trained on Common
Crawl and Wikipedia, using CBOW with
positionweights, in dimension 300, with character n-grams
of length 5, a window of size 5 and 10 negatives.
We calculated the sentence vectors starting from
the word vectors and using get_sentence_vector
method of fastText python wrapper: each word
1https://ludwig-ai.github.io/
ludwig-docs/user_guide/#set-featurespreprocessing
vector is divided by its L2 norm and then averaged.
Obtained sentence vector has dimension 300.
        </p>
        <p>
          Our second classifier used BERT word
representations
          <xref ref-type="bibr" rid="ref5">(Devlin et al., 2018)</xref>
          : context-based
word embeddings. BERT model uses word-piece
tokenization: therefore it too provides
embeddings for unseen words. In particular, we used
GilBERTo2, an Italian pretrained language model
based on Facebook RoBERTa architecture and
CamemBERT text tokenization approach; it was
trained with the subword masking technique for
100k steps managing 71GB of Italian text with
more than 11 billion words. As an interface for
this language model, we used python library
HuggingFace’s Transformers
          <xref ref-type="bibr" rid="ref12">(Wolf et al., 2019)</xref>
          . To
obtain sentence vectors, we took the output from
the [CLS] token, which is prepended to the
sentence during the preprocessing phase and is
typically used for classification tasks; undoubtedly,
there are also other methods for extracting
sentence embeddings from BERT models that may
prove more effective. Obtained sentence vector
has dimension 768.
        </p>
      </sec>
      <sec id="sec-2-6">
        <title>1.2.4 Image representation</title>
        <p>
          For image representation, we used the embeddings
provided in DANKMEMES dataset. The vector
representations are computed employing ResNet
          <xref ref-type="bibr" rid="ref7">(He et al., 2016)</xref>
          , a state-of-the-art model for
image recognition based on Deep Residual Learning.
Every image vector has dimension 2048.
1.3
        </p>
      </sec>
      <sec id="sec-2-7">
        <title>System architecture</title>
        <p>Figure 1 shows a block diagram of system
architecture, which is very simple. Picture
manipulation, Visual actors, Engagement, Image vector and
Sentence vector (obtained from word embedding)
were combined by concatenation. The resulting
multimodal feature vector was fed as input into
a feed-forward neural network with two hidden
layers of 256 and 16 neurons respectively, with a
ReLU activation function. The last single neuron
predicts whether the image is a meme or not.
2
2.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments and results</title>
      <sec id="sec-3-1">
        <title>Experimental settings</title>
        <p>To train our neural networks, we chose
crossentropy loss as the objective function. As defined
in the subtask, the metrics of interest are precision,
recall and F1 score. In the following, all metrics
2https://github.com/idb-ita/GilBERTo
reported were calculated using the officially
provided evaluation script3.</p>
        <p>We used Adam optimizer with the following
parameters: 1 = 0:9, 2 = 0:999, " = 10 8. We
set an early stop of 5 epochs, in order to avoid
overfitting.</p>
        <p>Hyperparameter optimization was manually
conducted and we tried various combinations of
learning rate and batch size: our final models have
learning rate of 10 5 and batch size of 10.</p>
        <p>During our experiments, we studied the impact
of a multimodal analysis, compared to using
language or vision only.</p>
        <p>We trained various models, including different
combinations of basic features (Picture
manipulation, Visual actors and Engagement), text
representation (fastText or GilBERTo) and image
representation (ResNet).
2.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Results</title>
      </sec>
      <sec id="sec-3-3">
        <title>Model</title>
        <p>random baseline
Basic Features
BF+fastText
BF+GilBERTo
BF+ResNet</p>
        <p>BF+fastText+
ResNet (SNK1)
BF+GilBERTo+
ResNet (SNK2)</p>
        <p>Pr
0.525
0.8732
0.8253
0.7685
0.8341
0.8515</p>
        <p>Re
0.5147
0.6078
0.6716
0.7647
0.8382
0.8431
0.8317
0.848</p>
        <p>We observe that basic features are quite
informative: the model based only on them far
outper3https://github.com/gianlucalebani/
dankmemes2020
forms the random baseline.</p>
        <p>Models based on basic features and visual
representations perform meme detection well. It
should be noted that unimodal vision models
perform significantly better than textual models. As
Sabat et al. (2019) pointed out, an obvious reason
is that the dimensionality of the image
representation (2048) is much larger than the linguistic one
(fastText: 300; GilBERTo: 768), so it has the
capacity to encode more information. It would be
interesting to conduct further experiments to
investigate less obvious motivations and understand
if the image representation actually conveys
features of the visual scene, which are specific and
distinctive of a meme.</p>
        <p>As shown by Beskow et al. (2019),
multimodal classifiers are considerably better than
textual models and provide some improvement over
unimodal vision models, which nevertheless
provide solid performance in meme detection.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Team + Run</title>
        <p>A2
SNK1
B2
A1
SNK2
B1
...
baseline</p>
        <p>With reference to the competition, model SNK1
(Basic features + fastText + ResNet) ranked 2nd,
at a short distance from the first classified. Model
SNK2 (Basic features + GilBERTo + ResNet)
ranked 5th.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Discussion and conclusion</title>
      <p>In this paper, we have presented simple
multimodal systems for meme detection, based on
a neural network classifier; they leverage
existing pretrained embeddings to represent both text
and image. Our systems achieve good
performance, providing improvements over unimodal
classifiers. In the first subtask of DANKMEMES
(EVALITA 2020), our models ranked 2nd and 5th.</p>
      <p>Based on our experiments, it is observed that
pre-trained embeddings can be used effectively
and with little effort to represent information
conveyed by visual and textual components. While
we haven’t explicitly included irony or other
distinctive aspects derived from text or image among
the features, it is understood that the vectors
generated by the embeddings express them implicitly.</p>
      <p>
        Starting from the simple model used, it could be
interesting to conduct in-depth analyzes to
understand which of the basic features are most
important. Furthermore, we could build saliency maps
        <xref ref-type="bibr" rid="ref11">(Simonyan et al., 2013)</xref>
        to understand which
areas of the images are most relevant for the meme
detection task.
      </p>
      <p>The proposed model could be improved. With
more time and computational resources, a broader
experimentation campaign could be conducted,
using Bayesian hyperparameter optimization; we
could try different numbers of neurons in
hidden layers and other neural network architectures.
To improve the classifier without much effort, we
could also make an ensemble of our best
performing models.</p>
      <p>In our classifier, we used BERT powerful
language model to get text vectors. We could do
BERT fine tuning, in order to obtain better textual
embedding, aimed at meme detection task.</p>
      <p>
        Finally, to overcome the limits of this simple
model, we could look for a more explicit way to
encode the irony present in the text, drawing
inspiration from IronITA
        <xref ref-type="bibr" rid="ref4">(Cignarella et al., 2018)</xref>
        .
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          , Danilo Croce, Maria Di Maro and Lucia C.
          <article-title>Passaro 2020</article-title>
          .
          <article-title>EVALITA 2020: Overview of the 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>Valerio Basile</source>
          , Danilo Croce, Maria Di Maro and Lucia C. Passaro (eds.).
          <source>Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Beskow</surname>
          </string-name>
          ,
          <source>Sumeet Kumar and Kathleen Carley</source>
          <year>2019</year>
          .
          <article-title>The Evolution of Political Memes: Detecting and Characterizing Internet Memes with Multimodal Deep Learning Information Processing</article-title>
          &amp; Management, volume
          <volume>57</volume>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Piotr</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          , Edouard Grave,
          <source>Armand Joulin and Tomas Mikolov 2016. Enriching Word Vectors with Subword Information arXiv:1607.04606</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Alessandra</given-names>
            <surname>Teresa</surname>
          </string-name>
          <string-name>
            <surname>Cignarella</surname>
          </string-name>
          , Simona Frenda, Valerio Basile, Cristina Bosco,
          <source>Viviana Patti and Paolo Rosso</source>
          <year>2018</year>
          .
          <article-title>Overview of the EVALITA 2018 Task on Irony Detection in Italian Tweets (IronITA) Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <source>Kenton Lee and Kristina Toutanova</source>
          <year>2018</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language arXiv:</article-title>
          <year>1810</year>
          .04805
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Edouard</given-names>
            <surname>Grave</surname>
          </string-name>
          , Piotr Bojanowski, Prakhar Gupta,
          <source>Armand Joulin and Tomas Mikolov 2018. Learning Word Vectors for 157 Languages Proceedings of the International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang,
          <source>Shaoqing Ren and Jian Sun</source>
          <year>2016</year>
          .
          <article-title>Deep Residual Learning for Image Recognition 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Martina</given-names>
            <surname>Miliani</surname>
          </string-name>
          , Giulia Giorgi, Ilir Rama, Guido Anselmi and
          <string-name>
            <surname>Gianluca E. Lebani</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>DANKMEMES @ EVALITA2020: The memeing of life: memes, multimodality and politics</article-title>
          .
          <source>Valerio Basile</source>
          , Danilo Croce, Maria Di Maro and Lucia C. Passaro (eds.).
          <source>Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Piero</given-names>
            <surname>Molino</surname>
          </string-name>
          ,
          <source>Yaroslav Dudin and Sai Sumanth</source>
          <year>2019</year>
          .
          <article-title>Ludwig: a type-based declarative deep learning toolbox arXiv:</article-title>
          <year>1909</year>
          .07930
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Benet</given-names>
            <surname>Oriol</surname>
          </string-name>
          <string-name>
            <surname>Sabat</surname>
          </string-name>
          ,
          <source>Cristian Canton Ferrer and Xavier Giro-i-Nieto</source>
          <year>2019</year>
          .
          <article-title>Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation arXiv:</article-title>
          <year>1910</year>
          .02334
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <source>Andrea Vedaldi and Andrew Zisserman</source>
          <year>2013</year>
          .
          <article-title>Deep inside convolutional networks: Visualising image classification models and saliency maps</article-title>
          <source>arXiv:1312.6034</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Wolf</surname>
          </string-name>
          , Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest and
          <string-name>
            <surname>Alexander M. Rush</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>HuggingFace's Transformers: State-of-the-art Natural Language Processing arXiv:</article-title>
          <year>1910</year>
          .03771
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>