Multimodal Hate Speech Detection in Memes from Mexico
                         using BLIP
                         Fariha Maqbool1 , Elisabetta Fersini1
                         1
                          Dipartimento di informatica, sistemistica e comunicazione
                         University of Milano-Bicocca
                         Viale Sarca 336, 20126 Milan, Italy


                                      Abstract
                                      The proliferation of online platforms has introduced a novel challenge in identifying inappropriate and hateful
                                      content in digital discourse. This paper describes our approach to detect such content on social media platforms,
                                      for Task 1 of DIMEMEX challenge in IberLEF 2024 [1]. We employed vision-language based pre-trained model
                                      BLIP to extract the combined image text embeddings. Subsequently, a Gradient Boosting Classifier was employed
                                      for sample classification. Our findings highlight the potential for further enhancements in multi-modal analysis
                                      and classification frameworks.

                                      Keywords
                                      Hate Speech, Inappropriate Content, BLIP


                         1. Introduction
                         The emergence of online social media platforms has transformed communication by enabling people to
                         instantly connect with each other worldwide. Despite all of their advantages, these platforms also bring
                         with them new challenges. They have evolved into major channels for the spread of fake news, hate
                         speech, cyberbullying and harassment, playing a crucial role in the recent rise of cyber-hate crimes
                         [2]. The instantaneous and viral nature of content dissemination on social media enables this harmful
                         content to reach vast audiences rapidly. Consequently, it becomes extremely difficult to monitor this
                         content effectively due to the sheer volume of information available on these platforms.
                            Hate speech has been persistently a social problem, and its forms have evolved significantly over time.
                         It encompasses any kind of expression that targets individuals or groups based on their gender, sexual
                         orientation, race, religion, ethnicity, or nationality [3]. This type of speech can incite violence, promote
                         prejudice, and cause various other harmful effects on individuals and communities. In addition to hate
                         speech, social media platforms also encourage the spread of other types of inappropriate content, such
                         as profane, obscene, offensive, and macabre humor. These all types of content on social media spread
                         through various means, such as text, images, multimedia, and other forms of digital communication.
                         Despite the negative nature of this content, it unfortunately possesses certain qualities that contribute
                         to its rapid dissemination.
                            Memes are ubiquitous form of multimedia that are created by overlaying text onto images. These
                         humorous or satirical messages have gained immense popularity as a means of communication, spread-
                         ing rapidly among individuals. Although the majority of internet memes are harmless and amusing,
                         some of them are the source of spreading inappropriate or hateful content. It is extremely challenging
                         to manually identify and stop the propagation of such harmful memes due to the enormous amount of
                         data. Furthermore, automated detection methods face additional challenges due to the complex and
                         multimodal nature of the problem, which necessitates a thorough understanding of the image, text, and
                         context of both modalities. While humans possess an inherent capability to comprehend the meaning
                         conveyed by the fusion of text and images in memes, machines struggle to perform this type of complex
                         task. Detection of Inappropriate Memes from Mexico (DIMEMEX) [4] proposed shared tasks in IberLEF

                          IberLEF 2024, September 2024, Valladolid, Spain
                          $ f.maqbool@campus.unimib.it (F. Maqbool); elisabetta.fersini@unimib.it (E. Fersini)
                           0009-0008-2587-9417 (F. Maqbool); 0000-0002-8987-100X (E. Fersini)
                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2024 [1] to detect hate speech in memes written in the Spanish language. The DIMEMEX shared tasks
aim to foster advancements in the field of meme analysis and contribute to the creation of safer and
more inclusive online spaces.
  In this paper, we describe the overview of the approach we adopted to detect hateful content in memes.
We utilized a pre-trained BLIP model for this task to predict if each meme was hateful, inappropriate, or
harmless. The paper is structured as follows: Section 2 reviews the literature on hateful memes and
multilingual tasks. In Section 3, we describe the task and dataset utilized. Section 4 details the proposed
approach, and Section 5 presents the results obtained from proposed method.


2. Related Work
Hate speech detection in memes is a challenging task that has garnered significant attention in research
and academia. Numerous studies have explored various approaches to identify hate speech in memes.
One of the most notable efforts in this area was the challenge proposed by Facebook AI, which focused
on multimodal classification to identify hate speech in memes [5]. A preliminary study related to
misogyny, a type of hate against women, was conducted by E. Fersini et al. [6] using unimodal (Visual
or text) and multimodal-based approaches (text-visual) on a dataset consisting of misogynous content.
   The majority of published methodologies and resources for detecting offensive language and hate
speech were designed for the English language [7]. Therefore, researchers tried to generate resources
for cross-lingual and cross-cultural perspectives. E. Hossain et al. [8] introduced a novel Memes dataset
in Bengali language consisting of 7,148 memes. The researchers proposed a multimodal deep neural
network called DORA (Dual cO-attention fRAmework) to combat the challenge of detecting hate speech
in memes. They performed experiments for both binary classification to identify hate speech and to
identify the targeted social entities within the memes. To address the issue of multilingual resources,
the authors in [9] expanded Spanish resources with a new dataset of 9834 tweets. They also developed
a comparative framework for evaluating models and organized a repository to make it easier to access
multilingual datasets.
   Other notable efforts to promote research in the Spanish language related to hateful content are
DA-VINCIS [10] and HOMO-MEX [11] tasks from the shared evaluation campaign of Natural Language
Processing systems in Spanish and other Iberian languages (IberLEF 2023) [12]. These tasks primarily
focused on detecting harmful content using textual data. This year DIMEMEX [4]challenge introduced
tasks based on memes in Spanish-language, aiming to categorize memes as either hateful, inappropriate
or harmless. These initiatives demonstrate the ongoing advancements and the critical need for improved
methods in detecting and moderating harmful content across different languages and modalities.


3. Task Description and Dataset
Our team participated in the first task of DIMEMEX challenge which consists of a classification of
memes into three categories: hateful, inappropriate, or harmless. Memes that display a clear bias or
prejudice against a particular group of individuals are labeled as hateful. On the other hand, memes
that do not promote hate but contain vulgar, obscene, or morbid humor are considered inappropriate.
Finally, memes without any hateful or inappropriate content are deemed harmless.

    Table 1
    Dataset description for Task 1 of DIMEMEX
                       Data Type    Total Size   hateful   inappropriate   harmless
                          Train        2263       386          472          1405
                          Test          648        –            –            –

  The dataset for the first shared task consists of 2263 memes for training set and 648 for testing. All
the text included in the memes is written in Spanish, and each meme is assigned one of three labels:
harmful, inappropriate, or harmless. For development purpose, we split the training set into ratio of 80,
10 and 10 for training, validation and test sets. Table-1 shows the details of the dataset for Task 1 of this
challenge.


4. Proposed Approach
We proposed a BLIP model based approach to perform multiclass classification of memes. We utilized
BLIP model to extract embeddings and then used a classifier to detect the class of each meme. The flow
of the approach is shown in Figure 1.

4.1. Model
We utilized a vision-language based pre-trained model named BLIP [13] for this task. BLIP exploits
noisy web data by generating synthetic captions, filtering out noisy ones and pre-training a multimodal
mixture of encoder-decoder model. It integrates both language and image modalities into a unified
model, aiming to enhance the performance of tasks that require multimodal reasoning. We adopted BLIP
for its main capabilities to encode vision and text. Since it has been designed to perform vision-language
tasks such as Visual Question Answering (VQA), 0-shot retrieval, and our goal was to predict hateful,
inappropriate and harmless memes, we included straightforward fine-tuning.
   BLIP is a multimodel Mixture of Encoder and Decoder that consists of a Text Encoder, Image-grounded
Text Encoder, and a Decoder. The Unimodal encoder employs Image-Text Contrastive Loss (ITC) to
favor positive image-text pairs with similar representations as opposed to negative pairs. Using Image-
Text Matching Loss (ITM), the Image-grounded text encoder seeks to capture the finely grounded
alignment between language and vision. In the ITM, the model predicts the match positive and unmatch
negative pair in a binary classification task. The last part of the model is an image-grounded text
decoder that employs causal self-attention layers rather than bi-directional self-attention. It is equipped
with Language Modeling Loss (LM) to facilitate the generation of textual descriptions based on an
input image. The purpose of this loss is to train the decoder in an autoregressive manner, maximizing
the likelihood of the generated text. The architecture and training strategy of BLIP model enables
remarkable performance across a wide range of vision-language tasks, demonstrating the effectiveness
of its integrated framework.


Figure 1: Workflow of the Proposed Approach


4.2. Preprocessing
In the preprocessing step of our model, we utilized the transformers library to handle the initial
processing of our data. The associated text is translated to English using the GoogleTranslator API
to ensure uniform language processing. Both the image and translated text are processed using a
pre-defined processor that tokenizes the text and resizes the image as needed. The encoders of BLIP
leverages pre-trained image encoders and frozen large language models (LLMs) to train a lightweight,
12-layer Transformer encoder. This processor converts the data into tensors, applying padding and
truncation to ensure consistent input lengths, with a maximum of 128 tokens. The processed inputs are
stored in a dictionary and squeezed to remove any unnecessary dimensions, ensuring compatibility
with the input requirements of the model.

4.3. Feature Extraction
The dataset is divided into training and validation sets, and DataLoader objects are created for each
set. To obtain feature embeddings, the preprocessed input is forward passed through the model in
batches. The model processes the batch inputs to generate the last hidden state, which contains the
combined embeddings. Subsequently, these embeddings are appended to a list for later concatenation.
After processing all batches, the collected embeddings and labels are concatenated into single tensors.
These tensors contain the feature representations and corresponding labels for the entire dataset, ready
for further analysis or training downstream models.

4.4. Classification
Following the preprocessing and feature extraction stages, we proceeded with the classification of the
extracted embeddings. To reduce the dimensionality of the embeddings and create a more manageable
feature set, we employed mean pooling across the token dimension. This step computes the mean of
the embeddings for each instance. The labels for the test and training datasets were converted from
one-hot encoded format to their corresponding class indices. This step is essential for compatibility
with the classifier, which requires labels in integer format. We utilized the Gradient Boosting Classifier
from scikit-learn, a powerful ensemble method that builds an additive model in a forward stage-wise
manner. Each iteration fits a new base-learner to the residual errors made by the previous model. Once
the classifier was trained, we used it to predict the labels of the test embeddings.


5. Experimentation and Results
We utilized PyTorch library in Python in our implementation. After data preprocessing, the combined
image-text embeddings are extracted using BLIP model using batch size of 16. The Gradient Boosting
Classifier is trained on these embeddings using 100 estimators. This classifier is then used to predict the
labels for unseen test dataset.
   The challenge uses the Macro-average of Precision, recall and f1-score as evaluation measures. Table
2 shows the results of our approach based on labels produced by our model. The model was able to
achieve the macro average F1-score of 0.47 on the test dataset.

    Table 2
    Official results of the proposed approach
                                                               Proposed Model
                             Evaluation Metric   Best Scores   Scores   Rank
                             Precision              0.63        0.52      3
                             Recall                 0.56        0.50      3
                             F1-score               0.58        0.47      5


6. Conclusion
In this paper, we present our approach for the Task-1 of DIMEMEX challenge. The task is a multi-
classification problem to categorize memes as hateful, inappropriate, or harmless. We used the visual
language model BLIP to extract the combined image and text embeddings. These embeddings were
then used to train a Gradient Booster Classifier with 100 estimators. The performance of the model was
evaluated on the test data provided by the task organizers, using precision, recall, and macro F1 score
metrics. Our model achieved precision, recall, and macro F1 scores of 0.52, 0.50, and 0.47, respectively. In
conclusion, our approach demonstrates the potential of visual language models and ensemble learning
techniques in addressing complex multi-modal classification tasks. By using BLIP to extract image-text
embeddings, the complex relationships between visual and textual content in memes can be captured.
In the future, classification performance can be improved by using an ensemble of multiple models,
such as combining BLIP with other vision-language models or classifiers.


7. Acknowledgments
We acknowledge the support of the PNRR ICSC National Research Centre for High Performance
Computing, Big Data and Quantum Computing (CN00000013), under the NRRP MUR program funded
by the NextGenerationEU.


References
 [1] L. Chiruzzo, S. M. Jiménez-Zafra, F. Rangel, Overview of IberLEF 2024: Natural Language Process-
     ing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages
     Evaluation Forum (IberLEF 2024), co-located with the 40th Conference of the Spanish Society for
     Natural Language Processing (SEPLN 2024), CEUR-WS.org, 2024.
 [2] N. S. Mullah, W. M. N. W. Zainon, Advances in machine learning algorithms for hate speech
     detection in social media: A review, IEEE Access 9 (2021) 88364–88376. URL: https://doi.org/10.
     1109/ACCESS.2021.3089515. doi:10.1109/ACCESS.2021.3089515.
 [3] A. Rawat, S. Kumar, S. S. Samant, Hate speech detection in social media: Techniques, recent trends,
     and future challenges, WIREs Computational Statistics 16 (2024). doi:https://doi.org/10.
     1002/wics.1648.
 [4] H. J. Vásquez, I. Tlelo-Coyotecatl, I. H. Farías, M. Casavantes, H. J. Escalante, L. Villaseñor-Pineda,
     M. M. y Gǿmez, Overview of DIMEMEX at IberLEF 2024: Detection of Inappropriate Memes from
     Mexico, Procesamiento del Lenguaje Natural (2024).
 [5] D. Kiela, H. Firooz, A. Mohan, V. Goswami, A. Singh, P. Ringshia, D. Testuggine, The hateful
     memes challenge: Detecting hate speech in multimodal memes, in: H. Larochelle, M. Ranzato,
     R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems 33:
     Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December
     6-12, 2020, virtual, 2020.
 [6] E. Fersini, G. Rizzi, A. Saibene, F. Gasparini, Misogynous MEME recognition: A preliminary study,
     in: S. Bandini, F. Gasparini, V. Mascardi, M. Palmonari, G. Vizzari (Eds.), AIxIA 2021 - Advances
     in Artificial Intelligence - 20th International Conference of the Italian Association for Artificial
     Intelligence, Virtual Event, December 1-3, 2021, Revised Selected Papers, volume 13196 of Lecture
     Notes in Computer Science, Springer, 2021, pp. 279–293. doi:10.1007/978-3-031-08421-8\_19.
 [7] E. Fersini, F. Gasparini, G. Rizzi, A. Saibene, B. Chulvi, P. Rosso, A. Lees, J. Sorensen, Semeval-2022
     task 5: Multimedia automatic misogyny identification, in: G. Emerson, N. Schluter, G. Stanovsky,
     R. Kumar, A. Palmer, N. Schneider, S. Singh, S. Ratan (Eds.), Proceedings of the 16th International
     Workshop on Semantic Evaluation, SemEval@NAACL 2022, Seattle, Washington, United States,
     July 14-15, 2022, Association for Computational Linguistics, 2022, pp. 533–549. doi:10.18653/
     V1/2022.SEMEVAL-1.74.
 [8] E. Hossain, O. Sharif, M. M. Hoque, S. M. Preum, Deciphering hate: Identifying hateful
     memes and their targets, CoRR abs/2403.10829 (2024). doi:10.48550/ARXIV.2403.10829.
     arXiv:2403.10829.
 [9] A. A. Monnar, J. Perez, B. Poblete, M. Saldaña, V. Proust, Resources for multilingual hate speech
     detection, in: K. Narang, A. M. Davani, L. Mathias, B. Vidgen, Z. Talat (Eds.), Proceedings of the
     Sixth Workshop on Online Abuse and Harms (WOAH), Association for Computational Linguistics,
     Seattle, Washington (Hybrid), 2022, pp. 122–130. doi:10.18653/v1/2022.woah-1.12.
[10] H. J. Jarquín-Vásquez, D. I. H. Farías, L. J. Arellano, H. J. Escalante, L. V. Pineda, M. Montes-y-
     Gómez, F. Sánchez-Vega, Overview of DA-VINCIS at iberlef 2023: Detection of aggressive and
     violent incidents from social media in spanish, Proces. del Leng. Natural 71 (2023) 351–360.
[11] G. Bel-Enguix, H. Gómez-Adorno, G. Sierra, J. Vásquez, S. T. Andersen, S. Ojeda-Trueba, Overview
     of HOMO-MEX at iberlef 2023: Hate speech detection in online messages directed towards the
     mexican spanish speaking LGBTQ+ population, Proces. del Leng. Natural 71 (2023) 361–370.
[12] M.-y.-G. Jiménez-Zafra, Francisco Rangel, Overview of IberLEF 2023: Natural Language Processing
     Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages
     Evaluation Forum (IberLEF 2023), co-located with the 39th Conference of the Spanish Society for
     Natural Language Processing (SEPLN 2023), CEUR-WS.org, 2023.
[13] J. Li, D. Li, C. Xiong, S. C. H. Hoi, BLIP: bootstrapping language-image pre-training for unified
     vision-language understanding and generation, in: K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári,
     G. Niu, S. Sabato (Eds.), International Conference on Machine Learning, ICML 2022, 17-23 July
     2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, PMLR,
     2022, pp. 12888–12900.