<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UCM's Participation to the 2024 DIMEMEX Task: Automatic Detection of Inappropriate Memes in Mexico</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aisha Aman Parveen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad Complutense de Madrid (UCM)</institution>
          ,
          <addr-line>Av. Complutense, s/n, Moncloa - Aravaca, 28040 Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Social media has a huge impact in our world. It has transformed how we share and communicate. Nevertheless, detection of ofensive and malicious messages is still an area that must be improved. The conference IberLef 2024 has prepared a competition aimed to detect abusive content. In this paper, we will describe the process of analysing and classifying diferent Facebook memes. We will describe our datasets and the process we have followed in order to pre-process and prepare the data to be used to train the Machine Learning (ML) model. Our main focus is on diferent techniques to process the data and looking for the model with the best combination of hyperparameter in order to classify properly each meme.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing</kwd>
        <kwd>Hate Speech</kwd>
        <kwd>Social Media</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Social networks are online platforms that enable people, companies and governments to connect,
communicate and share information. These networks promote virtual communities where users can
establish personal and professional relationships, share information and collaborate on diverse activities.
Diferent social media platforms are created for purposes such as marketing, business communication
and education. Some of the most popular social media include Facebook, Instagram, Twitter and
TikTok. Nevertheless, inappropriate content and hate speech are significant challenges in today’s digital
world. Inappropriate content includes ofensive or harmful material unsuitable for certain audiences,
while hate speech involves discriminatory communication against individuals or groups based on
attributes like race, religion, or gender. These issues can harm mental health, foster division, and incite
violence. Research to detecting that type of content has become a critical issue in order to prevent
online harassment growth. However, despite the notable advances made, there are still challenges for a
deeper understanding.</p>
      <p>
        In this study, we aim to address these challenges by analysing Facebook memes as part of the
DIMEMEX task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] of the IberLef 2024 conference [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. First, we will preprocess text and images. In
order to prepare the text dataset, we will make some transformations to obtain a text embedding. Same
process will be carried out with images but with a Transformer [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Once we have both text and image
embeddings, we will merge them into a single embedding and train it with diferent models. Also, we
will implement the starting kit provided by the organization and analyse the results. For this submission,
we have participated on Task 1. The predictions labels are: hate speech, inappropriate and neither.
Finally, we will analyse the predictions and compare them with the final results of the competition.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>Dataset is organized in 3 diferent blocks: train, validation and test. First, we will train our model with
all the data available in train. Then, we will run our model with validation dataset and submit our
predictions and receive feedback of our performance. Finally, after the adjustment that we consider, we
will run our model with test dataset and submit final results.</p>
      <p>As we can observe in Table 1, train dataset is the only data which we know not only the date, but
also the output, in other words, the label of each meme. From Figure 1, we can observe that 65% of
train-dataset is None, 13% is hate-speech and 22% is inappropriate content. This information is revealing
as our model will be better trained to predict memes with the label "none" than those with "inappropriate
content". It is important to remark that predicting the category of hate speech will be much more
challenging as we have less data to train our model.</p>
      <p>Each post includes a meme and a sentence. A meme is a photography with a text that emphases the
idea of the sentence written. Figure 2 presents an example of each post with their corresponding image.
The image order from left to right matches the following labels: hate-speech, inappropriate content and
none. The text associated with hate-speech is “Cuando te das cuenta que no es mugre @alexesmu”; for
the inappropriate content is “[RA ESTE ESTÁ MAS PENDEJO QUE TU” and for the label none is “Ya
estaríamos en octavos; si alguien no fallaba su 00 MEXICO Zona peTexnal”.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Text mining approach</title>
      <p>
        We will now describe the models we implement to prepare and analyse out datasets. In the first approach,
we have applied diferent classification models adjusting the hyperparameters associated to obtain the
best combination. In the second approach, we use the starting kit provided by the shared task organizers
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which is a combination of an image and text neural network.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Classification models</title>
        <p>As our dataset combine text and image, we will process each one individually and then concatenate
them into a single embedding to train our Machine Learning (ML) model and make predictions. Figure
3, describes schematically the process that has been followed. The processes followed in each case
is diferent, for the images we have used a Transformer pretrained and for the text we have applied
diferent techniques. In both cases, we will with details the process in the following section.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Text approach</title>
          <p>In order to process data properly, we have made some adjustments on the dataset and then applied
spaCy and Nltk libraries. The steps followed are:</p>
          <p>1. We remove spaces and transform the text to lower case.</p>
          <p>C’: 0.1, ’class_weight’:’balanced’, ’penalty’: ’l2’,’solver’: ’liblinear’
class_weight’:’balanced’, ’max_depth’: 23, ’max_features’: 5,
’min_samples_leaf’: 10, ’n_estimators’: 100
alpha’: 1e-05’, batch_size’: 500, ’hidden_layer_sizes’: 100,
’learning_rate_init’: 0.05, ’random_state’: 1, ’solver’: ’lbfgs’
2. With the TweetTokenizer we split the document into words and replaced URLS, mentions and
numbers by tokens. Then, we merge the words into a new document.
3. With the spaCy and Nltk libraries, we search for text lemmas, punctuation symbols and stopwords
to remove them.
4. We merge the modified tokens into a new text.
5. We create 3 new columns where we will count the number of mentions, URLs and numbers in
each post. Once the steps are finished, we will have the text dataset ready to join it with the
embedding of the images.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Image approach</title>
          <p>
            In this section, we extract features from each image with a pre-trained Hugging Face pipeline. This
pipeline applies the principles of Transformers that have been so successful in NLP to computer vision.
The pre-trained model is called “google/vit-base-patch16-384” [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. No preprocessing has been applied
apart from the one implemented in the “google/vit-base-patch16-384” pipeline. With the previous model
we obtain the embedding of the images in order to merge them with the text.
          </p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.1.3. Modelling</title>
          <p>Once we have prepared the dataset, we proceed to transform the normalised text into a feature matrix
using the TF-IDF technique. The parameters we set in the model are:
• We will convert the text to lowercase.
• We will use between 1 and 4 n-grams, i.e. we take into account unigrams, bigrams, trigrams and
quadrigrams.
• We set the minimum number of times a term has to appear to be ignored 0.1%, which means that
any term with less than 0.1% in the dataset is not going to be taken into account. Words that
represent only 0.1% of the total amount of words, that is, words that have appeared very few
times, will not be taken into account as they do not contribute significantly to the training and
learning of the model.</p>
          <p>Then, we split the train dataset into train and dev, with train been 80% of the dataset and dev 20%.
During training, we will look for the model with highest performance. Once we know which model
gives us the best result, we will train it with the whole dataset. In the next step, we download the
libraries of LogisticRegression, RandomForestClassifier and MLPCLassifier models from Sklearn. We
will set up a dictionary with a number of hyperparameters to test in each model in order to obtain the
best combination for each model. To do so, we will use a 5-fold cross-validation scheme.</p>
          <p>Table 2, shows the best combination of hyperparameters for each model. As the first model is the
one with best performance, the Logistic Regression model will be trained with the whole data for the
ifnal submission.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Starting kit</title>
        <p>The multimodal (MM) starting kit provided by the organization is a model that combines both image
and text Transformers. For the image, the “google/vit-base-patch16-224-in21k” is used, whereas the
“dccuchile/bert-base-spanish-wwm-uncased” is used for the text part. Both encodings are then merged
together, followed by a linear layer. In this case, the weights of all models are updated using gradient
descent. In contrast, in our previous approach, the weights of the image Transformer were frozen.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Results</title>
        <p>As we can observe in Table 3, the model predicted correctly class 1 (hate-speech) for 35 examples, class
2 (inappropriate content) for 31 examples and class 3 (none) for 205 examples of our internal dev set. In
other words, those numbers reflects that the class prediction corresponds to what it actually is. On the
other hand, for those values outside the diagonal, the model predicted erroneously the type of class
they are.</p>
        <p>As Table 4 shows, the model has an accuracy of 48%. In this case the model was evaluated on 97
samples for class 0 (hate speech), 118 samples for class 1 (inappropriate content) and 351 samples for
class 2 (none). The support samples reinforce the fact that there is less data available for class 0 and 2
than class 3. The f1-score of class 2 (none) is the one that has predicted the best, followed by class 1
(inappropriate content) and finally class 0 (hate-speech). One of the explanations for this, is that the
model has more samples to train class 2 (1405 samples) and 1 (472 samples) than class 0 (386 samples).
We observed this at an early stage when we analysed our initial dataset.</p>
        <p>In contrast to this, the starting kit model obtained 0.59 accuracy and 0.47 f1 score. Although we have
made a distinction between train and dev, we have always worked on the train dataset. The model with
the best results was trained with the train dataset and the labels obtained associated to each image/text
were uploaded to the conference platform, since we do not have the output (Y) of the test to check the
accuracy of our model. We submitted both our proposed approach and the starting kit.</p>
        <p>Once this process has been carried out, we observe in Table 5 that our accuracy, precision and recall is
49%, obtained with the starting kit. The Logistic Regression model obtained a significantly lower score,
27%. Among the participants, CLTL has the highest F1 score (0.58), precision (0.61) and recall (0.56).
Garcia Rodriguez Mario has the lowest F1 score (0.36), which is lower than the competition Baseline
TXT, IMG and MM. All the participants have a higher score in Precision, which means that they have
classified properly the predicted class. Regarding the Baseline results, we can observe that same patter is
repeated as in competitors, precision is higher than F1. Aaman and GarciaRodriguezMario and Baseline
have obtained the same score in all metrics. Michaelibrahim, Vickbat, fariha32 and mashd3v has higher
precision than recall.</p>
        <p>During our internal experimentation, there was not a large diference. Nevertheless, if we compare
with the results of the submissions, the Logistic Regression model is not competitive when compared
with the starting kit.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>We have presented our participation at the IberLef 2024 DIMEMEX Task 1, which consists of detecting
hate speech or inappropriate content in social media. We have tested two multimodal approaches that
process both text and image. On our proposed text mining approach, the accuracy was lower than the
starting kit. The results are 49% and 27% accuracy, respectively.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Vásquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Tlelo-Coyotecatl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Farías</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Casavantes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Escalante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Villaseñor-Pineda</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>M. y Gómez, Overview of dimemex at iberlef 2024: Detection of inappropriate memes from mexico</article-title>
          ,
          <source>in: Procesamiento del Lenguaje Natural</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Chiruzzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Jiménez-Zafra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          , Overview of IberLEF 2024:
          <article-title>Natural Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2024), co-located with the 40th Conference of the Spanish Society for Natural Language Processing (SEPLN 2024), CEUR-WS</article-title>
          .org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          , in: I. Guyon, U. von Luxburg, S. Bengio,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. V. N.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9</source>
          ,
          <year>2017</year>
          , Long Beach, CA, USA,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          . URL: https://proceedings.neurips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          ,
          <source>in: 9th International Conference on Learning Representations, ICLR</source>
          <year>2021</year>
          ,
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Event</surname>
          </string-name>
          , Austria, May 3-
          <issue>7</issue>
          ,
          <year>2021</year>
          , OpenReview.net,
          <year>2021</year>
          . URL: https://openreview.net/forum?id=YicbFdNTTy.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>