<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Zaragoza, Spain
* Corresponding author.
†These authors contributed equally.
$ dmoctezuma@centrogeo.edu.mx (D. Moctezuma); tramirez@centrogeo.edu.mx (T. Ramirez-delreal);
eric.tellez@infotec.mx (E. Tellez); mario.graf@infotec.edu.mx (M. Graf); luis.ruiz@infotec.mx (G. Ruiz)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>DIMEMEX 2025: Solution based on Open-Clip of the INFOTEC+CentroGeo team</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniela Moctezuma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tania Ramirez-delreal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eric Tellez</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mario Graf</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guillermo Ruiz</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro de Investigación en Ciencias de Información Geoespacial (CentroGeo)</institution>
          ,
          <addr-line>Aguascalientes, Ags., 20213</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>INFOTEC Centro de Investigación en Tecnologías de la Información y Comunicación, 112 Circuito Tecnopolo Sur, Parque Industrial Tecnopolo 2</institution>
          ,
          <addr-line>Aguascalientes, 20326</addr-line>
          ,
          <country country="MX">México</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>SECIHTI, Secretaria de Ciencia</institution>
          ,
          <addr-line>Humanidades, Tecnologia e Innovacion, Benito Juárez, Mexico City, 03940</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>This manuscript presents the INFOTEC+CentroGeo team's solution for task 1 of the DIMEMEX@IberLEF2025, which asks for solutions to a three-way classification problem using a meme and its text as data. The categories included are hate speech, inappropriate content, and harmless content. Our approaches include the use of CLIP models, our EvoMSA framework, and traditional machine learning models with diferent encodings for text and image data. Our result in the internal evaluation was a f1-score of 0.75 yet our result in the gold standard was relatively low, achieving a f1-score of 0.42. However, we draw a set of useful recipes and conclusions to help people select models for multimodal tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;DIMEMEX</kwd>
        <kwd>Meme classification</kwd>
        <kwd>Text and image representation</kwd>
        <kwd>CLIP models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Memes are a well-known way to communicate ideas, especially on social media platforms. A meme is
an image that includes a small text; the sense of a meme is usually humorous, funny, satirical, or ironic.
Using these multimodal data (image and text) to tackle the problem of hate speech, inappropriate, and
harmless content has become more frequent in the natural language processing research community.
Nevertheless, this kind of problems are also addressed through text; unfortunately, hate speech and
harmless content are very common phenomena in social networks, particularly Twitter (now X),
Facebook, Youtube, and also in more professional networks such as LinkedIn.</p>
      <p>Chetty and Alathur [1] define hate speech as “any speech that attacks an individual or a group with
the intention of hurting or disrespecting based on identity.” Paz et al. [2] present a systematic review
surveying diferent proposals dealing with the recognition of hate speech using the Web of Science as
a source of information. This task has also been deeply studied with Social Media data, for instance,
Chetty and Alathur [1] presented a review of several methodologies applied to data from Twitter and
Facebook. The main conclusions of this review are that laws on hate speech are not the same in all
countries, and there are also a variety of communities dealing with this kind of discrimination on social
networks.</p>
      <p>In the harmless content detection task, the community has made some eforts. Arora et al. [3] survey
the existing methods and suggests future research lines for this topic. Kirk et al. [4] developed an
analytical scheme to categorize harms in text in three aspects: (1) type of harm (misinformation, racial
stereotypes), (2) if harm is sought or intentionally addressed, and (3) who the afected people are.</p>
      <p>Also, some research has classified memes containing text and image into hateful or not-hateful,
depending on whether the message is funny. Some works use feature extraction; in some cases
it could be with attention mechanism methods or classic methods of machine learning [5]. Another
approach dealing with multimodal data is presented in [6], where it compares and fine-tunes VisualBERT
pretrained on several datasets, such as the Conceptual Caption dataset. For images was used the
ResNeXT-152 Aggregated Residual Transformations–based Masked Regions with Convolutional Neural
Networks was used, and for the text representation, the well-known uncased BERT (Bidirectional
Encoder Representations from Transformers) model.</p>
      <p>In addition, Burbi et al. [7] present a method called ISSUES, which is based on a pretrained CLIP model.
Outstanding results were reached with two datasets, the Hateful Memes Challenge and HarMeme.1
After reviewing diferent approaches, one can conclude that the image and text are not homogeneous
and that the Transformer-enhanced ensemble models perform better, see [8]. Nevertheless, in our
case, we tested both a more traditional approach (EvoMSA) and CLIP to represent the combination of
par-associated image and text, trying to explore which one achieved better results in this particular
problem.</p>
      <p>Furthermore, this is a very complex task because sometimes the same image with diferent text is
about two diferent and opposite classes; this is not the case of the dataset provided by the DIMEMEX
2025 organizers, since each meme has a unique text and classification.</p>
      <p>This manuscript is organized as follows. Section 2 describes the target task of our solution. Section 3
describes the dataset used and provided by the competition organizers. Section 4 shows our eforts to
solve Task, while Section 5 is dedicated to present and discuss our results. Finally, Section 6 concludes
with some general ideas and future directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task description</title>
      <p>In general, DIMEMEX 2025[9] involves three diferent tasks, especially our solution is only for task 1,
Detection of Hate Speech, Inappropriate, and Harmless Memes. It is important to say DIMEMEX is part
of a set of text classification contests hosted by IberLEF 2025 [10].</p>
      <p>This task has a three-way classification in which each meme should belong exclusively to one of
the following classes: hate speech, inappropriate content, and harmless. These classes are defined as
follows.</p>
      <p>Hate Speech: The meme presents "Any kind of communication in speech, writing, or behavior that
attacks or uses pejorative or discriminatory language with reference to a person or a group based on
who they are, in other words, based on their religion, ethnicity, nationality, race, color, descent, gender,
or other identity factors" (United Nations, n.d.).</p>
      <p>Inappropriate Content: The meme presents any kind of manifestation of ofensive, vulgar (profane,
obscene, sexually charged), and/or morbid humor content.</p>
      <p>Harmless: The meme does not present any kind of manifestation that presents neither hate speech
nor inappropriate content.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The data set provided for DIMEMEX 2025 consists of approximately 3,000 memes, extracted from
public Facebook groups based in Mexico, and manually annotated for the presence of hate speech,
inappropriate content, and harmful content.</p>
      <p>Figure 1 shows the distribution of the three classes of task 1. Here, it can be seen that the most
frequent class is Hate Speech, having a higher number of 1’s in comparison with inappropriate and
harmful content, being the last two more or less balanced between them.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>For internal purposes, we split the train data into training and testing (70% training and 30% for testing),
the second where all our proposals were evaluated, and the highest was submitted to the Codalab
platform. In this section, all our eforts will be explained.</p>
      <p>Figure 2 shows the general methodology that consists of obtaining the features contained in the
images of the dataset, depending on the model it can be text, image or both. Then, a machine learning
model is used to train with these features, and finally, the classification is obtained. In our case, there
are two general ways to extract features with text or image, or both. As for the machine learning
approaches, two were tested, Support Vector Machine and Random Forest.</p>
      <sec id="sec-4-1">
        <title>4.1. Text and image representation with CLIP</title>
        <p>We used CLIP for the representation of text and images. Specifically, we used the OpenAI version
available 2. All technical details can also be found in [11]. We tested several CLIP models, but with
ViT-B/32 the best result was reached, so our final submission was with it. In this case, three ways were
tested, the first one was using only the embeddings provided by CLIP of the images, to finally classify
using an SVM classifier with default parameters (Radial Basis Function, RBF, with gamma selected
automatically). The second one is the same but using only text data (in this case, we only use the meme
inside text), and finally, the last way is using both embeddings, text, and image.</p>
        <p>All of the approaches used the same configuration of the SVM algorithm. The vector size for both
image and text was 512, so in the case of using both types of data, the vector size was 1024.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. EvoMSA for text representation</title>
        <p>The other kind of representation used was based on EvoMSA [12], which is based on a kind of
Bagof-words scheme. This scheme combined traditional supervised learning estimators as well as their
decision functions.</p>
        <p>EvoMSA has been tested in a variety of international text classification competitions, achieving
outstanding results compared to Deep Learning solutions in terms of computational resources, dataset
sizes, and explicability. As mentioned before, this representation was only tested with the text, both the
text inside the meme and the text description provided by the organizers; just in the case of the other
representations, the text inside the meme was only used. Similarly to CLIP representations, EvoMSA
vectors were classified using an SVM with the same configuration (Radial Basis Function, RBF, with
gamma selected automatically).</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Count-Vectorizer for text representation</title>
        <p>This approach uses the count vectorizer [13, 14] from scikit-learn. We then obtain the normalization of
the meme text by transforming the textual data into a numerical representation that can be used by
machine learning algorithms.</p>
        <p>The count vectorizer converts each text into a vector whose dimension corresponds to the total
vocabulary, and each value represents the frequency of a specific word in the text. Since the simplicity
of this representation, in contrast to other text representations tested, here we used a set of elements
that are filtered out if necessary (e.g., punctuation marks, very common or very rare words). Each
unique word is assigned a numerical index in the vocabulary. This process is carried out for both the
training and test data, ensuring that both sets share the same representational structure.</p>
        <p>The classification approach used is a probabilistic model based on the frequency of the text features.
The model is trained with transforms into vectors and their corresponding labels, the machine learning
method is the random forest classifier (RF) with n_estimators=100 and criterion=“gini” [15].</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and analysis</title>
      <p>As mentioned above, we use the training set to test our solutions internally. The training set consists of
2263 memes with their associated text as well as a general description of the image provided by the
competition organizers. For our internal evaluation, we split this dataset into training and test, 70% for
training (1584), and 30% for testing (679).</p>
      <p>Table 1 shows our internal results with this dataset split. Here, it can be seen that the best internal
result was achieved by CLIP using only the image embeddings; nevertheless, very similar results were
achieved using text and both text and image. Bigger diferences were reached with Count-Vectorizer
with text (F1-score of 0.38) and with EvoMSA (F1-score of 0.63).</p>
      <p>Our oficial results with the test data were: f 1-score of 0.422, precision of 0.422, and recall of 0.423.
Table 2 shows the results and the team name shown on the Codalab platform 3. In this table, it can be
seen that the best achieved F1-score was by Ryuan team with a score of 0.58, and in terms of precision,
the highest was achieved by the same team with a value of 0.58 as well as the same value for recall.
Despite achieving an equal value for the three metrics, our score remains low compared to the winner
team, which places us at the seventh position, in general, tied with VeronicaNeriMendoza.
3https://codalab.lisn.upsaclay.fr/competitions/22012#results</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>The results obtained in the DIMEMEX 2025 challenge show the superiority of models based on the
combination of visual and textual features, especially those using the CLIP model in conjunction
with SVM. The CLIP-IMG+SVM model achieved the best F1-score, indicating a solid balance between
precision and recall. Although the CLIP-IMG+TEXT+SVM model showed a slight decrease in this
metric, it achieved better recall, suggesting that including text can help improve classification. However,
the CLIP-TEXT+SVM model also maintained a competitive F1-score, demonstrating that even using
only the textual part of the CLIP model, quite high performance can be achieved.</p>
      <p>In contrast, more traditional approaches such as Count-Vectorizer+TEXT+RF showed inferior
performance. This may reflect the limitations of classical methods in the multimodal and semantic aspects.</p>
      <p>The EvoMSA (TEXT)+SVM model, while outperforming the Count Vectorizer-based approach, still
lagged behind the CLIP models, indicating a lack of ability to match the multimodal understanding
ofered by CLIP.</p>
      <p>In summary, the results reinforce the importance of integrating both visual and textual features to
efectively classify memes; nevertheless, more eforts must be made to achieve better results, which in
our case were low.</p>
    </sec>
    <sec id="sec-7">
      <title>Generative AI Declaration</title>
      <p>The authors declare they did not use any kind of Generative model to write this manuscript.
[1] N. Chetty, S. Alathur, Hate speech review in the context of online social networks, Aggression
and violent behavior 40 (2018) 108–118.
[2] M. A. Paz, J. Montero-Díaz, A. Moreno-Delgado, Hate speech: A systematized review, Sage Open
10 (2020) 2158244020973022.
[3] A. Arora, P. Nakov, M. Hardalov, S. M. Sarwar, V. Nayak, Y. Dinkov, D. Zlatkova, K. Dent,
A. Bhatawdekar, G. Bouchard, I. Augenstein, Detecting harmful content on online platforms:
What platforms need vs. where research eforts go, ACM Comput. Surv. 56 (2023). URL:
https://doi.org/10.1145/3603399. doi:10.1145/3603399.
[4] H. R. Kirk, A. Birhane, B. Vidgen, L. Derczynski, Handling and presenting harmful text in nlp
research, arXiv preprint arXiv:2204.14256 (2022).
[5] P. C. d. Q. Hermida, E. M. d. Santos, Detecting hate speech in memes: a review, Artificial</p>
      <p>Intelligence Review 56 (2023) 12833–12851.
[6] A. Hamza, A. R. Javed, F. Iqbal, A. Yasin, G. Srivastava, D. Połap, T. R. Gadekallu, Z. Jalil, Multimodal
religiously hateful social media memes classification based on textual and image data, ACM
Trans. Asian Low-Resour. Lang. Inf. Process. 23 (2024). URL: https://doi.org/10.1145/3623396.
doi:10.1145/3623396.
[7] G. Burbi, A. Baldrati, L. Agnolucci, M. Bertini, A. Del Bimbo, Mapping memes to words for
multimodal hateful meme classification, in: Proceedings of the IEEE/CVF International Conference
on Computer Vision, 2023, pp. 2832–2836.
[8] F. Yu, J. Tang, W. Yin, Y. Sun, H. Tian, H. Wu, H. Wang, Ernie-vil: Knowledge enhanced
visionlanguage representations through scene graphs, in: Proceedings of the AAAI conference on
artificial intelligence, volume 35, 2021, pp. 3208–3216.
[9] H. Jarquín-Vásquez, I. Tlelo-Coyotecatl, D. I. Hernández-Farías, H. J. Escalante, L.
VillaseñorPineda, M. Montes-y Gómez, Overview of DIMEMEX at IberLEF2025: Detection of Inappropriate
Memes from Mexico, Procesamiento del Lenguaje Natural 75 (2025).
[10] J. Á. González-Barba, L. Chiruzzo, S. M. Jiménez-Zafra, Overview of IberLEF 2025: Natural
Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the
Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the
Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS. org, 2025.
[11] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, et al., Learning transferable visual models from natural language supervision, in:
International conference on machine learning, PmLR, 2021, pp. 8748–8763.
[12] M. Graf, D. Moctezuma, E. S. Téllez, Bag-of-word approach is not dead: A performance analysis
on a myriad of text classification challenges, Natural Language Processing Journal (2025) 100154.
[13] J. Brownlee, Deep learning for natural language processing, Machine Learning Mystery, Vermont,</p>
      <p>Australia 322 (2017).
[14] S. S. Vel, Pre-processing techniques of text mining using computational linguistics and python
libraries, in: 2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS),
IEEE, 2021, pp. 879–884.
[15] S. K. Madhav, P. J. HimanshuRawat, A brief study on random forest using python (2021).</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>