<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UC-UCO-CICESE_UT3-Plenitas Team at Exploring the Detection of Inappropriate Memes from Mexico Using DeepLearning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yoan Martínez-López</string-name>
          <email>yoan.martinez@plenitas.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yanaima Jauriga</string-name>
          <email>yanaima.jauriga@reduc.edu.cu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mayte Guerra Saborit</string-name>
          <email>mayte.guerra@reduc.edu.cu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julio Madera</string-name>
          <email>julio.madera@reduc.edu.cu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ansel Rodríguez-González</string-name>
          <email>ansel@cicese.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlos de Castro Lozano</string-name>
          <email>carlosdecastrolozano@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jose Miguel Ramirez Uceda</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <addr-line>CICESE-UT3</addr-line>
          ,
          <country country="MX">México</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Plénitas</institution>
          ,
          <addr-line>C/ Le Corbusier s/n, 14005 Córdoba</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universidad de Camaguey</institution>
          ,
          <addr-line>Circunvalación Norte, Camino Viejo Km 5 y 1/2, Camaguey</addr-line>
          ,
          <country country="CU">Cuba</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Universidad de Córdoba</institution>
          ,
          <addr-line>Córdoba</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Detecting abusive or hateful content in multimodal social-media posts is a non-trivial problem that requires modelling both textual and visual cues as well as their interaction. We describe the system developed by the UC-UCO-CICESE_UT3-Plenitas team for the DIMEMEX 2025 shared task on the detection of inappropriate Mexican memes. Our approach combines (i) a late-fusion BETO-ViT architecture for joint image-text modelling, (ii) task-aware of Spanish BERT variants for purely textual settings, and (iii) prompt-engineering of a distilled 7-B parameter Qwen model (DeepSeek-R1-Qwen-7b) for the Large-Language-Model-only track. The resulting system reached third place in ternary classification (F1 = 0.55), first place in fine -grained hate-speech classification (F1 = 0.37), and second place in the restricted-LLM setting (F1 = 0.51). Experiments show that (1) no single model dominates all scenarios, (2) visual signals help general meme detection but hurt minority-class recall, and (3) compact LLMs can rival multimodal models when visual evidence is scarce. We release code, trained checkpoints and hyper-parameters for reproducibility.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Memes</kwd>
        <kwd>Hate-speech detection</kwd>
        <kwd>Multimodal</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Spanish NLP</kwd>
        <kwd>Mexican Spanish</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Social networks have become pivotal in modern life, reshaping how we communicate and share
information. Studying the content that flows through these platforms is now a major focus for the
computational linguistics community. Despite notable progress in recent years, several challenges
still demand deeper investigation to improve processing and understanding. Chief among them is
identifying abusive content—a category that includes hate speech, aggression, ofensive language, and
related phenomena [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
      </p>
      <p>
        Because social media is inherently multimodal, our goal is to push forward research and
development of multimodal computational models capable of detecting abusive material in Mexican
Spanish—especially memes that are hateful, ofensive, or vulgar. Memes typically rely on the interplay of text
and image to convey humor or irony; removing either component can completely change their meaning.
By encouraging work that tackles these multimodality challenges, we aim to inspire innovative, socially
impactful solutions[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In this paper, we describe our participation in the DIMEMEX 2025: Detection of
Inappropriate Memes from Mexico
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <sec id="sec-2-1">
        <title>2.1. DIMEMEX subtasks</title>
        <p>
          DIMEMEX comprises three subtasks [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]: a) A three-way classification: hate speech, inappropriate
content, and neither. Participants are free to use any approach of their choice. b) A finer-grained
classification distinguishing instances containing hate speech into diferent categories such as classism,
sexism, racism, and others. A three-way classification: hate speech, inappropriate content, and neither. c)
Participants are restricted to focusing exclusively on leveraging LLMs to detect the specified categories.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. DIMEMEX 2025 Corpus</title>
        <p>
          DIMEMEX 2025 [
          <xref ref-type="bibr" rid="ref1 ref2">2, 1</xref>
          ] a curated version of that used in the DIMEMEX2024 edition [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The dataset
consists of around 3,000 memes, compiled from public Facebook groups rooted in Mexico and manually
annotated on the presence of hate speech, inappropriate content, and harmful content. Each meme in
the dataset has been labeled by at least 3 annotators. For the 2025 editions the organizing DIMEMEX
team revised potentially noisy annotations, and improved the extraction of text from memes. Also,
additional information is provided with the dataset. The following categories were considered for the
labeling process:
• Classicism. Any manifestation that promotes an attitude or tendency to discriminate or minimize
someone base on the diference of social status.
• Racism. Any manifestation that promotes an attitude or tendency to discriminate or minimize
someone based on ethnic characteristics or that promotes the superiority of a group
• Sexism. Any manifestation that promotes an attitude or tendency to discriminate or minimize
someone based on gender characteristics. This includes misogyny, misandrist, and LGBTQ+
related content.
• Other. Any manifestation that promotes an attitude or tendency to discriminate or minimize
someone based on characteristics that do not belong to the previously defined ones, such as
profession, religion, place of origin, political afiliation, and interests.
• Inappropriate content. The meme presents any kind of manifestation of ofensive, vulgar (profane,
obscene, sexually charged) and morbid humor content.
        </p>
        <p>• None of the above
Categories hate speech, inappropriate content, and neither comprise the classes for Tasks 1 and 3. While
a fine grained classification considering the five categories is adopted for task 2.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. DeepLearning</title>
        <p>
          Deep learning is a type of machine learning that uses artificial neural networks with many layers (hence
"deep") to model complex patterns in data [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. It’s especially good at handling unstructured data like
images, text, and audio. Deep learning enables computers to learn from data much like the human brain
does. Instead of being programmed with specific rules, a deep learning system figures out the rules on
its own by training on large datasets.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. HugginFace</title>
        <p>
          Hugging Face is a widely recognized platform and community in the field of artificial intelligence
and machine learning. It’s often described as the "GitHub of Machine Learning" due to its focus on
open-source development and collaboration. They host in Hub a vast collection of models, datasets
and spaces that you can use. Hugging Face is particularly famous for its open-source transformers
library [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Thousands of pre-trained models for various tasks across diferent modalities, including
natural language processing (NLP), computer vision, audio, and more. These models are often based on
Transformer architectures. A wide range of datasets that can be used for training and evaluating AI
models and a platform for building and showcasing interactive machine learning demos and applications
directly in a web browser. In essence, Hugging Face serves as a central ecosystem that empowers the
AI community with open tools, resources, and a collaborative environment to build, share, and advance
machine learning technologies. Most of the models used in this competition is obtained from hub1.
        </p>
        <sec id="sec-2-4-1">
          <title>2.4.1. BERT: Bidirectional Encoder Representations from Transformers</title>
          <p>
            BERT is a transformer model, based on the encoder-decoder architecture, but in bidirectional way
because they predict words based on the previous words and the following words [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]. As a result,
the pre-trained BERT model can be fine-tuned with just one additional output layer to create
stateof-the-art models for a wide range of tasks, such as question answering and language inference,
without substantial task specific architecture modifications. This is achieved through the self-attention
mechanism, a layer that is incorporated in both the encoder and the decoder. The goal of the attention
layer is to capture the contextual relationships existing between diferent words in the input sentence.
Nowadays, there are many versions of pre-trained BERT, but in the original paper, Google trained two
versions of BERT: BERTbase and BERTlarge with diferent neural architectures. In essence, BERTbase
was developed with 12 transformer layers, 12 attention layers, and 110 million parameters, while
BERTlarge used 24 transformer layers, 16 attention layers, and 340 million parameters. As expected,
BERTlarge outperformed its smaller brother in accuracy tests (Cual fue el que se uso). There are two
steps in the framework: pre-training and fine-tuning. During pre-training, the model is trained on
unlabeled data over diferent pre-training tasks. For fine-tuning, the BERT model is first initialized
with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the
downstream tasks. Input text is tokenized into subword units, converted into vectors via embeddings,
and then passed through the encoder layers to produce contextualized token representations. BERT is
pre-trained on large unlabeled text corpora (English Wikipedia and Toronto BookCorpus) using two
main objectives:Masked Language Model (MLM): Randomly masks about 15% of input tokens and trains
the model to predict these masked words based on context.Next Sentence Prediction (NSP): Trains the
model to predict if one sentence logically follows another, helping it understand sentence relationships
BETO: (BERT for Spanish) is a language model based on BERT architecture. The key characteristic of
BETO is that it has been pre-trained specifically on a large corpus of Spanish text. This specialization
allows BETO to achieve a deep understanding of the nuances, grammar, and vocabulary of the Spanish
language, making it highly efective for various natural language processing (NLP) tasks in Spanish,
such as text classification, sentiment analysis, named entity recognition, and question answering.
          </p>
          <p>
            ViT (Vision Transformer): ViT is a model designed for computer vision tasks. Unlike traditional
convolutional neural networks (CNNs) that were dominant in image processing [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ], ViT applies the
Transformer architecture (originally developed for NLP) to images. It works by splitting an image
into smaller patches, treating these patches as a sequence of tokens, and then processing them using
a standard Transformer encoder. ViT has demonstrated impressive performance on tasks like image
classification, object detection, and image segmentation, particularly when trained on large datasets.
          </p>
          <p>Vision language models is the integration of computer vision and natural language processing.
Transformers have been adapted to handle multimodal inputs, enabling VLMs to capture the
complex relationships between visual and linguistic data. A typical VLM architecture involves two main
components:
• Image encoder: It is responsible for processing visual data, extracting features (objects, colors,
textures, etc.), and transforming them into a format that can be understood by the model.
• Text decoder: It processes textual data and generates output based on the encoded visual features.</p>
        </sec>
        <sec id="sec-2-4-2">
          <title>2.4.3. Deepseek-qwen-7b</title>
          <p>
            "DeepSeek-Qwen-7B" most likely refers to the DeepSeek-R1-Distill-Qwen-7B model. This is not a new
model architecture built from the ground up by merging DeepSeek and Qwen designs. Instead, it’s a
product of model distillation, leveraging the strengths of both [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ]. Qwen 2.5 7B: This is the base model
architecture from Alibaba Cloud’s Qwen series, with 7 billion parameters. Qwen models are known
for their strong performance across various language tasks and multilingual capabilities.
DeepSeekR1: This refers to a family of powerful reasoning models developed by DeepSeek AI. DeepSeek-R1
models are noted for their enhanced reasoning abilities, often achieved through advanced training
techniques like large-scale reinforcement learning. Distillation: In this context, distillation means
that the knowledge and reasoning patterns from the larger, more powerful DeepSeek-R1 model are
transferred to the smaller Qwen 2.5 7B model [
            <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
            ]. This allows the 7B model to gain some of the
reasoning capabilities of DeepSeek-R1 without being as large or computationally intensive. Therefore,
DeepSeek-R1-Distill-Qwen-7B is essentially a refined version of the Qwen 2.5 7B model that has been
specifically fine-tuned using reasoning data generated by DeepSeek-R1. The goal of this distillation
process is to enhance the reasoning performance of the 7B Qwen model, making it more capable in tasks
requiring logical inference and problem-solving, while retaining the eficiency of a 7-billion-parameter
model.
          </p>
        </sec>
        <sec id="sec-2-4-3">
          <title>2.4.4. Nous-Hermes</title>
          <p>
            Nous-Hermes-13b is a state-of-the-art language model fine-tuned on over 300,000 instructions [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ].
This model was fine-tuned by Nous Research, with Teknium and Karan4D leading the fine tuning
process and dataset curation, Redmond AI sponsoring the compute, and several other contributors.
The result is an enhanced Llama 13b model that rivals GPT-3.5-turbo in performance across a variety
of tasks. This model stands out for its long responses, low hallucination rate, and absence of OpenAI
censorship mechanisms. The fine-tuning process was performed with a 2000 sequence length on an 8x
a100 80GB DGX machine for over 50 hours. The model was trained almost entirely on synthetic GPT-4
outputs. This includes data from diverse sources such as GPTeacher, the general, roleplay v12, code
instruct datasets, Nous Instruct PDACTL (unpublished), CodeAlpaca. Additional data input came from
Camel-AI’s Biology/Physics/Chemistry and Math Datasets, Airoboros’ GPT-4 Dataset, and more from
CodeAlpaca. The total volume of data encompassed more than 300,000 instructions.
2.4.5. Mistral
Mistral 7B is a transformer model with a 7-billion-parameter language engineered for superior
performance and eficiency. Mistral 7B outperforms the best open 13B model (Llama 2) in all evaluated
benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation
[
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. The model leverages grouped-query attention (GQA) for faster inference, coupled with
slidingwindow attention (SWA) to efectively handle sequences of arbitrary length with a reduced inference
cost and Byte-fallback BPE tokenizer. It also provides a model fine-tuned to follow instructions, Mistral
7B – Instruct, that surpasses Llama 2 13B – chat model both on human and automated benchmarks [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ].
          </p>
        </sec>
        <sec id="sec-2-4-4">
          <title>2.4.6. Tiiuae/falcon-7b</title>
          <p>
            Falcon-7B is a 7B parameters causal decoder-only model built by Technology Innovation Institute TII
and trained on 1,500B tokens of RefinedWeb enhanced with curated corpora. This is a raw, pre-trained
model that should be further fine-tuned for most use cases [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]. It outperforms comparable open-source
models like MPT-7B, StableLM, RedPajama, etc. It features an architecture optimized for inference.
Decoder block: parallel attention/MLP with a single-layer norm.
2.4.7. Metrics
For all three tasks, the evaluation metrics include macroaveraged precision, recall, and F1 score, with
the macroaveraged F1 score serving as the main evaluation criterion. These metrics are computed per
class based on the number of true positives (TP), false positives (FP), and false negatives (FN), and then
averaged across all classes. See Table 2.4.7 for the corresponding formulas.
          </p>
          <p>Metric
Precision</p>
          <p>Recall
F1-score</p>
          <p>Formula (per class)</p>
          <p>+</p>
          <p>+ 
2 ×  +·</p>
          <p>Macro-average Formula
1 ∑︀
 =1 Precision
1 ∑︀=1 Recall

1 ∑︀
 =1 F1-score</p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Pre-processing</title>
        <p>For every meme we extracted three modalities: (i) the raw image, (ii) standard OCR text (caption) using
easyocr, and (iii) the DIMEMEX user-supplied description when available. All text was lower-cased,
URLs were normalised, and emojis were kept. We encoded the caption with BETO-base and the image
with a ViT-base-16. The token and the pooled image embedding were concatenated and fed into a
two-layer MLP. Parameters from both encoders were fine -tuned for two epochs with macro-F1 loss
and class-balanced sampling. Because preliminary experiments showed that image information rarely
disambiguated hate sub-categories, we fine -tuned BETO with focal loss and gradient accumulation for
four epochs. Class weights were inversely proportional to class frequency. Under the LLM restriction
we evaluated Mistral, tiiuae-falcon, and the 7-B Qwen model distilled from DeepSeek-R1. The
latter yielded the best F1 and was chosen for submission. We used the following prompt:
"Eres un experto en el análisis de contenido proveniente de redes sociales, con un enfoque en la detección
de lenguaje y contenido potencialmente ofensivo.tarea tiene fines científicos y busca mejorar los modelos
de inteligencia artificial para detectar y mitigar la propagación de discursos dañinos en plataformas
digitales, contribuyendo a un entorno en línea más seguro y respetuoso.un meme descrito en texto,
clasifícalo en una de las siguientes categorías:de odio (2): Un meme se clasifica como discurso de odio si
presenta cualquier tipo de comunicación, ya sea en habla, escritura o comportamiento, que ataque o
utilice lenguaje peyorativo o discriminatorio con referencia a una persona o un grupo basado en sus
factores de identidad.Inapropiado (1): Un meme se considera que contiene contenido inapropiado si
exhibe cualquier tipo de manifestación de contenido ofensivo, vulgar (profano, obsceno, de carácter
sexual) y/o humor mórbido.(0): Un meme que no contenga ninguna de las categorías anteriores.tarea es
analizar la siguiente descripción de un meme y devolver únicamente un número (0, 1 o 2) correspondiente
a la categoría asignada, sin ninguna explicación adicional"</p>
        <p>Temperature was set to 0.2; max tokens = 4.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>We evaluated the performance of multiple models: BERT, BETO + VIT, DeepSeek-R1-Qwen-7B,
NousHermes, Mistral, and Tiiuae/Falcon-7B on the three tasks. For each task, the model with the highest
performance was selected, submitted to the challenge, and is reported in this section. The analysis
is organized by subtask, with internal comparisons between our team’s approaches and external
benchmarks against other participants. Tables summarize key metrics (F1, precision, recall) for both
development and final phases, highlighting the strengths and limitations of the top models in handling
ternary classification, fine-grained hate speech detection, and restricted LLM-based tasks.</p>
      <sec id="sec-3-1">
        <title>3.1. Task 1: General Ternary Classification</title>
        <p>Internal Comparisons</p>
        <p>In Task 1, the BERT-only model slightly outperformed the multimodal BETO+VIT setup, scoring
0.5496 versus 0.5476—a marginal 0.36 improvement—indicating that textual information alone was more
informative for this task. Although the BETO+VIT model incorporated visual features, it did not yield
better results, potentially due to irrelevant or noisy image data. Additionally, the BERT-only outputs
were more compact in file size, while VIT+BERT produced consistently larger files. Overall, the results
suggest that for this task, the simpler BERT-only approach was not only more efective but also more
eficient, highlighting the need for improved fusion strategies in multimodal systems.</p>
        <p>Comparison with other approaches</p>
        <p>Top-performing model: BERT/BETO + ViT (multimodal) Performance (F1): 0.55, (Precision): 0.57 and
(Recall): 0.55 The leading approach combined BETO (a Spanish-optimized BERT variant) with Vision
Transformer (ViT) to analyze both textual and visual elements in memes. This multimodal integration
proved optimal for distinguishing among the three primary categories: hate speech, inappropriate
content, and neutral memes. Its strength lay in generalization, efectively handling the diverse linguistic
and visual expressions characteristic of Mexican memes. However, performance dipped in cases
involving irony or sarcasm, where contextual ambiguity between text and imagery posed challenges.
Table 2 and 3 present the results from both the development phase and the final stage of the research.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Task 2: Fine-Grained Hate Speech Classification</title>
        <p>Internal Comparisons</p>
        <p>In Task 2, the BERT-only model outperformed the multimodal BETO+VIT approach, achieving a
higher score of 0.3704 compared to 0.3266, indicating that text alone was more efective than combining
text and image data. The lower performance of BETO+VIT suggests that the addition of visual features
User
User
HoracioJarquin
UC-UCO-CICESE_UT3-Plenitas(ymlopez)
hugojair
girish_koushik
may have introduced noise or irrelevant information. File sizes also reflected this diference, with
BERT-only outputs varying more widely due to preprocessing, while BETO+VIT files were consistently
larger due to image embeddings. Overall, the text-only approach proved more efective for this task,
and any future use of multimodal models would require improved fusion strategies or better visual
feature extraction.</p>
        <p>Comparison with other approaches</p>
        <p>Top-performing model: BERT-only Performance (F1): 0.37; (Precision): 0.40 and (Recall): 0.36 For
this subtask, the team employed BERT to detect hate speech subcategories (e.g., classism, racism,
sexism). While the F1 score was lower than in Task 1, the model achieved a precision of 0.40 in sensitive
categories like racism and sexism, underscoring its eficacy in identifying specific discriminatory content.
Limitations emerged in minority classes (e.g., "others"), where lower recall suggested a need for data
augmentation or class-balancing techniques to improve coverage. Immediately following this sentence
is the point at which Table 4 and 5 are included in the input file; compare the placement of the table
here with the table in the printed output of this document.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Task 3: Ternary Classification Using Restricted LLMs</title>
        <p>Internal Comparisons</p>
        <p>In Task 3, DeepSeek-R1-Qwen-7B emerged as the top-performing language
modelDeepSeek-R1Qwen-7B a score of 0.5069, clearly outperforming all other tested models. Nous-Hermes followed with
a solid, though lower, score of 0.4774. Mid-tier models like Mistral and Tiiulo performed similarly but
significantly worse, scoring around 0.29. The lowest performance came from DeepSeek-R1-Qwen-7B,
which achieved only 0.1745, indicating it may be unsuitable for the task without substantial optimization.
File sizes ranged from approximately 16.8 KB to 23.8 KB, showing no clear link between size and
efectiveness. Overall, DeepSeek-R1-Qwen-7B proved most efective, while models like Mistral, Tiiulo,
and especially DeepSeek-R1-Qwen-7B may benefit from fine-tuning or further adaptation.</p>
        <p>Comparison with other approaches</p>
        <p>Top-performing model: DeepSeek-R1-Distill-Qwen 7B Performance (F1): 0.51, (Precision): 0.54
and (Recall): 0.50 In this task, which restricted solutions to large language models (LLMs), the
distilled DeepSeek-Qwen 7B model (enhanced with DeepSeek-R1’s reasoning techniques) outperformed
competitors like Mistral 7B and Falcon-7B. Its advantages included reduced hallucination tendencies,
computational eficiency, and adaptability to Mexican Spanish. Nevertheless, its slightly lower F1
compared to Task 3’s multimodal approach suggests that LLMs could benefit from localized fine-tuning
or hybrid architectures incorporating visual analysis. See Table 6 and 7.</p>
        <p>Model
BERT
Model
BERT</p>
        <p>F1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>DIMEMEX 2025 experiments confirm that aligning model strategy with task demands is essential.
Multimodal fusion (text + image) produced the strongest general-purpose result—Task 1: 3rd place, F1 =
0.55—while task-specific fine-tuning enhanced category-level precision—Task 2: 1st place, F1 = 0.37,
despite its more granular hate speech classification. In restricted settings, lightweight LLMs proved
competitive—Task 3: 2nd place, F1 = 0.51—demonstrating their efectiveness when visual cues are
limited. However, challenges such as irony, sarcasm, and minority-class imbalance persist, indicating
that future improvements may stem from hybrid LLM-plus-vision architectures, culturally tailored data
augmentation, and adversarial training. No single model dominates across all scenarios, but a flexible,
task-aware approach consistently yields the best outcomes. Our results demonstrate that demonstrate
that:
• Multimodality is critical for generalization. The fusion of text (BETO) and image (ViT) features
in Task 1 yielded the highest F1 score (0.55), underscoring the importance of leveraging both
modalities for broad meme classification.
• Task-specific tuning enhances precision. While Task 2’s F1 score was lower (0.37), fine-tuning for
hate speech subcategories improved the model’s ability to detect racism and sexism, revealing
trade-ofs between generalization and granularity.
• LLMs show promise but require refinement. DeepSeek-Qwen 7B’s strong performance in Task
3 (F1 = 0.51) indicates that LLMs are viable in restricted settings; however, integrating visual
components could help close the performance gap with fully multimodal models.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4 and Grammarly in order to: Grammar and
spelling check. After using these services, the authors reviewed and edited the content as needed and
takes full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jarquín-Vásquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Tlelo-Coyotecatl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Casavantes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. I.</given-names>
            <surname>Hernández-Farías</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Escalante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Villaseñor-Pineda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Montes</surname>
          </string-name>
          , et al., Panorama de dimemex en iberlef2025: Detección de memes inapropiados de méxico,
          <source>in: Actas del Foro</source>
          Ibérico de Evaluación de Lenguas (IberLEF
          <year>2025</year>
          ),
          <article-title>celebrado conjuntamente con la 41ª Conferencia de la Sociedad Española de Procesamiento del Lenguaje Natural (SEPLN 2025), CEUR-WS</article-title>
          .org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>y. J.-Z. S. M. González-Barba</surname>
          </string-name>
          , José Ángel y Chiruzzo, Panorama general de IberLEF 2025:
          <article-title>Desafíos del procesamiento del lenguaje natural para el español y otras lenguas ibéricas</article-title>
          ,
          <source>in: Actas del Foro</source>
          Ibérico de Evaluación de Lenguas (IberLEF
          <year>2025</year>
          ),
          <article-title>celebrado conjuntamente con la 41ª Conferencia de la Sociedad Española de Procesamiento del Lenguaje Natural (SEPLN 2025), CEUR-WS</article-title>
          .org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jarquín-Vásquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Tlelo-Coyotecatl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Casavantes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. I.</given-names>
            <surname>Hernández-Farías</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Escalante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Villaseñor-Pineda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Montes</surname>
          </string-name>
          , et al.,
          <source>Overview of dimemex at iberlef</source>
          <year>2024</year>
          :
          <article-title>Detection of inappropriate memes from mexico</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>73</volume>
          (
          <year>2024</year>
          )
          <fpage>335</fpage>
          -
          <lpage>345</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , Y. Bengio, G. Hinton,
          <article-title>Deep learning</article-title>
          , nature
          <volume>521</volume>
          (
          <year>2015</year>
          )
          <fpage>436</fpage>
          -
          <lpage>444</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <article-title>Hugging face, in: Introduction to transformers for NLP: With the hugging face library and models to solve problems</article-title>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cañete</surname>
          </string-name>
          , G. Chaperon,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-H.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <article-title>Spanish pre-trained bert model and evaluation data</article-title>
          ,
          <source>arXiv preprint arXiv:2308.02976</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          , et al.,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Song,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , S. Ma,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bi</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Deepseek-</surname>
          </string-name>
          r1:
          <article-title>Incentivizing reasoning capability in llms via reinforcement learning</article-title>
          ,
          <source>arXiv preprint arXiv:2501.12948</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>I.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. P.</given-names>
            <surname>Datta</surname>
          </string-name>
          , I. Kabir,
          <string-name>
            <given-names>N. U. R.</given-names>
            <surname>Chowdhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Haque</surname>
          </string-name>
          ,
          <article-title>Qwen 2.5: A comprehensive review of the leading resource-eficient llm with potentioal to surpass all competitors</article-title>
          , Authorea
          <string-name>
            <surname>Preprints</surname>
          </string-name>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lv</surname>
          </string-name>
          , et al.,
          <source>Qwen3 technical report, arXiv preprint arXiv:2505.09388</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kotonya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Krishnasamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tetreault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jaimes</surname>
          </string-name>
          ,
          <article-title>Little giants: Exploring the potential of small llms as evaluation metrics in summarization in the eval4nlp 2023 shared task</article-title>
          ,
          <source>arXiv preprint arXiv:2311.00686</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          ,
          <article-title>Albert q. jiang, alexandre sablayrolles, arthur mensch, chris bamford, devendra singh chaplot, diego de las casas, florian bressand, gianna lengyel, guillaume lample, lucile saulnier, lélio renard lavaud, marie-anne lachaux, pierre stock, teven le scao, thibaut lavril, thomas wang, timothée lacroix, william el sayed</article-title>
          ,
          <source>arXiv preprint arXiv:2310.06825</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Basyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanghvi</surname>
          </string-name>
          ,
          <article-title>Text summarization using large language models: a comparative study of mpt-7b-instruct, falcon-7b-instruct, and openai chat-gpt models</article-title>
          ,
          <source>arXiv preprint arXiv:2310.10449</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>