<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ELiRF-UPV at SatiSPeech-IberLEF 2025: Multimodal Speech-text Satire Recognition in Spanish</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alejandro Joaquín Barceló Milkova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreu Casamayor Segarra</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vicent Ahuir</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>María José Castro-Bleda</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Systems and Computation, Universitat Politècnica de València</institution>
          ,
          <addr-line>Camí de Vera s/n, València, 46020</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>VRAIN: Valencian Research Institute for Artificial Intelligence, Universitat Politècnica de València</institution>
          ,
          <addr-line>Camí de Vera s/n, València, 46020</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>ValgrAI: Valencian Graduate School and Research Network of Artificial Intelligence, Universitat Politècnica de València</institution>
          ,
          <addr-line>Camí de Vera s/n, València, 46020</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper describes our participation in the SatiSPeech shared task at IberLEF 2025, which focuses on the automatic detection of satirical content in Spanish using both textual and acoustic modalities. The task is divided into two subtasks: satire detection from text alone, and a more challenging multimodal satire detection task combining speech and its transcription. We developed several systems leveraging pre-trained transformer-based language models and instruction-tuned large language models, applying both fine-tuning and few-shot prompting strategies. For the multimodal task, we designed an approach that integrates acoustic and textual features to capture the nuanced cues characteristic of satirical discourse. Our models were trained and evaluated using the newly introduced SatirA dataset, which includes approximately 25 hours of labeled speech and corresponding transcriptions. The results demonstrate the efectiveness of our methods, achieving first place in the text-only task and second place in the multimodal task. These findings highlight the feasibility of applying multimodal learning to complex language understanding tasks such as satire detection and underscore the value of combining linguistic and prosodic cues for improved performance.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing</kwd>
        <kwd>Transformers-based Models</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Multimodal Satire Detection</kwd>
        <kwd>Spanish Satirical Content</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Satire is a sophisticated and multifaceted communication form that intertwines humor, irony, and
criticism to expose or mock social, political, or cultural phenomena. Unlike direct or straightforward
humor, satire often hinges on subtle linguistic and contextual cues (such as tone, exaggeration, and
implied meaning), making it dificult to detect, even for humans. This complexity is further magnified
in multimodal scenarios, where meaning is distributed across text, speech, and prosodic elements [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Misinterpreting satirical content can result in confusion or unintended consequences, particularly when
the subject matter is sensitive or highly critical. However, developing systems capable of understanding
satire is increasingly relevant for applications such as media monitoring [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], misinformation detection,
or political discourse analysis. Leveraging multimodal approaches that integrate textual, acoustic, and
contextual features ofers a promising path toward modeling the nuanced, context-dependent nature of
satirical discourse.
      </p>
      <p>
        Although several previous works have addressed the detection of satire in text [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ], fewer have
explored multimodal approaches that incorporate audiovisual information along with text [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This
gap is especially evident in low-resource languages such as Spanish, where annotated datasets and
multimodal benchmarks remain scarce [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In this work, we address the task of automatic satire detection in Spanish as a supervised classification
problem. We have explored a range of models, including traditional machine learning approaches
such as Support Vector Machines (SVMs) and transformer-based deep learning architectures [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In
addition, we evaluate several large language models (LLMs) under few-shot prompting scenarios to
assess their few-shot generalization capabilities. Our study is grounded on a novel multimodal dataset
which contains spoken content and its corresponding transcription, labeled as satirical or non-satirical.
We investigate how textual, acoustic, and contextual features can be efectively combined to improve
classification performance.
      </p>
      <p>The rest of the paper is structured as follows. In Sections 2 and 3, we introduce the task and describe
the dataset and evaluation metrics. Section 4 presents our multimodal satire classification system. In
Section 5, we report experimental results and conduct an error analysis. Finally, Section 6 concludes
the paper and discusses avenues for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task description</title>
      <p>The SatiSPeech 2025 shared task [8] at IberLEF 2025 [9] aims to address these challenges by investigating
the detection of satire in Spanish through a multimodal lens, combining textual and audio information.
Satirical communication often draws on irony, double meanings, and culturally grounded references,
which pose significant challenges for automatic systems [ 10]. The task is framed as a binary classification
problem: distinguishing between satirical and non-satirical content, using both textual and audio-based
cues. Central challenges include identifying which features most reliably indicate satire (e.g., linguistic
structures, prosody, intonation) and overcoming the scarcity of multimodal datasets that reflect the
diversity and authenticity of real-world satire. Through this initiative, the task seeks to advance the
capabilities of multimodal classification models and promote novel approaches to satire recognition
across languages and modalities.</p>
      <p>
        Despite recent progress, multimodal satire detection remains relatively underexplored, and most
existing research focuses on text or visual content. For example, [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] introduced a system that combines
text and images to detect satire in news headlines, showing that multimodal models outperform unimodal
counterparts. Similarly, [11] evaluated the efectiveness of large language models in identifying satirical
news in Brazilian Portuguese, reporting promising results and valuable insights into the processing of
satirical language characteristics. In the Spanish-language context, research has only recently begun
to gain momentum. The development of the SatiCorpus 2021 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] marked a significant step, ofering a
labeled dataset of satirical and non-satirical texts evaluated using deep learning models and linguistic
features.
      </p>
      <p>The shared task is organized into two subtasks: Text-based Satire Detection and Multimodal Satire
Detection. The first task focuses on determining whether a given text expresses satirical or non-satirical
content, relying solely on linguistic and semantic cues present in the written modality. The second task
extends this framework by incorporating audio data alongside text. Multimodal Satire Detection aims
to leverage both spoken language and its textual transcription to assess whether an audio-text pair
conveys satire, capturing prosodic, acoustic, and contextual signals that may be critical to identifying
satirical intent.</p>
    </sec>
    <sec id="sec-3">
      <title>3. The dataset and the evaluation metrics</title>
      <p>To address the challenges of satire detection in a multimodal framework, a dedicated dataset was
curated, combining both textual and audio information. The data were sourced from a wide selection of
YouTube channels, including satirical programs such as El Intermedio, Zapeando, Homo-Zapping, and
El Mundo Today, as well as non-satirical news programs like Antena 3 Noticias, El Mundo, and BBC
News. This variety ensures a rich representation of regional Spanish dialects and stylistic diversity,
capturing the linguistic and cultural nuances necessary for the task.</p>
      <p>The dataset construction process involved video extraction, followed by segmentation into
manageable short audio clips using a speaker diarization tool [12, 13]. To maintain consistency, only segments
of no longer than 25 seconds were retained. These clips were then transcribed using Whisper [14],
allowing the generation of high-quality textual data aligned with the audio.</p>
      <p>Annotation was carried out through a semi-supervised strategy, combining automatic classification
techniques with manual validation by three expert annotators. This hybrid approach ensured both
eficiency and accuracy. A subsequent manual review by task organizers was conducted to refine
the labels and ensure the quality of the annotation. The final dataset features a broad spectrum of
Spanish-speaking regions, promoting linguistic diversity and minimizing potential regional biases.
Some examples from the dataset are shown in Table 7 in Section A.</p>
      <p>The SatirA dataset consists of approximately 25 hours of labeled audio and the corresponding
transcriptions. For the shared task, a curated subset of 8000 multimodal samples was selected and
divided into Training (6000 samples) and Test (2000 samples) sets. Each sample contains a unique
identifier, the speech signal, its transcription, and a binary label indicating whether the content is
satirical or non-satirical.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Developed Systems</title>
      <sec id="sec-4-1">
        <title>4.1. Working partitions</title>
        <p>To develop, tune, and validate our systems before submitting them to the competition, we divided the
original Training set into two stratified partitions: 5100 samples for system training (Train) and 900
samples for tuning and validation of the diferent approaches (Dev). Depending on the task, we used
only the text in the samples (Task 1: Text Satire Detection) or text+audio (Task 2: Multimodal Satire
Detection) to develop the diferent systems. Table 2 shows the distribution of satirical and non-satirical
samples within the two partitions. The partitions are the same for tasks 1 and 2.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Task 1: Text Satire Detection</title>
        <p>For the first downstream task, we developed some text satire detection models using diferent
transformer-based language models and the Low-Rank Adaptation technique (LoRA) [15]. We employed
LoRA since the number of training samples is scarce, and LoRA will reduce the probability of overfitting
compared to regular fine-tuning. We targeted LoRA, the query and value modules, with a fixed dropout
set to 0.05. We selected diferent base models, monolingual and multilingual, publicly available at the
Hugging Face Hub [16] that were pre-trained for the Spanish language.</p>
        <p>For this approach, we developed the following 4 systems:
• System T1-1: The first system was based on the monolingual Spanish model BETO [ 17] (https:
//huggingface.co/dccuchile/bert-base-spanish-wwm-cased). BETO is a BERT model pre-trained
on large-scale Spanish corpora.
• System T1-2: The second system was based on the monolingual Spanish model BERTIN [18]
(https://huggingface.co/bertin-project/bertin-roberta-base-spanish). BERTIN is a RoBERTA
model pre-trained on Spanish texts.
• System T1-3: This system was based on the multilingual model XLM-RoBERTa [19] in its base
version (https://huggingface.co/FacebookAI/xlm-roberta-base). XLM-RoBERTa is a multilingual
version of RoBERTa, pre-trained on data containing 100 languages, including Spanish.
• System T1-4: System based on the large version of the monolingual Spanish model
RoBERTaBNE [20] (https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne). RoBERTA-BNE was
pretrained with data from the web crawling performed by the National Library of Spain from 2009
to 2019.</p>
        <p>For the fine-tuned systems, we made a hyperparameter search using the Optuna library [ 21].
Hyperparameters such as the learning rate, the batch size and the rank and scaling factor from LoRA were set
by a cross-validation system dividing the training set into five folds. For each model, ten trials were
performed.</p>
        <p>The search for the learning rate consisted of a log-uniform search between 1e-5 and 1e-3, while
the search for the batch size was between the specific values of 8, 16, 32 and 64. For the LoRA
hyperparameters, an integer search was done for both of them: the rank searching values between 8
and 32 with steps of 8, and the scaling factor searching values between 16 and 64 with steps of 16. The
hyperparameter values search results found by Optuna for each system are described in Table 3.</p>
        <p>We also developed systems based on LLMs and in-context learning prompting. We followed a 6-shot
approach and tried diferent sets of random examples from the training test. Fig. 1 shows the prompt
template and an example of how the template would look instantiated.</p>
        <p>We explored various LLMs for satire detection in text. All models were selected based on their support
for Spanish and/or their capabilities for few-shot reasoning. We developed 3 systems based on LLMs
and prompting:
• System T1-5: The fifth system was based on the Qwen2.5 family of LLMs [ 22], specifically on
the 7 billion parameters instruct version (https://huggingface.co/Qwen/Qwen2.5-7B-Instruct).
Qwen2.5 is a versatile family of LLMs developed by Alibaba Cloud, designed to handle a wide
range of natural language processing tasks.
Tu tarea es clasificar un texto en español como ’satire’ o ’no-satire’.</p>
        <p>Definición de sátira: La sátira es un tipo de discurso que usa el humor, la ironía, la exageración o el
absurdo para criticar o ridiculizar a personas, instituciones o situaciones sociales. A menudo aparenta ser
serio, pero en realidad busca provocar reflexión o burla.</p>
        <p>Instrucciones:
- Responde SOLO con ’satire’ o ’no-satire’.
- Analiza el tono, el contenido y el propósito del texto.
- Considera si hay señales de exageración, ironía o crítica disfrazada.</p>
        <p>Ejemplos:
• Texto: "¡Ey! ¡Ey! ¡Ey! Y todos hemos escuchado la frase, el que se mueve no sale en la foto..."</p>
        <p>Clasificación: satire
• Texto: "La Organización Mundial de la Salud advirtió que en las próximas dos décadas..."</p>
        <p>Clasificación: no-satire
• Texto: "Científicos de Texas logran disparar al coronavirus y agujerearle el sombrero..."</p>
        <p>Clasificación: satire
• Texto: "Estampa su firma para que el Parlamento se constituya el 26 de octubre..."</p>
        <p>Clasificación: no-satire
• Texto: "Sí, yo solo quiero decir que para la semana que viene..."</p>
        <p>Clasificación: satire
• Texto: "Estampa su firma para que el Parlamento se constituya el 26 de octubre..."</p>
        <p>Clasificación: no-satire
Ahora clasifica este texto:
Texto: "En Francia, se juega una nueva carta en favor de su política de austeridad..."</p>
        <p>Clasificación:</p>
        <p>• System T1-6: In this system, we employed the 8 billion parameters instruct version of the
Llama-3.1 LLMs [23] (https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct). Llama 3.1 was
developed by Meta and represented a significant leap in capability, scale, and multilingual support
compared to previous Llama versions.
• System T1-7: In the seventh system we employed the LLM 7 billion parameters instruct version
of Mistral [24] (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3). The Mistral model
was developed by the Mistral AI team, and it ofers a balance between performance and eficiency,
making it suitable for a wide range of applications, including those with limited computational
resources.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Task 2: Multimodal Satire Detection</title>
        <p>A multimodal approach combining both audio and transcription analysis was pursued. The goal
was to obtain, for each sample, an embedding representative of its audio and a diferent embedding
representative of its text. Then, both embeddings would be concatenated and used as the input of a
classifier that performed the classification task.</p>
        <p>To obtain the audio embeddings, we employed three diferent models: (1) OpenAI’s Whisper [ 25]
in its base version (https://openai.com/index/whisper/), (2) the base version of Meta’s Wav2Vec2 [26]
models (https://huggingface.co/facebook/wav2vec2-base), and (3) Meta’s Hubert [27] base version
model (https://huggingface.co/facebook/hubert-base-ls960). Although Wav2Vec2 and Hubert directly
output the desired embeddings, Whisper does not-as we had to extract the embeddings from the output
of its encoder. For the text embeddings, we used the mean pooling over the last hidden state of the
XLM-RoBERTa to extract the contextual embeddings of the text; we employed the base model version.</p>
        <p>Then, diferent types of classifiers such as Multi-Layer Perceptron (MLP), Support Vector Classifier
(SVC), Linear SVC (lSVC), Gradient Boosting classifier (GB), and -Neighbors classifier ( -Ns) were
trained and evaluated; all classifiers were trained using the Scikit-learn library for machine learning
in Python [28]. A grid search using a 5-fold cross-validation was performed in order to find optimal
values for the parameters of each of these classifiers.</p>
        <p>For task 2, we developed several multimodal classification systems by varying both the audio
embedding extraction methods and the classification techniques. These systems were trained and fine-tuned
using the Train and Dev partitions. In total, we implemented seven distinct multimodal classifiers:
• System T2-1: This system extracts the audio embedding from the Whisper encoder module
of the model, and the text embedding is extracted using XML-RoBERTa. These embeddings are
concatenated and passed to an MLP classifier.
• System T2-2: For audio embedding system employs Wav2Vec2 and XML-RoBERTa for the text.</p>
        <p>These embeddings are concatenated and passed to an MLP classifier.
• System T2-3: For audio embedding system employs the Hubert model and the XML-RoBERTa
model for the text. These embeddings are concatenated and passed to an MLP classifier.
• System T2-4: Same as T2-2 (Wave2Vect2 and XML-RoBERTa), but it uses an SVC for the
classification task.
• System T2-5: Same as T2-2, but it uses a Linear SVC for the classification task.
• System T2-6: Same as T2-2, but it uses Gradient Boosting for the classification task.
• System T2-7: Same as T2-2, but it uses -Neighbors for the classification task.</p>
        <p>We also explored the possibility of increasing the amount of data, changing the way the audio
embedding was extracted, and applied noise on the audio side. We obtained 3 more systems:
• System T2-8: The system is identical to the T2-4 (Wav2Vec2 and XML-RoBERTa with SVC
for classification). However, it was trained with the 6000 samples available in the Training set
provided by the organizers of the shared task.
• System T2-9: This system uses the same components as T2-4, however it uses attention pooling
for audio embedding extraction. This system was also trained with the Training set of the shared
task.
• System T2-10: This system is identical to the T2-9 system. In addition to training it with the
6000 available samples, we introduce a new synthetic sample per original sample by randomly
introducing volume variation, white noise, frequency masking, and time masking in the audio to
improve the robustness of the system. We trained the system with 12 000 samples.</p>
        <p>
          Although there was no hyperparameter optimization for the MLP-based systems, all other systems
were optimized by performing a grid search on its parameters using a 5-fold cross-validation system.
For SVCs, its penalty parameter (C) is chosen between [0.1, 1, 10], and its gamma parameter is chosen
between ’scale’ or ’auto’. For linear SVCs, its penalty parameter is also chosen between the same values,
its tolerance is decided between [0.1, 0.01, 0.001], and its loss function is decided between ’hinge’ and
’squared_hinge’. For gradient boosting classifiers, the number of estimators is decided between 100 or
300, the learning rate is chosen between 0.1 and 0.05 and its maximum depth is chosen between 3 and 5.
For -Neighbors classifiers, the number of neighbors is decided between [
          <xref ref-type="bibr" rid="ref3 ref5 ref7">3, 5, 7</xref>
          ], with weights that
can be uniform or up to distance, and its metric can be either Euclidean distance or Manhattan distance.
The hyperparameter values search results found by the grid search for each system are described in
Table 4.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental results and discussion</title>
      <p>This section presents the results of our systems on both the development (Dev) partition (defined in
Section 4.1) and the oficial Test set provided for the shared task.</p>
      <sec id="sec-5-1">
        <title>5.1. Task 1: Text Satire Detection</title>
        <p>Compared to the baseline system T1-Bl, all of our systems, except T1-6, outperformed the base
performance, indicating that modern transformer models are efective for satire detection.</p>
        <p>Analyzing the results of the models obtained with LoRA fine-tuning, we observe that these systems
have obtained the best performance on Dev compared to the prompting-based systems. However,
their performance dropped significantly on the Test set, compared to the prompt-based systems. This
suggests that LoRA-based models may generalize less efectively than larger, prompt-based models.
Despite the drop, the best overall result was achieved by a LoRA system (T1-3), closely followed by a
prompt-based system (T1-5), which interestingly performed better on the Test set than on Dev. These
results led us to achieve first place in the text-based satire detection task.</p>
        <p>Notably, although this task is conducted in Spanish texts, the monolingual models T1-1, T1-2, and
T1-4 presented a lower performance on the Test set than some multilingual models (T1-3 and T1-5).
This may suggest that the Test samples included foreign expressions or multilingual cues that were
better handled by multilingual models.</p>
        <p>Fig. 2 visually summarizes the results from Table 5, with systems ordered by Test performance. The
performance gap between Dev and Test is more pronounced in monolingual-based systems than in
multilingual ones. As stated before, the multilingual model T1-5 even improve from Dev to Test.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Task 2: Multimodal Satire Detection</title>
        <p>Table 6 shows the results for the second task. T2-Bl is the baseline provided by the organizers. The
baseline relies on MFCC features extracted from the audio, concatenated with the textual features from
Task 1, and using a Support Vector Classifier (SVC) for classification [ 8]. Systems T2-1 to T2-7 were
trained using our internal Train partition (as defined in Section 4.1), while T2-8 to T2-10 were trained
on the oficial Training set provided by the shared task organizers. Consequently, no Dev results are
available for TT2-8 to TT2-10.</p>
        <p>All our developed systems outperform the baseline system on the Test set. Among the systems
trained with our internal Dev data, the combination of the Wav2Vec2 with XML-RoBERTa (system
T2-2) performed better the other two audio models: Whisper (T2-1) and HuBERT (T2-3). This suggests
that Wav2Vec2 provides more robust acoustic representations in this context.</p>
        <p>When comparing classification strategies, T2-4 (SVC) outperformed T2-2 (MLP), even with the same
audio and text embeddings, indicating that SVC may ofer better generalization or be more stable for
smaller training sets. Other classical classifiers, such as linear SVC ( T2-5), Gradient Boosting (T2-6),
and -Neighbors (T2-7), showed lower performance than SVC and MLP, especially on the Test set,
suggesting limited suitability for this multimodal setup.</p>
        <p>The highest overall performance was obtained with system T2-8, trained on the full 6000 samples
of the oficial Training set. This confirms that increasing the training data size leads to significant
performance improvements, especially for high-capacity models and multimodal architectures.
1.0
0.9</p>
        <p>Regarding on the last two systems (T2-9 and T2-10), both of which build upon T2-8, we
experimented with audio attention pooling (AAP) and data augmentation. However, applying AAP in T2-9
slightly decreased performance, and combining AAP with audio data augmentation in T2-10 led to a
further drop. We hypothesize that attention pooling may reduce generalization due to the limited size
of training data, and that the added noise from augmentation did not introduce suficient variability to
be beneficial. Further investigation is needed to better understand these results and refine the use of
AAP in future iterations.</p>
        <p>As can be observed in Table 6, with the system T2-8 we achieved the second position in the
competition. This system achieved a relative performance of 97.85% in relation to the best result
achieved in Task 2 (0.8834 of F1-score), demonstrating the competitiveness of our approach.</p>
        <p>Fig. 3 visually summarizes the results of Table 6, with systems sorted by Test performance. It can
be noticed in the systems that were evaluated on Dev and Test that there is a noticeable performance
reduction from Dev to Test. We can remark T2-6 and T2-5 systems, which were promising with the
development results, but on Test they did not generalize properly. Therefore, Dev results should always
be taken with caution when we are making decisions during the system development.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future work</title>
      <p>This study proposes various possible solutions to both tasks of the SatiSPeech 2025 contest [8], analyzing
samples containing both text and audio in order to classify them into a "satire" or a "no-satire" category.
Task 1 required to only analyze the transcriptions of the samples, while Task 2 allowed to analyze both
the audio and the transcription of each sample.</p>
      <p>For Task 1, we adjust for the downstream task various pre-trained language models, both
Spanishspecialized and multilingual, using LoRA due to the relatively low amount of samples. We also explored
LLMs and prompting techniques to study whether these models could reach the performance of the
ifne-tuned ones. For Task 2, we used both audio and text deep learning models in order to extract
representative embeddings. These embeddings were then used to train various types of classifiers, such
as MultiLayer Perceptrons or Support Vector Classifiers.</p>
      <p>These methodologies have helped us achieve first place in Task 1 by fine-tuning a multilingual
classification model and second place in Task 2 by using a support vector machine to classify samples,
which was a very favorable outcome for our approaches to the proposed problem.</p>
      <p>For future research, it would be worthwhile to try diferent forms of data augmentation in order
to reach better results with larger models. The close results between prompting and fine-tuning also
make a compelling case for expanding the hyperparameter search for the fine-tuned systems in Task 1,
especially considering the diferences between fine-tuned models observed on the test set. Additionally,
tuning the prompts using alternative samples could further improve performance. Other future line of
research will be to test ensemble methods, that is, combining the strengths of fine-tuned and
promptingbased models, or integrating multiple classifiers in Task 2. Finally, given the superior performance of
multilingual models, exploring how these systems handle cross-lingual or code-switched satire could
provide useful insights for broader applications.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Ethics Statement</title>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>We have not used additional data to those provided by the competition. The pretrained models used are
obtained from HuggingFace models hub, under the Apache License 2.0, except for OpenAI’s Whisper,
which was obtained using its own oficial python package.</p>
      <p>This work is partially supported by MCIN/AEI/10.13039/501100011033 and "ERDF A way of making
Europe" under grant PID2021-126061OB-C41. Partially supported by the Vicerrectorado de Investigación
de la Universitat Politècnica de València PAID-01-23. It is also partially supported by the Spanish
Ministerio de Universidades under the grant FPU21/05288 for university teacher training.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT in order to: Grammar and spelling
check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and
take(s) full responsibility for the publication’s content.
[8] R. Pan, J. A. García-Díaz, T. Bernal-Beltrán, F. García-Sánchez, R. Valencia-García, Overview of
SatiSPeech at IberLEF 2025: Multimodal Audio-Text Satire Classification in Spanish, Procesamiento
del Lenguaje Natural 75 (2025).
[9] J. Á. González-Barba, L. Chiruzzo, S. M. Jiménez-Zafra, Overview of IberLEF 2025: Natural
Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the
Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the
Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS. org, 2025.
[10] M. d. P. Salas-Zárate, G. Alor-Hernández, J. L. Sánchez-Cervantes, M. A. Paredes-Valverde, J. L.</p>
      <p>García-Alcaraz, R. Valencia-García, Review of English literature on figurative language applied to
social networks, Knowledge and Information Systems 62 (2020) 2105–2137.
[11] G. Wick-Pedro, C. F. da Silva, M. L. Inácio, O. A. Vale, H. de Medeiros Caseli, Using large
language models for identifying satirical news in brazilian portuguese, in: Proceedings of the 16th
International Conference on Computational Processing of Portuguese, 2024, pp. 156–167.
[12] H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz,
M.-P. Gill, pyannote.audio: neural building blocks for speaker diarization, in: 2020 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp.
7124–7128.
[13] H. Bredin, A. Laurent, End-to-end speaker segmentation for overlap-aware resegmentation, in:</p>
      <p>Interspeech 2021, 2021, pp. 3111–3115.
[14] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, Robust speech recognition
via large-scale weak supervision, in: International Conference on Machine Learning, PMLR, 2023,
pp. 28492–28518.
[15] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, LoRA: Low-Rank
Adaptation of Large Language Models, 2021. URL: https://arxiv.org/abs/2106.09685. arXiv:2106.09685.
[16] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M.
Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger,
M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural language processing,
in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing:
System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 38–45. URL:
https://www.aclweb.org/anthology/2020.emnlp-demos.6.
[17] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model and
evaluation data, in: PML4DC at ICLR 2020, 2020.
[18] J. D. la Rosa y Eduardo G. Ponferrada y Manu Romero y Paulo Villegas y Pablo González de Prado
Salas y María Grandury, Bertin: Eficient pre-training of a spanish language model using perplexity
sampling, Procesamiento del Lenguaje Natural 68 (2022) 13–23. URL: http://journal.sepln.org/
sepln/ojs/ojs/index.php/pln/article/view/6403.
[19] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, CoRR
abs/1911.02116 (2019). URL: http://arxiv.org/abs/1911.02116. arXiv:1911.02116.
[20] A. G. Fandiño, J. A. Estapé, M. Pàmies, J. L. Palao, J. S. Ocampo, C. P. Carrino, C. A. Oller,
C. R. Penagos, A. G. Agirre, M. Villegas, Maria: Spanish language models, Procesamiento del
Lenguaje Natural 68 (2022). URL: https://upcommons.upc.edu/handle/2117/367156#.YyMTB4X9A-0.
mendeley. doi:10.26342/2022-68-3.
[21] T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyperparameter
optimization framework, in: Proceedings of the 25th ACM SIGKDD International Conference
on Knowledge Discovery &amp; Data Mining, Association for Computing Machinery, New York, NY,
USA, 2019, p. 2623–2631. URL: https://doi.org/10.1145/3292500.3330701. doi:10.1145/3292500.
3330701.
[22] Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei,
H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang,
L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan,
Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, Z. Qiu, Qwen2.5 technical report, 2025. URL:
https://arxiv.org/abs/2412.15115. arXiv:2412.15115.
[23] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur,
A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, A. Hartshorn, et al., The llama 3 herd of models,
2024. URL: https://arxiv.org/abs/2407.21783. arXiv:2407.21783.
[24] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F.
Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao,
T. Lavril, T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023. URL: https://arxiv.org/abs/2310.06825.
arXiv:2310.06825.
[25] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, Robust speech recognition via
large-scale weak supervision, 2022. URL: https://arxiv.org/abs/2212.04356. arXiv:2212.04356.
[26] A. Baevski, H. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: A framework for self-supervised learning
of speech representations, 2020. URL: https://arxiv.org/abs/2006.11477. arXiv:2006.11477.
[27] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, A. Mohamed, Hubert:
Selfsupervised speech representation learning by masked prediction of hidden units, 2021. URL:
https://arxiv.org/abs/2106.07447. arXiv:2106.07447.
[28] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay,
Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–
2830.
Por las dos cosas se te saltan las lágrimas. Tú dirás. ¿Para qué te quieres ir, hombre?
Si aquí estamos genial. Lo que dices es que esta mañana casi me pica un alacrán. ¿Un
alacrán? ¡Me río yo! Si te pica un alacrán, puedes morir después de agonizar durante
tres días entre dolores insoportables. Pero conozco bichos peores.</p>
      <p>Tras un estudio con 18 adultos, científicos de la Universidad McGill en Canadá
descubrieron que un par de juegos en Tetris puede resultar útil a la hora de tratar la
ambliopía. Según los investigadores, esto demostró ser más eficaz que el tradicional
método de cubrir el ojo sano con un parche.</p>
      <p>En mis 40 años de política no he dejado una promesa sin cumplir. Dije que cenábamos
anchoucas y cenamos anchoucas. ¡No hombre, no hombre, no! Pero Miguel Ángel, por
el amor de Dios, ¿qué haces? Si hasta tenía el palo ya para hacer el espeto. ¡Déjate de
doradas! No me explico que no pique ni una anchoa.</p>
      <p>No, me parece estupendo, me parece estupendo. Yo creo que cuanto más empleos creemos,
mejor. Lo que me parece es... Señor Rodríguez, usted hace las preguntas de una manera
curiosísima, que es, yo le escucho, entonces yo quiero responder y usted pasa de mi
respuesta. Pero si yo le pregunto en siete segundos. Miguel Ángel, no le interrumpas.</p>
      <p>Usted tarda cinco minutos y medio en contestar algo que todavía no ha entendido lo del
papel. Pero es que no le dejas, es cierto que no le dejas hablar. Lo del papel espero leerlo
en Twitter, porque aquí no lo he entendido.
satire
satire</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <article-title>Cultural diferences in humor perception, usage, and implications</article-title>
          ,
          <source>Frontiers in Psychology</source>
          <volume>10</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Rubin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Conroy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cornwell</surname>
          </string-name>
          ,
          <article-title>Fake News or Truth? Using Satirical Cues to Detect Potentially Misleading News</article-title>
          , in: Proceedings of the Second Workshop on Computational Approaches to Deception Detection, Association for Computational Linguistics, San Diego, California,
          <year>2016</year>
          , pp.
          <fpage>7</fpage>
          -
          <lpage>17</lpage>
          . URL: https://aclanthology.org/W16-0802/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W16</fpage>
          -0802.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Burfoot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Baldwin</surname>
          </string-name>
          , Automatic Satire Detection:
          <article-title>Are You Having a Laugh?</article-title>
          ,
          <source>in: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, Association for Computational Linguistics</source>
          , Suntec, Singapore,
          <year>2009</year>
          , pp.
          <fpage>161</fpage>
          -
          <lpage>164</lpage>
          . URL: https://aclanthology.org/P09-2041/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Reyes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Veale</surname>
          </string-name>
          ,
          <article-title>A multidimensional approach for detecting irony in Twitter</article-title>
          ,
          <source>Lang Resources &amp; Evaluation</source>
          <volume>47</volume>
          (
          <year>2013</year>
          )
          <fpage>239</fpage>
          -
          <lpage>268</lpage>
          . URL: https://doi.org/10.1007/s10579-012-9196-x.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Broniatowski</surname>
          </string-name>
          ,
          <article-title>A multi-modal method for satire detection using textual and visual cues</article-title>
          ,
          <source>in: Proceedings of the 3rd NLP4IF Workshop on NLP for Internet Freedom: Censorship</source>
          , Disinformation, and Propaganda,
          <source>International Committee on Computational Linguistics (ICCL)</source>
          , Barcelona,
          <source>Spain (Online)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>33</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>García-Díaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Valencia-García</surname>
          </string-name>
          ,
          <article-title>Compilation and evaluation of the Spanish SATICorpus 2021 for satire identification using linguistic features and transformers</article-title>
          ,
          <source>Complex &amp; Intelligent Systems</source>
          <volume>8</volume>
          (
          <year>2022</year>
          )
          <fpage>1723</fpage>
          -
          <lpage>1736</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Proceedings of the 31st International Conference on Neural Information Processing Systems</source>
          , NIPS'17, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2017</year>
          , p.
          <fpage>6000</fpage>
          -
          <lpage>6010</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Examples from the SatirA dataset 2025</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>