<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UTP at SatiSpeech-IberLEF 2025: FastText and Wav2Vec2 for Satire Detection in Spanish</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Denis Cedeño-Moreno</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miguel Vargas-Lombardo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alan Delgado-Herrera</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad Tecnológica de Panamá</institution>
          ,
          <addr-line>Ciudad de Panamá</addr-line>
          ,
          <country country="PA">Panamá</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper describes the UTP team's participation in the SatiSPeech 2025 shared task, which addresses the challenge of detecting satirical content in Spanish using both textual and multimodal (text + audio) data. We have implemented two lightweight and interpretable classification pipelines. For Task 1 (text-only satire detection), we combined FastText embeddings with a Random Forest classifier. For Task 2 (multimodal satire detection), we extended this architecture by adding acoustic embeddings extracted via Wav2Vec2. Our systems were evaluated on the oficial test set. The text-based model achieved a macro F1 score of 76.48%, while the multimodal approach improved slightly to 77.93%. Our team ranked 11th out of 11 participants in both tasks and did not outperform the baseline. Nevertheless, the results highlight the potential of simple architectures to provide stable performance and interpretability. We analyze class-wise behavior and the contribution of audio features to satire detection.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Speech Recognition</kwd>
        <kwd>Automatic Satire Recognition</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>FastText</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Satire is a nuanced and context-sensitive communicative genre that presents significant challenges
for automatic classification systems. Unlike conventional humor, it often delivers implicit social or
political critique through rhetorical strategies such as irony, parody, and exaggeration. Accurately
interpreting satire demands not only linguistic analysis but also an understanding of sociocultural
context and speaker intent. These elements are embedded both in the textual content and in prosodic
features of speech—such as intonation, pitch, and rhythm—which shape how messages are perceived.
When satire is expressed through multimodal formats that combine text and audio, the interpretive
complexity increases, requiring models capable of jointly processing both modalities to detect subtle
cues efectively. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        In complex multimodal scenarios where textual, auditory, and visual elements are combined, the
challenge of interpreting satire becomes more pronounced. Satirical communication depends not only
on linguistic content, but also on paralinguistic and visual cues, such as intonation, facial expressions,
body language, and temporal synchronization. These elements play a critical role in conveying ironic
intent and guiding audience interpretation. Therefore, satire detection systems must be capable of
coherently fusing and reasoning over heterogeneous data sources. Recent advances in multimodal
deep learning demonstrate that models integrating textual and auditory modalities through joint or
coordinated representation learning are efective at capturing the nuances of nonliteral language [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>In recent years, interest in automatic satire detection has increased due to its relevance in various
applications, including mitigating misinformation, moderating content, and analyzing media. On digital
platforms, satirical content is often confused with factual information, which increases the risk of
misinterpretation and unintentionally spreading misleading narratives. Reliable satire detection systems
are especially essential in multilingual and culturally diverse environments, where linguistic nuances
and socio-political contexts further complicate interpretation.</p>
      <p>
        Traditionally, satire detection has focused primarily on textual analysis, often using
transformerbased language models trained on datasets from news articles and social media. However, a significant
amount of satirical content is communicated through spoken media-such as television shows, podcasts,
and video sketches-where vocal delivery and prosodic cues are integral to meaning. Despite this,
benchmarks for multimodal satire classification remain scarce. Previous research in related areas
such as sarcasm and irony detection has shown that combining textual input with audio or visual
features tends to outperform text-only models [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In the Spanish language context, recent work has
explored satire detection from text alone, providing foundational datasets and baselines that highlight
the linguistic complexity of the task [4].
      </p>
      <p>The SatiSPeech shared task [5] at IberLEF 2025 [6] introduces a novel benchmark for multimodal
satire recognition in Spanish. The task is divided into two subtasks: (1) satire classification based on text
alone, and (2) multimodal classification combining aligned text and audio input. The dataset includes
content from well-known Spanish satirical programs.</p>
      <p>For this, we developed models based on the approach used in the EmoSpeech task at IberLEF 2024
[7], which also featured a multimodal setup involving audio and text for emotion detection. In this case,
we applied similar lightweight and interpretable machine learning pipelines for both tasks. For Task
1 (text-only classification), we generated sentence embeddings using FastText [ 8] and fed them into
a random forest classifier with hyperparameters tuned by cross-validation. For Task 2 (multimodal
classification), we extended this architecture by incorporating acoustic features extracted with Wav2Vec2
[9], a pre-trained speech representation model. Each instance was represented as a concatenation of
textual and audio embeddings, and a new Random Forest classifier was trained on these multimodal
vectors. Both models were evaluated using the oficial validation splits, and performance was measured
using standard classification metrics such as precision, recall, and F1-score.</p>
      <p>The remainder of this paper is organized as follows: Section 2 reviews previous research on satire
detection, multimodal classification, and the use of pre-trained embeddings such as FastText and
Wav2Vec2. Section 3 presents an overview of the SatiSPeech shared task and describes the dataset used
in both tasks. Section 4 outlines our modeling strategies for unimodal (text-only) and multimodal (text
+ audio) satire detection. Section 5 reports and analyzes the experimental results obtained from the
validation set. Finally, Section 6 concludes the paper and discusses potential directions for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>Satire detection is a subfield of computational humor and figurative language processing that has
attracted growing interest in recent years. Early work focused predominantly on textual data, using
handcrafted linguistic features and classical classifiers [ 10, 11]. With the advent of deep learning, neural
models such as CNNs, LSTMs, and, more recently, transformers like BERT and RoBERTa have been
applied with improved results.</p>
      <p>In the Spanish language domain, datasets such as SatiCorpus [4] have supported progress in automatic
satire detection. These corpora typically involve news and social media sources and have been used to
benchmark monomodal text-based approaches.</p>
      <p>
        The use of pre-trained word embeddings such as FastText [8] has become standard due to their ability
to capture subword-level information, which is particularly useful for morphologically rich languages
like Spanish. Our work builds upon this by incorporating FastText to encode textual input eficiently
and interpretably. Beyond text, recent research has explored multimodal strategies, combining visual or
acoustic data with textual features to capture subtleties in tone, prosody, or facial expression [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In
related tasks such as sarcasm and emotion detection, models like Wav2Vec2 [9] have demonstrated
the value of acoustic representations, even in low-resource settings. Emotion recognition tasks such
as EmoSpeech [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] have also shown that simple audio-text fusion pipelines can yield competitive
performance with relatively low computational cost.
      </p>
      <p>While there is limited prior work on multimodal satire detection in Spanish, the SatiSPeech shared
task [5] introduces an important benchmark for evaluating such systems. Our approach contributes
to this area by ofering an interpretable, lightweight baseline that combines FastText and Wav2Vec2
without relying on deep fine-tuning.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset Description</title>
      <p>For training and evaluation, participants were provided with the SatirA dataset, a curated collection
of Spanish-language audio segments sourced from online media. Satirical content was selected from
various comedic programs, while non-satirical data was drawn from broadcast news sources. The
dataset spans multiple dialects and speaking styles to enhance linguistic diversity and reduce potential
regional bias.</p>
      <p>Segments were generated using automatic diarization, excluding clips longer than 25 seconds.
Transcriptions were produced using Whisper ASR, and labels were refined through a semi-supervised
annotation process involving expert verification to ensure high-quality ground truth.</p>
      <p>The full dataset comprises approximately 25 hours of labeled content. Predefined training and
validation splits were provided by the organizers, ensuring a stratified distribution of labels. All
development and evaluation adhered to the oficial task guidelines. Table 1 shows the distribution of
training and validation samples by class.</p>
      <p>Split
Train
Train
Validation
Validation</p>
      <p>Class
Non-satirical
Satirical
Non-satirical
Satirical</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <sec id="sec-4-1">
        <title>4.1. Task 1: Text-Based Satire Detection</title>
        <p>For Task 1, we designed a pipeline that uses the FastText model to generate fixed-length sentence
embeddings from each transcript. FastText was chosen for its computational eficiency and its ability
to capture subword information - an advantage in morphologically rich languages such as Spanish.
The resulting embeddings were used as input features for a Random Forest (RF) classifier. We trained
the RF model using stratified cross-validation and optimized its hyperparameters based on validation
performance. This approach focuses on extracting semantic representations from the text, while
benefiting from the interpretability and robustness ofered by ensemble learning methods.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Task 2: Multimodal Satire Detection</title>
        <p>For Task 2, we extended the pipeline developed for Task 1 by incorporating prosodic and acoustic
information from the audio stream. To do this, we used the pre-trained Wav2Vec model 12, a
transformerbased architecture designed for self-supervised learning of speech representations. From each audio file,
1facebook/wav2vec2-large-xlsr-53-spanish
2https://huggingface.co/facebook/wav2vec2-large-xlsr-53-spanish
we extracted a fixed-length embedding by taking the output of the model’s first hidden layer. This audio
representation was then concatenated with the corresponding FastText-based text embedding to form a
unified multimodal feature vector for each instance. A second Random Forest classifier was trained on
these combined vectors, with parameters adjusted to account for the increased input dimensionality.
This architecture allows the system to capture both linguistic and acoustic cues relevant to satirical
expression, such as irony conveyed through tone, rhythm, or timing.</p>
        <p>All models were implemented using scikit-learn, and feature extraction was performed with
fastText, librosa, and the transformers library. Both tasks were evaluated using standard
classification metrics on the predefined validation split.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>We evaluated our models using the oficial validation split released by the SatiSPeech organizers.
Performance was assessed using standard classification metrics, including precision, recall, and F1-score,
for both tasks. Our approach combines pre-trained embeddings—FastText for textual data and Wav2Vec2
for acoustic signals—with RF classifiers to perform the final binary classification.</p>
      <sec id="sec-5-1">
        <title>5.1. Task 1: Text-Based Satire Detection</title>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Task 2: Multimodal Satire Detection</title>
        <p>As shown in Table 3, our approach for Task 2 achieved a macro-averaged F1-score of 80.81%. The
integration of audio information via Wav2Vec2 embeddings provided a moderate performance boost
compared to Task 1, particularly improving recall balance across classes.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Leaderboard Ranking</title>
        <p>According to the oficial SatiSPeech 2025 rankings, the UTP system ranked 11th out of 11 in both
tasks, as shown in Tables 4 and 5. In Task 1 (text-only), our macro F1-score of 77.93% was slightly lower
than the baseline system (79.37%), while in Task 2 (multimodal), our score of 76.48% was also lower
than the baseline (79.92%).</p>
        <p>Interestingly, our multimodal approach yielded lower results than the text-only model, suggesting
that the integration of modalities introduced inconsistencies in classification. Despite this, our models
demonstrated stable behavior and competitive performance, thanks to their simplicity and lack of
taskspecific fine-tuning. These findings highlight the viability of interpretable, low-complexity architectures
in exploratory or resource-constrained scenarios.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Result Analysis</title>
        <p>Our findings indicate that both text-only and multimodal configurations can detect satirical content
with reasonable accuracy. However, several patterns emerge upon closer inspection of validation results.</p>
        <p>In both tasks, the model performs better on the non-satirical class. For Task 1, recall for non-satirical
content reached 0.854, compared to 0.738 for satirical content. This disparity likely reflects the more
consistent lexical and structural patterns of journalistic language, which are easier to model. Satirical
expressions, in contrast, tend to be more diverse and nuanced.</p>
        <p>The addition of Wav2Vec2-based audio features led to a modest improvement in macro F1-score
(from 79.8% to 80.8%). This underscores the value of prosodic elements such as tone and intonation in
identifying satirical intent. However, the overall gain was limited, possibly due to:
• The use of naive feature concatenation instead of interaction-aware fusion methods.
• The general-purpose nature of the Wav2Vec2 pretraining, which may not capture satire-specific
prosody.
• The Random Forest classifier’s limitations in handling high-dimensional, heterogeneous feature
spaces.</p>
        <p>Validation results aligned with development observations, suggesting good generalization and
minimal overfitting—likely due to the use of pre-trained embeddings and ensemble methods.</p>
        <p>Although our final ranking was lower than that of top-performing deep learning systems, the UTP
pipeline represents a strong, interpretable baseline. Its simplicity, modularity, and resource eficiency
make it suitable for early-stage experimentation and comparative evaluations. Future work will focus
on improving multimodal fusion, leveraging task-adaptive fine-tuning, and expanding the feature space
to better capture the complexity of satirical discourse.</p>
        <p>To gain a deeper understanding of the system’s limitations, we analyzed several misclassified examples
from both tasks. Table 6 shows representative cases where the model failed to match the true label.
These errors illustrate some of the subtle linguistic and acoustic challenges that arise in satire detection.</p>
        <p>Predicted (T1)</p>
        <p>Predicted (T2)</p>
        <p>True Label</p>
        <p>Transcription
no-satire
no-satire
satire
no-satire
no-satire
no-satire
no-satire
satire
no-satire
satire
satire
satire
no-satire
satire
no-satire</p>
        <p>Tan solo tres días después de que esta
información saliera a la luz, ya teníamos un plan
alternativo en marcha.</p>
        <p>Bueno, pues con que hemos sacado el mejor
resultado posible, lo cual demuestra la eficiencia
de nuestra estrategia revolucionaria.</p>
        <p>Comienzo cada clase enseñándoles cómo dejar
una huella digital significativa en el ecosistema
virtual moderno.</p>
        <p>Intelectual, llevado a la cárcel por un primer
ministro corrupto que asegura que todo es parte de
una obra teatral.</p>
        <p>Mucha presencia policial también en las
principales avenidas, donde se registraron aplausos
por parte de los manifestantes.
• Subtle satire: Many false negatives involve mild or indirect satire, with expressions that require
contextual or cultural background to interpret correctly.
• False positives due to figurative language : Some non-satirical sentences include metaphorical
or exaggerated language, which the classifier may misinterpret as satire.
• Limited contribution of audio: Audio cues alone were not always reliable, as certain tonal
variations were insuficient to distinguish satire from factual reporting without deeper semantic
grounding.</p>
        <p>These findings highlight the need for more advanced context modeling and fusion mechanisms that
go beyond feature concatenation. Incorporating pragmatic, discourse-level features and leveraging
ifne-tuned acoustic embeddings may help reduce such misclassifications in future work.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we presented our participation in the SatiSpeech 2025 shared task, focusing on satire
detection in Spanish using both text-only and multimodal (text + audio) inputs. Our approach relied on
lightweight, interpretable pipelines that combined FastText and Wav2Vec2 embeddings with Random
Forest classifiers.</p>
      <p>Although our models did not outperform the baseline on the oficial leaderboard, they demonstrated
consistent and stable performance with minimal computational cost. Notably, the inclusion of acoustic
features led to a modest improvement in recall balance across classes, confirming the relevance of
prosodic information in capturing satirical intent.</p>
      <p>Our results support the hypothesis that simple, well-structured architectures can provide robust
baselines in complex language tasks, especially under resource constraints. However, the limitations
observed—particularly in the multimodal setup—suggest that more sophisticated modeling strategies
are needed to fully exploit the available data.</p>
      <p>As future work, we plan to explore deeper multimodal fusion strategies beyond simple feature
concatenation, including attention-based or tensor fusion methods. We also aim to fine-tune Wav2Vec2
on satire-specific audio data, and to incorporate discourse-level and pragmatic features that may better
reflect the nuanced nature of satire.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The authors would like to thank the National Secretariat of Science, Technology and Innovation of
Panama (SENACYT- SNI) for the support given in the development of this research. Also, Panama of
the Technological University of Panama (FISC-UTP).</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used DeepL in order to Grammar and spelling check.
[4] J. A. García-Díaz, R. Valencia-García, Compilation and evaluation of the spanish saticorpus 2021
for satire identification using linguistic features and transformers, Complex &amp; Intelligent Systems
8 (2022) 1723–1736.
[5] R. Pan, J. A. García-Díaz, T. Bernal-Beltrán, F. García-Sánchez, R. Valencia-García, Overview of
SatiSPeech at IberLEF 2025: Multimodal Audio-Text Satire Classification in Spanish, Procesamiento
del Lenguaje Natural 75 (2025).
[6] J. Á. González-Barba, L. Chiruzzo, S. M. Jiménez-Zafra, Overview of IberLEF 2025: Natural
Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the
Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the
Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS. org, 2025.
[7] R. Pan, J. A. García Díaz, M. Á. Rodríguez García, F. García Sánchez, R. Valencia García, Overview of
emospeech at iberlef 2024: Multimodal speech-text emotion recognition in spanish, Procesamiento
del lenguaje natural 73 (2024) 359–368.
[8] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information,</p>
      <p>Transactions of the association for computational linguistics 5 (2017) 135–146.
[9] A. Baevski, H. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: A framework for self-supervised learning
of speech representations, 2020. URL: https://arxiv.org/abs/2006.11477. arXiv:2006.11477.
[10] W. Chen, F. Lin, G. Li, B. Liu, A survey of automatic sarcasm detection: Fundamental theories,
formulation, datasets, detection methods, and opportunities, Neurocomputing 578 (2024) 127428.
URL: https://www.sciencedirect.com/science/article/pii/S0925231224001991. doi:https://doi.
org/10.1016/j.neucom.2024.127428.
[11] J. A. García-Díaz, R. Valencia-García, Compilation and evaluation of the spanish saticorpus 2021
for satire identification using linguistic features and transformers, Complex &amp; Intelligent Systems
8 (2022) 1723–1736.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <article-title>Cultural diferences in humor perception, usage, and implications</article-title>
          ,
          <source>Frontiers in Psychology</source>
          <volume>10</volume>
          (
          <year>2019</year>
          ). URL: https://api.semanticscholar.org/CorpusID:59307773.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>García-Díaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>Rodríguez-García</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Valencia-García</surname>
          </string-name>
          ,
          <source>Spanish meacorpus</source>
          <year>2023</year>
          :
          <article-title>A multimodal speech-text corpus for emotion analysis in spanish from natural environments</article-title>
          ,
          <source>Computer Standards &amp; Interfaces</source>
          <volume>90</volume>
          (
          <year>2024</year>
          )
          <article-title>103856</article-title>
          . URL: https://www.sciencedirect.com/science/ article/pii/S0920548924000254. doi:https://doi.org/10.1016/j.csi.
          <year>2024</year>
          .
          <volume>103856</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Broniatowski</surname>
          </string-name>
          ,
          <article-title>A multi-modal method for satire detection using textual and visual cues</article-title>
          , in: G. Da San Martino, C. Brew,
          <string-name>
            <given-names>G. L.</given-names>
            <surname>Ciampaglia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Feldman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leberknight</surname>
          </string-name>
          , P. Nakov (Eds.),
          <source>Proceedings of the 3rd NLP4IF Workshop on NLP for Internet Freedom: Censorship</source>
          , Disinformation, and Propaganda,
          <source>International Committee on Computational Linguistics (ICCL)</source>
          , Barcelona,
          <source>Spain (Online)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>33</fpage>
          -
          <lpage>38</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .nlp4if-
          <fpage>1</fpage>
          .4/.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>