<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>A. K. Ulpo Carangui, S. Cabrera Almeida, R. Pizarro Matamoros, G. Morocho, El análisis de los
sentimientos con Inteligencia Artificial como estrategia de las relaciones públicas, Ciencia Latina
Revista Científica Multidisciplinar</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.48550/arXiv.1911.02116</article-id>
      <title-group>
        <article-title>UMU-Ev at SatiSPeech-IberLEF 2025: Exploring Multimodal Satire Detection with Eficient Audio-Text Representations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eduardo Valero-Vilella</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Facultad de Informática, Universidad de Murcia, Campus de Espinardo</institution>
          ,
          <addr-line>30100</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>8</volume>
      <issue>2024</issue>
      <fpage>7658</fpage>
      <lpage>7666</lpage>
      <abstract>
        <p>This paper presents the system developed by the UMU-Ev team for the SatiSPeech 2025 challenge organized within IberLEF, focused on satire detection in Spanish. The proposed approach integrates textual and acoustic modalities using computationally lightweight representations and classifiers, prioritizing eficiency in resource-constrained environments while maintaining competitive performance. For the text modality, models such as RoBERTa-bne and FastText were evaluated, while for audio, HuBERT-based representations were used in combination with MFCCs and prosodic features. The modalities were integrated through diferent fusion strategies: concatenation, averaging, weighted sum, and attention. The system attained top-tier rankings in both tasks. In the monomodal text task, it reached a Macro F1-Score of 0.8445, ranking third. In the multimodal task, it achieved a Macro F1-Score of 0.8834, obtaining the first position in the oficial ranking. These results prove that it is possible to achieve strong performance using lightweight architectures and systematic experimentation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Satire Detection</kwd>
        <kwd>Multimodal Classification</kwd>
        <kwd>Transformers</kwd>
        <kwd>Spanish NLP</kwd>
        <kwd>MFCC</kwd>
        <kwd>HuBERT</kwd>
        <kwd>RoBERTa</kwd>
        <kwd>FastText</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In today’s digital media landscape, disinformation and satire increasingly intermingle. This convergence
is especially prominent on social networks and audiovisual platforms. Developing technologies that
can properly interpret user-consumed and shared messages has thus become crucial. Distinguishing
satirical from ironic content is particularly relevant for key tasks. These include intent detection,
discourse analysis, and automatic content classification in both informative and humorous contexts [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Multimodal analysis combining textual and acoustic signals ofers a promising solution. This approach
efectively addresses language complexity and its expressive nuances.
      </p>
      <p>
        Automated satire detection presents major challenges for NLP systems and audio analysis. This
discourse type frequently employs irony, exaggeration, and ambiguity, along with linguistic devices
that resist formalization [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Real-world contexts like interviews, talk shows, or online humor introduce
additional complexity, where subtle prosodic cues often accompany satirical elements. These include
intonation, pauses, and emphasis patterns that escape text-only models [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Consequently, exploring
approaches that integrate textual and acoustic information becomes essential. Such integration would
enhance satire detection models’ performance.
      </p>
      <p>
        Traditionally, satire detection relied on textual analysis. Statistical models or transformer architectures
like BETO and Spanish-trained RoBERTa variants were commonly used. These approaches succeeded
in related tasks like emotion and irony detection. Several recent studies in the EmoSPeech framework
demonstrate this efectiveness [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, performance declines in ambiguous scenarios. This is
especially true when acoustic signals contribute meaningful nuances. Multimodal approaches are now
gaining momentum. They combine text and audio representations [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Despite progress in recent
competitions, robust modality integration remains challenging. The issue persists particularly for
languages like Spanish. Although increasingly represented, Spanish resources remain incomparable to
those in English.
      </p>
      <p>
        Our participation proved valuable for the SatiSPeech 2025 competition [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Organized within the
IberLEF [7] forum, it enabled development and evaluation of monomodal and multimodal configurations.
We used a realistically annotated corpus for this purpose. The final system employed a late fusion
architecture. This combined two data streams: Audio embeddings from HuBERT [8] plus prosodic
features (MFCCs [9], pauses, intensity, and pitch). Textual representations came from RoBERTa-bne
[10]. Integration used a late fusion strategy. Final classification relied on a Support Vector Machine
(SVM). Our system achieved top results in the competition’s multimodal task. We ranked first on the
oficial leader board. This outcome supports an important hypothesis: Competitive performance in
complex tasks remains achievable. It holds true even with limited computational resources and base
models. Furthermore, our simplified approach enables new research pathways. Future work can explore
more complex models.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. State of the Art</title>
      <p>Automated satire detection remains a significant challenge in natural language processing. This
discursive phenomenon systematically employs irony, exaggeration, and double meanings to critique social,
political, or cultural behaviors [11]. Unlike more straightforward linguistic tasks, satire requires
interpreting not only explicit content but also communicative intent, often contrary to literal meaning [12].
This semantic ambiguity complicates automated system design, as even humans show inconsistencies
when identifying satirical or sarcastic expressions [13]. Current state-of-the-art models also struggle
with figurative language. Their strong dependency on data patterns limits their ability to represent
pragmatic and sociocultural context [12, 11].</p>
      <p>These dificulties are amplified for Spanish, a language underrepresented in international resources
and benchmarks compared to English. Scientific literature on Spanish satire and irony detection
remains scarce. Available datasets typically sufer from size limitations, quality issues, or inadequate
dialectal coverage [11]. Spanish also exhibits significant regional diversity afecting lexicon, syntactic
constructions, and ironic usage patterns. This diversity prevents simple generalization of models trained
on single domains or varieties [14]. Additional challenges arise when working with texts from social
media or audiovisual sources. These often contain grammatical errors, abbreviations, emojis, and
implicit prosodic markers that evade conventional textual analysis [15]. Consequently, researchers
must develop more robust linguistic representations and classification strategies adaptable to satire’s
diverse manifestations across Spanish-speaking contexts [11].</p>
      <p>The scenario grows more complex when addressing satire from a multimodal perspective integrating
textual and acoustic information. While prosody, tone, rhythm, and speech pauses provide crucial clues
about satirical intent, their correct interpretation requires efective integration with textual content [ 16].
Detecting inconsistencies between verbal content and delivery, such as a seemingly serious comment
uttered with mocking intonation, demands models capable of establishing cross-modal relationships.
These models must identify ironic patterns not explicitly expressed [17]. However, multimodal satire
research in Spanish remains incipient. Existing approaches focus primarily on related tasks like emotion
or irony detection. Few annotated corpora combine text and audio with specific satire labels. Efective
system development thus requires overcoming both technical barriers and structural limitations from
data scarcity [14, 16].</p>
      <p>Early satire and irony detection approaches used traditional machine learning methods with manually
extracted linguistic features. These works explored lexical and stylistic traits (n-grams, punctuation,
PoS tags), semantic indicators (polysemy, polarity contrast), and emotional markers (polarity,
categorical/dimensional emotions) [12, 11]. They also examined contextual and extralinguistic properties using
tools like the LIWC dictionary [18]. Some studies proposed satire-specific features including jargon
and ofensive language usage [ 11, 12]. Classifiers like SVMs, decision trees, and ensemble methods
showed competitive performance, especially for Spanish social media corpora [16]. While efective
in specific domains, these approaches require intensive feature engineering and generalize poorly to
heterogeneous or informal contexts [15, 11].</p>
      <p>The deep learning revolution brought models that learn distributed representations directly from
text. Notable architectures include BiLSTM with attention mechanisms [15], hierarchical networks
combining phrase- and document-level analysis [11, 15], and CNN-based models for detecting ironic
tones [17]. Researchers also explored Transformer models [17] like BERT and RoBERTa, plus multi-view
systems such as MvAttLSTM [11]. The latter integrate multiple information sources (linguistic, dense,
contextualized) via multi-head attention. These models demonstrate a marked capacity to encode
complex semantic and pragmatic subtleties inherent in satirical discourse. Some frameworks even
unify sarcasm detection in multimodal environments incorporating text, audio, and visual signals.
However, their efectiveness heavily depends on corpus quality/diversity and modality fusion strategies
[11, 17, 15].</p>
      <p>Fusion strategy represents a key dimension in multimodal systems. Integration approaches for text,
audio, and image vectors significantly impact model performance [ 16]. Common techniques include
Early Fusion (direct concatenation of raw representations/embeddings) and Late Fusion (separate
modality processing with later output combination) [19, 17]. In satire detection, researchers observe
interesting interactions between semantic and prosodic cues: When ironic meaning is textually clear,
tone/intonation influence diminishes. Conversely, in ambiguous cases, prosodic features (F0 modulation,
duration, amplitude) become decisive for correct intent interpretation [20, 19].</p>
      <p>
        Recent competitions have catalyzed advances in this field. Notable examples include the EmoSPeech
challenge [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] at IberLEF 2024, focused on Spanish emotion recognition from real-world data. This
competition introduced the Spanish MEACorpus 2023 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], over 13 hours of manually annotated audio
using Ekman’s taxonomy. This multimodal corpus from spontaneous YouTube situations has gained wide
adoption. The BSC-UPC team [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] achieved top results by integrating pre-trained text (RoBERTa-bne)
and audio (XLSR-wav2vec 2.0) representations. Their architecture used Attention Pooling and
dense classification networks in a voting ensemble. The attention mechanism reduced embedding
dimensionality eficiently, particularly valuable in resource-limited contexts.
      </p>
      <p>Collectively, these studies formed the foundation for our experimental design. They guided our
selection of promising textual/acoustic representations and fusion techniques compatible with modest
computational infrastructure. We prioritized simplicity, eficiency, and performance balance.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Objectives</title>
      <p>
        The primary objective of this work is to develop a system capable of detecting satire in Spanish using
multimodal signals. This system will combine textual and acoustic information within the framework
of the SatiSPeech 2025 challenge [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. To achieve this overarching goal, we define the following specific
objectives:
• Evaluate various representation strategies for Spanish text and audio. This includes both
pretrained models and classical techniques, with emphasis on solutions ofering optimal
performancecomputational cost balance.
• Compare diferent monomodal configurations (processing text and audio separately). This
comparison will establish a robust baseline for developing our multimodal solution.
• Design an efective multimodal fusion strategy. The solution must be compatible with modest
training infrastructure constraints.
• Implement and evaluate diverse classification approaches. We will consider both simple neural
networks and traditional classifiers including SVM and Random Forest.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>This section describes the proposed system for the task of satire detection in Spanish using multimodal
signals. The system comprises several independent modules for processing textual and acoustic
modalities, whose outputs are integrated in a fusion stage prior to classification. The strategies employed in
each component are detailed below.</p>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>
          The oficial data provided by the organizers of the multimodal task in the SatiSPeech 2025 challenge
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] were used for system development and evaluation. The corpus, named SatirA, consists of Spanish
audio clips extracted from YouTube videos of satirical programs such as El Intermedio, Zapeando,
HomoZapping, and El Mundo Today, as well as non-satirical news sources like Antena 3 Noticias, El Mundo, and
BBC News. The dataset includes content from various Spanish-speaking regions, ensuring significant
linguistic and cultural diversity while minimizing regional bias.
        </p>
        <p>Videos were segmented automatically with diarization tools [21, 22], and clips exceeding 25 seconds
in duration were discarded. Automatic transcriptions were generated using Whisper [23]. Content
annotation as satirical or non-satirical followed a semi-automatic approach: first, automatic classification
techniques were applied, followed by manual validation by three experts.</p>
        <p>The dataset includes approximately 25 hours of recordings, divided into a training set with 6,000
samples and a test set with 2,000 samples. The test set contains hidden labels as it was used for the
oficial evaluation of participant submissions. The final system evaluation was conducted through the
Codalab platform, used by the organizers to host and manage the competition.</p>
        <p>Analysis of class distribution in the training set reveals a slight imbalance: approximately 52.8% of
samples are labeled as non-satirical and 47.2% as satirical.</p>
        <p>Figure 1 shows the distribution analysis of transcription lengths and audio durations, diferentiated
by class in the training set (satire / no-satire) and aggregated in the test set. Regarding text,
non-satirical transcriptions tend to be slightly longer with a more concentrated distribution centered
around 55 words. Satirical transcriptions, conversely, show greater dispersion.</p>
        <p>For audio, non-satirical segments consistently exhibit longer durations. These diferences may
reflect distinct structural patterns in satirical versus informative or descriptive discourse. The test set
distributions align with those of the training set, suggesting good data coherence, though models should
avoid relying exclusively on these formal diferences for class inference.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Text Representation</title>
        <p>Various text representation strategies were explored, ranging from classical approaches to
pretrained language models. First, traditional vectorization techniques like CountVectorizer and
TfidfVectorizer were applied, with vocabulary limited to the 1,000 most frequent words and
removal of Spanish stop-words. Basic preprocessing included lowercase conversion and removal of
URLs, mentions, and punctuation marks.</p>
        <p>Concurrently, dense representations from pre-trained models were evaluated. For Word2Vec and
FastText, average vectors were generated from words in each text. For Transformer-based models
like RoBERTa-bne and XML-RoBERTa, Hugging Face implementations with AutoTokenizer and
AutoModel were used. Table 1 summarizes the dimensions of the most efective representations
identified during experimentation for both textual and acoustic modalities.</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. Experimental Procedure</title>
          <p>Internal testing employed cross-validation on diferent subsets of the training set. The primary
evaluation metric was Macro F1-Score, following challenge guidelines. Macro F1-Score is the unweighted
average of F1-scores across all classes, calculated individually per class and then averaged.</p>
          <p>Experimentation was divided into two phases. An initial exploratory phase used reduced training
subsets (1,000 samples for training, 200 for validation) to rapidly test multiple representation-classifier
combinations. The most promising configurations were selected from these tests.</p>
          <p>In the second phase, complete embeddings of selected models were precomputed and stored in .npy
format to accelerate training and evaluation. Subsequently, hyperparameter search techniques were
used to optimize classifiers:
• GridSearchCV from sklearn for logistic regression and support vector machines.
• Keras Tuner for dense neural networks, testing diferent architectures, layer sizes, and dropout
rates.</p>
          <p>To ensure experiment reproducibility, the parameter random_state=420 (or equivalents) was
ifxed for all applicable procedures, including data splitting, cross-validation strategies, and classifier
initialization. Evaluation results along with optimal hyperparameter combinations were saved in .json
ifles for subsequent analysis.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Notable Representations and Classifiers</title>
          <p>Three representations demonstrated particularly competitive performance during experimentation:
• RoBERTa-base-bne: A RoBERTa version trained on 570 GB of peninsular Spanish text compiled
by the National Library of Spain. This model is specifically adapted for Spanish tasks and has
shown outstanding performance in text classification, entity recognition, and question answering
[24].
• XML-RoBERTa: A multilingual RoBERTa-based model trained on data from over 100 languages,
demonstrating strong contextual text handling capabilities [25].
• FastText: Dense vector representations based on sub-words, proving especially robust for
informal or diverse linguistic contexts [26].
• Word2Vec: An unsupervised learning model generating dense word representations in continuous
vector space, capturing semantic and syntactic relationships through contextual occurrence in
large text corpora [27].</p>
          <p>Among evaluated classifiers, the following stood out:
• Logistic regression: Linear model for binary classification that fits a logistic function to estimate
probabilities and assign classes [28].
• Support vector machines (SVM): Classifier identifying the hyperplane that maximizes the
margin between classes, using only the closest points (support vectors) to define boundaries [ 28].
• Fully connected dense neural networks (DNN): Model composed of interconnected neuron
layers where each node applies a nonlinear transformation to a linear combination of its inputs
[28].</p>
          <p>Dimension</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Audio Representation</title>
        <p>Two audio representation strategies were explored: traditional MFCC-based feature extraction, and
pre-trained self-supervised representation models:
• MFCCs (Mel-Frequency Cepstral Coeficients) : Coeficients representing the short-term power
spectrum of an audio signal, modeling human sound perception via mel scales [29]. Additional
features were incorporated:
– Prosodic features: Reflect suprasegmental speech aspects like intonation, rhythm, energy,
and duration, associated with emotional expression or emphasis [30].
– Deltas and delta-deltas: First and second-order temporal derivatives of acoustic features
(MFCCs in our case), capturing dynamic changes over time such as spectral parameter
velocity and acceleration [29].
• Wav2Vec2: Self-supervised model based on contrastive learning that acquires speech
representations directly from unlabeled audio. It uses a convolutional network to encode signals and a
transformer network for contextualization, trained to distinguish true from negative
representations [31].
• HuBERT: Self-supervised model predicting hidden phonetic units obtained through clustering on
acoustic representations, combining segmentation learning and content prediction. This enables
learning hierarchical structures without manual transcriptions [8].</p>
        <p>For both self-supervised models, base versions were used, extracting two representation types:
mean pooling and the special CLS token (the output vector’s first position, considered a condensed
representation of processed audio clip content) [31, 8].</p>
        <sec id="sec-4-3-1">
          <title>4.3.1. Experimental Procedure</title>
          <p>The experimentation process mirrored the textual case, divided into two main phases:</p>
          <p>An initial exploratory phase used a reduced training subset (1,000 training samples, 200 validation
samples), enabling rapid evaluation of multiple acoustic representation-classifier combinations without
hyperparameter tuning. The most promising combinations were selected from these experiments.</p>
          <p>In the second phase, complete representations of the training set were precomputed and stored in
.npy format for reuse. Using these representations, more complex models were explored alongside
identical hyperparameter search techniques: GridSearchCV and Keras Tuner.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>4.3.2. Notable Representations and Classifiers</title>
          <p>After completing experimentation, the following acoustic representations proved most efective:
• MFCCs with prosodic features
• MFCCs with prosodic features and deltas
• HuBERT with mean pooling extraction</p>
          <p>Notably, while performance diferences between MFCCs with and without deltas (always including
prosodic features) were modest, including deltas provided consistent slight improvements across most
tested models. Thus, this variant was retained for final experimentation.</p>
          <p>Regarding classifiers used with these representations, support vector machines and dense neural
networks achieved the best results, mirroring their efectiveness in text classification, unlike logistic
regression, which underperformed in this modality.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Multimodal Fusion</title>
        <p>Multimodal experimentation was designed using classifiers and representations that performed best in
unimodal text and audio tasks. Specifically, SVM and DNN classifiers were selected, excluding logistic
regression due to its inferior audio performance. The following representations were used:
• Text: RoBERTa-bne (CLS) and FastText (mean).
• Audio: HuBERT (mean), MFCCs with prosodic features, and MFCCs with prosodic features and
delta-deltas.</p>
        <p>Combinations were generated using diferent vector fusion methods for each text-audio representation
pair, as illustrated in Figure 2. The goal was to evaluate cross-modal complementarity and analyze its
impact on classifier performance. Concatenation was used for same-modality representations due to its
simplicity and speed.</p>
        <p>The experimental procedure replicated earlier phases: first, reduced training subsets (1,000/200) were
used for exploratory testing of combinations, fusion methods, and classifiers. Subsequently, the most
promising configurations were retrained on the full training set, applying hyperparameter searches to
optimize DNNs and SVMs.</p>
        <sec id="sec-4-4-1">
          <title>4.4.1. Fusion Methods</title>
          <p>The following fusion methods were evaluated to combine text and audio representations:
• Concatenation: Baseline method joining both vectors into a single extended vector by appending
the audio vector to the text vector (or vice versa). Preserves all information from both modalities
but increases classifier input dimensionality.</p>
          <p>Two independent vectors (one per modality) are pre-normalized with StandardScaler to ensure
comparable scales. The NumPy concatenate function is then applied, forming an input vector
with total dimensionality equal to the sum of both representations.
• Weighted sum: Vectors are combined through weighted summation, multiplying each vector
by a weight (between 0 and 1) before summing. This method adjusts each modality’s relative
influence but requires identical dimensions. For mismatched dimensions, the longer vector was
truncated.</p>
          <p>A logistic regression model validated weight searches due to its speed and low computational cost.
Weights were tested in 0.1 increments using a validation subset for each weight combination.
Note that weights were optimized for this lightweight model and may not maximize performance
for subsequent SVM/DNN classifiers.
• Mean: Equivalent to weighted fusion with equal weights (0.5/0.5). This specific variant was
included for its simplicity and balanced integration without additional tuning.
• Attention: This method adapts the attention architecture by Vaswani et al. [32] for text-audio
fusion. Specifically, the text vector serves as query, while the audio vector simultaneously acts as
key and value. This configuration allows the attention mechanism to identify the audio aspects
most relevant to textual content.</p>
          <p>Implementation used TensorFlow’s MultiHeadAttention class [33], which computes attention
as a weighted combination of value vectors using weights derived from query-key similarity.
Since original vectors don’t represent temporal sequences, an extra dimension was added to
simulate this, a technical requirement for this layer. However, this simplification may limit the
mechanism’s capacity to model complex relationships observed in real sequences. Attention
output is normalized and fed to the classifier.</p>
          <p>For parameterization, num_heads=4 was selected to balance expressiveness and computational
cost. This allows attention space division into multiple subspaces without excessive overfitting
or parameter growth, particularly relevant given the embeddings’ relatively low dimensionality.
The key_dim parameter (key dimension per head) was defined as dim_model//num_heads,
ensuring concatenated attention output matches input embedding dimensionality, a standard
practice guaranteeing model coherence.</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Experimental Environment</title>
        <p>All experiments were conducted on a personal ASUS ROG Strix G531GT laptop with an Intel Core
i7-9750H processor (2.60GHz), 16 GB RAM, and an NVIDIA GeForce GTX 1650 Mobile GPU (4 GB
VRAM). The operating system was Ubuntu 24.04 LTS, and development used a Python 3.12.6 virtual
environment. Notebooks and scripts were primarily coded and executed in Visual Studio Code.</p>
        <p>Key libraries included: TensorFlow 2, NumPy, torch, torchaudio, HuggingFace
Transformers, scikit-learn, librosa, keras_tuner and FastText.</p>
        <p>System computational limitations influenced certain experimental decisions. Thus, embedding
models were used exclusively in feature extraction mode, with generated embeddings saved to disk
to accelerate subsequent processes. Computing audio representations for the 6,000 training samples
required approximately one hour per configuration.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>This section presents the results obtained during the second phase of the experimental process. We begin
with monomodal models applied separately to text and audio, followed by multimodal experiments
combining both modalities. Finally, we analyze the results obtained in the oficial competition.</p>
      <p>Experiments were conducted on the training set using five-fold cross-validation, with a split of 5,500
samples for training and 500 for validation in each fold. The primary evaluation metric was Macro
F1-Score, following the competition’s established criteria. Precision and recall were also calculated but
were consulted only occasionally as auxiliary metrics.</p>
      <sec id="sec-5-1">
        <title>5.1. Monomodal Results: Text</title>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Monomodal Results: Audio</title>
        <p>Acoustic representations HuBERT and Wav2Vec2 (base versions) were evaluated with two extraction
strategies (mean pooling and CLS token), both isolated and combined with MFCCs. Results are illustrated
in Figure 4. Hyperparameter search configurations and optimal settings are shown in Table 4.</p>
        <p>HuBERT consistently showed superior performance when using mean pooling rather than the CLS
token, with an average diference of 3%. This gap was particularly pronounced with DNN and SVM
classifiers. The best configuration combined HuBERT-mean and prosodic MFCCs with SVM, as shown
in Table 5.</p>
        <p>Similar to HuBERT, the mean-based representation systematically outperformed CLS for Wav2Vec2,
with a much larger diference. The best result used Wav2Vec2-mean and MFCCs with prosodic features
and delta-deltas with DNN (Macro F1-Score = 0.9258). Logistic regression again performed best with the
CLS token representation, except when Wav2Vec2 was concatenated with full MFCCs.</p>
        <p>In all configurations, HuBERT clearly outperformed Wav2Vec2, both in standalone versions and
when combined with MFCCs. The average performance gap was 5% for mean representations and 8%
for the CLS token. Consequently, HuBERT-mean was selected as the audio representation model for
multimodal combinations, both standalone and combined with MFCCs.</p>
        <p>SVM and DNN were the most competitive classifiers for this task. Logistic regression, which showed
competitive performance in text and other audio embeddings, lagged notably with HuBERT-mean. This
limitation motivated its use as an auxiliary method for weight estimation in weighted fusion, where its
low computational cost enables rapid combination exploration.
• C: 10
• kernel: rbf
• gamma: scale
• Search time per iteration: 61.77 seconds
• C: 10
• kernel: rbf
• gamma: scale
• Search time per iteration: 44.94 seconds</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Multimodal Results</title>
        <p>This final phase analyzed multiple text-acoustic representation combinations using diferent fusion
methods and classifiers, evaluated via cross-validation on the training set. Results shown in Figures 5, 6,
and 7 provide an overview, with key configurations summarized in Table 6.</p>
        <p>Overall multimodal combination performance was high and relatively homogeneous, with most
Macro F1-Scores between 0.92 and 0.97. Regarding representations, best combinations used HuBERT
acoustic embeddings with prosodic features (Figures 5a and 6b). All text variants showed similar
performance when combined with audio representations.</p>
        <p>Fusion methods yielded consistent results across concatenation, attention, mean, and weighted
sum, with no dominant strategy. Attention and weighting achieved the highest scores in several
combinations (Figure 5b). However, Figure 7a shows no clearly dominant method, with all displaying
similar distributions and slight advantages for attention at the upper end.</p>
        <p>Regarding classifier performance (Figure 6), SVM was the most robust option across most
conifgurations, systematically outperforming dense networks. The exception occurred with full MFCCs
combinations, where DNN achieved better results. DNN also showed lower score dispersion (Figure
7b), suggesting greater stability despite lower average scores.</p>
        <p>An exception occurred with HuBERT combined with full MFCCs, where DNN performed better.
Additionally, DNN exhibited lower variance, suggesting greater stability across runs compared to SVM’s
wider dispersion.</p>
        <p>Notably, while top multimodal configurations did not significantly outperform standalone acoustic
models as they did with text models, they clearly improved result consistency (Figure 8). Monomodal
audio models showed greater dispersion across runs and combinations, while multimodal systems
sustained high performance across configurations. Hyperparameter search configurations and optimal
settings are shown in Table 7.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Competition Results</title>
        <p>We present results from evaluation on the competition’s hidden-label test set and analyze submitted
configurations for both the monomodal text task and the multimodal task.</p>
        <sec id="sec-5-4-1">
          <title>5.4.1. Monomodal Task: Text</title>
          <p>For the monomodal text task, five configurations combining RoBERTa-bne and FastText with diferent
classifiers were tested. Results are shown in Table 9.</p>
          <p>The best result used RoBERTa-bne with SVM, achieving a Macro F1-Score of 0.8446 (Experiment 2).
This aligns with cross-validation results, confirming the classifier’s consistency across experimental
conditions. The second-best configuration combined RoBERTa-bne and FastText with SVM (Experiment
4, 0.8417), indicating that adding FastText did not improve prediction performance.</p>
          <p>The RoBERTa-bne + FastText with logistic regression configuration (Experiment 5) achieved
a Macro F1-Score of 0.8191, notable for its simplicity and low computational cost. Dense neural
networks scored lower with both RoBERTa (Experiment 1, 0.8122) and FastText (Experiment 3, 0.8058),
suggesting reduced generalization capacity compared to traditional classifiers.</p>
          <p>Model
SVM
DNN</p>
        </sec>
        <sec id="sec-5-4-2">
          <title>5.4.2. Multimodal Task</title>
        </sec>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Error Analysis</title>
        <p>For error evaluation, we took the best-performing model on the training set, defining a 500-sample
holdout set. We identified 9 misclassifications: 7 false positives and 2 false negatives, shown in Table 11.
Errors fall into these groups:
• False positives: Latin American news</p>
        <p>Errors in non-satirical samples (3db0b886, eebdb176) correspond to segments from Latin
American news channels, presented in neutral accents with background music and sound efects
characteristic of this format. Despite serious content and monotone delivery, these were
misclassified as satire.
• False negatives: satire mimicking news</p>
        <p>A second group includes satirical samples misclassified as non-satirical, such as segments from El
Mundo Today (8f7e5f21, e29b87d0) and El Intermedio (607a93b9, 5b97747c). Speakers use
completely serious tones mimicking traditional news/report formats. Satire is conveyed primarily
through semantic content requiring contextual and pragmatic interpretation beyond acoustic
cues and literal text. When present, prosodic cues appear subtly or late, potentially escaping
model detection.
• False negatives: satire cues only in audio</p>
        <p>Some samples contain typical satirical elements (laughter, onomatopoeia, exaggerated tone) not
reflected in transcriptions, as in ab2e18c4 or ca6bda3a. These highlight limitations of text
models and certain acoustic representations in capturing paralinguistic information essential for
satire detection.
• Contextual ambiguity and labeling errors</p>
        <p>Sample e1a1a563 is particularly ambiguous: a social media-style intervention with reel/short
background music where the tone is informal but content isn’t clearly satirical without
speaker/context knowledge. This ambiguity poses challenges even for human annotators and suggests
some errors may stem from label quality issues or lack of explicit contextual information.
Applying the same procedure to other top-performing multimodal models, we observed that fragments
ab2e18c4 and 5b97747c were misclassified by all evaluated models. In both cases, satire manifests
primarily in the acoustic channel while transcriptions contain no clear clues.</p>
        <p>• ab2e18c4 contains laughter and onomatopoeia not reflected in text.
• 5b97747c relies on ironic intonation for satire interpretation, especially in the opening phrase.</p>
        <p>Additionally, knowledge of the broadcast format (political comedy programs) is crucial for correct
interpretation.</p>
        <p>These errors highlight models’ dificulty capturing implicit satire, which depends on both subtle
acoustic elements and external contextual information. From this analysis, we derive:
• Errors are systematic rather than random, especially when satire:
– Is conveyed through subtle intonation/prosody.
– Is absent from transcriptions.</p>
        <p>– Requires external knowledge.
• Models struggle with implicit satire lacking explicit textual cues.
• Certain sound efects (e.g., background music) may induce false positives if erroneously correlated
with satire during training.</p>
      </sec>
      <sec id="sec-5-6">
        <title>5.6. Interpretation and Analysis</title>
        <sec id="sec-5-6-1">
          <title>5.6.1. Classifiers</title>
          <p>Comparative analysis shows relevant performance diferences across classifiers, representations, and
fusion strategies. Generally, support vector machines (SVM) demonstrated more robust behavior,
systematically outperforming dense neural networks (DNN) in most configurations. This may stem
from SVMs’ capacity to maximize inter-class separation margins in high-dimensional spaces, particularly
efective with our representations. DNNs, though slightly less accurate on average, showed lower
variance between runs, indicating greater result stability.</p>
          <p>Logistic regression performed notably worse in demanding scenarios, especially with HuBERT-based
tasks, due to its linear nature. However, its low computational cost made it useful as an auxiliary classifier
for tasks like weight estimation in weighted fusion, prioritizing eficiency over final performance.</p>
        </sec>
        <sec id="sec-5-6-2">
          <title>5.6.2. Text Representations</title>
          <p>For text representations, results confirm contextualized models’ superiority over static alternatives.
Specifically, RoBERTa-bne delivered the best representations for satire detection, significantly
outperforming classical models like FastText or Word2Vec. This was expected since RoBERTa captures
full utterance context, including ironic structures and pragmatic nuances dificult to represent with
non-contextual models.</p>
          <p>FastText showed competitive standalone performance, especially with SVM or DNN classifiers,
benefiting from its ability to handle rare/out-of-vocabulary words via sublexical components.
However, its combination with RoBERTa yielded no substantial improvements, suggesting contextualized
representations already contained most useful signals.</p>
          <p>Conversely, Word2Vec showed the worst performance both standalone and combined, reinforcing
that static representations, even enriched through aggregation techniques, are inadequate for capturing
satire’s linguistic mechanisms.</p>
        </sec>
        <sec id="sec-5-6-3">
          <title>5.6.3. Audio Representations</title>
          <p>For audio, a similar pattern emerged with clearly superior performance from advanced model
representations. HuBERT consistently outperformed Wav2Vec2 by average margins of 5 ∼ 8% in Macro
F1-Score depending on extraction strategy. This advantage may stem from its masked phonetic unit
prediction training, favoring capture of prosodic/phonological structures relevant to this task.</p>
          <p>In both cases, mean temporal representation (mean pooling) outperformed the CLS vector, suggesting
global aggregation preserves information distributed throughout the signal, including intonational and
temporal patterns key to identifying ironic/satirical content.</p>
          <p>Moreover, combining these representations with traditional acoustic descriptors like MFCCs and
prosodic features (pauses, energy, pitch) further improved performance. The optimal configuration
fused HuBERT-mean with prosodic MFCCs using an SVM classifier. This demonstrates efective
complementarity: self-supervised models provide rich contextual information while handcrafted descriptors
incorporate timbre/prosody nuances enriching the joint representation. Attributes like voice melody,
pause duration, or intensity modulation, fundamental for oral satire expression, are thus better reflected
in the final feature space.</p>
        </sec>
        <sec id="sec-5-6-4">
          <title>5.6.4. Fusion Methods</title>
          <p>Regarding multimodal fusion strategies, the evaluated methods (concatenation, mean, weighted sum,
attention) yielded similar overall performance, suggesting the key lies in joint modality exposure rather
than specific integration techniques. No clearly superior technique emerged, though concatenation and
attention slightly led at the distribution’s upper end.</p>
          <p>Concatenation achieved the highest Macro F1-Score in oficial evaluation when used with SVM.
This strategy, preserving all information from both modalities in an extended vector, allowed SVM to
efectively leverage generated feature richness, showing particular robustness in high-dimensional,
well-balanced input spaces.</p>
          <p>Attention fusion showed equally competitive performance. Its design dynamically prioritized the
most relevant acoustic components based on textual content, without requiring explicit sequential
structure. This mechanism, implemented via multi-head attention with text as query, proved capable of
modeling useful cross-modal relationships, emphasizing intonational/rhythmic aspects associated with
verbal irony.</p>
          <p>The parity between attention and concatenation indicates both strategies efectively exploit modal
complementarity, either letting the classifier discover relevant interactions or guiding integration
through the fusion mechanism. Even simpler approaches like arithmetic mean yielded competitive
results, reinforcing that both modalities provide coherent complementary information usable with
minimally parametrized techniques.</p>
          <p>Overall, results demonstrate multimodal combination not only improves average scores but
significantly reduces run-to-run dispersion, providing greater robustness and stability than monomodal
configurations regardless of the specific fusion method applied.</p>
        </sec>
        <sec id="sec-5-6-5">
          <title>5.6.5. System Limitations</title>
          <p>Beyond aggregate results, error analysis reveals recurring patterns explaining current system limitations.</p>
          <p>First, errors aren’t randomly distributed but concentrate in specific sample types with particular
characteristics. Among false negatives, prominent cases involve satire expressed through subtle prosodic
elements or requiring implicit contextual knowledge for correct interpretation. Segments with formal
tone or news structure from sources like El Mundo Today or El Intermedio were misclassified as
nonsatirical despite clearly satirical discourse content. These examples highlight models’ dificulty capturing
communicative intent when explicit markers are absent from textual or acoustic signals.</p>
          <p>Additionally, false positives occurred in fragments from Latin American news, where background
music, sound efects, or marked accents may have induced misclassification through associations learned
during training. Such errors suggest deceptive correlations in data, where secondary elements like
delivery style, background music, or certain acoustic features may have been learned as indirect satire
indicators without being inherently so.</p>
          <p>These errors systematically arise in underrepresented genres or contexts where satire relies on
implicit prosodic cues or cultural knowledge absent from training data. Overcoming these may require
incorporating external context information and improving acoustic representations’ sensitivity to fine
prosodic nuances still partially missed by current models.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <sec id="sec-6-1">
        <title>6.1. Proposed System</title>
        <p>The proposed system addresses automatic satire detection in Spanish through a multimodal approach,
integrating textual and acoustic information via dense representations and eficient classifiers.</p>
        <p>For the textual modality, various pre-trained models were evaluated, with RoBERTa-bne, specifically
trained for Spanish, standing out, alongside representations based on FastText and Word2Vec
constructed through average aggregation. For the audio signal, both traditional acoustic features (MFCCs
and prosodic traits like pauses, intonation, or energy) and embeddings from self-supervised models
(HuBERT and Wav2Vec2) were explored, using both CLS token extraction and mean pooling.</p>
        <p>To fuse representations from both modalities, we evaluated several strategies: concatenation,
arithmetic mean, weighted sum, and multi-head attention mechanisms. For classification, support vector
machines (SVM) and dense neural networks (DNN) were primarily used, with logistic regression
discarded in advanced stages due to inferior performance.</p>
        <p>Systematic exploration of these combinations identified highly competitive configurations during
internal validation, particularly attention-based fusion between FastText and HuBERT+MFCCs with
prosodic features using SVM, though diferences among top models were minimal. However, in the
oficial evaluation on the blind test set, the highest score was achieved by combining RoBERTa-bne and
HuBERT+MFCCs with prosodic features, using concatenation fusion and SVM classification, securing
ifrst place in the multimodal task of the SatiSPeech 2025 competition. In the monomodal text task, the
system also performed exceptionally, ranking third using SVM on RoBERTa-bne.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. System Limitations</title>
        <p>Despite its performance, the developed approach has limitations stemming from computational
constraints and architectural decisions. Hardware restrictions and processing time limited the use of
pre-trained models to feature extraction mode only, preventing fine-tuning on architectures like
RoBERTa-bne or HuBERT. This hindered adaptation to the satire detection domain, likely afecting
the capture of subtle pragmatic or prosodic patterns. The inability to train task or context-specific
representations also precluded exploring larger, deeper, or specialized architectures.</p>
        <p>Architecturally, we opted for late multimodal integration based on concatenation or direct
combination of independently extracted embeddings, followed by standalone classification. Though
computationally eficient, this approach cannot explicitly model cross-modal interactions during training,
potentially limiting the system’s ability to capture complex multimodal signals or dependencies between
textual content and acoustic signals. More sophisticated alternatives, like end-to-end architectures or
cross-modal attention were excluded due to these constraints.</p>
        <p>Finally, some components exhibit limited linguistic coverage. The acoustic model HuBERT, while
high-performing, was not originally trained on Spanish data, potentially hindering its ability to capture
phonological nuances, regional accents, or language-specific prosodic patterns. This lack of linguistic
specialization, both textual and acoustic, may have reduced the system’s sensitivity to culturally marked
or subtle satire.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Impact on Results</title>
        <p>These limitations directly impacted system behavior and results. Without fine-tuning, pre-trained
models could not adapt to task-specific nuances of text/audio representations, likely afecting the
capture of subtle semantic or prosodic patterns. For text, this lack of specialization may explain why
performance, though competitive, did not surpass other solutions likely using fine-tuning on similar
models.</p>
        <p>Similarly, for audio and multimodal configurations, using generic embeddings combined with late
fusion strategies like concatenation may have limited the system’s ability to distinguish ambiguous
or complex examples. Qualitative analysis of mispredictions suggests that without explicit context
modeling, the system relies on superficial audio cues: certain voice tones, paralinguistic elements (e.g.,
laughter), or background sound efects that, due to their recurrence in satirical training examples, may
have been misinterpreted as reliable satire indicators.</p>
        <p>This dependence on deceptive correlations or unintended artifacts compromised robustness for
outof-distribution cases. Additionally, the absence of joint fusion architectures prevented full exploitation
of text-audio complementarity. In examples where irony manifests only through intonation or content
form contrast, the system lacks mechanisms to efectively integrate and enhance these signals.</p>
        <p>In summary, while the approach delivered solid performance in both challenge tasks, computational
and architectural constraints likely limited its potential. Incorporating fine-tuned models and deeper,
jointly trained fusion strategies would better capture pragmatic and multimodal satire nuances, reduce
errors, and improve generalization to new domains or discourse styles.</p>
      </sec>
      <sec id="sec-6-4">
        <title>6.4. Ethical Considerations</title>
        <p>Developing automated systems for complex linguistic phenomena like satire entails ethical implications
beyond technical dimensions. Unlike objective tasks (e.g., topic classification or basic sentiment analysis),
satire involves pragmatic and culturally loaded components that vary significantly across contexts,
speakers, and communities [13, 12]. Integrating such tasks into sensitive applications thus requires
critical reflection on potential risks, especially when deployed in real-world scenarios where decisions
may have tangible impact [34, 35].</p>
        <p>A representative risk emerges when using satire detectors as auxiliary tools in misinformation
detection systems [36, 37, 14, 35]. While distinguishing humor from deliberate deception is legitimate,
avoiding penalization of legitimate satire, ambiguities persist. A system might mislabel satirical content
as fake news, censuring or discrediting humor that serves critical public discourse functions. Similarly,
disinformation agents could exploit this inverse bias by superficially incorporating ironic elements to
evade detection. Both scenarios illustrate how overreliance on satirical indicators as misinformation
iflters may yield consequential false positives/negatives.</p>
        <p>Moreover, the strong contextual dependence of humor and irony must be considered. Satire relies
on cultural conventions, shared references, and linguistic codes that are not always equivalent across
communities [11]. Systems trained on specific corpora may reflect a partial view of the phenomenon,
erratically classifying expressions deviating from dominant patterns. Such biases could invisibilize
legitimate satire from underrepresented groups or perpetuate inequalities through systematically skewed
classifications [35, 13].</p>
        <p>Finally, satire detection must not be equated with falsity detection or malicious intent attribution. Not
all satire contains false claims, nor do all false claims adopt satirical framing [14, 35]. The line between
humor, critique, and manipulation is often blurred, requiring contextual knowledge that current models
cannot reliably emulate. As noted in prior work [34, 36, 37], deploying automated systems in sensitive
domains like misinformation detection necessitates human oversight, continuous auditing, and training
data updates to prevent both uncritical decision automation and the imposition of implicit biases.</p>
      </sec>
      <sec id="sec-6-5">
        <title>6.5. Future Work</title>
        <p>Although the proposed system demonstrated competitive performance, several improvements warrant
exploration. First, expanding fusion methods to include early attention mechanisms alongside the
late-attention approaches used here [13, 17]. This variant could model cross-modal interactions from
earlier architectural stages, particularly beneficial when acoustic signals reinforce textual content from
sequence onset.</p>
        <p>Additionally, since all models remained frozen, partial or full fine-tuning of pre-trained models [ 38]
would allow adapting learned representations to Spanish satire nuances [39, 14], potentially enhancing
discriminative capacity. Larger models (e.g., RoBERTa-large-bne [24] or HuBERT-large [8]) could
capture more complex patterns, albeit at higher computational cost [11].</p>
        <p>To balance performance and eficiency, eficient adaptation techniques like adapter layers [ 40] or LoRA
(Low-Rank Adaptation) [41] are especially promising. These methods fine-tune pre-trained models
by training only a small parameter subset while freezing the original weights. Adapter layers insert
low-dimensional modules between original model layers, learning task-specific transformations without
altering core weights, proven efective across NLP tasks while matching full fine-tuning performance
with fewer trained parameters [40].</p>
        <p>Similarly, LoRA [41] injects low-rank matrices into linear projections, representing necessary updates
via eficient decomposition. This adds adaptability without significant inference latency, crucial for
computationally constrained applications, and has been validated for text and audio tasks, matching full
ifne-tuning results at reduced training cost. Integrating these techniques would enable future iterations
to incorporate models specifically adapted to Spanish satire without compromising eficiency in modest
computing environments.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author used GPT-4o for translation and spell checking. After
using this tool, the author reviewed and edited the content as needed and takes full responsibility for
the final version of the publication.
[7] J. Á. González-Barba, L. Chiruzzo, S. M. Jiménez-Zafra, Overview of IberLEF 2025: Natural
Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the
Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the
Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS. org, 2025.
[8] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, A. Mohamed, HuBERT:
SelfSupervised Speech Representation Learning by Masked Prediction of Hidden Units, 2021. doi:10.
48550/arXiv.2106.07447.
[9] Z. K. Abdul, A. K. Al-Talabani, Mel Frequency Cepstral Coeficient and its Applications: A Review,</p>
      <p>IEEE Access 10 (2022) 122136–122158. doi:10.1109/ACCESS.2022.3223444.
[10] PlanTL-GOB-ES, roberta-base-bne: Pretrained RoBERTa model for Spanish from the National</p>
      <p>Library Corpus, 2022. URL: https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne.
[11] R. Ortega-Bueno, P. Rosso, J. E. M. Pagola, Multi-view informed attention-based model for
Irony and Satire detection in Spanish variants, Knowledge-Based Systems 235 (2022) 107597.
doi:https://doi.org/10.1016/j.knosys.2021.107597.
[12] F. Barbieri, F. Ronzano, H. Saggion, Is This Tweet Satirical? A Computational Approach for Satire
Detection in Spanish, Procesamiento del Lenguaje Natural (2015). URL: https://www.redalyc.org/
articulo.oa?id=515751524015.
[13] F. Bellido Delgado, F. d. B. Navarro Colorado, D. Tomás Díaz, Generación de ironía multimodal,</p>
      <p>Trabajo de Fin de Máster en Ciencia de Datos, Universidad de Alicante, 2024.
[14] N. Mafla, M. Flores, S. Castillo-Páez, R. Andrade, Automatic Detection of Fake News in Spanish:</p>
      <p>Ecuadorian Political Satire, Revista Politécnica 50 (2022) 7–16. doi:10.33333/rp.vol50n3.01.
[15] A. Kamal, M. Abulaish, Jahiruddin, Contextualized Satire Detection in Short Texts Using
Deep Learning Techniques, Journal of Web Engineering 23 (2024) 27–52. doi:10.13052/
jwe1540-9589.2312.
[16] K. Alnajjar, M. Hämäläinen, ¡Qué maravilla! Multimodal Sarcasm Detection in Spanish: a Dataset
and a Baseline, in: Proceedings of the Third Workshop on Multimodal Artificial Intelligence,
Association for Computational Linguistics, Mexico City, Mexico, 2021, pp. 63–68. doi:10.18653/
v1/2021.maiworkshop-1.9.
[17] P. Bisht, D. Bisht, A. Srivastava, Multimodal Sarcasm Detection Using Transformer- Based
Architectures: A Unified Framework for Text, Audio, and Visual Data, International Research
Journal of Education and Technology 07 (2025).
[18] M. d. P. Salas-Zárate, M. A. Paredes-Valverde, M. Á. Rodríguez-García, R. Valencia-García,
G. Alor-Hernández, Automatic detection of satire in Twitter: A psycholinguistic-based
approach, Knowledge-Based Systems 128 (2017) 20–33. doi:https://doi.org/10.1016/j.
knosys.2017.04.009.
[19] R. Awasthi, V. Chavan, Sarcasm Detection Based on Sentiment Analysis of Audio Corpus Using
Deep Learning, South Eastern European Journal of Public Health (2024) 785–794. doi:10.70135/
seejph.vi.1543.
[20] Z. Li, X. Gao, Y. Zhang, S. Nayak, M. Coler, A Functional Trade-of between Prosodic and Semantic
Cues in Conveying Sarcasm, in: Interspeech 2024, ISCA, 2024, pp. 1070–1074. doi:10.21437/
Interspeech.2024-1962.
[21] H. Bredin, A. Laurent, End-to-end speaker segmentation for overlap-aware resegmentation, 2021.</p>
      <p>URL: https://arxiv.org/abs/2104.04045, _eprint: 2104.04045.
[22] H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz,
M.-P. Gill, Pyannote.Audio: Neural Building Blocks for Speaker Diarization, in: ICASSP 2020
2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020,
pp. 7124–7128. doi:10.1109/ICASSP40776.2020.9052974.
[23] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, Robust Speech Recognition
via Large-Scale Weak Supervision, 2022. URL: https://cdn.openai.com/papers/whisper.pdf.
[24] A. G. Fandiño, J. A. Estapé, M. Pàmies, J. L. Palao, J. S. Ocampo, C. P. Carrino, C. A. Oller,
C. R. Penagos, A. G. Agirre, M. Villegas, MarIA: Spanish Language Models, Procesamiento del
Lenguaje Natural 68 (2022). doi:10.26342/2022-68-3, publisher: Sociedad Española para el</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>García-Díaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Valencia-García</surname>
          </string-name>
          ,
          <article-title>Compilation and evaluation of the Spanish SatiCorpus 2021 for satire identification using linguistic features and transformers</article-title>
          ,
          <source>Complex &amp; Intelligent Systems</source>
          <volume>8</volume>
          (
          <year>2022</year>
          )
          <fpage>1723</fpage>
          -
          <lpage>1736</lpage>
          . doi:
          <volume>10</volume>
          .1007/s40747-021-00625-1.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>García-Díaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>Rodríguez-García</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Valencia-García</surname>
          </string-name>
          ,
          <source>Spanish MEACorpus</source>
          <year>2023</year>
          :
          <article-title>A multimodal speech-text corpus for emotion analysis in Spanish from natural environments</article-title>
          ,
          <source>Computer Standards &amp; Interfaces</source>
          <volume>90</volume>
          (
          <year>2024</year>
          )
          <article-title>103856</article-title>
          . doi:https://doi.org/10.1016/j.csi.
          <year>2024</year>
          .
          <volume>103856</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <article-title>Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>61</volume>
          (
          <year>2018</year>
          )
          <fpage>90</fpage>
          -
          <lpage>99</lpage>
          . doi:
          <volume>10</volume>
          .1145/3129340.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Casals-Salvador</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Costa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>India</surname>
          </string-name>
          , J. Hernando,
          <string-name>
            <surname>BSC-UPC at</surname>
          </string-name>
          EmoSPeech-IberLEF2024:
          <article-title>Attention Pooling for Emotion Recognition, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2024), co-located with the 40th SEPLN Conference</article-title>
          , CEUR Workshop Proceedings, CEURWS.org, Valladolid, Spain,
          <year>2024</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3756</volume>
          /EmoSPeech2024_paper1.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>García-Díaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>Rodríguez-García</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>García-Sánchez</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Valencia-García</surname>
          </string-name>
          , Overview of EmoSPeech at IberLEF 2024:
          <article-title>Multimodal Speech-text Emotion Recognition in Spanish</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>73</volume>
          (
          <year>2024</year>
          )
          <fpage>359</fpage>
          -
          <lpage>368</lpage>
          . doi:
          <volume>10</volume>
          .26342/2024-73-27.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>García-Díaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bernal-Beltrán</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>García-Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Valencia-García</surname>
          </string-name>
          , Overview of SatiSPeech at IberLEF 2025:
          <article-title>Multimodal Audio-Text Satire Classification in Spanish</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>75</volume>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>