<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ferrara at SatiSPeech-IberLEF 2025: Leveraging BETO and HuBERT for Multimodal Speech-Text Satire Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Bortolotti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Ferrara</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Satire is a complex and subtle form of communication that blends humor, irony, and criticism to address social, political, or cultural issues. Its interpretation often depends on nuanced linguistic and prosodic cues, making satire particularly dificult to detect-especially in multimodal settings that involve both textual and acoustic signals. This paper presents a system developed for the SatiSPeech@IberLEF 2025 shared task, which focuses on the binary classification of Spanish content as satirical or non-satirical using multimodal data. The proposed approach explores the interplay of linguistic patterns, vocal intonation, and rhythm to identify features most indicative of satire. Key challenges include the scarcity of rich multimodal satire datasets and the complexity of designing robust fusion strategies for heterogeneous modalities. By leveraging recent advances in deep learning and multimodal integration, the work aims to contribute to the development of more accurate and culturally aware satire detection systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Satire Detection</kwd>
        <kwd>Multimodal Classification</kwd>
        <kwd>NLP</kwd>
        <kwd>Speech Processing</kwd>
        <kwd>BETO</kwd>
        <kwd>HuBERT</kwd>
        <kwd>Multi-Head Attention</kwd>
        <kwd>Text-Audio Fusion</kwd>
        <kwd>Spanish</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Satire is a sophisticated and multifaceted form of expression that employs irony, sarcasm, and
exaggeration to critique social, political, or cultural phenomena. Unlike straightforward humor, satire often
relies on implicit cues and contextual knowledge, making it inherently dificult to detect and interpret,
even for human readers. In computational contexts, this complexity increases further, particularly when
dealing with content that spans multiple modalities.</p>
      <p>
        Recent advances in multimodal learning [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] have opened new avenues for satire detection by
enabling the fusion of textual and acoustic information. Prosodic features such as intonation, rhythm,
and speech rate can provide essential signals for detecting the tone and intent behind a message.
However, capturing the interplay between linguistic structure and vocal delivery remains a challenging
task, especially in languages like Spanish, where cultural and regional nuances play a critical role in
humorous expression.
      </p>
      <p>
        The SatiSPeech@IberLEF 2025 shared task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], held at IberLEF 2025 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], addresses this research gap
by promoting the development of systems capable of identifying satire in Spanish through a multimodal
approach. The task consists of a binary classification challenge in which participants are required to
determine whether a given audio-text pair is satirical or not. This setting enables the exploration of
novel fusion techniques and the evaluation of various deep learning architectures suited for handling
both sequential and acoustic inputs.
      </p>
      <p>To address this challenge, the proposed system leverages two state-of-the-art pretrained models:
BETO, a BERT-based language model trained on Spanish texts, for the textual modality, and HuBERT,
a self-supervised speech representation model, for the audio modality. These representations are
subsequently integrated through a late fusion architecture designed to combine complementary cues
from both modalities.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>The dataset for this task was compiled to address the challenge of detecting satire in a multimodal
context, combining text and audio. The data were obtained from a wide range of YouTube channels,
including satirical programs such as El Intermedio, Zapeando, Homo-Zapping, and El Mundo Today, as
well as non-satirical news programs such as Antena 3 Noticias, El Mundo, and BBC News. This ensures
a broad representation of the varieties of Spanish spoken in diferent regions.</p>
      <p>
        The compilation process involved extracting videos from these channels and segmenting them into
manageable audio units using a diarization tool [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Segments longer than 25 seconds were discarded
to maintain a consistent length. The audio segments were transcribed using Whisper [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to ensure
high-quality textual representations.
      </p>
      <p>A semi-supervised approach was used to annotate the segments as either satirical or non-satirical,
combining manual annotation by three experts and automatic classification techniques to increase
eficiency and reliability. Next, a manual annotation process was conducted by the organizers to ensure
high-quality labels for the dataset. It is worth noting that the dataset includes content from diverse
Spanish-speaking regions, ensuring linguistic and cultural diversity while minimizing regional bias.</p>
      <p>The final dataset consists of approximately 25 hours of annotated audio segments and their
corresponding transcriptions. For the purposes of this competition, around 5,000–6,000 audio-text pairs will
be selected and divided into training and test sets with an 80%-20% split.</p>
    </sec>
    <sec id="sec-3">
      <title>3. System Description</title>
      <p>The system developed for the SatiSPeech@IberLEF 2025 shared task addresses two distinct subtasks:
the first (Task 1) focuses on satire detection using only textual modality, while the second (Task 2)
employs a multimodal approach that combines both textual and acoustic information. Each task is
tackled using a specific preprocessing pipeline and tailored classification models.</p>
      <p>For the textual modality, the BETO model, a Spanish pre-trained BERT variant, is used to extract deep
contextual embeddings. For the audio modality, HuBERT is employed to capture prosodic and phonetic
representations. In the multimodal task, these two modalities are fused to enhance classification
performance by integrating information from both text and audio.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Text Modality (Task 1)</title>
      <p>In this task, the aim is to classify satirical content based solely on textual transcriptions. The system relies
on the BETO model, a Spanish BERT-based model, which is fine-tuned using the SentenceTransformers
framework.</p>
      <sec id="sec-4-1">
        <title>4.1. BETO: Spanish BERT-based Model</title>
        <p>
          BETO [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] is a transformer-based language model pre-trained on large Spanish corpora. It follows the
same architecture as BERT-Base [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] (12 layers, 768 hidden units, 12 attention heads, and 110 million
parameters), but it is trained exclusively on Spanish data, such as Wikipedia and news articles.
        </p>
        <p>By focusing on the Spanish language, BETO captures language-specific nuances, making it more
suitable than multilingual models for tasks such as satire detection, where understanding idiomatic
expressions, tone, and cultural references is crucial.</p>
        <p>Tokenization Process. The BETO model uses the WordPiece tokenization algorithm, which splits
the input text into subword units, handling out-of-vocabulary words and morphological variations
efectively. During preprocessing, the input text is tokenized, with special tokens like [CLS] and [SEP]
added at the beginning and end of the sequence. The tokenized input is then transformed into token
IDs, attention masks, and segment embeddings, which are used in the downstream fine-tuning process.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Model Architecture</title>
        <p>The core of the system is based on the pre-trained BETO model
(dccuchile/bert-base-spanish-wwmuncased). For this task, the BETO model was directly utilized for classification purposes.</p>
        <p>The model was integrated into the SentenceTransformers framework, which eficiently converts text
into sentence embeddings. The architecture consists of three components: the transformer encoder
(BETO), a pooling layer that computes sentence embeddings, and a dense layer with a Tanh activation
function to project the embeddings into a 768-dimensional space.</p>
        <p>Tokenization and preprocessing were performed using the tokenizer associated with the BETO model.
The dataset was tokenized and prepared without manual text preprocessing, as BETO’s tokenizer
handles typical cleaning operations such as lowercasing, punctuation removal, and tokenization.</p>
        <p>The model was trained for three epochs on the training dataset. This training process enabled the
model to adjust its weights to improve its ability to classify satirical content in Spanish. Despite the
simplicity of the configuration, the pre-trained BETO model yielded strong performance, highlighting
the efectiveness of transfer learning.</p>
        <p>This architecture leverages the advantages of a pre-trained language model, providing a robust and
data-eficient solution for the classification task.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Multimodal Task (Task 2)</title>
      <p>The goal of the second task is to identify satire by combining both textual and vocal features in a
multimodal approach, leveraging the strengths of both data types. First, the audio features are extracted
and used to train an independent audio classifier. Then, a fusion strategy is applied to integrate
predictions from both modalities.</p>
      <sec id="sec-5-1">
        <title>5.1. Audio Modality</title>
        <p>The objective of the audio subtask is to perform satire classification using only vocal characteristics,
assuming that acoustic patterns such as tone, rhythm, or prosody may signal satirical intent.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Feature Extraction with HuBERT</title>
        <p>
          To extract high-level speech features, we employed the pre-trained HuBERT model [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] HuBERT
(HiddenUnit BERT) is a self-supervised model that learns acoustic representations by predicting masked audio
segments based on unsupervised cluster assignments.
        </p>
        <p>Raw audio files were resampled to 16 kHz using librosa. We used the Wav2Vec2Processor and
the pre-trained HuBERT model from Hugging Face to compute hidden representations. The final feature
vector for each audio file was obtained by averaging the last hidden states across the time dimension,
resulting in a fixed-length 1024-dimensional embedding that captures prosodic, rhythmic, and timbral
features.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Audio Classification Architecture</title>
        <p>The extracted embeddings were subsequently processed by a feed-forward neural network designed
for classification. The architecture begins with an input layer that receives the 1024-dimensional
embeddings generated by the HuBERT model.</p>
        <p>This is followed by two hidden layers. The first hidden layer consists of a fully connected layer with
1024 units, followed by Batch Normalization, a ReLU activation function, and a Dropout layer with a
dropout rate of 0.4. The second hidden layer applies a similar structure, with a fully connected layer
reduced to 256 units, again followed by Batch Normalization, ReLU activation, and Dropout with the
same rate.</p>
        <p>Finally, the network concludes with an output layer consisting of a fully connected layer, making it
suitable for binary classification.</p>
        <p>Training Details. The model was trained using CrossEntropyLoss and optimized with the Adam
optimizer (learning rate = 0.001). A StepLR scheduler decreased the learning rate by a factor of 0.1
every 10 epochs. Training was performed for 40 epochs.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Multimodal Fusion Strategy</title>
      <p>To exploit both textual and acoustic information, we adopted a late fusion strategy, which combines the
class probability distributions generated independently by the text and audio classifiers.</p>
      <sec id="sec-6-1">
        <title>6.1. Fusion Mechanism</title>
        <p>Each classifier outputs a probability vector over the target classes. These vectors are combined via
element-wise average (arithmetic mean), giving equal weight to each modality. The final prediction
corresponds to the class with the highest combined score.
6.2. System Overview
• Text Classifier : A transformer-based model (BETO) generates semantic representations from
text transcriptions and outputs class probabilities.
• Audio Classifier : A HuBERT-based pipeline generates vocal embeddings, which are classified
using a feed-forward network.</p>
        <p>• Fusion Step: Class probabilities from both classifiers are averaged to produce the final decision.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Final Results for Competition</title>
      <p>The system was initially evaluated during the SatiSPeech@IberLEF 2025 competition. The model
achieved notable performance on the test set, with the following results:
• F1 Score (Textual Task): 0.832
• F1 Score (Multimodal Task): 0.837</p>
      <sec id="sec-7-1">
        <title>7.1. Results Overview</title>
        <p>In this subsection, we present the performance rankings for each subtask of the competition. The results
are evaluated using the MACRO F1-Score, which reflects the average performance across all classes.</p>
        <sec id="sec-7-1-1">
          <title>7.1.1. Task 1: Textual Task Ranking Results</title>
          <p>Below is the ranking of teams for the Textual Task, based on the MACRO F1-Score:
7.1.2. Task 2: Multimodal Task Ranking Results
Below is the ranking of teams for the Multimodal Task, based on the MACRO F1-Score:</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Results and Discussion</title>
      <p>The results obtained in the SatiSPeech@IberLEF 2025 competition demonstrate the efectiveness of
the developed model. The model achieved an F1-Score of 0.832 for the Multimodal Task and 0.837 for
the Textual Task, highlighting its capability to efectively handle both text recognition and multimodal
integration of text and audio. The model ranked third in the Multimodal Task and fifth in the Textual
Task.</p>
      <p>The competitive performance in the rankings, with a score of 83.7 for the Multimodal Task and 83.2
for the Textual Task, indicates a good balance between model complexity and accuracy, showing strong
generalization capabilities on the challenging task of satire detection.</p>
      <sec id="sec-8-1">
        <title>8.1. Discussion of the Results</title>
        <p>The results obtained show that the model has been successfully applied to the task of satire detection in
both textual and multimodal formats. The performance is competitive, ranking in the top positions,
reflecting a well-optimized and balanced model. These results suggest that the approach used is
capable of addressing complex problems involving both natural language processing and audio analysis,
achieving a high level of generalization. It can be observed that the inclusion of the audio modality has
had a positive impact on performance, as evidenced by a significant improvement in the ranking of a
participant who moved from a macro F1 score of 84.4 in the textual task to 88.3 in the multimodal task.
This indicates that audio plays a beneficial role in enhancing the model’s ability to detect satirical content.
Nevertheless, there is still room for improvement in both tasks, particularly in the audio component,
where further refinements to the proposed models in this paper could lead to better classification
accuracy.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>9. Improvements After the Competition</title>
      <p>Following the conclusion of the competition, several enhancements were introduced to improve overall
model performance by refining both the textual and audio modalities. Modifications to the text pipeline
included training multiple BETO models with varied splits and epochs, while the audio component was
enhanced through diferent fusion strategies and attention-based mechanisms.</p>
      <sec id="sec-9-1">
        <title>9.1. Improvements After the Competition - Task 1</title>
        <p>After the competition, several strategies were applied to enhance the model’s performance in the
Textual Task. These improvements focused on both refining the model architecture and enhancing the
training dataset. The main approaches explored were the use of multiple BETO models with varying
training-validation splits and data augmentation techniques to enrich the textual data.</p>
        <p>The first approach involved training multiple models on diferent subsets of the data, each with
varying numbers of epochs. This method aimed to increase the model’s generalization capability by
exposing it to diferent data splits and training settings. Additionally, the incorporation of soft labeling
and max labeling strategies was evaluated to determine which method would yield better performance.</p>
        <p>The second improvement focused on augmenting the textual dataset to increase its size and
variability. By employing various techniques such as synonym replacement, back-translation, and random
deletion/insertion, the training data was made more diverse, which was expected to help the model
generalize better to unseen data.</p>
        <sec id="sec-9-1-1">
          <title>9.1.1. Multiple BETO Models with Diferent Training-Validation Splits</title>
          <p>Several BETO models were trained with diferent training-validation splits. For each configuration, the
number of epochs was varied, and the F1 scores were calculated for each model. Below are the results
for the models trained with 3, 4, 5, and 6 epochs.</p>
          <p>Number of Models Epochs F1 Score
5 Models
4 Models
3 Models
2 Models
1 Model</p>
          <p>From the table above, it can be observed that the highest F1 score was achieved using 4 models
trained for 5 epochs, outperforming configurations with 3 and 6 epochs. Additionally, using multiple
models led to an improvement in performance compared to using a single model, with the F1 score
increasing as more models were included. Between soft labeling and max labeling, the soft labeling
approach was found to slightly outperform the max labeling technique, which is why it was chosen for
the experiments.</p>
        </sec>
        <sec id="sec-9-1-2">
          <title>9.1.2. Data Augmentation for Textual Data</title>
          <p>To further improve performance, data augmentation techniques were explored to increase the diversity
of the training data. The previously identified best-performing configuration (four models trained for
ifve epochs) served as the baseline for these experiments. The following augmentation techniques were
applied:
• Synonym Replacement: Words were randomly replaced with their synonyms to increase
vocabulary diversity.
• Back-Translation: Sentences were translated into another language (e.g., English) and then
translated back to Spanish, creating paraphrased versions of the original text.
• Random Deletion/Insertion: Words were randomly removed or added to sentences to create
variations in the textual structure and lexical content.</p>
          <p>Despite the application of these augmentation techniques, a slight decrease in performance was
observed on the oficial test set when compared to the original configuration. As a result, the data
augmentation approach was not included in the final submission. While the performance diferences
were not large, the original training setup without augmentation was found to be more reliable for
generalization, avoiding the risk of overfitting introduced by artificial noise in the data.</p>
        </sec>
      </sec>
      <sec id="sec-9-2">
        <title>9.2. Improvements After the Competition for Task 2</title>
        <p>After the competition, further improvements were made to the multimodal classification model by
exploring two main strategies for combining the predictions from the audio and text classifiers.</p>
        <p>The first approach involved a weighted averaging strategy, where various combinations of weights
were applied to the predictions of the two modalities. The objective was to identify the most efective
weight distribution capable of enhancing classification performance.</p>
        <p>The second approach consisted in the development of a neural network based on a multimodal
attention mechanism. In this model, the class probabilities generated independently by the audio and
text classifiers are used as input and fused through an attention-based mechanism to obtain the final
prediction. The use of attention allows the network to selectively focus on the most informative aspects
of each modality, thereby improving the accuracy of the resulting label prediction.</p>
        <p>These enhancements were aimed at boosting overall model performance by fully exploiting the
complementary nature of audio and text information.</p>
      </sec>
      <sec id="sec-9-3">
        <title>9.3. Weighted Averaging Results</title>
        <p>In this section, the results of the weighted averaging approach for combining the predictions from
the audio and text classifiers are presented. The goal of this approach was to determine the optimal
balance between the two modalities by experimenting with diferent weight distributions for the class
probabilities generated by the models.</p>
        <p>For the weighted averaging, a variety of combinations of weights for the text and audio classifiers
were used, and the corresponding performance was evaluated using the F1 Score. Below are the results
for several weight combinations:
text audio</p>
        <p>F1 Score</p>
        <p>From these results, it is clear that a balanced combination of the text and audio modalities generally
leads to better performance compared to using extreme values for either modality. The best performing
configuration was when the text modality was weighted slightly more heavily ( text = 0.65, audio =
0.35), achieving an F1 score of 0.858, outperforming the text-only baseline (0.853) and demonstrating
that a weighted fusion strategy can enhance classification accuracy by efectively leveraging the
complementary strengths of both modalities.</p>
      </sec>
      <sec id="sec-9-4">
        <title>9.4. Multimodal Attention Network</title>
        <p>
          To further improve the multimodal classification model, a Multimodal Attention Neural Network
[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] was employed, which combines the probability distributions obtained from the audio and text
models. The approach aims to better capture the relationships between the two modalities by leveraging
attention mechanisms.
9.4.1. Data Preparation
The dataset was split into training and validation sets with an 80-20% ratio. The audio model previously
described was retrained exclusively on the training portion (4800 examples) to ensure that the data
used for training the subsequent Multi-Head Attention fusion mechanism remained unseen during this
phase. This prevented any label leakage and ensured a fair evaluation when learning optimal fusion
weights.
        </p>
      </sec>
      <sec id="sec-9-5">
        <title>9.5. Multimodal Attention Network Model</title>
        <p>In order to improve the multimodal classification performance, a Multimodal Attention Network that
combines the probability distributions output by the audio and text models has been adopted. The core
idea behind this approach is to leverage the attention mechanism to learn the optimal fusion strategy
between the two modalities, enhancing the model’s ability to capture and integrate complementary
information from both audio and text.</p>
        <p>The architecture is structured around several interconnected components. First, the model takes as
input the probability distributions generated independently by the audio and text classifiers. These
distributions, which express the models’ confidence in the classification task, constitute the foundation
for the multimodal fusion.</p>
        <p>Subsequently, the input features are projected into three separate spaces—Queries, Keys, and Values—
through fully connected layers. These projections enable the model to build rich internal representations
and efectively guide the attention mechanism in focusing on the most relevant aspects of each modality.
The dimensionality of these projections is governed by the number of attention heads (4) and the hidden
dimension size.</p>
        <p>The core attention mechanism is based on the scaled dot-product attention, which computes similarity
scores between Queries and Keys. These scores are normalized using a softmax function, producing
attention weights that modulate the relative importance assigned to each modality. The resulting
weighted sum of the Values captures the most salient information from both audio and text sources.</p>
        <p>To preserve the original input features while enabling the learning of useful interactions, the attention
output is combined with the input through a residual connection and passed through a linear projection
layer. This ensures that the learned patterns do not override essential input characteristics.</p>
        <p>The fusion of modalities is then performed through a learnable weighted combination of the audio and
text outputs, governed by two parameters,  and  , which determine their respective contributions to
the final decision. This dynamic fusion mechanism allows the model to adaptively weigh the importance
of each modality based on the context.</p>
        <p>Finally, Layer Normalization is applied to promote training stability, and dropout is used with a
probability of 0.1 on the attention weights to mitigate overfitting. This combination ensures that the
model maintains robustness and generalizes well across diferent inputs.</p>
      </sec>
      <sec id="sec-9-6">
        <title>9.6. Multimodal Attention Network Results</title>
        <p>The resulting F1 score was 0.859, slightly higher than the best weighted average score of 0.858.</p>
        <p>This result suggests the potential efectiveness of the attention mechanism in combining probabilities
from both modalities, which could lead to improved overall classification performance. The model’s
ability to learn optimal fusion strategies appears promising and warrants further investigation.</p>
      </sec>
      <sec id="sec-9-7">
        <title>9.7. Post-Competition Results</title>
        <p>After the competition, the multimodal system was further refined and improved by testing diferent
techniques to optimize its performance.</p>
        <p>For Task 1, the main improvements came from:
• Increased Training Epochs for the BETO models: The number of training epochs for the</p>
        <p>BETO models was extended, allowing them to converge better and improve performance.
• Training Multiple BETO Models: Multiple BETO models were trained on diferent validation
and training splits, which helped increase the robustness and generalization capability of the
system.</p>
        <p>For Task 2, several fusion strategies were explored to improve performance. The following techniques
were implemented:
• Weighted Averaging: A weighted averaging strategy was experimented with, where the weights
of the predictions from the audio and text classifiers were adjusted. This helped find the optimal
balance between the modalities.
• Multimodal Attention Network: A neural network using a multimodal attention mechanism
was developed. The model takes class probabilities from both the audio and text classifiers and
combines them through attention-based fusion. This attention mechanism allows the model to
focus on the most relevant parts of each modality, improving prediction accuracy.</p>
        <p>The main performance improvements were achieved by training four BETO models for five epochs
each for Task 1. For Task 2, the implementation of the Multi-Head Attention strategy showed promising
results and contributed to a small but measurable performance increase.</p>
        <p>Post-competition Results:
• Task 1 (Binary Satire Detection - Text): F1-score of 0.853
• Task 2 (Multimodal Satire Detection): F1-score of 0.859
10. Conclusions and Future Improvements
The results obtained clearly indicate that the textual information, captured through the BETO language
model, plays a predominant role in the classification task, ofering the highest individual performance
among the modalities. However, the inclusion of acoustic features extracted using HuBERT contributed
to measurable improvements, reinforcing the hypothesis that audio and text provide complementary
information for the detection of satirical content.</p>
        <p>Despite the competitive results achieved, several avenues for future work remain open to further
enhance the system. A first line of improvement concerns the audio classifier itself, which in the current
approach is based on a relatively simple architecture. Employing more sophisticated neural architectures
or fine-tuning pre-trained speech models could potentially lead to better acoustic representations and
thus improved classification accuracy.</p>
        <p>A second promising direction involves refining the multimodal attention mechanism. In the present
work, attention is applied to the class probability distributions generated by the unimodal classifiers.
Future research could explore the design of a more expressive Multi-Head Attention module that
operates directly on the intermediate features extracted from both modalities. This would enable the
model to capture richer cross-modal interactions and possibly uncover deeper patterns relevant to satire
detection.</p>
        <p>Moreover, the current attention-based fusion approach was trained using a limited set of
approximately 1200 multimodal examples. Increasing the amount of training data would likely enhance the
stability and robustness of the fusion model, especially in complex or ambiguous cases. Additionally,
systematically tuning the hyperparameters of the attention mechanism—such as the number of heads
and the size of the hidden layers—may yield configurations that better support the integration of
multimodal information and further improve the overall performance.</p>
        <p>In summary, this study shows that combining textual and acoustic cues is a viable strategy for satire
detection. Further improvements in fusion strategies, acoustic modeling, and training data augmentation
are expected to yield even better results in future research.
During the preparation of this work, the author(s) used DeepL in order to Grammar and spelling check.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cañete</surname>
          </string-name>
          , G. Chaperon,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Poblete,</surname>
          </string-name>
          <article-title>BETO: Spanish BERT pretrained model</article-title>
          , https://github.com/dccuchile/beto,
          <year>2020</year>
          . Accessed June 2025.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.-N.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bolte</surname>
          </string-name>
          , Y.
          <string-name>
            <surname>-H. H. Tsai</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Lakhotia</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Salakhutdinov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mohamed</surname>
          </string-name>
          , Hubert:
          <article-title>Selfsupervised speech representation learning by masked prediction of hidden units</article-title>
          ,
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          <volume>29</volume>
          (
          <year>2021</year>
          )
          <fpage>3451</fpage>
          -
          <lpage>3463</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>García-Díaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bernal-Beltrán</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>García-Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Valencia-García</surname>
          </string-name>
          , Overview of SatiSPeech at IberLEF 2025:
          <article-title>Multimodal Audio-Text Satire Classification in Spanish</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>75</volume>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>González-Barba</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          <string-name>
            <surname>Jiménez-Zafra</surname>
          </string-name>
          ,
          <article-title>Overview of IberLEF 2025: Natural Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS</article-title>
          . org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Coria</surname>
          </string-name>
          , G. Gelly,
          <string-name>
            <given-names>P.</given-names>
            <surname>Korshunov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lavechin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fustes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Titeux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Bouaziz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.-P.</given-names>
            <surname>Gill</surname>
          </string-name>
          , Pyannote. audio
          <article-title>: neural building blocks for speaker diarization</article-title>
          ,
          <source>in: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>7124</fpage>
          -
          <lpage>7128</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          , et al.,
          <article-title>Whisper: Openai's speech recognition system</article-title>
          , https://github.com/openai/whisper,
          <year>2023</year>
          . Accessed June 2025.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems (NeurIPS)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>