Multimodal Fusion Techniques to Enhance Voice Disorder
                         Diagnoses
                         Qingqing Liu1,∗ , Gabriele Ciravegna1,∗ , Alkis Koudounas1 , Tania Cerquitelli1 and Elena Baralis1
                         1
                             Politecnico di Torino, Turin, Italy


                                            Abstract
                                            Voice disorders constitute a significant health concern, with an annual prevalence of approximately 7% among the adult population,
                                            adversely affecting patients’ quality of life, encompassing both social and occupational functioning. Also, the majority of diagnostic
                                            methodologies continue to depend on invasive techniques, whereas non-invasive automated diagnostic approaches have not been
                                            extensively investigated yet. This study introduces a transformer-based method for detecting voice disorders aimed at enhancing
                                            detection efficacy through a multimodal fusion strategy. Specifically addressing two distinct types of voice recordings – extracted from
                                            sentences reading and vowels emissions -— we devised and assessed five multimodal fusion strategies across three stages: early, mid,
                                            and late. Our experimental findings indicate that the cross-attention mid-fusion method harnesses the benefits of both data types, and it
                                            achieves a detection accuracy of 0.885 and a macro F1 score of 0.843 on an internal dataset. These results represent an improvement of
                                            +.03 to +.06 in accuracy and +.02 to +.05 in macro F1 score when compared to unimodal models (trained on sentence or vowel data
                                            only). This study represents an advancement for an effective non-invasive detection of voice disorders and provides insights for clinical
                                            practice.

                                            Keywords
                                            Medical AI, Pathological voice disorder, Transformers, Modality Fusion, Multimodal learning,


                         1. Introduction                                                                                               and real-time monitoring, providing a new solution for early
                                                                                                                                       detection and intervention of voice disorders.
                         Voice disorders have a significant impact on people’s lives,                                                     Very recently, Transformer-based models [9, 10, 11] have
                         with 7% of adults suffering from them each year. They can                                                     been shown to be effective tools for the automatic detection
                         lead to communication difficulties, reduced work productiv-                                                   of voice disorders. Their core advantage is the ability to
                         ity (7.4 work days lost per year on average), and even career                                                 capture long-term dependencies in time series data, which is
                         changes, with 4% of patients reporting a career change due                                                    crucial for analyzing complex speech patterns. Through the
                         to voice problems [1, 2]. There are many types of voice                                                       self-attention mechanism, transformers can not only effi-
                         disorders, including but not limited to murmurs, vocal cord                                                   ciently process large-scale datasets, but also extract complex
                         dysfunction, and other voice problems caused by neurologi-                                                    patterns that determine voice characteristics. However, this
                         cal diseases, and their early and accurate diagnosis is crucial                                               area still remains under-researched with several research
                         for effective treatment [3].                                                                                  questions that remain open due to the complexity and di-
                            Although traditional diagnostic techniques such as laryn-                                                  versity of pathological voice features, which still remains
                         goscopy and speech assessment are widely used clinically,                                                     an open issue. Indeed, doctors perform different patient
                         they have significant limitations [4]. First, these diagnostic                                                voice assessments to assess different voice properties, such
                         methods are very invasive and may cause discomfort to                                                         as requiring the patient to read pre-defined sentences and
                         the patient, thus affecting the experience particularly for                                                   emitting sustained vowels.
                         patients requiring several investigations and recurrent con-                                                     This study addresses this challenge by proposing a multi-
                         trols (e.g. cancer patients) [5]. Secondly, these technologies                                                modal approach to voice disorder detection. We leverage
                         often rely on expensive equipment and highly specialized                                                      the strengths of the transformer architecture to analyze mul-
                         operators, which limits their accessibility in resource-poor                                                  timodal pathological speech data. Specifically, dealing with
                         settings. Thirdly, traditional methods rely on doctors’ sub-                                                  two different types of data, namely sentences and vowels
                         jective judgments and suffer from subjective bias in eval-                                                    only, we design a unified model to process them together.
                         uation results. Finally, these methods are mostly used for                                                    Three fusion strategies – early fusion, mid-level fusion, and
                         diagnosis when symptoms are evident rather than as proac-                                                     late fusion – are investigated to effectively integrate cross-
                         tive preventive screening tools, limiting their role in the                                                   modal information.
                         early detection of voice disorders [6].                                                                          We empirically demonstrate that mid-level fusion tech-
                            The development of artificial intelligence technology [7],                                                 niques are particularly suited for this task, demonstrating
                         especially the application of deep learning in the field of                                                   their ability to capture complementary features and improve
                         audio and sound processing, provides new possibilities for                                                    detection performance. The cross-attention technique, in
                         overcoming the above challenges [8]. By enabling auto-                                                        particular, achieves performance gains of +.03-.06 in accu-
                         mated, non-invasive, efficient diagnostics, deep learning                                                     racy and +.04-.05 in macro F1 compared to single-modality
                         methods can lower diagnostic costs and reduce the need for                                                    models, highlighting the potential of multimodal integra-
                         professionals, making detection more accessible and accu-                                                     tion in enhancing detection performance. These findings
                         rate. In addition, these technologies can be integrated into                                                  highlight the feasibility of multi-modal transformer-based
                         portable devices or mobile applications for active screening                                                  models in clinical applications and lay a solid foundation for
                         Published in the Proceedings of the Workshops of the EDBT/ICDT 2025 Joint
                                                                                                                                       further advancement of automatic voice disorder detection.
                         Conference (March 25-28, 2025), Barcelona, Spain, as part of the DARLI-AP                                        The rest of the paper is organized as follows. In Section 2
                         workshop held in conjunction with the EDBT/ICDT 2025 conference.                                              we first review the relevant research on voice disorder detec-
                         ∗
                              Corresponding author.                                                                                    tion and analyze the main challenges of existing methods in
                         Envelope-Open s315203@studenti.polito.it (Q. Liu); gabriele.ciravegna@polito.it                               application. Section 3 describes the proposed Transformer-
                         (G. Ciravegna); alkis.koudounas@polito.it (A. Koudounas)
                                        © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License   based method in detail, focusing on different multimodal
                                        Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
fusion strategies. In Section 4 we introduce the experimen-        [28]. For example, people chatting, singing, reading, or
tal setup and evaluation methods, while in Section 5 we            performing particular sound patterns are all typical modal-
provide a comprehensive analysis and interpretation of the         ities. The information provided by each modality may be
experimental results. Finally, in Section 6 we summarize the       different and complementary, and a single modality often
significance of our research, explore potential limitations,       cannot fully capture pathological features. Therefore, by
and provide suggestions for future research directions.            fusing data from different modalities, we can have a more
                                                                   comprehensive understanding of the pathological condition,
                                                                   thus improving the accuracy and robustness of detection.
2. Related work                                                       In multimodal fusion, there are three main strategies:
                                                                   early fusion, mid-level fusion, and late fusion [28, 29]. Early
This section reviews the relevant research on voice disorder
                                                                   fusion combines features from different modalities into a
detection and provides a theoretical basis for the tools and
                                                                   vector before feeding them into a model. Mid-level fusion
methods used in subsequent sections. The discussion fo-
                                                                   integrates data at an intermediate stage, allowing for more
cuses on the evolution of voice feature analysis techniques,
                                                                   flexibility in capturing deeper correlations while maintain-
the application of classifiers in detecting voice disorders,
                                                                   ing some distinctions between modalities. Late fusion trains
and the latest progress in data augmentation and fusion
                                                                   separate models for each modality and combines their pre-
models.
                                                                   dictions via an aggregation function such as average voting,
                                                                   weighted voting, or using a meta-classifier.
2.1. Automatic Voice Disorder Detection
     Methods                                                       2.3. Shallow approach to Multi-modal
Traditional voice disorder detection methods rely on arti-              Fusion for Voice Disorder detection
ficial feature engineering, that is, extracting acoustic fea-
                                                                   The research by Koudounas et al. [9] proposed an end-to-
tures such as Mel-frequency cepstral coefficients (MFCC),
                                                                   end method based on a transformer, which directly pro-
pitch jitter, and amplitude shimmer from speech signals
                                                                   cesses the original audio signal. To address the challenges
[12, 13]. These features, rooted in digital signal process-
                                                                   posed by different recording types (such as sentence read-
ing and speech science, have long been the cornerstone of
                                                                   ing and sustained vowel utterances), they used a shallow
voice analysis. Using these manual features, researchers
                                                                   mixture of experts (MoE) [30] framework to optimize the
rely on shallow learning models such as support vector ma-
                                                                   prediction alignment across recording types. Experimental
chines (SVMs) and multi-layer perceptrons (MLPs), which
                                                                   results show that the method improves the single-modality
perform well in voice disorder detection problems in rela-
                                                                   approach in speech pathology detection and classification
tively simple or well-controlled environments [14, 15, 16].
                                                                   tasks, and achieves good performance on public and private
However, the complexity of pathological voice features and
                                                                   datasets. However, this study mainly focuses on synthetic
the diversity of real-world scenarios have revealed the limi-
                                                                   data and the MoE framework, and lacks in-depth exploration
tations of these traditional methods, particularly in terms
                                                                   of multimodal fusion strategies.
of adaptability and generalization [17]
                                                                      Building on this, our study introduces a systematic study
   The advent of deep learning has transformed voice disor-
                                                                   of multimodal fusion strategies in voice pathology detec-
der detection, as it can automatically extract features from
                                                                   tion. We focus on early, mid, and late fusion methods, es-
raw speech signals. Unlike traditional methods that rely on
                                                                   pecially mid and late fusion, because these two methods
handcrafted features, deep learning models such as convo-
                                                                   have greater flexibility and can capture deeper correlations
lutional neural networks (CNNs) and recurrent neural net-
                                                                   between modalities. Compared with the method of [9], our
works (RNNs) can learn more abstract and comprehensive
                                                                   study explores fusion strategies in more detail and demon-
feature representations directly from data [12, 16, 18, 19, 20].
                                                                   strates how mid-fusion strategies are the best multimodal
CNNs excel at capturing local patterns, while RNNs excel at
                                                                   approach in this domain to improve model generalization.
modeling temporally related patterns, making them more
suitable for voice pathology analysis, particularly when em-
ployed together.                                                   3. Method
   Recently, transformer-based architectures have made
breakthroughs in automatic speech recognition and related          This section outlines our contributions to multimodal fu-
tasks [21, 22, 23, 24, 25, 26]. These models use self-attention    sion strategies, emphasizing the mathematical formulation
mechanisms to capture short and long-range dependencies            of the problem and the model architecture. Specifically, we
at the same time, thus performing well in processing com-          introduce early, mid-level, and late fusion strategies in trans-
plex speech patterns [8, 11, 27]. Among them, Wav2Vec2’s           former architecture that integrate multiple modalities for
end-to-end modeling capability [27] combines a convolu-            robust prediction.
tional encoder for extracting potential speech representa-            In this study, we used two speech-based modalities to
tions, a transformer-based context network for capturing           solve the voice pathology detection task, each capturing
long-distance dependencies, and a quantization module for          voice characteristics. The first modality 𝑥1 represents the
self-supervised learning, further simplifying the feature ex-      original features extracted from the sentence reading record-
traction process. This architecture enables the efficient and      ing, while 𝑥2 represents the features extracted from the sec-
accurate analysis of voices under various conditions.              ond modality, the sustained vowel pronunciation recording.
                                                                   Given a multi-modal architecture 𝑓, we input the raw audio
2.2. Multimodal fusion                                             waveforms into the Wav2Vec2 model [27] to combine the
                                                                   feature extraction for the different modalities. The model
In voice analysis, multi-modality refers to input data ex-         then outputs the probabilities 𝑦,̂ which are used to produce
tracted from different data sources or forms of information        the final classification result.
Figure 1: Diagram of the early-fusion.
                                                                 Figure 2: Mid-level fusion using concatenated embedding.

3.1. Early fusion strategies
                                                                 extraction, the extracted embeddings are normalized to en-
The early fusion strategy connects the raw features of the
                                                                 sure consistency across modalities, then concatenated as
two modalities into a unified input representation. In this
                                                                 done in the early fusion approach (Eq.4), but at a deeper
stage, we first truncate all audio samples to a uniform length
                                                                 feature level, as shown in Figure 2. Finally, the combined
to standardize the input length and eliminate the bias caused
                                                                 feature vector goes through a dimension reduction layer to
by the difference in sample length. Then, we directly con-
                                                                 fit the input size of the subsequent transformer encoder.
catenate the raw audio features from the two modalities.
Specifically, the raw audio features from the two modal-                          ℎ1 = 𝑒1 (𝑥1 ),     ℎ2 = 𝑒2 (𝑥2 )          (3)
ities are concatenated in a fixed order: modality 𝑥1 first,
                                                                                           𝑥𝑚𝑖𝑑 = [ℎ1 ; ℎ2 ]                (4)
then modality 𝑥2 . To further distinguish the two of them,
a 1-second silence (𝑠) padding is inserted between the two,                          𝑦̂ = softmax(𝑔(𝑥𝑚𝑖𝑑 ))                 (5)
providing a clear boundary for the model (Eq.2). After con-      where:
catenation, the generated unified features are fed as input
into the pre-trained Wav2Vec2 model for prediction. Fig.1             • ℎ1 and ℎ2 : high-dimensional embeddings extracted
visually depicts this process.                                          from modalities 𝑥1 and 𝑥2 using CNN extractor, re-
                                                                        spectively.
                      𝑥𝑒𝑎𝑟𝑙𝑦 = [𝑥1 ; 𝑠; 𝑥2 ]              (1)         • 𝑥𝑚𝑖𝑑 : Concatenated feature embeddings from both
                   𝑦̂ = softmax(𝑓 (𝑥𝑒𝑎𝑟𝑙𝑦 ))              (2)           modalities.
                                                                      • 𝑔 represents the transformer encoder layers.
where:

     • 𝑠: a 1-second silence padding between the two             3.2.2. Cross-Attention
       modalities.
                                                                 The cross-attention mechanism [11] dynamically captures
     • 𝑥𝑒𝑎𝑟𝑙𝑦 : the concatenated feature vector after early
                                                                 interactions between modalities by computing attention
       fusion.
                                                                 weights based on the relationship between the Query (𝑄),
     • The symbol [;]: the concatenation operation.              Key (𝐾), and Value (𝑉) matrices. This allows the model to
     • 𝑦:̂ Predicted output probabilities which are produced     focus on important features across modalities.
       by a softmax.                                                First, given input feature matrices ℎ1 and ℎ2 of the two
                                                                 modalities, we generate 𝑄, 𝐾, and 𝑉 through linear transfor-
   This method effectively captures modality-specific pat-
                                                                 mation,
terns from distinct modalities through simple and direct
feature combinations.                                                       𝑄 = ℎ 1 𝑊𝑄 ,     𝐾 = ℎ2 𝑊𝐾 ,       𝑉 = ℎ2 𝑊𝑉    (6)

                                                                 Here, 𝑊𝑄 , 𝑊𝐾 , and 𝑊𝑉 are learnable weight matrices for the
3.2. Mid-level fusion strategies
                                                                 query, key, and value, respectively. Next, we calculate the
In the mid-level fusion strategy, feature fusion is performed    attention matrix 𝐴 between the Query (𝑄) and the Key (𝐾) by
after CNN encoding but before the features are fed into the      measuring their similarity, then normalized using softmax.
transformer encoder. This approach combines modality-            The attention weight is used to perform a weighted sum of
specific features in a shared representation space, allow-       the Value 𝑉 to generate output features 𝑂:
ing the model to leverage interactions between modalities
for more robust predictions. We will analyze two differ-                                                𝑄𝐾 ⊤
                                                                                    𝐴 = softmax (              )            (7)
ent fusion strategies: concatenated embedding and cross-
                                                                                                         √𝑑𝑘
attention.
                                                                                              𝑂 = 𝐴𝑉                        (8)
3.2.1. Concatenated embeddings                                   where:
In the concatenated embedding strategy, features are first            • 𝐴 is the general attention matrix.
extracted from each modality using a separate CNN layer               • 𝑑𝑘 is the dimension of the key, √𝑑𝑘 is the normaliza-
and mapped to the same vector space (Eq.3). We thus de-                 tion factor used for scaling.
compose the network into the composition of two modules
𝑓 = 𝑔 ∘ 𝑒, where 𝑒 is the CNN-based feature extractor, while       As illustrated in Figure 3, cross-attention is computed in
𝑔 represents the transformer encoder layers. After feature       both directions to effectively capture interactions between
                                                                 the two modalities.
                                                                    As shown in Figure 4, we use a simple multi-layer percep-
                                                                 tron (MLP) configured with a single hidden layer to predict
                                                                 weights to combine the outputs of each model. The in-
                                                                 put layer of the MLP is a probabilistic concatenation of the
                                                                 two modalities (𝑦1̂ , 𝑦2̂ ), and the output layer applies a soft-
                                                                 max function to ensure that the sum of all model weights
                                                                 is 1 (Eq.14). During inference, the final prediction is com-
                                                                 puted using the weights to combine the contributions of
                                                                 both models (Eq. 15). This approach improves the system’s
Figure 3: Mid fusion using cross-attention.                      performance on unseen data while maintaining a simple
                                                                 architecture.

    1. We use ℎ1 as the query and ℎ2 as the key and value
       to compute the attention (Eq.9).                                         [𝑤1 , 𝑤2 ] = softmax(MLP([𝑦1̂ ; 𝑦2̂ ]))             (14)
    2. We reverse the roles of the modalities and use ℎ2 as                       𝑦late2
                                                                                   ̂     = 𝑤1 ⋅ 𝑦1,test
                                                                                                 ̂      + 𝑤2 ⋅ 𝑦2,test
                                                                                                                ̂                   (15)
       the query and ℎ1 as the key and value (Eq.10).
                                                                 Here:
  Finally, the outputs of the cross-attentions from both di-
rections, 𝑂1→2 and 𝑂2→1 , are concatenated to form a unified           • [𝑤1 , 𝑤2 ]: Weights learned from the concatenated out-
representation, 𝑥fused , as shown in Eq.11.                              puts 𝑦1̂ and 𝑦2̂ on the validation set.
  This concatenation helps to merge the information from               • 𝑦1,test
                                                                          ̂      , 𝑦2,test
                                                                                    ̂      : Predicted probabilities from the two
both modalities in a unified feature space. The fused features           models on the test set.
are then processed through a shared fusion layer before
passing them to a transformer encoder for deeper feature
extraction and ultimately classification.
                                                                 4. Results
         𝑂1→2 = 𝐴1→2 𝑉2 = CrossAttention(ℎ1 , ℎ2 )        (9)    This section provides an overview of the datasets and pre-
                                                                 processing methods used in our experiments, followed by a
         𝑂2→1 = 𝐴2→1 𝑉1 = CrossAttention(ℎ2 , ℎ1 )       (10)    detailed description of the training setup to ensure repro-
                   𝑥fused = [𝑂1→2 ; 𝑂2→1 ]               (11)    ducibility.
                   𝑦̂ = softmax(𝑔(𝑥𝑓 𝑢𝑠𝑒𝑑 )              (12)      All experiments were conducted in a cloud-based envi-
                                                                 ronment equipped with a Tesla P100-PCIE-16GB GPU1 .
where:                                                             Details of the software environment can be found in the
                                                                 project repository2 .
     • Arrows represent the direction of attention.

                                                                 4.1. Dataset
3.3. Late fusion strategies
                                                                 IPV The Italian Pathological Voice (IPV) dataset is a novel
While mid-level fusion captures fine-grained interactions        and diverse resource designed specifically for voice pathol-
between modality-specific embeddings, late fusion is per-        ogy research, currently unpublished and introduced in [9].
formed at the decision level, allowing each modality to be       Collected from participants in Italian otolaryngology and
optimized independently and then integrated into a unified       voice therapy clinics, the dataset includes both healthy indi-
prediction. This approach allows each model to focus on its      viduals and patients with varying degrees of voice disorders.
specific modality before being integrated, although it dou-      All recordings were conducted under strict standardization
bles the size of the final model. Two late fusion techniques     protocols in quiet environments, ensuring high-quality sam-
are employed in our study.                                       ples with a signal-to-noise ratio exceeding 30 dB and a fixed
                                                                 microphone distance of 30 cm.
Simple average In this approach, the outputs of the two             The dataset comprises two modalities: sustained phona-
models, 𝑦1̂ and 𝑦2̂ , are combined by taking their simple        tion of the vowel /a/ (SV) and reading of five phonetically
average, as illustrated in the top part of Figure 4. This        balanced sentences (CS) adapted from the Italian version
strategy assumes that both models contribute equally to the      of CAPE-V [31]. Each sample includes detailed health con-
final prediction. The combined output 𝑦late1̂  is computed as    dition notes and diagnoses from experienced physicians.
follows:                                                         Table 1 provides a detailed summary of the dataset charac-
                              1                                  teristics, including sample distribution, record length, and
                      𝑦late1
                       ̂     = (𝑦1̂ + 𝑦2̂ )              (13)
                              2                                  modal information.
where 𝑦1̂ and 𝑦2̂ are the probability distributions produced
by the two individual models.                                    Audio Preprocessing To ensure the consistency of audio
   This fusion method is simple and it is computationally        duration and facilitate comparison, we cropped the samples
efficient as it avoids any extra parameters.                     in the datasets to fixed lengths: CS samples were cropped
                                                                 to 19 seconds, and SV samples were cropped to 18 seconds.
Mixture of Expert As a second late fusion strategy, we           These lengths are designed to cover approximately 90% of
employ a shallow mixture of experts (MoE) to combine the         the samples in each modality, ensuring that most voice
outputs of two independent models and improve the overall        1
                                                                  We gratefully acknowledge the computational resources provided by
performance of the system. Unlike the simple averaging             Kaggle (https://www.kaggle.com/) for this research. We also appreciate
method, this approach assigns weights to each model’s pre-         the early-stage support from HPC@Polito (http://www.hpc.polito.it).
dictions based on how relevant they are to the final output.     2
                                                                   Github repository: github.com/multimodal_pathologies_prediction
           Figure 4: Two late fusion strategies. The upper right part is the simple average method, and the lower right part is the MoE
           method.


Table 1                                                                to a fixed maximum duration to ensure sample consistency.
The table summarizes the characteristics of the dataset. Healthy       We extract 40-dimensional MFCC features through librosa,
and Pathological represent the number of healthy and diseased          transpose them into a time-step sequence form, and uni-
samples, respectively. CS indicates the number of sentence read-       formly zero-fill the feature sequence. At the same time, a
ing samples, and SV represents the number of syllable articulation     padding mask is generated to distinguish between real data
samples. 𝑇 (𝑠) denotes the average duration of the audio samples       and padding parts. The following is the specific design of
in seconds.
                                                                       the two baseline models.
            Healthy     pathological    CS      SV     T(s)
     IPV    362         672             517     517    12.95
                                                                       MLP is designed with two fully connected layers contain-
                                                                       ing 50 hidden units, using the ReLU activation function to
                                                                       extract high-dimensional features, aggregating the time di-
information is preserved for effective model training while
                                                                       mension information through the global average pooling
reducing the impact of outlier samples that are too long. For
                                                                       layer, and finally performing binary classification through
samples shorter than the fixed lengths, zero-padding was
                                                                       the Softmax output layer. The training process uses the
applied to extend them to the required duration.
                                                                       Adam optimizer with a learning rate of 0.01, a batch size of
   Then, the audio data was standardized using the prede-
                                                                       16, and an early stopping strategy to prevent overfitting.
fined processor provided by the Wav2Vec2 framework. The
processor first resamples the audio data to 16kHz to ensure
compatibility with the framework, and reduce computa-                  2D-CNN The audio features are converted to 2D images
tional overhead. Then the converted feature representation             by repeating a single channel to RGB three channels to fit
can not only effectively capture the key information in the            the input requirements of the pre-trained model. We load
speech signal, but also provide consistent and efficient input         the pre-trained weights (ImageNet [33]) of MobileNetV2
features for the model to support subsequent training tasks.           [32], remove the top classification head, and add a global
   In order to avoid issues with the imbalance of pathologi-           average pooling layer, a 512-unit fully connected layer, and
cal voice data (healthy samples are less than pathological             a Softmax classification layer. Dropout is added to the top
samples), a stratified sampling method was used in the data            network to improve generalization, and the pre-trained fea-
division process to ensure proportional representation of              ture extraction part is fine-tuned. Two fine-tuning strategies
healthy and pathological samples across all splits. We di-             are used: full fine-tuning and head-only fine-tuning. In full
vided the data into training, validation, and test sets in a           fine-tuning, all layers of MobileNetV2 are updated during
ratio of 8:1:1 to ensure fair and reproducible evaluations.            training to maximize performance optimization; while in
The test set was first separated using a fixed random seed.            head-only fine-tuning, only the newly added classification
Subsequently, the training and validation sets were further            head is trained, while the pre-trained feature extraction
split using three different random seeds to create multiple            layer is frozen to retain the common features learned from
splits. The final results are calculated by averaging the per-         ImageNet. The training hyperparameters of both strategies
formance metrics over these splits to ensure the robustness            are consistent with the MLP model.
and reliability of the evaluation.
                                                                       4.3. Training Procedure
4.2. Baselines                                                         Our method is based on a pre-trained Wav2Vec2.0 model
To verify the effectiveness of our proposed method and                 (trained on the LibriSpeech 960-hour dataset) and evaluates
provide a comparison, we designed a series of traditional              three fusion strategies on the IPV dataset: early fusion, mid-
baseline models, including the classic multi-layer percep-             level fusion, and late fusion.
tron (MLP) and a lightweight convolutional neural network                 Early fusion is accomplished by directly concatenating the
(MobileNetV2 [32]) based on transfer learning. These base-             original audio of CS and SV, and adding 1 second of silence
line models are trained based on traditional audio features            (38 seconds) after the total length of the audio to avoid
to evaluate the performance of different model architectures.          feature loss. The concatenation is performed on the same
In contrast, the unimodal model based on the Wav2Vec2                  individual. The concatenated audio signals are uniformly
processor directly processes the audio waveform to extract             processed in a Wav2Vec2.0 processor to ensure consistency
features, reflecting the advantages of end-to-end methods.             in feature extraction. Mid-level fusion is based on 2 fine-
   In the feature extraction process of the baseline model,            tuned Wav2Vec2.0 models, and global feature modeling is
the audio data is uniformly sampled to 16kHz and truncated             achieved through a shared Transformer encoder (initialized
    Table 2
    Performance Comparison of Single-Modality Baselines and Dual-Modality Fusion Methods. CS refers to a single modality
    with sentence reading, SV to a single modality with vowel pronunciation. Values spanning both columns refers to modality
    fusion methods. Bold values indicate the best performance for a given metric.
                                                                          Accuracy              Macro F1
                    Modality      Method
                                                                        CS            SV      CS            SV
                                  MLP                                .801±.011 .750±.057   .767±.022 .686±.053
                                  2D-CNN (Train all layers)          .667±.011 .673±.000   .400±.004 .402±.000
                    Single
                                  2D-CNN (Fine-tune classify head)   .789±.019 .782±.048   .765±.021 .723±.063
                                  Wav2Vec2                           .859±.029 .827±.000   .837±.038 .793±.000
                                  Early Fusion                              .859±.011             .829±.016
                                  Mid (Concatenated Embeddings)             .878±.011             .838±.014
                    Multi         Mid (Cross Attention)                    .885±.000              .843±.005
                                  Late (Simple Average)                     .852±.022             .824±.027
                                  Late (MoE)                                .872±.011            .857±.012


                                                                                                            𝑁
with pre-trained Wav2Vec2 parameters). The first method                                                  1
                                                                                     Macro F1-Score =      ∑ 𝐹1                (18)
directly concatenates CS and SV extracted features, and the                                              𝑁 𝑖=1 𝑖
second achieves feature interaction through a bidirectional
                                                                     Here, 𝑁 is the total number of categories. Macro F1-Score
cross-modal attention mechanism. The number of attention
                                                                     gives equal importance to all classes, making it particularly
heads is set to 4. Late fusion utilizes fine-tuned CS and
                                                                     suitable for tasks with class imbalance, such as voice disor-
SV models to generate the final classification results by
                                                                     der detection.
combining the probabilities from both modalities, either
through simple averaging or a shallow MOE (an MLP with
10 hidden nodes) that determines modality weighting based            5. Discussion
on the probabilities from the training and validation sets.
   All experiments above were completed within 50 training           In this section, we analyze and interpret the experimental
rounds (epochs), and using fixed random seed to ensure               results by focusing on two key aspects: comparing base-
the reproducibility of the results. The AdamW optimizer              line models to assess their effectiveness as reference, and
(weight decay = 0.01) was used for all experiments. A lin-           evaluating different fusion models to explore their ability
ear learning rate scheduler is used to optimize the learning         to integrate multimodal information and improve general-
rate adjustment. The scheduler reduced the learning rate             ization to unseen data. By systematically studying these
linearly over the total number of training steps, with no            factors, we aim to highlight the strengths and limitations
warm-up steps. Initial learning rates are optimized by man-          of the proposed approach and provide insights for future
ual adjustment, using 1e-5 for single modality and concate-          improvements.
nated fusion and 6e-6 for cross-attention fusion. To address
class imbalance, a weighted cross-entropy loss function was          Benchmark comparison Table 2 presents a comparison
applied, with class weights computed based on the train-             of the performance of unimodal baseline models for voice
ing dataset’s label distribution. The batch size was set to          disorder detection on the IPV dataset.
8, and an early stopping strategy with the patience of 10               As expected, Wav2Vec2 achieved the best results among
epochs was used to terminate training when the validation            the four baseline models, with accuracy of .859 and .827
performance plateaued. More experimental details and hy-             in CS and SV modes, respectively, and .837 and .793 for F1
perparameter configurations can be found in the GitHub               macro, respectively. The superior performance of Wav2Vec2
repository of the article.                                           underscores the benefits of self-supervised pre-training on
                                                                     large-scale audio data. This means that the model does not
4.4. Evaluation Metrics                                              need to be trained from scratch, but through pre-training
                                                                     and transfer learning capabilities, it can have audio features
To evaluate the performance of the model in the voice dis-           with good generalization capabilities, even with a small
order detection task, we used two key metrics:                       amount of labeled data. Moreover, it benefits of the attention
                                                                     mechanism which better extract relevant features from long
Accuracy Accuracy measures the proportion of correctly               sequence of data.
predicted samples to the total number of samples, providing             The MLP model performs well in CS mode with an ac-
an overall assessment of classification performance:                 curacy of .801 and F1 Macro of .767, but drops to .750 and
                    Number of Correct Predictions                    .686 in SV mode, highlighting its limitations in capturing
       Accuracy =                                         (16)       complex audio features with limited contextual information.
                      Total Number of Samples
                                                                     Compared to MLP, our method improves F1 Macro by +.07
While accuracy is a useful general metric, it can be less
                                                                     in CS mode and +.10-.11 in SV mode, with corresponding
informative in imbalanced datasets.
                                                                     accuracy improvements of +.05-.06 and +.07-.08.
                                                                        For the 2D-CNN, fully fine-tuning all layers leads to poor
Macro F1-Score To better evaluate performance across                 performance (.667 and .673 accuracy in CS and SV modes,
imbalanced classes, we adopted the macro-average F1 score,           respectively; .400 and .402 F1 Macro), likely due to the dis-
which calculates the F1 score for each class and then aver-          ruption of pre-trained features. Fine-tuning only the clas-
ages them:                                                           sification head improves the performance to .789 and .782
                            Precision ⋅ Recall                       accuracy in CS and SV modes, and .765 and .723 F1 Macro,
               𝐹1 = 2 ×                                   (17)
                            Precision + Recall
respectively. However, our method performs better than            performance. In contrast, early fusion methods, while bene-
the fine-tuned 2D-CNN, with +.07-.08 improvement in F1            ficial for capturing joint features from the beginning, may
Macro, +.07-.08 and +.04-.05 improvement in accuracy in           lack flexibility in handling complex interactions between
CS and SV modes.                                                  modalities. This often leads to inferior performance com-
   The above results show that fine-tuning the pre-trained        pared to mid-level fusion. Late fusion methods are easier to
Wav2Vec2 model is an effective solution for small dataset         implement but have limited capabilities in modeling com-
tasks, it highlights the necessity of carefully designed opti-    plex feature interactions and a higher number of parameters.
mization methods.                                                    These findings provide valuable insights into the design of
                                                                  voice disorder detection systems, especially with regard to
Fusion strategy vs. single modality performance The               their potential applications in clinical diagnosis and health
fusion method shows an advantage over the single modality         monitoring.
by effectively combining the complementary information
of CS and SV inputs. In particular, as shown in Table 2,          Future Work Although this study provides valuable in-
the proposed mid-level fusion pipeline shows significant          sights, there are still some limitations and directions for
improvements over single modality models. Concatenated            improvement.
Embeddings improves accuracy by +.02 and macro F1 by                 First, the experiments are limited to a specific dataset,
+.001 on the CS model, and by +.05 and +.04 on the SV             IPV, which contains two homogeneous audio modalities and
model, respectively. Cross Attention performs even better,        cannot cover a wider range of scenarios. Future work can
with accuracy and F1 gains of +.03 and +.006 on the CS            explore larger and more diverse datasets, including datasets
model, and +.06 and +.05 on the SV model for accuracy and         collected in realistic noisy environments, or cross-lingual
macro F1, respectively. These results highlight the benefits      datasets to evaluate the reliability of the model in the real
of leveraging complementary information from multiple             world. In addition, future work can integrate other medical
modalities.                                                       modalities (e.g. laryngoscope images + audio samples), to
   When compared to other fusion strategies, instead, mid-        expand audio beyond the audio domain for more compre-
level fusion consistently outperforms both early and late         hensive voice disorder detection. Second, the current study
fusion methods. The cross-attention method achieves the           only focuses on voice disorder detection tasks. In the fu-
best results with .885 accuracy and .843 macro F1, which          ture, it can be expanded to multi-classification tasks to more
is +.02-.03 in accuracy and +.01-.02 in macro F1 compared         comprehensively evaluate the effectiveness of the model
with early fusion. Similarly, it achieves +.01-.04 improve-       in practical applications, especially in the classification of
ment in accuracy and +.01-.03 improvement in macro F1             different types of pathologies that are at the root of the voice
compared to late fusion strategies such as Mixture of Ex-         disorder.
perts (MoE). These results demonstrate the effectiveness of          Third, we only used the wav2vec2 model for feature ex-
dynamically capturing inter-modality dependencies during          traction and multi-modal fusion, and did not compare it on
feature integration.                                              other advanced Tansformer models (e.g. Hubert, WavLM,
   Compared with early fusion that concatenates raw fea-          etc). Future work can explore and evaluate their effective-
tures, the proposed mid-level fusion method can model com-        ness in the medical voice pathology analysis of these models.
plex inter-dependencies, leading to robust feature represen-         Data augmentation techniques can also be combined to
tation. In contrast, late fusion methods, while simpler to        enhance the generalization ability of the model, so as to
implement, operate at the decision level and cannot fully         maintain excellent performance in more diverse application
exploit the interactions between modalities.                      scenarios.
   In summary, our proposed mid-level fusion strategy, es-           By addressing these limitations, future research can build
pecially the cross-attention strategy, achieves the best per-     on this study to develop more powerful, efficient, and scal-
formance among all methods. The results show that it is           able voice disorder detection solutions, thereby bringing
able to dynamically integrate complementary modality in-          greater social and technological impact for practical appli-
formation, leading to significant improvements in accuracy        cations.
and macro F1 performance.

                                                                  References
6. Conclusion
                                                                   [1] N. Bhattacharyya, The prevalence of voice problems
This study investigates the effectiveness of various models,           among adults in the united states, The Laryngoscope
and fusion methods for speech impairment detection using               124 (2014) 2359–2362.
unimodal and multimodal approaches. We leverage end-               [2] N. Roy, R. M. Merrill, S. D. Gray, E. M. Smith, Voice
to-end pre-trained models Wav2Vec2, which is once again                disorders in the general population: prevalence, risk
proven to be an effective model for solving audio tasks, even          factors, and occupational impact, The Laryngoscope
with a limited dataset size. This not only reduces the steps of        115 (2005) 1988–1995.
manual feature extraction but also enables robust features         [3] C. L. Payten, G. Chiapello, K. A. Weir, C. J. Madill,
to be extracted from audio data through self-supervised                Frameworks, terminology and definitions used for the
pre-training, showing good generalization ability.                     classification of voice disorders: a scoping review, Jour-
   Among multimodal methods, our experiments show that                 nal of Voice (2022).
mid-level fusion strategies, especially the cross-attention        [4] P. Daraei, C. R. Villari, A. D. Rubin, A. T. Hillel, E. R.
mechanism, outperform early and late fusion techniques.                Hapner, A. M. Klein, M. M. Johns, The role of laryn-
The cross-attention mechanism dynamically captures fine-               goscopy in the diagnosis of spasmodic dysphonia,
grained inter-modal dependencies, leading to the highest               JAMA Otolaryngology–Head & Neck Surgery 140
                                                                       (2014) 228–232.
 [5] G. Ciravegna, A. Koudounas, M. Fantini, T. Cerquitelli,             Journal of Voice (2023).
     E. Baralis, E. Crosetti, G. Succo, Non-invasive ai-            [21] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen,
     powered diagnostics: The case of voice-disorder                     J. Li, N. Kanda, T. Yoshioka, X. Xiao, et al., Wavlm:
     detection-vision paper, EDBT/ICDT Workshop 2348                     Large-scale self-supervised pre-training for full stack
     (2024).                                                             speech processing, IEEE Journal of Selected Topics in
 [6] M. Fantini, G. Ciravegna, A. Koudounas, T. Cerquitelli,             Signal Processing 16 (2022) 1505–1518.
     E. Baralis, G. Succo, E. Crosetti, The rapidly evolving        [22] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia,
     scenario of acoustic voice analysis in otolaryngology,              R. Salakhutdinov, A. Mohamed,              Hubert: Self-
     Cureus 16 (2024) e73491.                                            supervised speech representation learning by masked
 [7] P. Rajpurkar, E. Chen, O. Banerjee, E. J. Topol, Ai in              prediction of hidden units, IEEE/ACM transactions
     health and medicine, Nature medicine 28 (2022) 31–38.               on audio, speech, and language processing 29 (2021)
 [8] S. Schneider, A. Baevski, R. Collobert, M. Auli,                    3451–3460.
     wav2vec: Unsupervised pre-training for speech recog-           [23] A. Koudounas, E. Pastor, G. Attanasio, L. de Alfaro,
     nition, arXiv preprint arXiv:1904.05862 (2019).                     E. Baralis, Prioritizing data acquisition for end-to-end
 [9] A. Koudounas, G. Ciravegna, M. Fantini, E. Crosetti,                speech model improvement, in: ICASSP 2024 - 2024
     G. Succo, T. Cerquitelli, E. Baralis, Voice disorder                IEEE International Conference on Acoustics, Speech
     analysis: a transformer-based approach, in: In-                     and Signal Processing (ICASSP), 2024, pp. 7000–7004.
     terspeech 2024, 2024, pp. 3040–3044. doi:10.21437/                  doi:10.1109/ICASSP48485.2024.10446326 .
     Interspeech.2024- 1122 .                                       [24] A. Koudounas, M. La Quatra, S. M. Siniscalchi, E. Bar-
[10] M. La Quatra, M. F. Turco, T. Svendsen, G. Salvi, J. R.             alis, voc2vec: A foundation model for non-verbal
     Orozco-Arroyave, S. M. Siniscalchi, Exploiting founda-              vocalization, in: ICASSP 2025 - 2025 IEEE Interna-
     tion models and speech enhancement for parkinson’s                  tional Conference on Acoustics, Speech and Signal
     disease detection from speech in real-world operative               Processing (ICASSP), 2025.
     conditions, in: Interspeech 2024, 2024, pp. 1405–1409.         [25] M. La Quatra, A. Koudounas, L. Vaiani, E. Baralis,
     doi:10.21437/Interspeech.2024- 522 .                                P. Garza, L. Cagliero, S. M. Siniscalchi, Benchmarking
[11] A. Vaswani, Attention is all you need, Advances in                  representations for speech, music, and acoustic events,
     Neural Information Processing Systems (2017).                       in: 2024 IEEE International Conference on Acoustics,
[12] X. Peng, H. Xu, J. Liu, J. Wang, C. He, Voice disor-                Speech, and Signal Processing Workshops (ICASSPW),
     der classification using convolutional neural network               2024.
     based on deep transfer learning, Scientific Reports 13         [26] S. wen Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai,
     (2023) 7264.                                                        K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T.
[13] L. W. Lopes, L. B. Simões, J. D. da Silva, D. da Silva Evan-        Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu,
     gelista, A. C. d. N. e Ugulino, P. O. C. Silva, V. J. D.            Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed,
     Vieira, Accuracy of acoustic analysis measurements                  H. yi Lee, Superb: Speech processing universal per-
     in the evaluation of patients with different laryngeal              formance benchmark, in: Interspeech 2021, 2021, pp.
     diagnoses, Journal of voice 31 (2017) 382–e15.                      1194–1198. doi:10.21437/Interspeech.2021- 1775 .
[14] M. Alhussein, G. Muhammad, Automatic voice pathol-             [27] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec
     ogy monitoring using parallel deep models for smart                 2.0: A framework for self-supervised learning of
     healthcare, Ieee Access 7 (2019) 46474–46479.                       speech representations, Advances in neural informa-
[15] P. H. Leung, K. T. Chui, K. Lo, P. O. de Pablos, A support          tion processing systems 33 (2020) 12449–12460.
     vector machine–based voice disorders detection using           [28] S. R. Stahlschmidt, B. Ulfenborg, J. Synnergren, Mul-
     human voice signal, in: Artificial Intelligence and Big             timodal deep learning for biomedical data fusion: a
     Data Analytics for Smart Healthcare, Elsevier, 2021,                review, Briefings in Bioinformatics 23 (2022) bbab569.
     pp. 197–208.                                                   [29] L. Ilias, D. Askounis, J. Psarras, Detecting dementia
[16] X. Peng, H. Xu, J. Liu, J. Wang, C. He, Voice disor-                from speech and transcripts using transformers, Com-
     der classification using convolutional neural network               puter Speech & Language 79 (2023) 101485.
     based on deep transfer learning, Scientific Reports 13         [30] R. Gupta, K. Audhkhasi, S. Narayanan, A mixture of
     (2023) 7264.                                                        experts approach towards intelligibility classification
[17] U. K. Lilhore, S. Dalal, N. Faujdar, M. Margala,                    of pathological speech, in: 2015 IEEE international
     P. Chakrabarti, T. Chakrabarti, S. Simaiya, P. Kumar,               conference on acoustics, speech and signal processing
     P. Thangaraju, H. Velmurugan, Hybrid cnn-lstm model                 (ICASSP), IEEE, 2015, pp. 1986–1990.
     with efficient hyperparameter tuning for prediction of         [31] G. B. Kempster, B. R. Gerratt, K. V. Abbott, J. Barkmeier-
     parkinson’s disease, Scientific Reports 13 (2023) 14605.            Kraemer, R. E. Hillman,            Consensus auditory-
[18] A. S. Almasoud, T. A. E. Eisa, F. N. Al-Wesabi, A. Elsafi,          perceptual evaluation of voice: development of a stan-
     M. Al Duhayyim, I. Yaseen, M. A. Hamza, A. Motwakel,                dardized clinical protocol (2009).
     Parkinson’s detection using rnn-graph-lstm with op-            [32] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov,
     timization based on speech signals, Comput. Mater.                  L. Chen, Inverted residuals and linear bottlenecks:
     Contin 72 (2022) 872–886.                                           Mobile networks for classification, detection and seg-
[19] R. Islam, E. Abdel-Raheem, M. Tarique, Voice pathol-                mentation, CoRR abs/1801.04381 (2018). URL: http:
     ogy detection using convolutional neural networks                   //arxiv.org/abs/1801.04381. arXiv:1801.04381 .
     with electroglottographic (egg) and speech signals,            [33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei,
     Computer Methods and Programs in Biomedicine Up-                    Imagenet: A large-scale hierarchical image database,
     date 2 (2022) 100074.                                               in: 2009 IEEE conference on computer vision and pat-
[20] X. Xie, H. Cai, C. Li, Y. Wu, F. Ding, A voice disease              tern recognition, Ieee, 2009, pp. 248–255.
     detection method based on mfccs and shallow cnn,