Multimodal Fusion Techniques to Enhance Voice Disorder Diagnoses Qingqing Liu1,∗ , Gabriele Ciravegna1,∗ , Alkis Koudounas1 , Tania Cerquitelli1 and Elena Baralis1 1 Politecnico di Torino, Turin, Italy Abstract Voice disorders constitute a significant health concern, with an annual prevalence of approximately 7% among the adult population, adversely affecting patients’ quality of life, encompassing both social and occupational functioning. Also, the majority of diagnostic methodologies continue to depend on invasive techniques, whereas non-invasive automated diagnostic approaches have not been extensively investigated yet. This study introduces a transformer-based method for detecting voice disorders aimed at enhancing detection efficacy through a multimodal fusion strategy. Specifically addressing two distinct types of voice recordings – extracted from sentences reading and vowels emissions -— we devised and assessed five multimodal fusion strategies across three stages: early, mid, and late. Our experimental findings indicate that the cross-attention mid-fusion method harnesses the benefits of both data types, and it achieves a detection accuracy of 0.885 and a macro F1 score of 0.843 on an internal dataset. These results represent an improvement of +.03 to +.06 in accuracy and +.02 to +.05 in macro F1 score when compared to unimodal models (trained on sentence or vowel data only). This study represents an advancement for an effective non-invasive detection of voice disorders and provides insights for clinical practice. Keywords Medical AI, Pathological voice disorder, Transformers, Modality Fusion, Multimodal learning, 1. Introduction and real-time monitoring, providing a new solution for early detection and intervention of voice disorders. Voice disorders have a significant impact on people’s lives, Very recently, Transformer-based models [9, 10, 11] have with 7% of adults suffering from them each year. They can been shown to be effective tools for the automatic detection lead to communication difficulties, reduced work productiv- of voice disorders. Their core advantage is the ability to ity (7.4 work days lost per year on average), and even career capture long-term dependencies in time series data, which is changes, with 4% of patients reporting a career change due crucial for analyzing complex speech patterns. Through the to voice problems [1, 2]. There are many types of voice self-attention mechanism, transformers can not only effi- disorders, including but not limited to murmurs, vocal cord ciently process large-scale datasets, but also extract complex dysfunction, and other voice problems caused by neurologi- patterns that determine voice characteristics. However, this cal diseases, and their early and accurate diagnosis is crucial area still remains under-researched with several research for effective treatment [3]. questions that remain open due to the complexity and di- Although traditional diagnostic techniques such as laryn- versity of pathological voice features, which still remains goscopy and speech assessment are widely used clinically, an open issue. Indeed, doctors perform different patient they have significant limitations [4]. First, these diagnostic voice assessments to assess different voice properties, such methods are very invasive and may cause discomfort to as requiring the patient to read pre-defined sentences and the patient, thus affecting the experience particularly for emitting sustained vowels. patients requiring several investigations and recurrent con- This study addresses this challenge by proposing a multi- trols (e.g. cancer patients) [5]. Secondly, these technologies modal approach to voice disorder detection. We leverage often rely on expensive equipment and highly specialized the strengths of the transformer architecture to analyze mul- operators, which limits their accessibility in resource-poor timodal pathological speech data. Specifically, dealing with settings. Thirdly, traditional methods rely on doctors’ sub- two different types of data, namely sentences and vowels jective judgments and suffer from subjective bias in eval- only, we design a unified model to process them together. uation results. Finally, these methods are mostly used for Three fusion strategies – early fusion, mid-level fusion, and diagnosis when symptoms are evident rather than as proac- late fusion – are investigated to effectively integrate cross- tive preventive screening tools, limiting their role in the modal information. early detection of voice disorders [6]. We empirically demonstrate that mid-level fusion tech- The development of artificial intelligence technology [7], niques are particularly suited for this task, demonstrating especially the application of deep learning in the field of their ability to capture complementary features and improve audio and sound processing, provides new possibilities for detection performance. The cross-attention technique, in overcoming the above challenges [8]. By enabling auto- particular, achieves performance gains of +.03-.06 in accu- mated, non-invasive, efficient diagnostics, deep learning racy and +.04-.05 in macro F1 compared to single-modality methods can lower diagnostic costs and reduce the need for models, highlighting the potential of multimodal integra- professionals, making detection more accessible and accu- tion in enhancing detection performance. These findings rate. In addition, these technologies can be integrated into highlight the feasibility of multi-modal transformer-based portable devices or mobile applications for active screening models in clinical applications and lay a solid foundation for Published in the Proceedings of the Workshops of the EDBT/ICDT 2025 Joint further advancement of automatic voice disorder detection. Conference (March 25-28, 2025), Barcelona, Spain, as part of the DARLI-AP The rest of the paper is organized as follows. In Section 2 workshop held in conjunction with the EDBT/ICDT 2025 conference. we first review the relevant research on voice disorder detec- ∗ Corresponding author. tion and analyze the main challenges of existing methods in Envelope-Open s315203@studenti.polito.it (Q. Liu); gabriele.ciravegna@polito.it application. Section 3 describes the proposed Transformer- (G. Ciravegna); alkis.koudounas@polito.it (A. Koudounas) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License based method in detail, focusing on different multimodal Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings fusion strategies. In Section 4 we introduce the experimen- [28]. For example, people chatting, singing, reading, or tal setup and evaluation methods, while in Section 5 we performing particular sound patterns are all typical modal- provide a comprehensive analysis and interpretation of the ities. The information provided by each modality may be experimental results. Finally, in Section 6 we summarize the different and complementary, and a single modality often significance of our research, explore potential limitations, cannot fully capture pathological features. Therefore, by and provide suggestions for future research directions. fusing data from different modalities, we can have a more comprehensive understanding of the pathological condition, thus improving the accuracy and robustness of detection. 2. Related work In multimodal fusion, there are three main strategies: early fusion, mid-level fusion, and late fusion [28, 29]. Early This section reviews the relevant research on voice disorder fusion combines features from different modalities into a detection and provides a theoretical basis for the tools and vector before feeding them into a model. Mid-level fusion methods used in subsequent sections. The discussion fo- integrates data at an intermediate stage, allowing for more cuses on the evolution of voice feature analysis techniques, flexibility in capturing deeper correlations while maintain- the application of classifiers in detecting voice disorders, ing some distinctions between modalities. Late fusion trains and the latest progress in data augmentation and fusion separate models for each modality and combines their pre- models. dictions via an aggregation function such as average voting, weighted voting, or using a meta-classifier. 2.1. Automatic Voice Disorder Detection Methods 2.3. Shallow approach to Multi-modal Traditional voice disorder detection methods rely on arti- Fusion for Voice Disorder detection ficial feature engineering, that is, extracting acoustic fea- The research by Koudounas et al. [9] proposed an end-to- tures such as Mel-frequency cepstral coefficients (MFCC), end method based on a transformer, which directly pro- pitch jitter, and amplitude shimmer from speech signals cesses the original audio signal. To address the challenges [12, 13]. These features, rooted in digital signal process- posed by different recording types (such as sentence read- ing and speech science, have long been the cornerstone of ing and sustained vowel utterances), they used a shallow voice analysis. Using these manual features, researchers mixture of experts (MoE) [30] framework to optimize the rely on shallow learning models such as support vector ma- prediction alignment across recording types. Experimental chines (SVMs) and multi-layer perceptrons (MLPs), which results show that the method improves the single-modality perform well in voice disorder detection problems in rela- approach in speech pathology detection and classification tively simple or well-controlled environments [14, 15, 16]. tasks, and achieves good performance on public and private However, the complexity of pathological voice features and datasets. However, this study mainly focuses on synthetic the diversity of real-world scenarios have revealed the limi- data and the MoE framework, and lacks in-depth exploration tations of these traditional methods, particularly in terms of multimodal fusion strategies. of adaptability and generalization [17] Building on this, our study introduces a systematic study The advent of deep learning has transformed voice disor- of multimodal fusion strategies in voice pathology detec- der detection, as it can automatically extract features from tion. We focus on early, mid, and late fusion methods, es- raw speech signals. Unlike traditional methods that rely on pecially mid and late fusion, because these two methods handcrafted features, deep learning models such as convo- have greater flexibility and can capture deeper correlations lutional neural networks (CNNs) and recurrent neural net- between modalities. Compared with the method of [9], our works (RNNs) can learn more abstract and comprehensive study explores fusion strategies in more detail and demon- feature representations directly from data [12, 16, 18, 19, 20]. strates how mid-fusion strategies are the best multimodal CNNs excel at capturing local patterns, while RNNs excel at approach in this domain to improve model generalization. modeling temporally related patterns, making them more suitable for voice pathology analysis, particularly when em- ployed together. 3. Method Recently, transformer-based architectures have made breakthroughs in automatic speech recognition and related This section outlines our contributions to multimodal fu- tasks [21, 22, 23, 24, 25, 26]. These models use self-attention sion strategies, emphasizing the mathematical formulation mechanisms to capture short and long-range dependencies of the problem and the model architecture. Specifically, we at the same time, thus performing well in processing com- introduce early, mid-level, and late fusion strategies in trans- plex speech patterns [8, 11, 27]. Among them, Wav2Vec2’s former architecture that integrate multiple modalities for end-to-end modeling capability [27] combines a convolu- robust prediction. tional encoder for extracting potential speech representa- In this study, we used two speech-based modalities to tions, a transformer-based context network for capturing solve the voice pathology detection task, each capturing long-distance dependencies, and a quantization module for voice characteristics. The first modality 𝑥1 represents the self-supervised learning, further simplifying the feature ex- original features extracted from the sentence reading record- traction process. This architecture enables the efficient and ing, while 𝑥2 represents the features extracted from the sec- accurate analysis of voices under various conditions. ond modality, the sustained vowel pronunciation recording. Given a multi-modal architecture 𝑓, we input the raw audio 2.2. Multimodal fusion waveforms into the Wav2Vec2 model [27] to combine the feature extraction for the different modalities. The model In voice analysis, multi-modality refers to input data ex- then outputs the probabilities 𝑦,̂ which are used to produce tracted from different data sources or forms of information the final classification result. Figure 1: Diagram of the early-fusion. Figure 2: Mid-level fusion using concatenated embedding. 3.1. Early fusion strategies extraction, the extracted embeddings are normalized to en- The early fusion strategy connects the raw features of the sure consistency across modalities, then concatenated as two modalities into a unified input representation. In this done in the early fusion approach (Eq.4), but at a deeper stage, we first truncate all audio samples to a uniform length feature level, as shown in Figure 2. Finally, the combined to standardize the input length and eliminate the bias caused feature vector goes through a dimension reduction layer to by the difference in sample length. Then, we directly con- fit the input size of the subsequent transformer encoder. catenate the raw audio features from the two modalities. Specifically, the raw audio features from the two modal- ℎ1 = 𝑒1 (𝑥1 ), ℎ2 = 𝑒2 (𝑥2 ) (3) ities are concatenated in a fixed order: modality 𝑥1 first, 𝑥𝑚𝑖𝑑 = [ℎ1 ; ℎ2 ] (4) then modality 𝑥2 . To further distinguish the two of them, a 1-second silence (𝑠) padding is inserted between the two, 𝑦̂ = softmax(𝑔(𝑥𝑚𝑖𝑑 )) (5) providing a clear boundary for the model (Eq.2). After con- where: catenation, the generated unified features are fed as input into the pre-trained Wav2Vec2 model for prediction. Fig.1 • ℎ1 and ℎ2 : high-dimensional embeddings extracted visually depicts this process. from modalities 𝑥1 and 𝑥2 using CNN extractor, re- spectively. 𝑥𝑒𝑎𝑟𝑙𝑦 = [𝑥1 ; 𝑠; 𝑥2 ] (1) • 𝑥𝑚𝑖𝑑 : Concatenated feature embeddings from both 𝑦̂ = softmax(𝑓 (𝑥𝑒𝑎𝑟𝑙𝑦 )) (2) modalities. • 𝑔 represents the transformer encoder layers. where: • 𝑠: a 1-second silence padding between the two 3.2.2. Cross-Attention modalities. The cross-attention mechanism [11] dynamically captures • 𝑥𝑒𝑎𝑟𝑙𝑦 : the concatenated feature vector after early interactions between modalities by computing attention fusion. weights based on the relationship between the Query (𝑄), • The symbol [;]: the concatenation operation. Key (𝐾), and Value (𝑉) matrices. This allows the model to • 𝑦:̂ Predicted output probabilities which are produced focus on important features across modalities. by a softmax. First, given input feature matrices ℎ1 and ℎ2 of the two modalities, we generate 𝑄, 𝐾, and 𝑉 through linear transfor- This method effectively captures modality-specific pat- mation, terns from distinct modalities through simple and direct feature combinations. 𝑄 = ℎ 1 𝑊𝑄 , 𝐾 = ℎ2 𝑊𝐾 , 𝑉 = ℎ2 𝑊𝑉 (6) Here, 𝑊𝑄 , 𝑊𝐾 , and 𝑊𝑉 are learnable weight matrices for the 3.2. Mid-level fusion strategies query, key, and value, respectively. Next, we calculate the In the mid-level fusion strategy, feature fusion is performed attention matrix 𝐴 between the Query (𝑄) and the Key (𝐾) by after CNN encoding but before the features are fed into the measuring their similarity, then normalized using softmax. transformer encoder. This approach combines modality- The attention weight is used to perform a weighted sum of specific features in a shared representation space, allow- the Value 𝑉 to generate output features 𝑂: ing the model to leverage interactions between modalities for more robust predictions. We will analyze two differ- 𝑄𝐾 ⊤ 𝐴 = softmax ( ) (7) ent fusion strategies: concatenated embedding and cross- √𝑑𝑘 attention. 𝑂 = 𝐴𝑉 (8) 3.2.1. Concatenated embeddings where: In the concatenated embedding strategy, features are first • 𝐴 is the general attention matrix. extracted from each modality using a separate CNN layer • 𝑑𝑘 is the dimension of the key, √𝑑𝑘 is the normaliza- and mapped to the same vector space (Eq.3). We thus de- tion factor used for scaling. compose the network into the composition of two modules 𝑓 = 𝑔 ∘ 𝑒, where 𝑒 is the CNN-based feature extractor, while As illustrated in Figure 3, cross-attention is computed in 𝑔 represents the transformer encoder layers. After feature both directions to effectively capture interactions between the two modalities. As shown in Figure 4, we use a simple multi-layer percep- tron (MLP) configured with a single hidden layer to predict weights to combine the outputs of each model. The in- put layer of the MLP is a probabilistic concatenation of the two modalities (𝑦1̂ , 𝑦2̂ ), and the output layer applies a soft- max function to ensure that the sum of all model weights is 1 (Eq.14). During inference, the final prediction is com- puted using the weights to combine the contributions of both models (Eq. 15). This approach improves the system’s Figure 3: Mid fusion using cross-attention. performance on unseen data while maintaining a simple architecture. 1. We use ℎ1 as the query and ℎ2 as the key and value to compute the attention (Eq.9). [𝑤1 , 𝑤2 ] = softmax(MLP([𝑦1̂ ; 𝑦2̂ ])) (14) 2. We reverse the roles of the modalities and use ℎ2 as 𝑦late2 ̂ = 𝑤1 ⋅ 𝑦1,test ̂ + 𝑤2 ⋅ 𝑦2,test ̂ (15) the query and ℎ1 as the key and value (Eq.10). Here: Finally, the outputs of the cross-attentions from both di- rections, 𝑂1→2 and 𝑂2→1 , are concatenated to form a unified • [𝑤1 , 𝑤2 ]: Weights learned from the concatenated out- representation, 𝑥fused , as shown in Eq.11. puts 𝑦1̂ and 𝑦2̂ on the validation set. This concatenation helps to merge the information from • 𝑦1,test ̂ , 𝑦2,test ̂ : Predicted probabilities from the two both modalities in a unified feature space. The fused features models on the test set. are then processed through a shared fusion layer before passing them to a transformer encoder for deeper feature extraction and ultimately classification. 4. Results 𝑂1→2 = 𝐴1→2 𝑉2 = CrossAttention(ℎ1 , ℎ2 ) (9) This section provides an overview of the datasets and pre- processing methods used in our experiments, followed by a 𝑂2→1 = 𝐴2→1 𝑉1 = CrossAttention(ℎ2 , ℎ1 ) (10) detailed description of the training setup to ensure repro- 𝑥fused = [𝑂1→2 ; 𝑂2→1 ] (11) ducibility. 𝑦̂ = softmax(𝑔(𝑥𝑓 𝑢𝑠𝑒𝑑 ) (12) All experiments were conducted in a cloud-based envi- ronment equipped with a Tesla P100-PCIE-16GB GPU1 . where: Details of the software environment can be found in the project repository2 . • Arrows represent the direction of attention. 4.1. Dataset 3.3. Late fusion strategies IPV The Italian Pathological Voice (IPV) dataset is a novel While mid-level fusion captures fine-grained interactions and diverse resource designed specifically for voice pathol- between modality-specific embeddings, late fusion is per- ogy research, currently unpublished and introduced in [9]. formed at the decision level, allowing each modality to be Collected from participants in Italian otolaryngology and optimized independently and then integrated into a unified voice therapy clinics, the dataset includes both healthy indi- prediction. This approach allows each model to focus on its viduals and patients with varying degrees of voice disorders. specific modality before being integrated, although it dou- All recordings were conducted under strict standardization bles the size of the final model. Two late fusion techniques protocols in quiet environments, ensuring high-quality sam- are employed in our study. ples with a signal-to-noise ratio exceeding 30 dB and a fixed microphone distance of 30 cm. Simple average In this approach, the outputs of the two The dataset comprises two modalities: sustained phona- models, 𝑦1̂ and 𝑦2̂ , are combined by taking their simple tion of the vowel /a/ (SV) and reading of five phonetically average, as illustrated in the top part of Figure 4. This balanced sentences (CS) adapted from the Italian version strategy assumes that both models contribute equally to the of CAPE-V [31]. Each sample includes detailed health con- final prediction. The combined output 𝑦late1̂ is computed as dition notes and diagnoses from experienced physicians. follows: Table 1 provides a detailed summary of the dataset charac- 1 teristics, including sample distribution, record length, and 𝑦late1 ̂ = (𝑦1̂ + 𝑦2̂ ) (13) 2 modal information. where 𝑦1̂ and 𝑦2̂ are the probability distributions produced by the two individual models. Audio Preprocessing To ensure the consistency of audio This fusion method is simple and it is computationally duration and facilitate comparison, we cropped the samples efficient as it avoids any extra parameters. in the datasets to fixed lengths: CS samples were cropped to 19 seconds, and SV samples were cropped to 18 seconds. Mixture of Expert As a second late fusion strategy, we These lengths are designed to cover approximately 90% of employ a shallow mixture of experts (MoE) to combine the the samples in each modality, ensuring that most voice outputs of two independent models and improve the overall 1 We gratefully acknowledge the computational resources provided by performance of the system. Unlike the simple averaging Kaggle (https://www.kaggle.com/) for this research. We also appreciate method, this approach assigns weights to each model’s pre- the early-stage support from HPC@Polito (http://www.hpc.polito.it). dictions based on how relevant they are to the final output. 2 Github repository: github.com/multimodal_pathologies_prediction Figure 4: Two late fusion strategies. The upper right part is the simple average method, and the lower right part is the MoE method. Table 1 to a fixed maximum duration to ensure sample consistency. The table summarizes the characteristics of the dataset. Healthy We extract 40-dimensional MFCC features through librosa, and Pathological represent the number of healthy and diseased transpose them into a time-step sequence form, and uni- samples, respectively. CS indicates the number of sentence read- formly zero-fill the feature sequence. At the same time, a ing samples, and SV represents the number of syllable articulation padding mask is generated to distinguish between real data samples. 𝑇 (𝑠) denotes the average duration of the audio samples and padding parts. The following is the specific design of in seconds. the two baseline models. Healthy pathological CS SV T(s) IPV 362 672 517 517 12.95 MLP is designed with two fully connected layers contain- ing 50 hidden units, using the ReLU activation function to extract high-dimensional features, aggregating the time di- information is preserved for effective model training while mension information through the global average pooling reducing the impact of outlier samples that are too long. For layer, and finally performing binary classification through samples shorter than the fixed lengths, zero-padding was the Softmax output layer. The training process uses the applied to extend them to the required duration. Adam optimizer with a learning rate of 0.01, a batch size of Then, the audio data was standardized using the prede- 16, and an early stopping strategy to prevent overfitting. fined processor provided by the Wav2Vec2 framework. The processor first resamples the audio data to 16kHz to ensure compatibility with the framework, and reduce computa- 2D-CNN The audio features are converted to 2D images tional overhead. Then the converted feature representation by repeating a single channel to RGB three channels to fit can not only effectively capture the key information in the the input requirements of the pre-trained model. We load speech signal, but also provide consistent and efficient input the pre-trained weights (ImageNet [33]) of MobileNetV2 features for the model to support subsequent training tasks. [32], remove the top classification head, and add a global In order to avoid issues with the imbalance of pathologi- average pooling layer, a 512-unit fully connected layer, and cal voice data (healthy samples are less than pathological a Softmax classification layer. Dropout is added to the top samples), a stratified sampling method was used in the data network to improve generalization, and the pre-trained fea- division process to ensure proportional representation of ture extraction part is fine-tuned. Two fine-tuning strategies healthy and pathological samples across all splits. We di- are used: full fine-tuning and head-only fine-tuning. In full vided the data into training, validation, and test sets in a fine-tuning, all layers of MobileNetV2 are updated during ratio of 8:1:1 to ensure fair and reproducible evaluations. training to maximize performance optimization; while in The test set was first separated using a fixed random seed. head-only fine-tuning, only the newly added classification Subsequently, the training and validation sets were further head is trained, while the pre-trained feature extraction split using three different random seeds to create multiple layer is frozen to retain the common features learned from splits. The final results are calculated by averaging the per- ImageNet. The training hyperparameters of both strategies formance metrics over these splits to ensure the robustness are consistent with the MLP model. and reliability of the evaluation. 4.3. Training Procedure 4.2. Baselines Our method is based on a pre-trained Wav2Vec2.0 model To verify the effectiveness of our proposed method and (trained on the LibriSpeech 960-hour dataset) and evaluates provide a comparison, we designed a series of traditional three fusion strategies on the IPV dataset: early fusion, mid- baseline models, including the classic multi-layer percep- level fusion, and late fusion. tron (MLP) and a lightweight convolutional neural network Early fusion is accomplished by directly concatenating the (MobileNetV2 [32]) based on transfer learning. These base- original audio of CS and SV, and adding 1 second of silence line models are trained based on traditional audio features (38 seconds) after the total length of the audio to avoid to evaluate the performance of different model architectures. feature loss. The concatenation is performed on the same In contrast, the unimodal model based on the Wav2Vec2 individual. The concatenated audio signals are uniformly processor directly processes the audio waveform to extract processed in a Wav2Vec2.0 processor to ensure consistency features, reflecting the advantages of end-to-end methods. in feature extraction. Mid-level fusion is based on 2 fine- In the feature extraction process of the baseline model, tuned Wav2Vec2.0 models, and global feature modeling is the audio data is uniformly sampled to 16kHz and truncated achieved through a shared Transformer encoder (initialized Table 2 Performance Comparison of Single-Modality Baselines and Dual-Modality Fusion Methods. CS refers to a single modality with sentence reading, SV to a single modality with vowel pronunciation. Values spanning both columns refers to modality fusion methods. Bold values indicate the best performance for a given metric. Accuracy Macro F1 Modality Method CS SV CS SV MLP .801±.011 .750±.057 .767±.022 .686±.053 2D-CNN (Train all layers) .667±.011 .673±.000 .400±.004 .402±.000 Single 2D-CNN (Fine-tune classify head) .789±.019 .782±.048 .765±.021 .723±.063 Wav2Vec2 .859±.029 .827±.000 .837±.038 .793±.000 Early Fusion .859±.011 .829±.016 Mid (Concatenated Embeddings) .878±.011 .838±.014 Multi Mid (Cross Attention) .885±.000 .843±.005 Late (Simple Average) .852±.022 .824±.027 Late (MoE) .872±.011 .857±.012 𝑁 with pre-trained Wav2Vec2 parameters). The first method 1 Macro F1-Score = ∑ 𝐹1 (18) directly concatenates CS and SV extracted features, and the 𝑁 𝑖=1 𝑖 second achieves feature interaction through a bidirectional Here, 𝑁 is the total number of categories. Macro F1-Score cross-modal attention mechanism. The number of attention gives equal importance to all classes, making it particularly heads is set to 4. Late fusion utilizes fine-tuned CS and suitable for tasks with class imbalance, such as voice disor- SV models to generate the final classification results by der detection. combining the probabilities from both modalities, either through simple averaging or a shallow MOE (an MLP with 10 hidden nodes) that determines modality weighting based 5. Discussion on the probabilities from the training and validation sets. All experiments above were completed within 50 training In this section, we analyze and interpret the experimental rounds (epochs), and using fixed random seed to ensure results by focusing on two key aspects: comparing base- the reproducibility of the results. The AdamW optimizer line models to assess their effectiveness as reference, and (weight decay = 0.01) was used for all experiments. A lin- evaluating different fusion models to explore their ability ear learning rate scheduler is used to optimize the learning to integrate multimodal information and improve general- rate adjustment. The scheduler reduced the learning rate ization to unseen data. By systematically studying these linearly over the total number of training steps, with no factors, we aim to highlight the strengths and limitations warm-up steps. Initial learning rates are optimized by man- of the proposed approach and provide insights for future ual adjustment, using 1e-5 for single modality and concate- improvements. nated fusion and 6e-6 for cross-attention fusion. To address class imbalance, a weighted cross-entropy loss function was Benchmark comparison Table 2 presents a comparison applied, with class weights computed based on the train- of the performance of unimodal baseline models for voice ing dataset’s label distribution. The batch size was set to disorder detection on the IPV dataset. 8, and an early stopping strategy with the patience of 10 As expected, Wav2Vec2 achieved the best results among epochs was used to terminate training when the validation the four baseline models, with accuracy of .859 and .827 performance plateaued. More experimental details and hy- in CS and SV modes, respectively, and .837 and .793 for F1 perparameter configurations can be found in the GitHub macro, respectively. The superior performance of Wav2Vec2 repository of the article. underscores the benefits of self-supervised pre-training on large-scale audio data. This means that the model does not 4.4. Evaluation Metrics need to be trained from scratch, but through pre-training and transfer learning capabilities, it can have audio features To evaluate the performance of the model in the voice dis- with good generalization capabilities, even with a small order detection task, we used two key metrics: amount of labeled data. Moreover, it benefits of the attention mechanism which better extract relevant features from long Accuracy Accuracy measures the proportion of correctly sequence of data. predicted samples to the total number of samples, providing The MLP model performs well in CS mode with an ac- an overall assessment of classification performance: curacy of .801 and F1 Macro of .767, but drops to .750 and Number of Correct Predictions .686 in SV mode, highlighting its limitations in capturing Accuracy = (16) complex audio features with limited contextual information. Total Number of Samples Compared to MLP, our method improves F1 Macro by +.07 While accuracy is a useful general metric, it can be less in CS mode and +.10-.11 in SV mode, with corresponding informative in imbalanced datasets. accuracy improvements of +.05-.06 and +.07-.08. For the 2D-CNN, fully fine-tuning all layers leads to poor Macro F1-Score To better evaluate performance across performance (.667 and .673 accuracy in CS and SV modes, imbalanced classes, we adopted the macro-average F1 score, respectively; .400 and .402 F1 Macro), likely due to the dis- which calculates the F1 score for each class and then aver- ruption of pre-trained features. Fine-tuning only the clas- ages them: sification head improves the performance to .789 and .782 Precision ⋅ Recall accuracy in CS and SV modes, and .765 and .723 F1 Macro, 𝐹1 = 2 × (17) Precision + Recall respectively. However, our method performs better than performance. In contrast, early fusion methods, while bene- the fine-tuned 2D-CNN, with +.07-.08 improvement in F1 ficial for capturing joint features from the beginning, may Macro, +.07-.08 and +.04-.05 improvement in accuracy in lack flexibility in handling complex interactions between CS and SV modes. modalities. This often leads to inferior performance com- The above results show that fine-tuning the pre-trained pared to mid-level fusion. Late fusion methods are easier to Wav2Vec2 model is an effective solution for small dataset implement but have limited capabilities in modeling com- tasks, it highlights the necessity of carefully designed opti- plex feature interactions and a higher number of parameters. mization methods. These findings provide valuable insights into the design of voice disorder detection systems, especially with regard to Fusion strategy vs. single modality performance The their potential applications in clinical diagnosis and health fusion method shows an advantage over the single modality monitoring. by effectively combining the complementary information of CS and SV inputs. In particular, as shown in Table 2, Future Work Although this study provides valuable in- the proposed mid-level fusion pipeline shows significant sights, there are still some limitations and directions for improvements over single modality models. Concatenated improvement. Embeddings improves accuracy by +.02 and macro F1 by First, the experiments are limited to a specific dataset, +.001 on the CS model, and by +.05 and +.04 on the SV IPV, which contains two homogeneous audio modalities and model, respectively. Cross Attention performs even better, cannot cover a wider range of scenarios. Future work can with accuracy and F1 gains of +.03 and +.006 on the CS explore larger and more diverse datasets, including datasets model, and +.06 and +.05 on the SV model for accuracy and collected in realistic noisy environments, or cross-lingual macro F1, respectively. These results highlight the benefits datasets to evaluate the reliability of the model in the real of leveraging complementary information from multiple world. In addition, future work can integrate other medical modalities. modalities (e.g. laryngoscope images + audio samples), to When compared to other fusion strategies, instead, mid- expand audio beyond the audio domain for more compre- level fusion consistently outperforms both early and late hensive voice disorder detection. Second, the current study fusion methods. The cross-attention method achieves the only focuses on voice disorder detection tasks. In the fu- best results with .885 accuracy and .843 macro F1, which ture, it can be expanded to multi-classification tasks to more is +.02-.03 in accuracy and +.01-.02 in macro F1 compared comprehensively evaluate the effectiveness of the model with early fusion. Similarly, it achieves +.01-.04 improve- in practical applications, especially in the classification of ment in accuracy and +.01-.03 improvement in macro F1 different types of pathologies that are at the root of the voice compared to late fusion strategies such as Mixture of Ex- disorder. perts (MoE). These results demonstrate the effectiveness of Third, we only used the wav2vec2 model for feature ex- dynamically capturing inter-modality dependencies during traction and multi-modal fusion, and did not compare it on feature integration. other advanced Tansformer models (e.g. Hubert, WavLM, Compared with early fusion that concatenates raw fea- etc). Future work can explore and evaluate their effective- tures, the proposed mid-level fusion method can model com- ness in the medical voice pathology analysis of these models. plex inter-dependencies, leading to robust feature represen- Data augmentation techniques can also be combined to tation. In contrast, late fusion methods, while simpler to enhance the generalization ability of the model, so as to implement, operate at the decision level and cannot fully maintain excellent performance in more diverse application exploit the interactions between modalities. scenarios. In summary, our proposed mid-level fusion strategy, es- By addressing these limitations, future research can build pecially the cross-attention strategy, achieves the best per- on this study to develop more powerful, efficient, and scal- formance among all methods. The results show that it is able voice disorder detection solutions, thereby bringing able to dynamically integrate complementary modality in- greater social and technological impact for practical appli- formation, leading to significant improvements in accuracy cations. and macro F1 performance. References 6. Conclusion [1] N. Bhattacharyya, The prevalence of voice problems This study investigates the effectiveness of various models, among adults in the united states, The Laryngoscope and fusion methods for speech impairment detection using 124 (2014) 2359–2362. unimodal and multimodal approaches. We leverage end- [2] N. Roy, R. M. Merrill, S. D. Gray, E. M. Smith, Voice to-end pre-trained models Wav2Vec2, which is once again disorders in the general population: prevalence, risk proven to be an effective model for solving audio tasks, even factors, and occupational impact, The Laryngoscope with a limited dataset size. This not only reduces the steps of 115 (2005) 1988–1995. manual feature extraction but also enables robust features [3] C. L. Payten, G. Chiapello, K. A. Weir, C. J. Madill, to be extracted from audio data through self-supervised Frameworks, terminology and definitions used for the pre-training, showing good generalization ability. classification of voice disorders: a scoping review, Jour- Among multimodal methods, our experiments show that nal of Voice (2022). mid-level fusion strategies, especially the cross-attention [4] P. Daraei, C. R. Villari, A. D. Rubin, A. T. Hillel, E. R. mechanism, outperform early and late fusion techniques. Hapner, A. M. Klein, M. M. Johns, The role of laryn- The cross-attention mechanism dynamically captures fine- goscopy in the diagnosis of spasmodic dysphonia, grained inter-modal dependencies, leading to the highest JAMA Otolaryngology–Head & Neck Surgery 140 (2014) 228–232. [5] G. Ciravegna, A. Koudounas, M. Fantini, T. Cerquitelli, Journal of Voice (2023). E. Baralis, E. Crosetti, G. Succo, Non-invasive ai- [21] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, powered diagnostics: The case of voice-disorder J. Li, N. Kanda, T. Yoshioka, X. Xiao, et al., Wavlm: detection-vision paper, EDBT/ICDT Workshop 2348 Large-scale self-supervised pre-training for full stack (2024). speech processing, IEEE Journal of Selected Topics in [6] M. Fantini, G. Ciravegna, A. Koudounas, T. Cerquitelli, Signal Processing 16 (2022) 1505–1518. E. Baralis, G. Succo, E. Crosetti, The rapidly evolving [22] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, scenario of acoustic voice analysis in otolaryngology, R. Salakhutdinov, A. Mohamed, Hubert: Self- Cureus 16 (2024) e73491. supervised speech representation learning by masked [7] P. Rajpurkar, E. Chen, O. Banerjee, E. J. Topol, Ai in prediction of hidden units, IEEE/ACM transactions health and medicine, Nature medicine 28 (2022) 31–38. on audio, speech, and language processing 29 (2021) [8] S. Schneider, A. Baevski, R. Collobert, M. Auli, 3451–3460. wav2vec: Unsupervised pre-training for speech recog- [23] A. Koudounas, E. Pastor, G. Attanasio, L. de Alfaro, nition, arXiv preprint arXiv:1904.05862 (2019). E. Baralis, Prioritizing data acquisition for end-to-end [9] A. Koudounas, G. Ciravegna, M. Fantini, E. Crosetti, speech model improvement, in: ICASSP 2024 - 2024 G. Succo, T. Cerquitelli, E. Baralis, Voice disorder IEEE International Conference on Acoustics, Speech analysis: a transformer-based approach, in: In- and Signal Processing (ICASSP), 2024, pp. 7000–7004. terspeech 2024, 2024, pp. 3040–3044. doi:10.21437/ doi:10.1109/ICASSP48485.2024.10446326 . Interspeech.2024- 1122 . [24] A. Koudounas, M. La Quatra, S. M. Siniscalchi, E. Bar- [10] M. La Quatra, M. F. Turco, T. Svendsen, G. Salvi, J. R. alis, voc2vec: A foundation model for non-verbal Orozco-Arroyave, S. M. Siniscalchi, Exploiting founda- vocalization, in: ICASSP 2025 - 2025 IEEE Interna- tion models and speech enhancement for parkinson’s tional Conference on Acoustics, Speech and Signal disease detection from speech in real-world operative Processing (ICASSP), 2025. conditions, in: Interspeech 2024, 2024, pp. 1405–1409. [25] M. La Quatra, A. Koudounas, L. Vaiani, E. Baralis, doi:10.21437/Interspeech.2024- 522 . P. Garza, L. Cagliero, S. M. Siniscalchi, Benchmarking [11] A. Vaswani, Attention is all you need, Advances in representations for speech, music, and acoustic events, Neural Information Processing Systems (2017). in: 2024 IEEE International Conference on Acoustics, [12] X. Peng, H. Xu, J. Liu, J. Wang, C. He, Voice disor- Speech, and Signal Processing Workshops (ICASSPW), der classification using convolutional neural network 2024. based on deep transfer learning, Scientific Reports 13 [26] S. wen Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, (2023) 7264. K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. [13] L. W. Lopes, L. B. Simões, J. D. da Silva, D. da Silva Evan- Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, gelista, A. C. d. N. e Ugulino, P. O. C. Silva, V. J. D. Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, Vieira, Accuracy of acoustic analysis measurements H. yi Lee, Superb: Speech processing universal per- in the evaluation of patients with different laryngeal formance benchmark, in: Interspeech 2021, 2021, pp. diagnoses, Journal of voice 31 (2017) 382–e15. 1194–1198. doi:10.21437/Interspeech.2021- 1775 . [14] M. Alhussein, G. Muhammad, Automatic voice pathol- [27] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec ogy monitoring using parallel deep models for smart 2.0: A framework for self-supervised learning of healthcare, Ieee Access 7 (2019) 46474–46479. speech representations, Advances in neural informa- [15] P. H. Leung, K. T. Chui, K. Lo, P. O. de Pablos, A support tion processing systems 33 (2020) 12449–12460. vector machine–based voice disorders detection using [28] S. R. Stahlschmidt, B. Ulfenborg, J. Synnergren, Mul- human voice signal, in: Artificial Intelligence and Big timodal deep learning for biomedical data fusion: a Data Analytics for Smart Healthcare, Elsevier, 2021, review, Briefings in Bioinformatics 23 (2022) bbab569. pp. 197–208. [29] L. Ilias, D. Askounis, J. Psarras, Detecting dementia [16] X. Peng, H. Xu, J. Liu, J. Wang, C. He, Voice disor- from speech and transcripts using transformers, Com- der classification using convolutional neural network puter Speech & Language 79 (2023) 101485. based on deep transfer learning, Scientific Reports 13 [30] R. Gupta, K. Audhkhasi, S. Narayanan, A mixture of (2023) 7264. experts approach towards intelligibility classification [17] U. K. Lilhore, S. Dalal, N. Faujdar, M. Margala, of pathological speech, in: 2015 IEEE international P. Chakrabarti, T. Chakrabarti, S. Simaiya, P. Kumar, conference on acoustics, speech and signal processing P. Thangaraju, H. Velmurugan, Hybrid cnn-lstm model (ICASSP), IEEE, 2015, pp. 1986–1990. with efficient hyperparameter tuning for prediction of [31] G. B. Kempster, B. R. Gerratt, K. V. Abbott, J. Barkmeier- parkinson’s disease, Scientific Reports 13 (2023) 14605. Kraemer, R. E. Hillman, Consensus auditory- [18] A. S. Almasoud, T. A. E. Eisa, F. N. Al-Wesabi, A. Elsafi, perceptual evaluation of voice: development of a stan- M. Al Duhayyim, I. Yaseen, M. A. Hamza, A. Motwakel, dardized clinical protocol (2009). Parkinson’s detection using rnn-graph-lstm with op- [32] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, timization based on speech signals, Comput. Mater. L. Chen, Inverted residuals and linear bottlenecks: Contin 72 (2022) 872–886. Mobile networks for classification, detection and seg- [19] R. Islam, E. Abdel-Raheem, M. Tarique, Voice pathol- mentation, CoRR abs/1801.04381 (2018). URL: http: ogy detection using convolutional neural networks //arxiv.org/abs/1801.04381. arXiv:1801.04381 . with electroglottographic (egg) and speech signals, [33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Computer Methods and Programs in Biomedicine Up- Imagenet: A large-scale hierarchical image database, date 2 (2022) 100074. in: 2009 IEEE conference on computer vision and pat- [20] X. Xie, H. Cai, C. Li, Y. Wu, F. Ding, A voice disease tern recognition, Ieee, 2009, pp. 248–255. detection method based on mfccs and shallow cnn,