1. Introduction

Journal of in Major Depressive Disorder. General Internal Medicine 14 (1999) 569-580. [13] S. Bn

A Privacy-Preserving Unsupervised Speaker Disentanglement Method for Depression Detection from Speech⋆

Vijay Ravi

Jinhan Wang

Jonathan Flint

Abeer Alwan

0 0 Department of Electrical and Computer Engineering, University of California Los Angeles , California, USA 90095 , USA 1 Department of Psychiatry and Biobehavioral Sciences, University of California Los Angeles , California, USA 90095 , USA

2002

2021 25 31

The proposed method focuses on speaker disentanglement in the context of depression detection from speech signals. Previous approaches require patient/speaker labels, encounter instability due to loss maximization, and introduce unnecessary parameters for adversarial domain prediction. In contrast, the proposed unsupervised approach reduces cosine similarity between latent spaces of depression and pre-trained speaker classification models. This method outperforms baseline models, matches or exceeds adversarial methods in performance, and does so without relying on speaker labels or introducing additional model parameters, leading to a reduction in model complexity. The higher the speaker de-identification score (), the better the depression detection system is in masking a patient's identity thereby enhancing the privacy attributes of depression detection systems. On the DAIC-WOZ dataset with ComparE16 features and an LSTM-only model, our method achieves an F1-Score of 0.776 and a score of 92.87%, outperforming its adversarial counterpart which has an F1Score of 0.762 and 68.37% , respectively. Furthermore, we demonstrate that speaker-disentanglement methods are complementary to text-based approaches, and a score-level fusion with a Word2vec-based depression detection model further enhances the overall performance to an F1-Score of 0.830.

eol>Speaker disentanglement Depression detection Privacy Healthcare AI DAIC-WOZ

1. Introduction

depression detection. More recently, adversarial learning (ADV), introduced in [15, 16], has demonstrated an Depression is anticipated to become the second leading enhancement in depression detection performance at cause of disability globally, revealing significant diagnos- the cost of a reduction in speaker classification accuracy. tic accessibility gaps [1]. Recent advancements in speech- In the work by [17], non-uniform adversarial weights based automatic detection have proven invaluable in tack- (NUSD) were identified as superior to vanilla adversarial ling the challenges posed by this formidable illness [2]. methods in the context of raw audio signals. Additionally, The evolution of speech-based depression detection en- in [18], the utilization of reconstruction loss in conjunccompasses diverse acoustic features [3, 4, 5], sophisti- tion with an autoencoder was found efective in achievcated backend modeling techniques [6, 7, 8], and inno- ing speaker disentanglement, consequently leading to vative data augmentation frameworks [9, 10]. While the improved depression detection performance. eficacy of depression detection systems has seen notable Despite the notable progress achieved by the aforeimprovements, safeguarding patient privacy remains a mentioned studies in enhancing depression detection paramount concern in digital healthcare systems [11], performance while reducing dependency on a patient’s particularly within the realm of mental health, where identity, there are significant drawbacks. Firstly, the societal stigma persists as a formidable challenge [12]. training of these systems still necessitates speaker labels

Given the pivotal importance of privacy preservation from patient datasets, posing a challenge to the privacyin speech-based depression detection, numerous studies preserving aspect of depression detection systems. Sechave attempted to address this issue. Approaches such as ondly, many prior methods rely on an adversarial loss federated learning [13] and sine wave speech [14] have maximization training procedure for speaker disentanbeen explored to safeguard patient identity; however, glement. While efective in achieving good performance, these methods often incur a performance degradation in it is acknowledged that loss maximization is inherently unstable due to the absence of upper bounds for the adMachine Learning for Cognitive and Mental Health Workshop versarial domain objective function [19]. Thirdly, all the (ML4CMH), AAAI 2024, Vancouver, BC, Canada aforementioned methods introduce additional parame* Corresponding author. ters, such as adversarial domain prediction layers or re($J. Wvijaanygs)u; mjflinatr@avmi@eduncleat..eudcula(.eVd. uR(aJv.iF);liwnta)n;ga7lw87a5n@@uecel.au.celdau.edu construction decoders, to the model training framework, (A. Alwan) which are extraneous for the primary task.

©At2tr0i2b4utCioonpy4r.0igIhnttefornratthioisnpaalp(CerCbByYit4s.0a)u.thors. Use permitted under Creative Commons License Driven by the widespread adoption of unsupervised methods, of unsupervised learning approaches [20], this paper introduces a novel speaker disentanglement method to address the above-mentioned challenges. The proposed method focuses on reducing the cosine similarity between the latent spaces of a depression detection model and a speaker classification model. Operating at the embedding level, this approach eliminates the need for speaker labels from the patient dataset. By reformulating the training process into a loss minimization framework, we overcome the issues of unboundedness associated with adversarial methods. Since the speaker classification models serve as embedding extractors and undergo neither retraining nor fine-tuning, our method achieves eficiency by not requiring domain prediction or reconstruction, resulting in fewer model parameters compared to previous approaches.

Extensive experiments are conducted to validate the eficacy of the proposed method, showcasing its superiority over baseline models (without speaker disentanglement) in terms of depression detection. Furthermore, the method demonstrates performance that is either better than or comparable to adversarial methods. Evaluation across multiple input features and backend models establishes the generalizability of the proposed framework to diverse architectures. The complementary nature of speaker disentanglement methods is highlighted through score-level fusion with text-based models, resulting in an enhanced overall performance when the models are combined.

Subsequent sections of this paper are: Section 2, which describes the proposed method, Section 3, which outlines experimental details, Section 4, which presents and discusses the results, and Section 5, which discusses future research directions.

2. Proposed Method

In conventional speaker disentanglement methods [21, 22], the loss function for the adversarial domain (speakerprediction) is maximized. Consider the depression prediction loss and the speaker prediction loss for the adversarial method − . The total loss for the model training can be written as 1 ∑︁ ∑︁ · log(ˆ ), (2) − (, ˆ) = −

=1 =1 is the ground-truth speaker label and ˆ is the predicted speaker probabilities for samples and speakers.

As discussed earlier, this approach has three major issues: 1) this method requires the ground-truth speaker label to achieve disentanglement, 2) the disentanglement of speaker identity is based on loss maximization (− · − which does not have an upper bound, resulting in degraded stability during training and 3) the speaker prediction branch in the model, to obtain ˆ, adds additional model parameters that are not useful for depression detection making this approach ineficient.

In [18], along with speaker labels, feature reconstruction is used for speaker disentanglement which adds even more unnecessary parameters. In contrast, we propose an unsupervised method of speaker disentanglement that does not need any patient dataset speaker labels and neither involves loss maximization nor adds additional model parameters. The proposed method is depicted in Figure 1.

− = − · − , where is a hyperparameter controlling the contribution of the adversarial loss to the main loss function = () (3) where the negative sign indicates that the speaker preldeiacrtinonmloorsesdiespmreasxsiimo nizdediscthriemreinbaytoforrycifneagtuthreesmaonddelletsos = () (4) speaker discriminatory features. The speaker prediction and ∈ RNXD where is embedloss − is usually the Cross-Entropy loss de- ding size. Next, we compute the predicted cosine simiifned as - larity matrix between the two latent space embeddings Consider a depression classification model ( ) and a speaker classification model ( ). For a given speech input ∈ RNXF ( is the batch size and is (1) the number of features) the latent embeddings of these models are: by computing the cosine similarity between every pair for evaluation purposes, aligning with the dataset speciof embeddings as follows - ifcations. The audio data only from the patients were extracted based on the provided time labels. For text-based (,) = |||| ·· || || (5) wexepreeruimseedn.tsR,etshuelttsraanrescrreippotsrtperdovuisdiendg wthiethvtahliedadtaitoanbasseet in line with previous research [25, 26, 27, 28].

We define the proposed speaker disentanglement loss function as follows

= (, ) and the total loss as:

− = + · , where 1 <= , <= and ∈ RNXN. The objective of the disentanglement process is to minimize the cosine similarity between the two embedding spaces by enforcing orthogonality between the depression and speaker latent spaces. To achieve this, we specifically set to 0, instead of -1. To enhance convergence during implementation, a small noise value, denoted as is incorporated [23].

(,) = 0 + , ∈ (0, 1 − 8)

3.2. Input Features For the audio, four input features are evaluated to show

that the proposed framework is independent of the acoustic features used. Mel-Spectrograms, raw-audio signals, ComparE16 features from the OpenSmile library [29], and the last hidden state of the Wav2Vec2 [30] model are used. Mel-Spectrograms are 40 and 80 dimensional, (6) raw-audio features are 1-dimensional, ComparE16 features are 130-dimensional and Wav2vec2 features are 768 (7) dimensional. For the text, a Word2vec model [31] is used to extract word-level embeddings from the transcripts of the patient’s audio. The embeddings are 200 dimensional. Audio and text feature processing is based on publicly available code repository [26]. Since there is an imbal(8) ance in the dataset, similar to [25, 26], random cropping and segmentation are applied. To negate the bias efects of randomness, 5 models are trained with diferent ran(9) dom seeds, and performances are obtained via majority voting (MV).

Minimizing the loss function described in Eq. 9 com

pels the model to emphasize learning more discrimina- 3.3. Models tory information related to depression while reducing its focus on speaker-related distinctions. In contrast to Similar to input features, multiple model architectures ADV (Eq. 1), speaker disentanglement in the proposed are designed for the audio modality to show that the method is achieved via loss minimization. proposed method generalizes to diferent model archi

It is important to note that embeddings from tectures. Mel-spectrogram features and Raw-Audio sigcan be extracted without the necessity of speaker labels, nals are used with two model configurations - CNNrendering the proposed speaker disentanglement method LSTM and ECAPA-TDNN [32, 33]. The other two feaunsupervised. Moreover, only the parameters of tures, ComparE16 and Wav2vec2 are used with an LSTMrequire updating, as the model does not need fine- only configuration. For the speaker classification model, tuning and can remain a pre-trained model with frozen two pre-trained models are used - ECAPA-TDNN (128weights. Lastly, experiments where is set to zero, mean- dimensional embedding) and the X-Vector model [34] ing the squared cosine similarity is directly minimized, (256-dimensional embedding) from the hugging face yielded subpar performance compared to those with a speechbrain library [35]. Note that the number of panon-zero . Consequently, results from experiments with rameters reported for each experiment does not include = 0 are not included in this paper. of-the-shelf speaker classification models that have not undergone re-training or fine-tuning. For the text model, a simple CNN-LSTM framework was used. In the interest 3. Experimental Details of space and since this paper does not propose any new neural network architecture but rather uses previously 3.1. Dataset: DAIC-WoZ established models, we do not explain the model archiThe dataset [24], comprises audio-visual interviews con- tecture in detail. However, the model weights and code ducted in English with 189 participants experiencing psy- repository will be publicly available here1. chological distress, including male and female speakers.

For our experiments, 107 speakers were employed for training, while an additional 35 speakers were designated

1Model weights and code repository available at

https://github.com/vijaysumaravi/USSD-depression

3.4. Evaluation Metrics

3.4.1. Depression Detection

4. Results and Discussion 4.1. Speaker Disentanglement versus Baseline As is common in the depression detection literature, to

measure system performance, the F1 scores [36] for the two classes (Depressed: D and Non-Depressed: ND) F1-D and F1-ND as well as their macro-average, F1-AVG were reported.

Table 1 shows enhanced depression detection performance (F1-AVG) across all experiments when applying speaker disentanglement, either in the form of ADV or USSD. On average, a notable improvement of 8.3% and 3.4.2. Privacy Preservation 8.2% over the baseline was observed for ADV and USSD, respectively, for the six experiments. The highest imTo assess the privacy-preserving capabilities of the mod- provement with ADV, 13.8%, occurred when utilizing els, we employ the De-Identification score ( [37]), Raw-Audio features with the ECAPA-TDNN model, while a metric inspired by the voice privacy literature[38]. The the lowest improvement, 5.3%, was observed with Mel score calculation begins with a voice similarity Spectrograms features and the ECAPA-TDNN model. In matrix denoted as , computed for a set of N speak- the case of USSD, the highest improvement was 11.7% ers. This matrix is derived from the log-likelihood ratio with ComparE16 features and the LSTM-only model, and (LLR) of two segments—one from model A and the other the lowest improvement was 3.8% with Mel-Spectrogram from model B—considered to be from the same speaker. features and the CNN-LSTM model. This highlights the The LLR computation uses a Probabilistic Linear Discrim- advantage of USSD over ADV in scenarios where speaker inant Analysis (PLDA) model [39]. labels for the training set are unavailable.

Subsequently, voice similarity matrices, and , are calculated. utilizes embeddings solely from 4.2. USSD versus ADV the baseline model (), while incorporates embeddings from both the baseline model () and the speaker- Comparing USSD to its adversarial counterpart, ADV, disentangled model ().. The next step involves calcu- we observe that the proposed method outperforms ADV lating the diagonal dominance ( ) for both in 2 out of 6 experiments: Raw-Audio with CNN-LSTM and . This measure is determined as the absolute (0.746 for USSD vs. 0.709 for ADV) and ComparE16 with diference between the average diagonal and of-diagonal LSTM-only (0.776 for USSD vs. 0.762 for ADV). Conelements in the matrices. The diagonal dominance value versely, ADV exhibits better performance than USSD in serves as an indicator of how identifiable individual 3 out of 6 experiments, with both methods yielding the speakers are within a given embedding space, ranging same results in 1 out of 6 experiments. In the aggregate, from 0 to 1. ADV achieves the best overall results with an F1-Score

When () equals 1, speakers are completely of 0.79, whereas the corresponding USSD model achieves identifiability in the original embedding space, whereas 0.773—a slight decrease of 2.15%, despite using 15k fewer if () equals 0, speakers are unidentifiable after parameters and not relying on speaker labels. Even withdisentanglement. To measure how good the anonymiza- out utilizing speaker labels or additional parameters for tion (disentanglement) process is, the score is predicting speakers, USSD showcases comparable or suformulated as - perior performance to ADV. This highlights the advantage of USSD over ADV in scenarios where speaker labels () for the training set are unavailable .

(10) DeID = 1 −

() is expressed as a percentage, where 0% signi- 4.3. Privacy Preservation - ifes poor anonymization, and 100% denotes fully successful anonymization. As relies on voice similarity matrices constructed from embeddings pre and post-disentanglement, it is exclusively reported for the experiments involving speaker disentanglement.

Privacy is a crucial aspect of speech-based depression detection, and Table 1 demonstrates positive results for both USSD and ADV across all models. Notably, ComparE16 features with USSD achieve the highest at 92.87%. Despite a marginal depression detection performance drop in USSD compared to ADV, USSD excels in privacy preservation. An intriguing finding is that USSD’s efectiveness is independent of the type or dimension of speaker embeddings used. Mel-spectrogram and Raw-Audio experiments employed ECAPA-TDNN speaker embeddings, while ComparE16 and Wav2Vec2

Feature Mel-Spectrogram

Raw-Audio

Model CNN-LSTM ECAPA-TDNN CNN-LSTM

ECAPA-TDNN ComparE16

LSTM-only Wav2vec2

LSTM-only

Disentanglement Number of Parameters ↓

No 280k ADV 293k USSD 280k No 515k ADV 529k USSD 515k No 445k ADV 459k USSD 445k No 595k ADV 609k USSD 595k No 1.15M ADV 1.18M USSD 1.15M No 3.6M ADV 3.7M USSD 3.6M experiments used X-vector embeddings with dimension reduction. USSD’s reliance on a pre-trained speaker classification model may contribute to leveraging pre-trained speaker embeddings, enhancing the masking of depression embeddings, and resulting in a higher .

Audio-Model

Disent. Audio-only Raw-Audio ECAPA-TDNN ComparE16 LSTM-only

ADV USSD 0.790 0.776

Word2vec Fusion (Text-only)

(Audio-only) 0.860 (0.762) 0.830 (0.762) 22.32% 92.87%

6. Acknowledgments

dcnn, IET Signal Processing 16 (2022) 62–79. arXiv:2106.04624, arXiv:2106.04624. [24] M. Valstar, J. Gratch, B. Schuller, F. Ringeval, [36] N. Chinchor, Muc-4 evaluation metrics in proc. of D. Lalanne, M. Torres Torres, S. Scherer, G. Stratou, the fourth message understanding conference 22– R. Cowie, M. Pantic, Avec 2016: Depression, mood, 29, 1992. and emotion recognition workshop and challenge, [37] P.-G. Noé, J.-F. Bonastre, D. Matrouf, in: Proc. 6th AVEC, 2016, pp. 3–10. N. Tomashenko, A. Nautsch, N. Evans, [25] X. Ma, H. Yang, Q. Chen, D. Huang, Y. Wang, De- Speech Pseudonymisation Assessment Uspaudionet: An eficient deep model for audio based ing Voice Similarity Matrices, in: Proc. depression classification, in: Proc. 6th Audio Visual Interspeech 2020, 2020, pp. 1718–1722.

Emotion Challenge, 2016, pp. 35–42. doi:10.21437/Interspeech.2020-2720. [26] A. Bailey, M. D. Plumbley, Gender bias in depression [38] N. Tomashenko, X. Wang, E. Vincent, J. Patino, detection using audio features, in: 29th EUSIPCO, B. M. L. Srivastava, P.-G. Noé, A. Nautsch, N. Evans, IEEE, 2021, pp. 596–600. J. Yamagishi, B. O’Brien, et al., The voiceprivacy [27] K. Feng, T. Chaspari, Toward knowledge-driven 2020 challenge: Results and findings, Computer speech-based models of depression: Leveraging Speech & Language 74 (2022) 101362. spectrotemporal variations in speech vowels, in: [39] P. Kenny, Bayesian speaker verification with, heavy IEEE-EMBS ICBHI, IEEE, 2022, pp. 01–07. tailed priors, Proc. Odyssey 2010 (2010). [28] W. Wu, C. Zhang, P. C. Woodland, Self-supervised [40] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, representations in speech-based depression detec- O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, tion, in: ICASSP 2023 - 2023 IEEE International Roberta: A robustly optimized bert pretraining apConference on Acoustics, Speech and Signal Pro- proach, arXiv preprint arXiv:1907.11692 (2019). cessing (ICASSP), 2023, pp. 1–5. doi:10.1109/ [41] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, ICASSP49357.2023.10094910. J. Li, N. Kanda, T. Yoshioka, X. Xiao, et al., Wavlm: [29] F. Eyben, M. Wöllmer, B. Schuller, Opensmile: the Large-scale self-supervised pre-training for full munich versatile and fast open-source audio feature stack speech processing, IEEE Journal of Selected extractor, in: Proc. 18th ACM-MM, 2010, pp. 1459– Topics in Signal Processing 16 (2022) 1505–1518. 1462. [42] K. Qian, Y. Zhang, H. Gao, J. Ni, C.-I. Lai, D. Cox, [30] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec M. Hasegawa-Johnson, S. Chang, Contentvec: An 2.0: A framework for self-supervised learning of improved self-supervised speech representation by speech representations, NIPS 33 (2020) 12449– disentangling speakers, in: ICML, PMLR, 2022, pp. 12460. 18003–18017. [31] T. Mikolov, K. Chen, G. Corrado, J. Dean, Eficient [43] J. Wang, V. Ravi, J. Flint, A. Alwan, Unsuperestimation of word representations in vector space, vised Instance Discriminative Learning for Depres2013. arXiv:1301.3781. sion Detection from Speech Signals, in: Proc. [32] B. Desplanques, J. Thienpondt, K. De- Interspeech, 2022, pp. 2018–2022. doi:10.21437/ muynck, ECAPA-TDNN: Emphasized Chan- Interspeech.2022-10814. nel Attention, Propagation and Aggregation in TDNN Based Speaker Verification, in: Proc. Interspeech, 2020, pp. 3830–3834.

doi:10.21437/Interspeech.2020-2650. [33] D. Wang, Y. Ding, Q. Zhao, P. Yang, S. Tan,

Y. Li, ECAPA-TDNN Based Depression Detection from Clinical Speech, in: Proc. Interspeech, 2022, pp. 3333–3337. doi:10.21437/

Interspeech.2022-10051. [34] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey,

S. Khudanpur, X-vectors: Robust dnn embeddings for speaker recognition, in: ICASSP, IEEE, 2018, pp.

5329–5333. [35] M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe,

S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y. Gao, R. D. Mori, Y. Bengio, SpeechBrain: A general-purpose speech toolkit, 2021.