<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of
in Major Depressive Disorder. General Internal Medicine 14 (1999) 569-580.
[13] S. Bn</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Privacy-Preserving Unsupervised Speaker Disentanglement Method for Depression Detection from Speech⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vijay Ravi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jinhan Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jonathan Flint</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abeer Alwan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Electrical and Computer Engineering, University of California Los Angeles</institution>
          ,
          <addr-line>California, USA 90095</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Psychiatry and Biobehavioral Sciences, University of California Los Angeles</institution>
          ,
          <addr-line>California, USA 90095</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2002</year>
      </pub-date>
      <volume>2021</volume>
      <fpage>25</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>The proposed method focuses on speaker disentanglement in the context of depression detection from speech signals. Previous approaches require patient/speaker labels, encounter instability due to loss maximization, and introduce unnecessary parameters for adversarial domain prediction. In contrast, the proposed unsupervised approach reduces cosine similarity between latent spaces of depression and pre-trained speaker classification models. This method outperforms baseline models, matches or exceeds adversarial methods in performance, and does so without relying on speaker labels or introducing additional model parameters, leading to a reduction in model complexity. The higher the speaker de-identification score (), the better the depression detection system is in masking a patient's identity thereby enhancing the privacy attributes of depression detection systems. On the DAIC-WOZ dataset with ComparE16 features and an LSTM-only model, our method achieves an F1-Score of 0.776 and a  score of 92.87%, outperforming its adversarial counterpart which has an F1Score of 0.762 and 68.37% , respectively. Furthermore, we demonstrate that speaker-disentanglement methods are complementary to text-based approaches, and a score-level fusion with a Word2vec-based depression detection model further enhances the overall performance to an F1-Score of 0.830.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Speaker disentanglement</kwd>
        <kwd>Depression detection</kwd>
        <kwd>Privacy</kwd>
        <kwd>Healthcare AI</kwd>
        <kwd>DAIC-WOZ</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>depression detection. More recently, adversarial
learning (ADV), introduced in [15, 16], has demonstrated an
Depression is anticipated to become the second leading enhancement in depression detection performance at
cause of disability globally, revealing significant diagnos- the cost of a reduction in speaker classification accuracy.
tic accessibility gaps [1]. Recent advancements in speech- In the work by [17], non-uniform adversarial weights
based automatic detection have proven invaluable in tack- (NUSD) were identified as superior to vanilla adversarial
ling the challenges posed by this formidable illness [2]. methods in the context of raw audio signals. Additionally,
The evolution of speech-based depression detection en- in [18], the utilization of reconstruction loss in
conjunccompasses diverse acoustic features [3, 4, 5], sophisti- tion with an autoencoder was found efective in
achievcated backend modeling techniques [6, 7, 8], and inno- ing speaker disentanglement, consequently leading to
vative data augmentation frameworks [9, 10]. While the improved depression detection performance.
eficacy of depression detection systems has seen notable Despite the notable progress achieved by the
aforeimprovements, safeguarding patient privacy remains a mentioned studies in enhancing depression detection
paramount concern in digital healthcare systems [11], performance while reducing dependency on a patient’s
particularly within the realm of mental health, where identity, there are significant drawbacks. Firstly, the
societal stigma persists as a formidable challenge [12]. training of these systems still necessitates speaker labels</p>
      <p>Given the pivotal importance of privacy preservation from patient datasets, posing a challenge to the
privacyin speech-based depression detection, numerous studies preserving aspect of depression detection systems.
Sechave attempted to address this issue. Approaches such as ondly, many prior methods rely on an adversarial loss
federated learning [13] and sine wave speech [14] have maximization training procedure for speaker
disentanbeen explored to safeguard patient identity; however, glement. While efective in achieving good performance,
these methods often incur a performance degradation in it is acknowledged that loss maximization is inherently
unstable due to the absence of upper bounds for the
adMachine Learning for Cognitive and Mental Health Workshop versarial domain objective function [19]. Thirdly, all the
(ML4CMH), AAAI 2024, Vancouver, BC, Canada aforementioned methods introduce additional
parame* Corresponding author. ters, such as adversarial domain prediction layers or
re($J. Wvijaanygs)u; mjflinatr@avmi@eduncleat..eudcula(.eVd. uR(aJv.iF);liwnta)n;ga7lw87a5n@@uecel.au.celdau.edu construction decoders, to the model training framework,
(A. Alwan) which are extraneous for the primary task.</p>
      <p>©At2tr0i2b4utCioonpy4r.0igIhnttefornratthioisnpaalp(CerCbByYit4s.0a)u.thors. Use permitted under Creative Commons License Driven by the widespread adoption of unsupervised
methods, of unsupervised learning approaches [20],
this paper introduces a novel speaker disentanglement
method to address the above-mentioned challenges. The
proposed method focuses on reducing the cosine
similarity between the latent spaces of a depression detection
model and a speaker classification model. Operating at
the embedding level, this approach eliminates the need
for speaker labels from the patient dataset. By
reformulating the training process into a loss minimization
framework, we overcome the issues of unboundedness
associated with adversarial methods. Since the speaker
classification models serve as embedding extractors and
undergo neither retraining nor fine-tuning, our method
achieves eficiency by not requiring domain prediction
or reconstruction, resulting in fewer model parameters
compared to previous approaches.</p>
      <p>Extensive experiments are conducted to validate the
eficacy of the proposed method, showcasing its
superiority over baseline models (without speaker
disentanglement) in terms of depression detection. Furthermore, the
method demonstrates performance that is either better
than or comparable to adversarial methods. Evaluation
across multiple input features and backend models
establishes the generalizability of the proposed framework
to diverse architectures. The complementary nature of
speaker disentanglement methods is highlighted through
score-level fusion with text-based models, resulting in
an enhanced overall performance when the models are
combined.</p>
      <p>Subsequent sections of this paper are: Section 2, which
describes the proposed method, Section 3, which outlines
experimental details, Section 4, which presents and
discusses the results, and Section 5, which discusses future
research directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Proposed Method</title>
      <p>In conventional speaker disentanglement methods [21,
22], the loss function for the adversarial domain
(speakerprediction) is maximized. Consider the depression
prediction loss  and the speaker prediction loss for
the adversarial method  −  . The total loss for
the model training can be written as
 
1 ∑︁ ∑︁  · log(ˆ ), (2)
 −  (, ˆ) = −</p>
      <p>=1 =1
 is the ground-truth speaker label and ˆ is the predicted
speaker probabilities for  samples and  speakers.</p>
      <p>As discussed earlier, this approach has three major
issues: 1) this method requires the ground-truth speaker
label  to achieve disentanglement, 2) the
disentanglement of speaker identity is based on loss maximization
(−  ·  −  which does not have an upper bound,
resulting in degraded stability during training and 3)
the speaker prediction branch in the model, to obtain ˆ,
adds additional model parameters that are not useful for
depression detection making this approach ineficient.</p>
      <p>In [18], along with speaker labels, feature reconstruction
is used for speaker disentanglement which adds even
more unnecessary parameters. In contrast, we propose
an unsupervised method of speaker disentanglement that
does not need any patient dataset speaker labels and
neither involves loss maximization nor adds additional
model parameters. The proposed method is depicted in
Figure 1.</p>
      <p>−  =  −  ·  −  ,
where  is a hyperparameter controlling the
contribution of the adversarial loss to the main loss function  =  () (3)
where the negative sign indicates that the speaker
preldeiacrtinonmloorsesdiespmreasxsiimo nizdediscthriemreinbaytoforrycifneagtuthreesmaonddelletsos   =    () (4)
speaker discriminatory features. The speaker prediction  and   ∈ RNXD where  is
embedloss  −  is usually the Cross-Entropy loss de- ding size. Next, we compute the predicted cosine
simiifned as - larity matrix between the two latent space embeddings
Consider a depression classification model (  )
and a speaker classification model (    ). For a given
speech input  ∈ RNXF ( is the batch size and  is
(1) the number of features) the latent embeddings of these
models are:
by computing the cosine similarity between every pair for evaluation purposes, aligning with the dataset
speciof embeddings as follows - ifcations. The audio data only from the patients were
extracted based on the provided time labels. For text-based
(,) = |||| ·· || || (5) wexepreeruimseedn.tsR,etshuelttsraanrescrreippotsrtperdovuisdiendg wthiethvtahliedadtaitoanbasseet
in line with previous research [25, 26, 27, 28].</p>
      <p>We define the proposed speaker disentanglement loss
function  as follows</p>
      <p>=  (, )
and the total loss as:</p>
      <p>−  =  +  · ,
where 1 &lt;= ,  &lt;=  and  ∈ RNXN. The
objective of the disentanglement process is to minimize
the cosine similarity between the two embedding spaces
by enforcing orthogonality between the depression and
speaker latent spaces. To achieve this, we specifically
set  to 0, instead of -1. To enhance convergence
during implementation, a small noise value, denoted as 
is incorporated [23].</p>
      <p>(,) = 0 + ,
 ∈  (0, 1 − 8)</p>
      <sec id="sec-2-1">
        <title>3.2. Input Features</title>
        <sec id="sec-2-1-1">
          <title>For the audio, four input features are evaluated to show</title>
          <p>that the proposed framework is independent of the
acoustic features used. Mel-Spectrograms, raw-audio signals,
ComparE16 features from the OpenSmile library [29],
and the last hidden state of the Wav2Vec2 [30] model
are used. Mel-Spectrograms are 40 and 80 dimensional,
(6) raw-audio features are 1-dimensional, ComparE16
features are 130-dimensional and Wav2vec2 features are 768
(7) dimensional. For the text, a Word2vec model [31] is used
to extract word-level embeddings from the transcripts of
the patient’s audio. The embeddings are 200 dimensional.
Audio and text feature processing is based on publicly
available code repository [26]. Since there is an
imbal(8) ance in the dataset, similar to [25, 26], random cropping
and segmentation are applied. To negate the bias efects
of randomness, 5 models are trained with diferent
ran(9) dom seeds, and performances are obtained via majority
voting (MV).</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Minimizing the loss function described in Eq. 9 com</title>
          <p>pels the model to emphasize learning more discrimina- 3.3. Models
tory information related to depression while reducing
its focus on speaker-related distinctions. In contrast to Similar to input features, multiple model architectures
ADV (Eq. 1), speaker disentanglement in the proposed are designed for the audio modality to show that the
method is achieved via loss minimization. proposed method generalizes to diferent model
archi</p>
          <p>It is important to note that embeddings from    tectures. Mel-spectrogram features and Raw-Audio
sigcan be extracted without the necessity of speaker labels, nals are used with two model configurations -
CNNrendering the proposed speaker disentanglement method LSTM and ECAPA-TDNN [32, 33]. The other two
feaunsupervised. Moreover, only the parameters of   tures, ComparE16 and Wav2vec2 are used with an
LSTMrequire updating, as the    model does not need fine- only configuration. For the speaker classification model,
tuning and can remain a pre-trained model with frozen two pre-trained models are used - ECAPA-TDNN
(128weights. Lastly, experiments where  is set to zero, mean- dimensional embedding) and the X-Vector model [34]
ing the squared cosine similarity is directly minimized, (256-dimensional embedding) from the hugging face
yielded subpar performance compared to those with a speechbrain library [35]. Note that the number of
panon-zero  . Consequently, results from experiments with rameters reported for each experiment does not include
 = 0 are not included in this paper. of-the-shelf speaker classification models that have not
undergone re-training or fine-tuning. For the text model,
a simple CNN-LSTM framework was used. In the interest
3. Experimental Details of space and since this paper does not propose any new
neural network architecture but rather uses previously
3.1. Dataset: DAIC-WoZ established models, we do not explain the model
archiThe dataset [24], comprises audio-visual interviews con- tecture in detail. However, the model weights and code
ducted in English with 189 participants experiencing psy- repository will be publicly available here1.
chological distress, including male and female speakers.</p>
          <p>For our experiments, 107 speakers were employed for
training, while an additional 35 speakers were designated</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>1Model weights and code repository available at</title>
          <p>https://github.com/vijaysumaravi/USSD-depression</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>3.4. Evaluation Metrics</title>
        <p>3.4.1. Depression Detection</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Results and Discussion</title>
      <sec id="sec-3-1">
        <title>4.1. Speaker Disentanglement versus</title>
      </sec>
      <sec id="sec-3-2">
        <title>Baseline</title>
        <sec id="sec-3-2-1">
          <title>As is common in the depression detection literature, to</title>
          <p>measure system performance, the F1 scores [36] for the
two classes (Depressed: D and Non-Depressed: ND) F1-D
and F1-ND as well as their macro-average, F1-AVG were
reported.</p>
          <p>Table 1 shows enhanced depression detection
performance (F1-AVG) across all experiments when applying
speaker disentanglement, either in the form of ADV or
USSD. On average, a notable improvement of 8.3% and
3.4.2. Privacy Preservation 8.2% over the baseline was observed for ADV and USSD,
respectively, for the six experiments. The highest
imTo assess the privacy-preserving capabilities of the mod- provement with ADV, 13.8%, occurred when utilizing
els, we employ the De-Identification score ( [37]), Raw-Audio features with the ECAPA-TDNN model, while
a metric inspired by the voice privacy literature[38]. The the lowest improvement, 5.3%, was observed with
Mel score calculation begins with a voice similarity Spectrograms features and the ECAPA-TDNN model. In
matrix denoted as  , computed for a set of N speak- the case of USSD, the highest improvement was 11.7%
ers. This matrix is derived from the log-likelihood ratio with ComparE16 features and the LSTM-only model, and
(LLR) of two segments—one from model A and the other the lowest improvement was 3.8% with Mel-Spectrogram
from model B—considered to be from the same speaker. features and the CNN-LSTM model. This highlights the
The LLR computation uses a Probabilistic Linear Discrim- advantage of USSD over ADV in scenarios where speaker
inant Analysis (PLDA) model [39]. labels for the training set are unavailable.</p>
          <p>Subsequently, voice similarity matrices,  and ,
are calculated.  utilizes embeddings solely from 4.2. USSD versus ADV
the baseline model (), while  incorporates
embeddings from both the baseline model () and the speaker- Comparing USSD to its adversarial counterpart, ADV,
disentangled model ().. The next step involves calcu- we observe that the proposed method outperforms ADV
lating the diagonal dominance ( ) for both  in 2 out of 6 experiments: Raw-Audio with CNN-LSTM
and . This measure is determined as the absolute (0.746 for USSD vs. 0.709 for ADV) and ComparE16 with
diference between the average diagonal and of-diagonal LSTM-only (0.776 for USSD vs. 0.762 for ADV).
Conelements in the matrices. The diagonal dominance value versely, ADV exhibits better performance than USSD in
serves as an indicator of how identifiable individual 3 out of 6 experiments, with both methods yielding the
speakers are within a given embedding space, ranging same results in 1 out of 6 experiments. In the aggregate,
from 0 to 1. ADV achieves the best overall results with an F1-Score</p>
          <p>When () equals 1, speakers are completely of 0.79, whereas the corresponding USSD model achieves
identifiability in the original embedding space, whereas 0.773—a slight decrease of 2.15%, despite using 15k fewer
if () equals 0, speakers are unidentifiable after parameters and not relying on speaker labels. Even
withdisentanglement. To measure how good the anonymiza- out utilizing speaker labels or additional parameters for
tion (disentanglement) process is, the  score is predicting speakers, USSD showcases comparable or
suformulated as - perior performance to ADV. This highlights the
advantage of USSD over ADV in scenarios where speaker labels
() for the training set are unavailable .</p>
          <p>(10)
DeID = 1 −</p>
          <p>()
 is expressed as a percentage, where 0% signi- 4.3. Privacy Preservation -  
ifes poor anonymization, and 100% denotes fully
successful anonymization. As  relies on voice
similarity matrices constructed from embeddings pre and
post-disentanglement, it is exclusively reported for the
experiments involving speaker disentanglement.</p>
          <p>Privacy is a crucial aspect of speech-based depression
detection, and Table 1 demonstrates positive  results
for both USSD and ADV across all models. Notably,
ComparE16 features with USSD achieve the highest 
at 92.87%. Despite a marginal depression detection
performance drop in USSD compared to ADV, USSD excels
in privacy preservation. An intriguing finding is that
USSD’s efectiveness is independent of the type or
dimension of speaker embeddings used. Mel-spectrogram
and Raw-Audio experiments employed ECAPA-TDNN
speaker embeddings, while ComparE16 and Wav2Vec2</p>
          <p>Feature
Mel-Spectrogram</p>
          <p>Raw-Audio</p>
          <p>Model
CNN-LSTM
ECAPA-TDNN
CNN-LSTM</p>
          <p>ECAPA-TDNN
ComparE16</p>
          <p>LSTM-only
Wav2vec2</p>
          <p>LSTM-only</p>
          <p>Disentanglement Number of Parameters ↓</p>
          <p>No 280k
ADV 293k
USSD 280k
No 515k
ADV 529k
USSD 515k
No 445k
ADV 459k
USSD 445k
No 595k
ADV 609k
USSD 595k
No 1.15M
ADV 1.18M
USSD 1.15M
No 3.6M
ADV 3.7M
USSD 3.6M
experiments used X-vector embeddings with dimension
reduction. USSD’s reliance on a pre-trained speaker
classification model may contribute to leveraging pre-trained
speaker embeddings, enhancing the masking of
depression embeddings, and resulting in a higher .</p>
          <p>Audio-Model</p>
          <p>Disent. Audio-only
Raw-Audio
ECAPA-TDNN
ComparE16
LSTM-only</p>
          <p>ADV
USSD
0.790
0.776</p>
          <p>Word2vec Fusion
(Text-only)</p>
          <p>(Audio-only)
0.860 (0.762)
0.830 (0.762)
22.32%
92.87%</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Acknowledgments</title>
      <p>dcnn, IET Signal Processing 16 (2022) 62–79. arXiv:2106.04624, arXiv:2106.04624.
[24] M. Valstar, J. Gratch, B. Schuller, F. Ringeval, [36] N. Chinchor, Muc-4 evaluation metrics in proc. of
D. Lalanne, M. Torres Torres, S. Scherer, G. Stratou, the fourth message understanding conference 22–
R. Cowie, M. Pantic, Avec 2016: Depression, mood, 29, 1992.
and emotion recognition workshop and challenge, [37] P.-G. Noé, J.-F. Bonastre, D. Matrouf,
in: Proc. 6th AVEC, 2016, pp. 3–10. N. Tomashenko, A. Nautsch, N. Evans,
[25] X. Ma, H. Yang, Q. Chen, D. Huang, Y. Wang, De- Speech Pseudonymisation Assessment
Uspaudionet: An eficient deep model for audio based ing Voice Similarity Matrices, in: Proc.
depression classification, in: Proc. 6th Audio Visual Interspeech 2020, 2020, pp. 1718–1722.</p>
      <p>Emotion Challenge, 2016, pp. 35–42. doi:10.21437/Interspeech.2020-2720.
[26] A. Bailey, M. D. Plumbley, Gender bias in depression [38] N. Tomashenko, X. Wang, E. Vincent, J. Patino,
detection using audio features, in: 29th EUSIPCO, B. M. L. Srivastava, P.-G. Noé, A. Nautsch, N. Evans,
IEEE, 2021, pp. 596–600. J. Yamagishi, B. O’Brien, et al., The voiceprivacy
[27] K. Feng, T. Chaspari, Toward knowledge-driven 2020 challenge: Results and findings, Computer
speech-based models of depression: Leveraging Speech &amp; Language 74 (2022) 101362.
spectrotemporal variations in speech vowels, in: [39] P. Kenny, Bayesian speaker verification with, heavy
IEEE-EMBS ICBHI, IEEE, 2022, pp. 01–07. tailed priors, Proc. Odyssey 2010 (2010).
[28] W. Wu, C. Zhang, P. C. Woodland, Self-supervised [40] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
representations in speech-based depression detec- O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
tion, in: ICASSP 2023 - 2023 IEEE International Roberta: A robustly optimized bert pretraining
apConference on Acoustics, Speech and Signal Pro- proach, arXiv preprint arXiv:1907.11692 (2019).
cessing (ICASSP), 2023, pp. 1–5. doi:10.1109/ [41] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen,
ICASSP49357.2023.10094910. J. Li, N. Kanda, T. Yoshioka, X. Xiao, et al., Wavlm:
[29] F. Eyben, M. Wöllmer, B. Schuller, Opensmile: the Large-scale self-supervised pre-training for full
munich versatile and fast open-source audio feature stack speech processing, IEEE Journal of Selected
extractor, in: Proc. 18th ACM-MM, 2010, pp. 1459– Topics in Signal Processing 16 (2022) 1505–1518.
1462. [42] K. Qian, Y. Zhang, H. Gao, J. Ni, C.-I. Lai, D. Cox,
[30] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec M. Hasegawa-Johnson, S. Chang, Contentvec: An
2.0: A framework for self-supervised learning of improved self-supervised speech representation by
speech representations, NIPS 33 (2020) 12449– disentangling speakers, in: ICML, PMLR, 2022, pp.
12460. 18003–18017.
[31] T. Mikolov, K. Chen, G. Corrado, J. Dean, Eficient [43] J. Wang, V. Ravi, J. Flint, A. Alwan,
Unsuperestimation of word representations in vector space, vised Instance Discriminative Learning for
Depres2013. arXiv:1301.3781. sion Detection from Speech Signals, in: Proc.
[32] B. Desplanques, J. Thienpondt, K. De- Interspeech, 2022, pp. 2018–2022. doi:10.21437/
muynck, ECAPA-TDNN: Emphasized Chan- Interspeech.2022-10814.
nel Attention, Propagation and
Aggregation in TDNN Based Speaker Verification,
in: Proc. Interspeech, 2020, pp. 3830–3834.</p>
      <p>doi:10.21437/Interspeech.2020-2650.
[33] D. Wang, Y. Ding, Q. Zhao, P. Yang, S. Tan,</p>
      <p>Y. Li, ECAPA-TDNN Based Depression
Detection from Clinical Speech, in: Proc.
Interspeech, 2022, pp. 3333–3337. doi:10.21437/</p>
      <p>Interspeech.2022-10051.
[34] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey,</p>
      <p>S. Khudanpur, X-vectors: Robust dnn embeddings
for speaker recognition, in: ICASSP, IEEE, 2018, pp.</p>
      <p>5329–5333.
[35] M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe,</p>
      <p>S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad,
A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu,
C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris,
H. Na, Y. Gao, R. D. Mori, Y. Bengio,
SpeechBrain: A general-purpose speech toolkit, 2021.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>