=Paper=
{{Paper
|id=Vol-3619/AISD_Paper_1
|storemode=property
|title=Depression Diagnosis through Audio Analysis Using Machine Learning Models Ensuring Sustainable Development of Mankind
|pdfUrl=https://ceur-ws.org/Vol-3619/AISD_Paper_1.pdf
|volume=Vol-3619
|authors=Prathika Yadav,Pooja Jain,Tapan Jain
|dblpUrl=https://dblp.org/rec/conf/aisd/YadavJJ23
}}
==Depression Diagnosis through Audio Analysis Using Machine Learning Models Ensuring Sustainable Development of Mankind==
<pdf width="1500px">https://ceur-ws.org/Vol-3619/AISD_Paper_1.pdf</pdf>
<pre>
                         Depression Detection through Audio Analysis using
                         Machine Learning Models Ensuring Sustainable
                         Development of Mankind
                         Prathika Yadav1, Pooja Jain1 and Tapan Jain1
                         1 Indian Institute of Information Technology, Nagpur, India


                                            Abstract
                                            Depression is one of the biggest issues of the world today, it affects an individual’s quality of life to a
                                            considerable extent. In this study we have examined the use of ML (Machine Learning) models and their
                                            performance in detection of depression through audio data from a single data source namely the DAIC-
                                            WOZ. We extracted the features from audio/voice recordings of patients and trained several different
                                            models. The results of our study show that several models can achieve high accuracy in predicting
                                            depression levels. Future research could explore the potential of integration of multiple modalities and
                                            deep learning approaches to improve the accuracy of depression detection. Overall this study
                                            demonstrated that machine learning models have great potential for depression detection using audio
                                            data, which requires further research to be validated.

                                            Keywords
                                            Depression Detection, Machine Learning, Sustainable Development, Depression Prediction, Model
                                            Comparison1


                         1. Introduction
                         Depression is a prevalent mental health condition that affects millions of people across the globe,
                         it is marked by a constant feeling of sadness and/or lack of interest in activities that were once
                         pleasurable to a particular person, usually lasting for a long period of time. [1] One of the effective
                         ways to deal with depression is to detect early on in the journey of a person to prevent long-term
                         negative outcomes such as chronic disability and suicide. [2] But in today’s world the traditional
                         methods of depression diagnosis, primarily, self-reporting and clinical assessments are very
                         subjective and various factors can lead to or cause depression, this includes social desirability
                         bias and differences in interpretation of the respective symptoms. [3]
                             The recent advances in machine learning and audio analysis have opened doors to new
                         methods of objective and non-invasive techniques of depression detection. Audio data, including
                         speech and voice patterns, have shown promising capabilities as a modality for detecting
                         depression. This can help with early detection of depression since even individuals without
                         severe symptoms can exhibit acoustic patterns that can be precursors to depression. [4]
                             Research has shown that Depression is more commonly diagnosed in women than in men,
                         reporting around twice the prevalence in females. Depression ranks among the primary
                         contributors to the burden of disease in women. [5]
                             According to Economic Times Article [12]. India has 7.5 psychiatrists per million people.
                         Additionally, a survey conducted on mental health across 12 Indian states indicates that there is
                         a significant treatment gap ranging from 70 to 92% for various mental disorders in these regions.
                         Depression is a widespread mental health disorder that impacts millions of people globally.
                         However, conventional methods of diagnosing depression, such as clinical interviews and self-

                         AISD 2023: First International Workshop on Artificial Intelligence: Empowering Sustainable Development,
                         September 4-5, 2023, co-located with First International Conference on Artificial Intelligence: Towards Sustainable
                         Intelligence (AI4S-2023), Pune, India
                            dprathikayadav@gmail.com (P. Yadav); poojajain@iiitn.ac.in (P. Jain); tapankumarjain@gmail.com (T. Jain)
                                https://dblp.org/pid/29/5985.html (P. Jain)
                                       © 2023 Copyright for this paper by its authors.
                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
report questionnaires, may have limited accuracy due to their subjective nature. In recent years,
audio data, including speech and voice patterns, have been studied as a promising source of
information for detecting depression. Distinct patterns in acoustic features such as pitch,
intensity, and speech rate have been identified between individuals with depression and healthy
controls.

2. Related Work
Depression is a widespread mental health disorder that impacts millions of people globally.
However, conventional methods of diagnosing depression, such as clinical interviews and self-
report questionnaires, may have limited accuracy due to their subjective nature. In recent years,
audio data, including speech and voice patterns, have been studied as a promising source of
information for detecting depression. Distinct patterns in acoustic features such as pitch,
intensity, and speech rate have been identified between individuals with depression and healthy
controls.
    Nicholas Cummins et al., [6] is a step towards considering speech as a key objective marker to
aid clinical assessment by reviewing the characteristics of patients with depression and suicidal
thoughts, including their size, associated clinical scores, and data collection methods, are
significant factors in the development of prediction and classification systems for these
conditions. The study utilized spectral features, such as Power Spectra Density (PSD) and Mel
Cepstral Features (MFCCs), as part of its methodology.
    Research in this area was conducted by Laura Verde et al. (2016) [7], who explored the use of
speech features extracted from audio recordings to predict depression. The study used features
such as pitch, intensity, and spectral entropy and achieved an accuracy of about 85%.
    In another study, Emna Rejaibi et al.(2021) [8], the authors of the study examined the potential
of Mel-frequency cepstral coefficients (MFCCs), extracted from audio recordings, in predicting
depression. They applied various machine learning algorithms, such as support vector machines
(SVMs) and random forests, and attained a 72% accuracy rate. Moreover, their MFCC-based
recurrent neural network (RNN) achieved an overall validation accuracy of 76.27%.
    Ah Young Kim et al. [13], have conducted a study in which they have built a conventional
machine-learning model that uses Log-Mel Spectrogram and deep convolutional neural
network(CNN). They have focused on using acoustic features and have achieved an accuracy of
78.14% which showed that deep-learned acoustic characteristics can be an approach to
automated depression detection.
    The study conducted by Hande Kaymaz Keskinpala et al., [14] demonstrated that Mel-cepstral
coefficients and energy in frequency bands can serve as a means of distinguishing between
depressed and suicidal patients, in which they used a different number of cepstral coefficients
and were compared by using unimodal Gaussian modeling. They have performed it on 2 kinds of
speech samples, interview sessions, and reading sessions. The conclusion drawn from the study
indicates that controlled reading has the potential to offer better results in comparison to
interviews.
    In a study by Sharifa Alghowinem et al., [15] involved a comparison of the efficacy of various
acoustic and prosodic features when used with different classifiers. In the classifiers used, GMM,
SVM, MLP, and HFS), they also investigated the classifiers using GMM as input and observed that
in the hybrid classifier, the best combination was then GMM was used with SVM. The conclusion
about the features that performed best in the detection of depression is Loudness, root mean
square, and intensity.

3. Data
Data from DAIC-WOZ Depression database is used for our study, it's a part of a larger corpus,
Distress Analysis Interview Corpus(DAIC)(Gratch et al, 2014)[9], that contains clinical interviews
designed to support the diagnosis of psychological conditions including depression. The
interviews were collected so as to create a computer agent that interviews people and identifies
verbal and non-verbal indicators of mental illness(Devault et al., 2014)[10].
    Data collected includes audio along with video recordings and the responses to an extensive
questionnaire. The interview was conducted by a virtual interviewer, Ellie; alongside a human
interviewer in another room. The dataset consists of a CSV file that contains binary labels on
whether the subject is depressed(1) or the subject is not depressed(0). The labels are given based
on PHQ score that is calculated based on the overall interview. The Patient Health Questionnaire
(PHQ-8) is a widely accepted depression screening tool that consists of 8 questions aligned with
the major diagnostic criteria for Major Depressive Disorder as defined in the Diagnostic and
Statistical Manual of Mental Disorders (DSM-IV). Each question is scored from 0 to 3, for a total
possible score ranging from 0 to 24. Studies have found the PHQ-8, which relies on self-reported
symptoms, to be both reliable and valid for gauging an individual’s depression severity, making
it a commonly used depression measure in both clinical and research contexts. In our particular
study, we considered the PHQ-8 scores as the gold standard for assessing depression severity
within the DAIC-WOZ dataset.

4. Data Exploration
In the process of exploring the data, a correlation matrix was conducted to gain insights into the
relationship between different attributes and depression. The findings revealed that gender
exhibited a cooler tone in comparison to the other attributes, which had warmer tones. it may be
beneficial to explore the potential gender differences in the acoustic features of speech related to
depression.
Figure 1: Correlation Matrix

  A graph was plotted to understand how the data points are distributed with respect to gender,
and whether or not the individual is depressed.
Figure 2: Gender Distribution of Drepressed and Non-Depressed Induvisuals.

   The data shows that there are more females with depression than males, with 38.7% of
females and 22.8% of males being depressed. These findings suggest that there is a class
simbalance and can lead to biased model performance, where the model is more likely to predict
the majority class.

5. Data Processing
The Data includes transcript files, which have information about who the speaker is at a given
particular time. The transcript includes timestamps and speaker information which can be used
for the extraction of the subject audio from the raw audio of the interview. The raw audio file,
using the transcript file can be trimmed and concatenated into a single audio file on which the
Feature extraction can be performed.


Figure 3: Visual Representation of Raw and Extracted Subject’s Audio.

   The imbalanced distribution of classes in datasets can significantly affect the performance of
machine learning models. In the context of audio data, a lack of depressed individuals in the
dataset can result in poor performance when trying to identify and classify depressed speech. To
address this issue, data augmentation techniques have been employed to increase the diversity
and quantity of the data. Specifically, time-stretching has been used to randomly adjust the speed
rate of the audio recording, creating new, altered versions of the original data. The process
involves loading the audio file, applying the time-stretching effect, and saving the new audio file
to be used for feature extraction. By increasing the quantity of the depressed audio samples
through data augmentation, the accuracy of the machine learning models can be improved.


Figure 4: Visual Representation of the depressed and Non-depressed patients before and after
data augmentation.

6. Feature Extraction
How features are identified within data is important for machine learning models that identify
things like depression. We can extract useful features from the raw data to train their models.
This ”feature extraction” helps simplify complex data sets into more manageable inputs. For
detecting depression, analyzing audio data for meaningful features provides insights into a
person’s emotions and speech patterns. Certain features in someone’s voice may signal
depression. By identifying these important characteristics, machine learning models can
recognize patterns that accurately predict a person’s mental health. By extracting meaningful
information from audio recordings using these feature extraction techniques: We can analyze
aspects like a person’s speech pace, tone and inflection, Look at use of certain words and phrases
Identify long pauses and moments of silence. Other vocal qualities that give clues about a person’s
emotional state, Through spotting relevant details in the audio, the feature extraction helps the
machine learning model ”learn” what to listen for to identify depression. The right features
improve the model’s ability to make accurate assessments of a person’s mental health. In this
study we have leveraged the following feature extraction techniques to obtain meaningful
information from the audio data:
   1. Mel Spectrogram: Mel spectrogram is a way to represent data in terms of a power
   spectrum of a signal in the frequency domain. It gives us information about the distribution of
   the frequency of an audio signal, with the spectral energies mapped to a mel-scale, which is a
   non-linear scale that is basically an approximation of the auditory response of humans
   mapped to different frequencies. In the context of depression detection, the changes in
   frequency distribution of speech can be found to be associated with depression and other
   mental health conditions. Hence, the mel spectrogram can give us insights into the spectral
   characteristics of the speech signal and help us identify relevant features that can be indicative
   of depression.
   2. Log Mel Spectrogram: The log mel spectrogram method consists of finding the logarithm
   of the mel spectrogram. This algorithm is used to compress the dynamic range of the
   spectrogram making it easier to understand and analyze.
   3. Fundamental Frequency (F0): Fundamental frequency, also known as pitch, refers to the
   perceived frequency of a periodic waveform. In the domain of speech, it corresponds to the
   fundamental frequency of the vibration of our vocal cords’. F0 provides us information about
   the prosodic features of speech, such as the intonation. Rhythm and stress of the speech. These
   are known to be associated with emotional states and can help us dig deeper for insights into
   the speaker’s emotional state.
   4. Spectral Contrast: Spectral contrast is used to measure the spectral shape of a signal that
   captures the difference between the energy in different frequency bands. It provides insights
   regarding the spectral characteristics of the audio signal, such as the presence of harmonics
   and formants. Changes in the spectral contrast of speech have been found to be associated
   with emotional states and can provide insights into a speaker’s emotional state.
   5. Recurrence matrix: A recurrence matrix is a binary matrix used in audio processing to
   identify repeating patterns in the sound. It compares each time point in the signal to all other
   time points, with a threshold applied to determine whether the points are similar enough to
   be considered a ”recurrence” event. Recurrence matrices are useful in depression detection as
   they can identify speech patterns indicative of depression, such as repetitive speech patterns
   and reduced vocabulary.

    In summary, these features provide information about the spectral and prosodic
   characteristics of the speech signal, which have been found to be associated with emotional
   states such as depression. Extracting these features can provide insight into the speaker’s
   emotional state and aid in depression detection.


Figure 5: Visual Representation of the features extracted.
7. Implementation Details
The first step in our approach was to obtain the raw audio data and preprocess it to extract the
relevant speech segments using start and end times as well as speaker information. Next, we
applied data augmentation techniques, specifically time stretching, to increase the size of the
dataset and improve the robustness of the model. Feature extraction was then performed on the
processed and augmented audio files using a range of techniques, including Mel-frequency
cepstral coefficients (MFCCs), fundamental frequency (f0), spectral contrast, and recurrence
matrix. This resulted in a set of feature vectors for each audio file, which were then used to train
and test various classifiers. In our study, we conducted training and evaluation using a labeled
dataset with five classification models: Support Vector Machine (SVM), Random Forest, Logistic
Regression, Decision Tree, and Gradient Boosting. These models were carefully chosen for their
relevance and efficacy in addressing the classification task. To evaluate the performance of these
models, we utilized the F1 score as the evaluation metric. The F1 score considers both precision
and recall, providing a balanced measure of the model’s accuracy in correctly identifying positive
instances and effectively handling class imbalance. By training and evaluating these models using
the F1 score, we gained valuable insights into their classification performance and suitability for
the given task.


Figure 6: Overall approach for depression detection.

8. Results
The results of the implemented approach for automatic depression classification are presented
in this section. The evaluation of the classification models was conducted using the F1 score,
which provides a balanced measure of accuracy by considering both precision and recall. The
models were trained and evaluated on a labeled dataset consisting of male and female
participants. The below table displays the F1 scores achieved by each classification model.

Table 1
F1 Score
                 Model                             F1 Score
                 SVM                               0.71
                 Random Forest                     0.79
                 Logistic Regression               0.73
                 Decision Tree                     0.82
                 Gradient Boosting                 0.85
 Amongst the models we evaluated, the Decision Tree and Gradient Boosting models in particular
exhibited the highest of the F1 scores, achieving scores of 0.82 and 0.85, respectively.
Additionally, a gender-specific analysis was performed to investigate the impact of gender
separation on the classification results. The Gradient Boosting model achieved the highest overall
F1 score and it was further evaluated on separate male and female subsets of the dataset. The
final F1 scores obtained were 0.886 for females and 0.865 for males. A comparison of gender-
specific F1 scores highlights the effects of considering gender-related differences in depression
as a condition and its classification. The model resulted in a higher F1 score when trained and
tested on the female subset than the male subset. This finding suggests that an approach of
gender-specific classification can contribute to the improved accuracy in identifying depression,
as it would consider the unique speech patterns and expressions associated with depression in
different genders. Overall, the results demonstrate the effectiveness of the Decision Tree and
Gradient Boosting models in depression classification. Moreover, the findings underscore the
significance of considering gender-related differences, as the gender-specific analysis revealed
improved performance when addressing the unique characteristics of each gender.


9. Conclusion
    In this study, we proposed an approach for the automatic classification of depression states
using speech-based features. The methodology we implemented involved preprocessing the raw
audio data to extract relevant speech segments. Data augmentation techniques, specifically time
stretching, were applied to increase the dataset size and enhance the robustness of the model.
Feature extraction was performed using a combination of techniques, including Mel Frequency
cepstral coefficients (MFCCs), fundamental frequency (f0), spectral contrast, and recurrence
matrix. To evaluate the classification performance, we trained and tested on different
classification models: Support Vector Machine (SVM), Random Forest, Logistic Regression,
Decision Tree, and Gradient Boosting. The evaluation was conducted using the F1 score, which
considers both precision and recall, providing a balanced measure of accuracy. Among the models
evaluated, the Decision Tree and Gradient Boosting models demonstrated the highest F1 scores,
achieving 0.82 and 0.85, respectively, indicating their effectiveness in accurately identifying
instances of depression using speech-based features. Additionally, we have explored the impact
of gender separation on the classification performance. The Gradient Boosting model, which
achieved the highest overall F1 score, was evaluated further on separate male and female subsets
of the original dataset. The results showed that considering gender-specific characteristics and
patterns in depression classification led to improved performance of detection. The model
exhibited a higher F1 score when it was trained and tested on the female subset compared to the
male subset, highlighting the importance of accounting for gender-related differences. In
conclusion, this study highlights the potential of speech-based features in the classification of
depression. The Decision Tree and Gradient Boosting models showcased very promising results,
outperforming most of the other classification models. Moreover, incorporating gender-specific
analysis enhanced the classification accuracy of the overall system, emphasizing the significance
of tailoring the model training and evaluating it with respect to specific genders present in the
dataset.

References
[1] Depressive disorder (depression) (no date) World Health Organization. Available at:
    https://www.who.int/news-room/fact-sheets/detail/depression
[2] “Practice guideline for the treatment of patients with major depressive disorder (revision).
    American Psychiatric Association.” Am J Psychiatry. 2000 Apr;157(4 Suppl):1-45. PMID:
    10767867.
[3] Latkin CA, Edwards C, Davey-Rothwell MA, Tobin KE. ”The relationship between social
     desirability bias and self-reports of health, substance use, and social network factors among
     urban substance users in Baltimore, Maryland.” Addict Behav. 2017 Oct;73:133-136. doi:
     10.1016/j.addbeh.2017.05.005. Epub 2017 May 9. PMID: 28511097; PMCID: PMC5519338.
[4] Albuquerque L, Valente ARS, Teixeira A, Figueiredo D, Sa-Couto P, Oliveira C. ”Association
     between acoustic speech features and non-severe levels of anxiety and depression symptoms
     across       lifespan.”     PLoS      One.      2021     Apr       8;16(4):e0248842.      doi:
     10.1371/journal.pone.0248842. PMID: 33831018; PMCID: PMC8031302.
[5] Depression:          His      versus       hers       (2021)       JHM.      Available      at:
     https://www.hopkinsmedicine.org/health/conditions-and-diseases/depression-his-
     versushers:
     :text=Researchers%20have%20known%20for%20years,of%20disease%20burden%20am
     ong%20women.
[6] Nicholas Cummins, Stefan Scherer, Jarek Krajewski, Sebastian Schnieder, Julien Epps,
     Thomas F. Quatieri, ”A review of depression and suicide risk assessment using speech
     analysis, Speech Communication,” Volume 71, 2015, Pages 10-49, ISSN 0167-6393,
[7] L. Verde et al., ”A Lightweight Machine Learning Approach to Detect Depression from Speech
     Analysis,” 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence
     (ICTAI), Washington, DC, USA, 2021, pp. 330-335, doi: 10.1109/ICTAI52525.2021.00054.
[8] Emna Rejaibi, Ali Komaty, Fabrice Meriaudeau, Said Agrebi, Alice Othmani, ”MFCC-based
     Recurrent Neural Network for automatic clinical depression recognition and assessment
     from speech,” Biomedical Signal Processing and Control, Volume 71, Part A, 2022, 103107,
     ISSN 1746-8094,
[9] Gratch, Jonathan Arstein, Ron Lucas, Gale Stratou, Giota Scherer, Stefan Nazarian, Angela
     Wood, Rachel Boberg, Jill DeVault, David Marsella, Stacy Traum, David Rizzo, Albert Morency,
     L.. (2014). ”The Distress Analysis Interview Corpus of human and computer interviews,”
[10] DeVault, David Artstein, Ron Benn, Grace Dey, Teresa Fast, Ed Gainer, Alesia Georgila,
     Kallirroi Gratch, Jonathan Hartholt, Arno Lor-Lhommet, Margot Lucas, Gale Marsella, Stacy
     Morbini, Fabrizio Nazarian, Angela Scherer, Stefan Stratou, Giota Suri, Apar Traum, David
     Wood, Rachel Morency, Louis-Philippe. (2014). ”SimSensei Kiosk: A Virtual Human
     Interviewer for Healthcare Decision Support. 13th International Conference on Autonomous
     Agents and Multiagent Systems,” AAMAS 2014. 2. 1061-1068.
[11] Kurt Kroenke, Tara W. Strine, Robert L. Spitzer, Janet B.W. Williams, Joyce T. Berry, Ali H.
     Mokdad, ”The PHQ-8 as a measure of current depression in the general population, Journal
     of Affective Disorders,” Volume 114, Issues 1–3, 2009, Pages 163-173, ISSN 0165-0327,
[12] India has 0.75 psychiatrists per 100,000 people. can telepsychiatry bridge the gap between
     Mental Health Experts & Patients? (no date) The Economic Times. Available at:
     https://economictimes.indiatimes.com/magazines/panache/india-has-0-75-
     psychiatristsper-100000-people-can-telepsychiatry-bridge-the-gap-between-mental-
     health-expertspatients/articleshow/78572684.cms?from=mdr.
[13] Kim A, Jang E, Lee S, Choi K, Park J, Shin H ”Automatic Depression Detection Using
     Smartphone-Based Text-Dependent Speech Signals: Deep Convolutional Neural Network
     Approach” J Med Internet Res 2023;25:e34474 URL: https://www.jmir.org/2023/1/e34474
     DOI: 10.2196/34474
[14] Keskinpala, Hande Yingthawornsuk, Thaweesak Wilkes, D.M. Shiavi, Richard Salomon,
     Ronald. (2007). ”Screening for high risk suicidal states using mel-cepstral coefficients and
     energy in frequency bands. European Signal Processing Conference.”
[15] S. Alghowinem et al., ”A comparative study of different classifiers for detecting depression
     from spontaneous speech,” 2013 IEEE International Conference on Acoustics, Speech and
     Signal Processing, Vancouver, BC, Canada, 2013, pp. 8022-8026, doi:
     10.1109/ICASSP.2013.6639227.
[16] Haihua Jiang, Bin Hu, Zhenyu Liu, Gang Wang, Lan Zhang, Xiaoyu Li, Huanyu Kang, ”Detecting
     Depression Using an Ensemble Logistic Regression Model Based on Multiple Speech
Features”, Computational and Mathematical Methods in Medicine, vol. 2018, Article ID
6508319, 9 pages, 2018. https://doi.org/10.1155/2018/6508319

</pre>