<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detecting anxiety and depression in dialogues: a multi-label and explainable approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Francisco de Arriba-Pérez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Silvia García-Méndez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Technologies Group, atlanTTic, University of Vigo</institution>
          ,
          <addr-line>Vigo</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <abstract>
        <p>Anxiety and depression are the most common mental health issues worldwide, afecting a non-negligible part of the population. Accordingly, stakeholders, including governments' health systems, are developing new strategies to promote early detection and prevention from a holistic perspective (i.e., addressing several disorders simultaneously). In this work, an entirely novel system for the multi-label classification of anxiety and depression is proposed. The input data consists of dialogues from user interactions with an assistant chatbot. Another relevant contribution lies in using Large Language Models (llms) for feature extraction, provided the complexity and variability of language. The combination of llms, given their high capability for language understanding, and Machine Learning (ml) models, provided their contextual knowledge about the classification problem thanks to the labeled data, constitute a promising approach towards mental health assessment. To promote the solution's trustworthiness, reliability, and accountability, explainability descriptions of the model's decision are provided in a graphical dashboard. Experimental results on a real dataset attain 90 % accuracy, improving those in the prior literature. The ultimate objective is to contribute in an accessible and scalable way before formal treatment occurs in the healthcare systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Anxiety and depression</kwd>
        <kwd>clinical decision-support system</kwd>
        <kwd>eXplainable Artificial Intelligence</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>multi-label classification</kwd>
        <kwd>Natural Language Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>More than 55 million people in the United States sufer from mental illness as indicated by the National
Institutes of Health (nih, 2023)1. More in detail, the most common mental conditions are anxiety
(19.1 %) and major depression (8.3 %). At the global level, 4 % of the population is afected by anxiety
disorder. At the same time, 280 million people worldwide sufer from depression, as stated by the
World Health Organization (who, 2023)2. However, only 25 % of people sufering from anxiety receive
treatment. A recent report by Forbes3 completes this information and indicates that 50 % of people
afected by depression go undiagnosed in the primary care system.</p>
      <p>
        In this regard, it should be noted that traditional screening methods (i.e., those that rely on subjective
and time-consuming interviews composed of binary questions for patients and their families) face
several issues [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Among them, the unreliability of self-reported diagnoses due to bias introduced
by subjectivity, intentional concealment, and even inconvenience of the number of questions must
be considered, resulting in the latter low rates of diagnosis and interventions [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Another concern is
stigma, which prevents treatment seeking and ignorance of the condition [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Representative examples
of these self-reporting methodologies are the Beck Depression Inventory (bdi), the General Health
3rd AIxIA Workshop on Artificial Intelligence For Healthcare and 5th Data4SmartHealth co-located with the 23rd Internarional
Conference of the Italian Association for Artificial Intelligence, November 25–28, 2024, Bolzano, Italy
* Corresponding author.
      </p>
      <p>
        Questionnaire (ghq), the Hamilton Rating Scale for Depression (hrsd), and the Patient Health
Questionnaire (phq). Similar to the ghq and the phq, the Depression, Anxiety, and Stress Scale (dass) combines
the questionnaires of each factor [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Consideration should also be given to the popular Diagnostic and
Statistical Manual of Mental Disorders, fifth edition ( dsm-5) published by the American Psychiatric
Association.
      </p>
      <p>
        Provided the severe consequences of anxiety and depression that even increase the risk of suicide
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], early detection and timely diagnoses are critical. In this regard, language can be a good predictor
of mood disorders [5]. More in detail, how users engage in a conversation and express themselves
is a strong indicator of their mental health state. Accordingly, the arrival of Large Language Models
(llms, e.g., gpt-44, Palm5, and Alpaca6) has contributed significantly to health-related topics thanks to
their context-learning capabilities, mostly in generative tasks. Specifically, the literature has reported
promising performance of these models in three relevant scenarios: (i) language comprehension, (ii)
text generation, and (iii) knowledge inference [6]. Moreover, the potential of these models to leverage
large volumes of online data is of great importance for both diagnosis and treatment [7].
      </p>
      <p>
        Consequently, several pre-trained language models (plms) and llms have been deployed for addressing
health issues like mental disorders. It is the case of llmental [8], Mentalbert [9] and Mentalllama [10].
Besides, Psychbert [11] is fine-tuned to detect language patterns in behavioral health, mental health,
psychiatry, and psychology texts. However, as indicated by previous works [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], their performance in
specific classification problems with task-specific data like anxiety and depression is still immature
when used as final solutions ( i.e., in zero-shot/few-shot learning or with limited fine-tuning). This is due
to the poor detection of nuances and task-specific patterns essential for accurate detection. Similarly,
the plms exhibit limited generalization and low multitask robustness [12, 13]. Another key limitation is
their low interpretability, which prevents their practical use beyond academic research [10].
      </p>
      <p>Summing up, using plms or directly llms in zero-shot settings is the prominent approach [14].
Regarding the experimental data, most researchers use social media [15]. In recent years, there has
even been an increasing interest in detecting mental health states with tracking devices, which results
from the growing importance that modern society places on mental well-being [16]. Regardless of the
approach, the aspect in which most of the researchers meet is the necessity to provide interpretable
results along with explainable descriptions of the rationale of the machine-based solutions, of uttermost
importance in the healthcare field provided their direct impact in the decision-making of clinicians and
thus, the patient’s well-being. In this regard, eXplainable Artificial Intelligence ( xai) comprises post hoc
and self-explanatory techniques. While the former posthoc alternatives aim to explain the prediction of
black-box classification models like the popular explanatory model-agnostic tools ( i.e. those that combine
local linear and random models, like lime and shap [17] to approximate feature importance weights
with regression and game theory), the self-explanatory approach relies on intrinsically interpretable
models that can provide explanations along with the predictions [14]. However, feature importance
methods like lime and shap only provide the weight of the selected features without considering
the interactions among the features and are low intuitive for end users [18]. In this regard, a major
regulatory milestone in the ai field was materialized with the Artificial Intelligence Act ( aia). The final
text pays particular attention to interpretability, the right of end users to receive clear explanations,
and the disclosure of the use of ai in human interactions [19].</p>
      <p>Given the safety-critical nature of these conditions, our solution must provide high accuracy and
explainability to promote trust among the end users and professionals. Accordingly, we combine the
traditional Machine Learning (ml) models (which can ofer higher accuracy but lack explainability)
operating in a multi-label setting with llms (which are intrinsically explicable but lack specific
downstream knowledge). Note that in our approach, llms are leveraged to extract users’ expert features
related to anxiety and depression by detecting linguistic patterns and language usage, taking advantage
of their understanding capabilities. Specifically, relying on llms solely as part of the feature engineering
module to extract user-level knowledge; we tackle the hallucination problem, that is, those predictions
that, even seem correct, present underlying misconceptions due to the absence of a comprehensive
understanding of the problem and expert data. Furthermore, accurate diagnosis requires formal clinical
knowledge [5]. Consequently, in this work, we understand the necessity of leveraging formal medical
knowledge into machine-based solutions (i.e., by incorporating transparent assessment based on oficial
methods, scales, and standards). Hence, we used formal clinical scales for anxiety and depression to label
the experimental data. We also acknowledge the limitations of using social media data. Thus, we exploit
free dialogues with a conversational assistant. Compared to free dialogues, clinical questionnaires limit
the users’ ability to express their state freely. Our ultimate objective is to perform an on-demand and
scalable assessment of anxiety and depression before formal clinical screening in the healthcare systems.
Note that intentional concealment is reduced in our study since the tests are embedded in the dialogues,
and the questions are adapted accordingly.</p>
      <p>The remainder of this manuscript is divided into the following sections. Section 2 summarizes the key
prior works on anxiety and depression detection using plms and llms, paying particular attention to
multi-label approaches and those that provide explainability. Section 3 details our system architecture,
while Section 4 shows the results obtained with our methodology and compares them with other works
in the state of the art. Finally, Section 5 does the main conclusions of this work and proposes future
research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature review</title>
      <p>
        Traditional ml, deep learning, and Natural Language Processing (nlp) techniques have been used in
the literature for mental health assessment [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The most recent works involve word embedding with
transformed-based models (also known as plms) to take advantage of contextual data [20, 21]. However,
scant research is available in state of art related to using llms like Chatgpt underlying models; most
use them as final classification solutions with limited prompt engineering or fine-tuning.
      </p>
      <p>Regarding the detection of anxiety and depression, many researchers apply a multitask learning
perspective (i.e., defining a primary and an auxiliary task), e.g., emotion inference. This is due to the
availability of experimental labeled data in terms of emotional content [22]. These works sustain
that stressed users are more likely to express negative emotions (e.g., anger, fear, and sadness) rather
than positive ones (e.g., happiness). This is the case in many works. In this regard, Qureshi et al.
[23] defined emotion classification as the second task, which follows depression as the main task,
similar to what Ghosh et al. [24] proposed. Moreover, the solution developed by Turcan et al. [25] is
another representative example. Notably, the authors applied this approach to stress detection. They
explored single-task models that operate similarly to bert and multitask learning with a fine-tuned
bert model on emotion detection and stress labels. Finally, they exploited lime for interpretability.
Although we agree with the strong relation between emotion load and mental health state, we believe
that relying mainly on emotion detection to assess anxiety, depression, or stress may lead to false
positive results. Thus, we incorporated this knowledge into the engineered features, using anxiety and
depression-labeled data as main tasks jointly in a multi-label setting.</p>
      <p>Moreover, Ghosh et al. [26] proposed a multitasking framework (not based on ml models) for
depression detection, sentiment classification, and emotion recognition. Even if slightly related to our
research, the promising results obtained prove the appropriateness of addressing anxiety and depression
simultaneously, given the strong link between both mental health conditions. Conversely, Sarkar et al.
[27] developed a multitask learning solution with a data-sharing mechanism, providing the relation
between anxiety and depression. The authors used word embedding models like bert for feature
engineering to feed traditional ml models, similar to our work but without the advantage in terms of
explainability that llms provide. Alike to the work by Sarkar et al. [27] is the more recent proposal
by Park et al. [28]. Additionally, Ilias and Askounis [29] defined a multitask learning framework in
which depression and stress detection are the main and auxiliary tasks, respectively, using social media
data. Note that two datasets gathered and labeled in diferent conditions are used. The first proposed
approach encompasses a bert-based layer shared for both tasks, primary and auxiliary, followed by
separate bert-based encoder layers. In contrast, the second approach derives from the first but exploits
weighting layers by attention fusion networks. However, no hyperparameter tuning was performed
due to limited access to computational resources. Explainability was not provided either.</p>
      <p>
        Despite the strong relation between anxiety or stress and depression, few studies address the joint
assessment of several conditions [29]. In this regard, Lee et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] focused on geriatric (i.e., experimental
data from mild cognitive impairment patients) anxiety and depression detection by exploiting
lowcost activity trackers. Regarding the multi-label classification approach followed, the authors applied
the binary relevance method. That is, unlike in our work, they used two single-label classifiers for
anxiety and depression, respectively, which is a more straightforward way of approaching the problem.
However, accuracy may be compromised since the solution does not consider the correlation between
labels. As in our work, they include questionnaire-based features from the geriatric anxiety inventory
(gai) and the geriatric depression scale (gds). In addition, Park et al. [30] also integrated the dsm-5
diagnostic criteria into their predictive methodology, which is based on a variant of the bert model.
Similarly, de Souza et al. [31] proposed a stacking solution with two single-binary classifiers for anxiety,
depression, and their comorbidity leveraging social media data. Note that the authors used shap for
interpretability.
      </p>
      <p>
        Some authors exploited the already mentioned plms. It is the case of Ahmed et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] who proposed
a transformer-based architecture for multi-class depression detection (i.e., in severity levels: absent,
mild, moderate, and severe). After text processing, diferent variants of the bert model are used
for classification. The final result is obtained following a voting approach. The authors applied
lime to provide interpretability to the solution. Ultimately, the proposed system was compared with
Chatgpt (gpt-3.5-turbo model, non-fine-tuned), which attained poor performance. Related to our
work, Chowdhury et al. [7] studied early depression detection from social media data using llms
(i.e., gpt-4), deep learning (e.g., lstm) and transformer models (e.g., bert). However, the authors’
approach to explanability is to provide feature-level interpretability. More recently, Ilias et al. [32]
developed a transformer-based solution for stress and depression detection from social media data.
Extra-linguistic information is introduced to the bert and Mentalbert models. However, the solution
does not approach the detection task simultaneously, as in our work, which is much more challenging.
Conversely, experiments were performed with datasets for binary classification of stress and depression,
respectively, and a multi-class (i.e., with diferent severity levels) depression dataset.
      </p>
      <p>When it comes to the application of llms, Wang et al. [6] leveraged a fine-tuned version of Chat gpt
to detect depression. To ensure accurate predictions, the authors proposed a knowledge-enhanced
pre-training scheme with emotion analysis capabilities and human feedback. Moreover, Liu et al. [14]
used Chatgpt for data collection along with manually created psychology data that feed bert and
Roberta models for depression detection. Regarding interpretability, shap was exploited. Similarly,
Ohse et al. [33] investigated several plms and llms (e.g., bert, gpt-4, llama) for depression assessment
using clinical interviews as experimental data. The authors exploited the models following the zero-shot
paradigm without fine-tuning or prompt engineering. Despite being a relevant study to endorse the
applicability to the mental health field of these models, the authors did not exploit their full potential,
as already mentioned with the lack of tuning and also regarding explainability. Furthermore, Wang
et al. [34] proposed a solution that searches for depression-related texts from the bdi questionnaire.
Then, llms are used to fill the latter survey using user data from social media to infer their mental
state. Ultimately, Xu et al. [15] evaluated diferent llms (e.g., Alapaca, llama, gpt-4) for mental health
classification (binary and multi-class prediction for stress, depression and suicide) from social media data
exploiting prompt engineering. Note that this work difers from ours in the absence of the multi-label
setting and explainable capabilities.</p>
      <sec id="sec-2-1">
        <title>2.1. Research contributions</title>
        <p>Table 1 shows the most closely related solutions to easily compare and assess our contributions. To the
best of our knowledge, our work is the first to apply llms to extract users’ expert features related to
anxiety and depression. By this means, we can detect linguistic patterns and language usage, using
the comprehension capabilities of llms without sacrificing explainability. Moreover, another relevant
contribution is combining traditional ml models in a multi-label setting, which can ofer higher accuracy.
Consideration should also be given to integrating formal clinical knowledge through standard tests used
for data labeling. More in detail, experimental data consists of a free conversation between patients
and a conversational chatbot, despite the popularity of social media data for anxiety and depression
detection and the rigidity of self-reporting questionnaires. Ultimately, an explainability dashboard
describes the most relevant data that leads to the classification decision and its confidence.</p>
        <p>Summing up, the main contributions of the proposed solution for the field are:
• A multi-label framework able to predict jointly anxiety and depression.
• The use of llms to extract high-level reasoning features used to train the ml models.
• The explainability dashboard which promotes trust and makes the solution accountable and
reliable.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Data acquisition</title>
        <p>The experimental dataset is composed of conversations with Celia chatbot7. This chatbot establishes
an entertaining and engaging dialogue with end users, including fun facts about the conversation
topics. Moreover, every 3 months, the chatbot uses the standard questionnaires presented in the
Spanish versions of the Goldberg Anxiety and Depression Scales (gads) and the Yesavage Geriatric
Depression Scale (ygds) to assess the cognitive state of the user. These questions are embedded during
the conversation flow. The latter data is used as the label of the user ( i.e., absence or presence of anxiety
and depression) for the supervised learning stage.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Feature engineering</title>
        <p>The solution combines feature generation based on prompt engineering with a sliding window strategy
to consider the history of past sessions8. Table 2 shows the features engineered that can verse on the</p>
        <sec id="sec-3-2-1">
          <title>7Available at https://celiatecuida.com/en/home_en, October 2024. 8A session is a complete dialogue with the end user until they decide to stop the conversation.</title>
          <p>cognitive state of the end user (i.e., their emotional well-being or health condition) or on the dialogue
itself (i.e., the discoursive and linguistic characteristics of the conversation with the chatbot). These
features are calculated using an llm and prompt engineering, and their values range from 0.0 to 1.09.</p>
          <p>Each generated feature is expanded with four new statistical features (average, and the three quartiles
1, 2, and 3). For this purpose, a sliding window is performed with the last 30 sessions (see
Equation (1), where  is the number of sessions and  [] is the historical feature in the last  sessions.
Consequently,  [] is the ordered version of  [].</p>
          <p>∀ ∈ {1...∞}
 [] = {[0], . . . , []}.
 [] = {0[], 1[], . . . , − 1[]} | 0[] ≤ 1[] ≤ . . . ≤ − 1[],
where; ∀ ∈  [],  ∈  [].</p>
          <p>[] = 1 ∑︁ []</p>
          <p>=0
1[] = ⌊ 41 ⌉[]
2[] = ⌊ 24 ⌉[]
3[] = ⌊ 34 ⌉[]
(1)</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Feature analysis &amp; selection</title>
        <p>In the cold-start step, the system uses 10 % of the samples to select the most relevant features. In this
ifrst phase, a selector based on a meta-transformer wrapper 10 is applied (i.e., following a model-agnostic
strategy). In order to select the most significant features, this transformer uses a tree ml classifier that
9Polarity takes integer values: 0 for negative, 1 for neutral, and 2 for positive.
10Available at https://scikit-learn.org/1.5/modules/feature_selection.html, October 2024.
calculates the Mean Decrease in Impurity (mdi)11 of each feature. Ultimately, the features with a mdi
lower than average are discarded.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Classification</title>
        <p>In this study, we solve a multi-label binary classification problem. More in detail, this scenario difers
from the binary class in the number of labels assigned. In the latter binary-class scenario, the classifier
provides a single label between the two classes in the experimental data (i.e., number of classes is
2, the number of resulting labels is 1). Since ours is a binary problem, we applied the Multi-class
Transformation Strategy (mts) [35]. Particularly, in our approach, we group the binary classes into four
categories: (none_none), (none_depression), (anxiety_none), and (anxiety_depression). Note that the
multi-label classification is a complex problem because the results may be partially correct, preventing
the use of standard ml evaluation metrics. Instead, the metrics described below (micro and macro
approach) are computed:
• Exact match ratio represents the proportion of predictions where both labels are correct (see
equation (2), where  is the actual label,  is the predicted label, and  is the number of samples).</p>
        <p>1 ∑︁  ( = )
 =1</p>
        <p>1 ∑︁ | ∩ |
 =1 | ∪ |</p>
        <p>1 ∑︁ | ∩ |
 =1 ||
• Accuracy is the percentage of correctly predicted labels over the total predicted and actual
categories (see equation (3)).</p>
        <p>• Precision is the percentage of correctly predicted labels over predicted labels (see equation (4)).
11Available at https://scikit-learn.org/1.5/auto_examples/inspection/plot_permutation_importance.html, October 2024.
(2)
(3)
(4)
• Recall is the percentage of correctly predicted labels over actual labels (see equation (5)).
• Hamming Loss (hl) calculates the incorrectly predicted labels. In our binary problem, it
complements the accuracy (see equation (6)).</p>
        <p>1 ∑︁ | ∩ |</p>
        <p>
          Our solution exploits the Naive Bayes (nb), Decision Tree (dt), and Random Forest (rf) classifiers
widely used in the literature to solve similar classification problems [
          <xref ref-type="bibr" rid="ref2">2, 24, 27</xref>
          ]. We analyze two diferent
scenarios. Scenario 1 uses all user sessions to summarize the analysis, while scenario 2 evaluates the
behavior of the system by reducing the number of samples and selecting 2 out of 3 entries.
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Explainability</title>
        <p>Our system explains the predictions obtained by leveraging an llm with a prompt engineering template.
This approach creates an explanation of the predicted majority category every 7 sessions. For this
purpose and to limit the computational load of the explainability module, the most representative statistics
of the features in Table 2 (i.e., the average and the 2) are considered. Moreover, the conversations of
the last two sessions are also sent to the model with the predicted category for interpretability purposes.
In addition to promoting trust among end users and clinicians and the accountability and reliability of
the solution, this information can be exploited to recommend formal assessment in the primary care
health system or treatment to prevent anxiety and depression.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and discussion</title>
      <p>This section explains the experimental dataset used and the results obtained. The analysis was conducted
on a computer with the following specifications:
• Operating System (os): Ubuntu 18.04.2 LTS 64 bits
• Processor: IntelCore i9-10900K 2.80 GHz
• RAM: 96 GB DDR4
• Disk storage: 480 GB NVME + 500 GB SSD</p>
      <sec id="sec-4-1">
        <title>4.1. Experimental data</title>
        <p>The dataset12 contains the complete conversations between voluntary users and the Celia chatbot from
16 May 2023 to 9 October 2024. Notably, it comprises 2186 user sessions, 32 users, and an average of
68 sessions per user. Moreover, each session comprises an average of 26 interactions and 157 words
per session. Table 3 shows the distribution of sessions by categories. More in detail, most cases are
concentrated in people without any pathology, and the presence of depression overlaps with anxiety,
reducing the number of isolated cases of the former. This increases the dificulty of the classification
problem, given the imbalance of experimental data, even more so in the multi-label scenario.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Data acquisition</title>
        <p>To motivate a new conversation with Celia, the assistant sends notifications by email and shows
reminders to the users. In this line, to detect the end of a session, the user must say goodbye, or this is
automatically finished after 3 minutes of inactivity. Those sessions with 5 or fewer human interventions
12The experimental dataset is available on request from the authors.
are discarded to ensure that a significant amount of data enters the anxiety and depression detection
system. In addition, as described in Section 3.1, every 3 months, the samples are re-labeled on anxiety
and depression using the standard questionnaires gads and ygds.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Feature engineering</title>
        <p>The features described in Table 2 are generated using the gpt-4o-mini13 model. It can be accessed by
sending requests to Openai api14, using the prompt in Listing 1. Each request contains the text of the
complete session and the temperature parameter set to 0. This removes the randomness and ensures
that the model provides the same response to the same input content.</p>
        <p>For each of the 14 features, the average and the three quartiles 1, 2, and 3 are calculated. In
total, 56 features are used in this multi-label problem. Once calculated, the values are rounded to 2
digits, and those with the same values are discarded.</p>
        <sec id="sec-4-3-1">
          <title>LISTING 1: Feature engineering using gpt-4o-mini.</title>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Feature analysis &amp; selection</title>
        <p>Our approach uses the SelectFromModel15 library of scikit-learn in combination with the rf
classifier 16 to analyze and select the most relevant features. This analysis is performed using the 10 %
13Available at https://platform.openai.com/docs/models/gpt-4o-mini, October 2024.
14Available at https://openai.com/api, October 2024.
15Available at https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html, October
2024.
16Available at https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html, October 2024.
of the dataset. Specifically, 39 % of the original features are selected in scenario 1 and 2.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Classification</title>
        <p>We evaluate our approach using the scikit-learn Python library. The nb17, dt18 and rf models are
selected.</p>
        <p>The hyperparameters of these classifiers are optimized using the GridSearchCV19 method. We use
a 10-fold cross-validation to evaluate the configuration that ofers the best accuracy value over the
10 % of the experimental dataset. Listings 2, 3 and 4 contain the parameter sets for nb, dt, and rf,
respectively. The values selected for scenarios 1 and 2 are shown below.</p>
        <p>Scenario 1:
• nb: var_smoothing = 1e-05
• dt: splitter = best, max_features = None, max_depth = 100, min_samples_split = 0.001,
min_samples_leaf = 0.001, criterion = entropy
• rf: n_estimators = 200, max_features = sqrt, max_depth = 10, min_samples_split = 2,
min_samples_leaf = 1, criterion = gini
Scenario 2:
• nb: var_smoothing = 1e-05
• dt: splitter = random, max_features = sqrt, max_depth = 100, min_samples_split =
0.001, min_samples_leaf = 0.001, criterion = entropy
• rf: n_estimators = 100, max_features = None, max_depth = 5, min_samples_split = 2,
min_samples_leaf = 1, criterion = entropy</p>
        <sec id="sec-4-5-1">
          <title>LISTING 2: nb hyperparameter configuration.</title>
        </sec>
        <sec id="sec-4-5-2">
          <title>LISTING 3: dt hyperparameter configuration.</title>
          <p>
            LISTING 4: rf hyperparameter configuration.
var_smoothing : [ 1 e −9 , 1 e −5 , 1 e −1]
s p l i t t e r : [ b e s t , random ] ,
m a x _ f e a t u r e s : [ None , s q r t , l o g 2 ] ,
max_depth : [ 1 , 1 0 0 , None ] ,
m i n _ s a m p l e s _ s p l i t : [ 0 . 0 0 1 , 0 . 1 , 1 ] ,
m i n _ s a m p l e s _ l e a f : [ 0 . 0 0 1 , 0 . 1 , 1 ] ,
c r i t e r i o n : [ g i n i , e n t r o p y ]
n _ e s t i m a t o r s : [ 1 0 0 , 1 5 0 , 2 0 0 ] ,
m a x _ f e a t u r e s : [ s q r t , log2 , None ] ,
max_depth : [ 5 , 1 0 , 1 0 0 , None ] ,
m i n _ s a m p l e s _ s p l i t : [
            <xref ref-type="bibr" rid="ref2">2 , 5 , 1 0</xref>
            ] ,
m i n _ s a m p l e s _ l e a f : [
            <xref ref-type="bibr" rid="ref1 ref2">1 , 2 , 5</xref>
            ] ,
c r i t e r i o n : [ g i n i , e n t r o p y ]
          </p>
          <p>Table 4 shows the classification results obtained. Note that the three alternative models attain
promising results, rf the one that attained the best values, all above 80 % regardless of the scenario.
Moreover, the accuracy value is close to 90 % (i.e., 10 % for hl) and the exact match value is close to
85 %. In scenario 1, rf obtains an increase of +22 % compared to the exact match of nb and +10 %
compared to the macro precision of dt. Moreover, in scenario 2, the reduction of the dataset has a
slight non-critical efect on the classifiers.
17Available at https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html, October 2024.
18Available at https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html, October 2024.
19Available at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html, October 2024.</p>
          <p>
            Compared to the most closely related work from the literature, even though they exploit data from
low-cost activity trackers, the multi-label approach by Lee et al. [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] attains similar results. However, as
explained in the related work discussion, the authors focused on mild cognitive impairment. Moreover,
the multi-label strategy is also diferent as they applied the binary relevance method with single-label
classifiers for anxiety and depression separately, not considering the correlation between these two
conditions. In contrast, we leverage the mts. Moreover, the experimental data of this work is limited
to 20 samples, while our approach results in a more realistic and transparent assessment using free
dialogues.
          </p>
        </sec>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Explainability</title>
        <p>Figure 2 shows the dashboard accessible to the caregivers, physicians, and end users with the four most
relevant features from Table 2 on the top. The boxes are green if the feature values are below 50 % and
red otherwise. At the bottom of the figure, the explanation generated by the gpt-4o-mini model is
shown. Our approach uses the prompt engineering template described in Listing 5 filled with the values
of the average and the 2 of the features in Table 2, the conversations of the last 2 sessions, and the
predicted majority category. On the right, the dashboard indicates the prediction and the confidence
percentage using Predict_Proba function20.</p>
        <sec id="sec-4-6-1">
          <title>LISTING 5: Prompt for explainability.</title>
          <p>T h i s s y s t e m a n a l y z e s t h e u s e r ’ s a n x i e t y and d e p r e s s i o n s t a t e u s i n g c o n v e r s a t i o n s
with a c h a t b o t . The l a s t 30 s e s s i o n s o f t h e u s e r r e t u r n t h e f o l l o w i n g f e a t u r e
v a l u e s :
Average :
i n s e c u r i t y : X ,
l o n e l i n e s s : X ,
n e g a t i v e _ e m o t i o n : X ,
p o s i t i v e _ e m o t i o n : X ,
s a d n e s s : X ,
a n g u i s h : X ,
h e a l t h _ i s s u e s : X ,
c a t a s t r o p h i c _ t e r m s : X ,
e m p h a s i z e d _ t e r m s : X ,
r e p e a t e d _ c o n c e p t s : X ,
i n t e r j e c t i o n s : X ,
n e g a t i v e _ a d v e r b s : X ,
n e g a t i v e s _ t e r m s : X ,
p o l a r i t y : X
20Available at https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html, October 2024.
i n s e c u r i t y : X ,
l o n e l i n e s s : X ,
n e g a t i v e _ e m o t i o n : X ,
p o s i t i v e _ e m o t i o n : X ,
s a d n e s s : X ,
a n g u i s h : X ,
h e a l t h _ i s s u e s : X ,
c a t a s t r o p h i c _ t e r m s : X ,
e m p h a s i z e d _ t e r m s : X ,
r e p e a t e d _ c o n c e p t s : X ,
i n t e r j e c t i o n s : X ,
n e g a t i v e _ a d v e r b s : X ,
n e g a t i v e s _ t e r m s : X ,
p o l a r i t y : X
Moreover , t h e l a s t 2 c o n v e r s a t i o n s a r e :
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
[ c o n v e r s a t i o n s ]
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
The p r e d i c t i o n o f our machine l e a r n i n g model i s t h a t t h e u s e r &lt; d o e s n o t s u f f e r |
s u f f e r s &gt; [ a n x i e t y / d e p r e s s i o n / a n x i e t y and d e p r e s s i o n ] .</p>
          <p>G e n e r a t e an e x p o s i t i o n o f no more t h a n 400 c h a r a c t e r s i n n a t u r a l l a n g u a g e t h a t
summarizes t h e r e a s o n s why t h i s p r e d i c t i o n has been g e n e r a t e d by t h e model b a s e d on
t h e i n f o r m a t i o n p r o v i d e d .</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>Given the appalling consequences of anxiety and depression, timely detection of these conditions is of
uttermost importance. Traditional screening methods are time-consuming and rely on rigid subjective
assessment with interviews and questionnaires. Moreover, despite the strong relationship between
anxiety or stress and depression, few studies address the joint assessment of several conditions.</p>
      <p>ai-based solutions have proposed several advantages regarding flexibility, scalability, and
personalization. However, their performance in specific classification problems with task-specific data like
anxiety and depression is still immature when used as final solutions, as is the case with llms. Moreover,
some solutions lack generalization and multitask robustness, apart from low interpretability, which
prevents their practical use beyond academic research. Interpretability and explainability are especially
relevant in this field, provided their direct impact on clinicians’ decision-making and, thus, the patient’s
well-being.</p>
      <p>In this work, an entirely novel system for the multi-label classification of anxiety and depression is
proposed. Another relevant contribution lies in using llms for feature extraction, which are intrinsically
explicable but lack specific downstream knowledge, with ml models operating in a multi-label setting,
which can ofer higher accuracy but lack explainability. Specifically, relying on llms solely as part of the
feature engineering module to extract user-level knowledge from free dialogues with a conversational
assistant, we tackle the hallucination problem. In addition, we leverage formal medical knowledge
using clinical scales for anxiety and depression to label the experimental data. Moreover, explainability
descriptions of the model’s decision are provided in a graphical dashboard along with the confidence
of the results to promote the solution’s trustworthiness, reliability, and accountability. Experimental
results on a real dataset attain 90 % accuracy, improving those in the prior literature. The ultimate
objective is to contribute in an accessible and scalable way before formal treatment occurs in the
healthcare systems.</p>
      <p>In future work, we plan to evolve the solution to study severity levels of mental health conditions, as
well as deploy the system in a real-world setting (i.e., stream-based ml). Another line of work will focus
on the analysis of non-verbal and paraverbal data (e.g., voice modulation).</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration of competing interest</title>
      <p>The authors have no competing interests to declare relevant to this article’s content.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration of studies in humans</title>
      <p>This study was carried out following the World Medical Association Declaration of Helsinki.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments References</title>
      <p>This work was partially supported by Xunta de Galicia grants ED481B-2022-093 and ED481D 2024/014,
Spain.
[5] F. B. Tumaliuan, L. Grepo, E. R. Jalao, Development of Depression Data Sets and a Language Model
for Depression Detection: Mixed Methods Study, JMIR Data 5 (2024) e53365. doi:10.2196/53365.
[6] X. Wang, K. Liu, C. Wang, Knowledge-enhanced Pre-Training large language model for depression
diagnosis and treatment, in: Proceeding of IEEE International Conference on Cloud Computing
and Intelligence Systems, IEEE, 2023, pp. 532–536. doi:10.1109/CCIS59572.2023.10263217.
[7] A. K. Chowdhury, S. R. Sujon, M. S. S. Shafi, T. Ahmmad, S. Ahmed, K. M. Hasib, F. M. Shah,
Harnessing large language models over transformer models for detecting Bengali depressive
social media text: A comprehensive study, Natural Language Processing Journal 7 (2024) 100075.
doi:10.1016/j.nlp.2024.100075.
[8] A. Nowacki, W. Sitek, H. Rybiński, LLMental: Classification of Mental Disorders with Large
Language Models, in: Proceedings of the International Symposium on Methodologies for Intelligent
Systems, Springer, 2024, pp. 35–44. doi:10.1007/978-3-031-62700-2_4.
[9] S. Ji, T. Zhang, L. Ansari, J. Fu, P. Tiwari, E. Cambria, MentalBERT: Publicly Available Pretrained
Language Models for Mental Healthcare, in: Proceedings of the Language Resources and Evaluation
Conference, European Language Resources Association, 2022, p. 7184–7190.
[10] K. Yang, T. Zhang, Z. Kuang, Q. Xie, J. Huang, S. Ananiadou, MentaLLaMA: interpretable mental
health analysis on social media with large language models, in: Proceedings of the ACM on Web
Conference, Association for Computing Machinery, 2024, pp. 4489–4500. doi:10.1145/3589334.
3648137.
[11] V. Vajre, M. Naylor, U. Kamath, A. Shehu, PsychBERT: A Mental Health Language Model for Social
Media Mental Health Behavioral Analysis, in: Proceedings of the IEEE International Conference
on Bioinformatics and Biomedicine, IEEE, 2021, pp. 1077–1082. doi:10.1109/BIBM52615.2021
.9669469.
[12] T. Wang, A. Roberts, D. Hesslow, T. L. Scao, H. W. Chung, I. Beltagy, J. Launay, C. Rafel, What
Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?,
in: Proceedings of Machine Learning Research, volume 162, MLR Press, 2022, pp. 1–21.
[13] M. Moor, O. Banerjee, Z. S. H. Abad, H. M. Krumholz, J. Leskovec, E. J. Topol, P. Rajpurkar,
Foundation models for generalist medical artificial intelligence, Nature 616 (2023) 259–265.
doi:10.1038/s41586-023-05881-4.
[14] Y. Liu, X. Ding, S. Peng, C. Zhang, Leveraging ChatGPT to optimize depression intervention
through explainable deep learning, Frontiers in Psychiatry 15 (2024) 1383648. doi:10.3389/fpsy
t.2024.1383648.
[15] X. Xu, B. Yao, Y. Dong, S. Gabriel, H. Yu, J. Hendler, M. Ghassemi, A. K. Dey, D. Wang, Mental-llm:
Leveraging large language models for mental health prediction via online text data, in: Proceedings
of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, volume 8, Association
for Computing Machinery, 2024, pp. 1–32. doi:10.1145/3643540.
[16] N. Gomes, M. Pato, A. R. Lourenço, N. Datia, A Survey on Wearable Sensors for Mental Health</p>
      <p>Monitoring, Sensors 23 (2023) 1330. doi:10.3390/s23031330.
[17] A. M. Salih, Z. Raisi-Estabragh, I. B. Galazzo, P. Radeva, S. E. Petersen, K. Lekadir, G. Menegaz, A
Perspective on Explainable Artificial Intelligence Methods: SHAP and LIME, Advanced Intelligent
Systems (2024) 2400304. doi:10.1002/aisy.202400304.
[18] L. Coroama, A. Groza, Explainable Artificial Intelligence for Person Identification, in: Proceedings
of the IEEE International Conference on Intelligent Computer Communication and Processing,
IEEE, 2021, pp. 375–382. doi:10.1109/ICCP53602.2021.9733525.
[19] L. Nannini, J. Alonso-Moral, A. Catala, M. Lama, S. Barro, Operationalizing Explainable AI in the EU</p>
      <p>Regulatory Ecosystem, IEEE Intelligent Systems (2024) 37–48. doi:10.1109/MIS.2024.3383155.
[20] L. Ren, H. Lin, B. Xu, S. Zhang, L. Yang, S. Sun, Depression Detection on Reddit With an
EmotionBased Attention Network: Algorithm Development and Validation, JMIR Medical Informatics 9
(2021) e28754. doi:10.2196/28754.
[21] A. B. S. Rahman, H.-T. Ta, L. Najjar, A. Azadmanesh, A. S. Gönul, DepressionEmo: A novel dataset
for multilabel classification of depression emotions, Journal of Afective Disorders (2024) 445–458.
doi:10.1016/j.jad.2024.08.013.
[22] R. W. Levenson, Stress and Illness: A Role for Specific Emotions, Psychosomatic Medicine 81
(2019) 720–730. doi:10.1097/PSY.0000000000000736.
[23] S. A. Qureshi, G. Dias, M. Hasanuzzaman, S. Saha, Improving Depression Level Estimation by
Concurrently Learning Emotion Intensity, IEEE Computational Intelligence Magazine 15 (2020)
47–59. doi:10.1109/MCI.2020.2998234.
[24] S. Ghosh, A. Ekbal, P. Bhattacharyya, What Does Your Bio Say? Inferring Twitter Users’
Depression Status From Multimodal Profile Information Using Deep Learning, IEEE Transactions on
Computational Social Systems 9 (2022) 1484–1494. doi:10.1109/TCSS.2021.3116242.
[25] E. Turcan, S. Muresan, K. McKeown, Emotion-Infused Models for Explainable Psychological Stress
Detection, in: Proceedings of the Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Association for Computational
Linguistics, 2021, p. 2895–2909. doi:10.18653/v1/2021.naacl-main.230.
[26] S. Ghosh, A. Ekbal, P. Bhattacharyya, A Multitask Framework to Detect Depression, Sentiment
and Multi-label Emotion from Suicide Notes, Cognitive Computation 14 (2022) 110–129. doi:10.1
007/s12559-021-09828-7.
[27] S. Sarkar, A. Alhamadani, L. Alkulaib, C.-T. Lu, Predicting Depression and Anxiety on Reddit: a
Multi-task Learning Approach, in: Proceedings of the IEEE/ACM International Conference on
Advances in Social Networks Analysis and Mining, IEEE, 2022, pp. 427–435. doi:10.1109/ASON
AM55673.2022.10068655.
[28] D. Park, G. Lee, S. Kim, T. Seo, H. Oh, S. J. Kim, Probability-based multi-label classification
considering correlation between labels–focusing on DSM-5 depressive disorder diagnostic criteria,
IEEE Access (2024) 70289–70296. doi:10.1109/ACCESS.2024.3401704.
[29] L. Ilias, D. Askounis, Multitask learning for recognizing stress and depression in social media,</p>
      <p>Online Social Networks and Media 37-38 (2023) 100270. doi:10.1016/j.osnem.2023.100270.
[30] D. Park, S. Lim, Y. Choi, H. Oh, Depression Emotion Multi-Label Classification Using Everytime
Platform With DSM-5 Diagnostic Criteria, IEEE Access 11 (2023) 89093–89106. doi:10.1109/AC
CESS.2023.3305477.
[31] V. B. de Souza, J. C. Nobre, K. Becker, DAC Stacking: A Deep Learning Ensemble to Classify
Anxiety, Depression, and Their Comorbidity From Reddit Texts, IEEE Journal of Biomedical and
Health Informatics 26 (2022) 3303–3311. doi:10.1109/JBHI.2022.3151589.
[32] L. Ilias, S. Mouzakitis, D. Askounis, Calibration of Transformer-Based Models for Identifying
Stress and Depression in Social Media, IEEE Transactions on Computational Social Systems 11
(2024) 1979–1990. doi:10.1109/TCSS.2023.3283009.
[33] J. Ohse, B. Hadžić, P. Mohammed, N. Peperkorn, M. Danner, A. Yorita, N. Kubota, M. Rätsch,
Y. Shiban, Zero-Shot Strike: Testing the generalisation capabilities of out-of-the-box LLM models
for depression detection, Computer Speech &amp; Language 88 (2024) 101663. doi:10.1016/j.csl.
2024.101663.
[34] Y. Wang, D. Inkpen, P. K. Gamaarachchige, Explainable depression detection using large language
models on social media data, in: Proceedings of the Workshop on Computational Linguistics and
Clinical Psychology, Association for Computational Linguistics, 2024, pp. 108–126.
[35] A. Rivolli, J. Read, C. Soares, B. Pfahringer, A. C. P. L. F. de Carvalho, An empirical analysis of
binary transformation strategies and base algorithms for multi-label learning, Machine Learning
109 (2020) 1509–1563. doi:10.1007/s10994-020-05879-3.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ivan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Munir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <article-title>Decoding depression: Analyzing social network insights for depression severity assessment with transformers</article-title>
          and
          <source>explainable AI</source>
          ,
          <source>Natural Language Processing Journal</source>
          <volume>7</volume>
          (
          <year>2024</year>
          )
          <article-title>100079</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.nlp.
          <year>2024</year>
          .
          <volume>100079</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. H.</given-names>
            <surname>Kim</surname>
          </string-name>
          , M. T. Choi,
          <article-title>Identification of Geriatric Depression and Anxiety Using Activity Tracking Data and Minimal Geriatric Assessment Scales</article-title>
          ,
          <source>Applied Sciences (Switzerland) 12</source>
          (
          <year>2022</year>
          )
          <article-title>2488</article-title>
          . doi:
          <volume>10</volume>
          .3390/app12052488.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Saylam</surname>
          </string-name>
          , Ö. D. İncel,
          <article-title>Multitask Learning for Mental Health: Depression, Anxiety</article-title>
          ,
          <string-name>
            <surname>Stress (DAS) Using</surname>
            <given-names>Wearables</given-names>
          </string-name>
          ,
          <source>Diagnostics</source>
          <volume>14</volume>
          (
          <year>2024</year>
          )
          <article-title>501</article-title>
          . doi:
          <volume>10</volume>
          .3390/diagnostics14050501.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Marwaha</surname>
          </string-name>
          , E. Palmer,
          <string-name>
            <given-names>T.</given-names>
            <surname>Suppes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Cons</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Young</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Upthegrove</surname>
          </string-name>
          ,
          <article-title>Novel and emerging treatments for major depression</article-title>
          ,
          <source>The Lancet</source>
          <volume>401</volume>
          (
          <year>2023</year>
          )
          <fpage>141</fpage>
          -
          <lpage>153</lpage>
          . doi:
          <volume>10</volume>
          .1016/S0140-
          <volume>6736</volume>
          (
          <issue>22</issue>
          )
          <fpage>02080</fpage>
          -
          <lpage>3</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>