Leveraging Prompt Engineering and Large Language
                                Models for Automating MADRS Score Computation for
                                Depression Severity Assessment
                                Alessandro Raganato1,∗ , Francesco Bartoli2 , Cristina Crocamo2 , Daniele Cavaleri2 ,
                                Giuseppe Carrà2,3 , Gabriella Pasi1 and Marco Viviani1,∗
                                1
                                  Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy
                                2
                                  School of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy
                                3
                                  Division of Psychiatry, University College London, London, UK


                                                 Abstract
                                                 This study ventures into the field of psychiatry by investigating the interactive dynamics between psychiatrists and their
                                                 patients. The primary goal is to create an automated scoring mechanism using prompt engineering techniques applied to
                                                 Large Language Models (LLMs) to assess the severity of depressive symptoms from these dialogues. In particular, the process
                                                 of generating a depression severity score against MADRS, a rating scale widely used in psychiatry, is automated. This work
                                                 aims to highlight the potential of using these techniques to improve traditional diagnostic approaches in psychiatry. The
                                                 results that have emerged, while not optimal, are promising, including for the purpose of developing a full-fledged system in
                                                 the future to enable the introduction of more targeted and timely interventions, thereby improving patient outcomes and
                                                 improving the overall level of mental health.

                                                 Keywords
                                                 Mental Health, MADRS, Prompt Engineering, Large Language Models, Natural Language Processing


                                1. Introduction                                                                                                     This study, in particular, embarks on the task of au-
                                                                                                                                                 tomatically mapping psychiatrist-patient dialogue con-
                                The assessment of symptom severity plays a crucial role tent to the Montgomery-Åsberg Depression Rating Scale
                                in the clinical management of mental disorders, being piv- (MADRS) [2], a widely accepted instrument for evalu-
                                otal in diagnosing and monitoring the mental well-being ating depression severity, through the potential of re-
                                of patients [1]. Traditionally, this evaluation has heavily cently developed generative Artificial Intelligence (AI)
                                relied on clinical experience, sometimes supported by models [3]. To establish a foundation, a manual map-
                                questionnaires and rating scales during in-person vis- ping process performed by clinical experts is employed
                                its. However, advancements in Machine Learning (ML) to establish connections between question-answers from
                                and Natural Language Processing (NLP) techniques offer some psychiatrist-patient dialogues and the correspond-
                                the potential for automated systems that can support in ing items of the MADRS questionnaire, together with
                                assessing measures of symptom severity in dialogues be- the corresponding scores (both at the individual item
                                tween psychiatrists and the growing number of patients. level and the global level). This manual mapping serves
                                In particular, the evolving landscape of prompt engineer- as a benchmark for subsequent comparison with results
                                ing techniques applied to Large Language Models (LLMs) obtained from the considered AI-based approaches.
                                presents a novel avenue for developing such kind of sys-                                                            In a first approach, distinct prompt engineering tech-
                                tems, to better support psychiatric assessment practices niques applied to LLMs are leveraged to compute depres-
                                in the future.                                                                                                   sion severity scores for each MADRS item. Each item is
                                Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- devoted to assessing a different symptom domain, such
                                nized by CINI, May 29-30, 2024, Naples, Italy                                                                    as sadness, inner tension, reduced sleep, etc., rated on a
                                ∗
                                     Corresponding author.                                                                                       scale from 0 to 6, with higher scores indicating more
                                Envelope-Open alessandro.raganato@unimib.it (A. Raganato);                                                       severe depressive symptoms. The computed scores are
                                francesco.bartoli@unimib.it (F. Bartoli);
                                                                                                                                                 then further aggregated to provide an overall assessment,
                                cristina.crocamo@unimib.it (C. Crocamo);
                                d.cavaleri1@campus.unimib.it (D. Cavaleri);                                                                      ranging from 0 to 60, with higher scores indicating more
                                giuseppe.carra@unimib.it (G. Carrà); gabriella.pasi@unimib.it                                                    severe depression. In a second approach, we evaluate the
                                (G. Pasi); marco.viviani@unimib.it (M. Viviani)                                                                  effectiveness of using prompts to directly compute the
                                Orcid 0000-0002-7018-7515 (A. Raganato); 0000-0003-2612-4119                                                     overall depression severity score.
                                (F. Bartoli); 0000-0002-2979-2107 (C. Crocamo);
                                                                                                                                                    This study serves as a preliminary step to explore the
                                0000-0001-5342-9394 (D. Cavaleri); 0000-0002-6877-6169 (G. Carrà);
                                0000-0002-6080-8170 (G. Pasi); 0000-0002-2274-9050 (M. Viviani)                                                  feasibility, in the future, of creating an advanced conver-
                                                    © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License sational system that generates questions and analyses
                                                    Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
responses to automatically assess symptom severity lev-       tools providing feedback to user input related to well-
els. The obtained results illustrate that the proposed        being and mental health queries) and their promising
approaches and the best models tested have an accuracy        role in screening, assessment, diagnosis, and treatment
of about 70% in making the mapping between conversa-          of mental disorders, including the effective identifica-
tion and MADRS scores, with a pretty high correlation.        tion of people with depressive symptoms [8, 13, 14]. For
While not optimal, this result appears encouraging in the     instance, discreet text interfaces possibly allowed par-
belief that refinements on the models (via fine-tuning)       ticipants to feel more comfortable using conversational
and prompts could lead to higher results and pursuit of       agents in public [15].
the goal of developing a fully automated system.                 Although these approaches appear to ensure optimal
                                                              control over conversation flow and topics benefiting
                                                              users and providers, a pre-defined response range may de-
2. Related Work                                               crease usability in a diverse range of clinical settings with
                                                              different risks such as possibly disrupting the therapeutic
The urgent need for innovation around access and qual-
                                                              alliance [15]. Indeed, a feasible option for developing a
ity of mental health care has become clear in the last few
                                                              mass screening integrated approach for early detection
years [4]. More and more mental health-related digital
                                                              of depression is intended as a means of assisting with
strategies for therapeutic approaches have been offered
                                                              automation and concealed communication with verified
via ML and, in general, AI models, thus contributing to
                                                              scoring systems rather than replacing clinical interviews
the development of detection systems for mental disor-
                                                              [16]. Moreover, the diversity of outcomes and the choice
ders, e.g., [5, 6, 7].
                                                              of outcome measurement instruments employed in stud-
   However, although significant progress has been made
                                                              ies on conversational agents for mental health point to
in the field, there are several barriers in the implementa-
                                                              the need for an established minimum core outcome set
tion of detection systems in real-world applications, in-
                                                              and greater use of validated instruments [17]. Therefore,
cluding a need for increased transparency and replication
                                                              an enhanced personalization of conversational agents
[8]. Moreover, the literature is sparse with a high degree
                                                              leveraging the interdisciplinary use of NLP techniques to
of heterogeneity between studies and the use of non-
                                                              better understand the context of the conversation about
standardized metrics reporting [9]. In addition, several
                                                              vulnerable experiences related to depressive symptoms
areas remain understudied, including the use of these ap-
                                                              – with a more human-like approach – appears desirable
proaches among people suffering from mental disorders
                                                              [18].
such as depression. Nonetheless, a few studies analyzed
automated approaches for evaluating depression.
   A recent study trained ML models to diagnose de-           3. Guiding LLMs to Automate
pression from spontaneous responses of 113 outpatients
using interviews by experienced physicians that were             MADRS Score Computation
first audio-recorded and transcribed verbatim. The study
                                                              LLMs are advanced AI systems [19], which possess the ca-
showed automated depression diagnosis based on inter-
                                                              pability to generate human-like text across a wide range
views as a feasible approach [10]. The use of transcribed
                                                              of topics, and thus seem to be the most suitable tool for
autobiographical memory interviews was also considered
                                                              solving the literature problem enunciated above. How-
for patients with treatment-resistant depression treated
                                                              ever, to accomplish a particular task, there is the need for
with psilocybin [11]. Quantitative speech measures were
                                                              a process for crafting specific instructions or prompts to
computed using the interview data from 17 patients and
                                                              guide these models; such a process is known as prompt
18 untreated age-matched healthy control subjects, and
                                                              engineering [20], and is gauging importance in recent
an ML algorithm was developed to classify between con-
                                                              years in medicine [21].
trols and patients and predict treatment response. Re-
sults showed that speech analytics and ML successfully
differentiated individuals with depression from healthy       3.1. Basics of Prompt Engineering
controls and identified treatment responders from non-        The main prompting techniques employed today in the
responders with a significant level of accuracy and preci-    literature are known as Zero-Shot (ZS), Few-Shot (FS),
sion. More generally, question-based computational lan-       and Chain-of-Thought (CoT) learning. In ZS learning, the
guage assessment, based on self-reported and freely gen-      LLM is provided with a prompt (describing the task to be
erated word responses, analyzed with AI, has been shown       accomplished) without any examples or specific training
as a potential tool that may complement rating scales and     data for that task. Despite this, the model attempts to
evaluate mental health issues in clinical settings [12]. A    generate a suitable response based solely on its under-
recent systematic review highlighted preliminary favor-       standing of the task description. FS learning extends ZS
able evidence about the use of conversational agents (i.e.,   by providing the model with a small number of examples
or demonstrations for the task at hand. These examples           Zero-Shot Learning. The model is simply asked to
serve as additional context for the model to understand          generate a score for each item of the MADRS. These items
the task better. Finally, CoT prompts guide the model            are specified in the template, as follows:
to generate coherent and logically connected responses
                                                                        Given the following document containing a
by sequentially structuring the prompt. Each step of the
                                                                        conversation between a physician and a pa-
prompt builds upon the previous one, creating a chain of
                                                                        tient, denoted by M and P respectively, fol-
thoughts that guide the model’s generation process.
                                                                        lowing the Montgomery-Åsberg Depression
                                                                        Rating Scale (MADRS), answer me with the
3.2. Automated Score Computation                                        severity score, from a minimum of 0 (symp-
Having made this necessary premise about prompt engi-                   tom absent) to a maximum of 6 (extremely
neering, we can illustrate the two different approaches                 severe), for the following item only: [item
proposed in this article to perform the considered task,                title, description]. Answer me only with
denoted as local and global. For both approaches, we                    a value between a minimum of 0 and a
consider ZS and CoT prompting techniques, being in-                     maximum of 6 related only to the described
sufficient in the number of available examples in the                   label. Below is the document to be analyzed:
considered dataset (detailed in Section 4.1) to perform FS.             [document].
This means designing appropriate prompt templates for          This template is repeated for each of the 10 items of
each prompting technique with respect to each approach. MADRS, and [item title, description] contains the title
                                                            and description shown in Figure 1 for each item, for
3.2.1. Local Computation Approach                           example: Reduced sleep, representing the experience of
                                                            reduced duration or depth of sleep compared to the subject’s
We ask LLMs appropriately guided by prompts to gener-
                                                            own normal pattern when well. Once the scores for each
ate a score for each item of the MADRS. Such items and
                                                            item are obtained, they are simply added together to
their descriptions are illustrated in Figure 1, while ZS
                                                            obtain the overall score.
and CoT prompt templates are detailed in the following.
                                                                 CoT Learning. In this preliminary work, the CoT ap-
                                                                 proach is based on simply asking the model to provide
                                                                 a motivation before performing the task. This helps the
                                                                 model make a more informed decision than the ZS sce-
                                                                 nario. Therefore, the CoT template used is as follows:
                                                                        [ZS “local” template] + Provide the ratio-
                                                                        nale before answering.
                                                                   Also in this case, the scores for each item are summed
                                                                 up to obtain the overall score.

                                                                 3.2.2. Global Computation Approach
                                                                 Here, LLMs are appropriately guided to directly generate
                                                                 the overall depression score with respect to MADRS.

                                                                 Zero-Shot Learning. The ZS template employed in
                                                                 this global approach to computation is as follows:
                                                                        Given the following document containing a
                                                                        conversation between a physician and a pa-
                                                                        tient, denoted by M and P respectively, fol-
                                                                        lowing the Montgomery-Åsberg Depression
                                                                        Rating Scale (MADRS), answer me with
                                                                        what would be the severity score with re-
                                                                        spect to depression that you would assign.
                                                                        The threshold values are: 0 to 6 no depres-
Figure 1: A detail on the 10 items, with related descriptions,          sion, 7 to 19 mild depression, 20 to 34 mod-
that constitute the MADRS.
                                                                        erate depression, and 35 to 60 severe depres-
                                                                        sion. Answer only with a value between
            the minimum of 0 and a maximum of 60.            text inputs, emitting text outputs).2 Mistral: Mistral-
            Below is the document to be analyzed: [doc-      7B-Instruct-v0.2 , it is an instruct fine-tuned 7B LLM,
            ument].                                          trained mainly on English data, but also acquainted
                                                             with Italian during its pretraining phase [22]. Mixtral:
CoT Learning. CoT learning in the global approach Mixtral-8x7B-Instruct-v0.1 , it is a pretrained gen-
uses the ZS “global” template in which reasoning is re- erative Sparse Mixture of Experts model, trained mainly
quired before providing the answer:                          on 5 languages including Italian. It has 46.7B total pa-
                                                             rameters but only uses 12.9B parameters per token.3
         [ZS “global” template] + Provide the ratio-         Dante: DanteLLM_instruct_7b-v0.2-boosted , it is a
         nale before answering.                              recent state-of-the-art Italian LLM based on the 7B Mis-
                                                             tral model.4 Hermes: Hermes7b_ITA , it is a 7B LLM
                                                             trained on a 120K instruction/answer dataset in Italian.
4. Comparative Evaluation                                    It is based on Nous-Hermes-llama-2-7b LLM, a version
                                                                                                                       5
In this section, we present the results of the comparative of meta/Llama-2-7b fine-tuned to follow instructions.
evaluation of the local and global approaches, in relation
to the various proposed prompt engineering techniques 4.3. Results
(and thus, regarding the different templates used). Firstly,
we introduce the dataset employed in the evaluations and The results obtained measure the effectiveness of the
the technical characteristics of the implemented models. above-mentioned models, in conjunction with the appro-
                                                             priate prompting templates, in correctly predicting the
                                                             item-level scores and overall score of each conversation
4.1. The Conversation Dataset                                compared with those assigned by the medical experts.
It is well understood, especially in such a delicate field   They   are illustrated in terms of accuracy (Acc.), Pearson
as psychiatry, that dealing with patient data is rather (P.), and Spearman (S.) correlation coefficients.
complex and ethically sensitive. For this reason, for this
preliminary study, a team of medical experts generated a                 4.3.1. Local Computation Results
small dataset in which clinicians took on the roles of both              Tables 1 and 2 show some results of the prompts and
the doctor and the patient. This was done to create typi-                LLMs models applied to the local computation approach.
cal conversations regarding various levels of depression
severity, namely: severe depression, moderate depression,
mild depression, and absence of depression. In total, 10                 Table 1
                                                                         Overall results for the local computation approach.
doctor-patient conversations were generated in Italian,
with at least 3 conversations for the first three previously                                     ZS                      CoT
outlined severity levels. Clinicians also labeled the ques-                  Model     Acc.      P.      S.      Acc.     P.     S.
tions and answers against the corresponding items of the                     GPT-3.5    0.30    0.81    0.81    0.30     0.86   0.83
MADRS and provided both item-level and global scores                         GPT-4      0.30    0.92    0.88    0.40     0.93   0.90
for the entire conversation.1                                                Mistral    0.30    0.70    0.69    0.40     0.85   0.91
                                                                             Mixtral    0.40    0.92    0.91    0.40     0.86   0.87
4.2. Technical Details                                                       Dante      0.30    0.47    0.42    0.40     0.27   0.16
                                                                             Hermes     0.40    0.51    0.54    0.60     0.31   0.15
To assess the effectiveness of generative models in ad-
dressing the considered problem, various LLMs were                          It can be seen that from the results in Table 1, espe-
tested. These models were trained on diverse datasets,                   cially in terms of accuracy, the local approach does not
tailored for a multilingual context, given that our                      provide satisfactory overall results. However, a substan-
psychiatrist-patient conversations are in Italian. In par-               tial improvement can be appreciated when models are
ticular, the following models were used: GPT-3.5: GPT-                   asked to explain the reasons for their choices (CoT), and
3.5-turbo-0613 , it is an iteration of the Generative Pre-               in particular for the Hermes model. Regarding the cor-
trained Transformer (GPT) model developed by OpenAI. It                  relation coefficients of Person and Spearman, we can
is an advanced version of its predecessor, GPT-3, with im-               observe how these are globally quite high, improving in
provements in various aspects such as model architecture,                the CoT scenario for models trained on larger amounts
training data, and fine-tuning techniques. GPT-4: GPT-4-                 of data and decreasing on smaller ones.
0613 , it is a large multimodal model (accepting image and
                                                                         2
                                                                           https://platform.openai.com/docs/models/overview
1                                                                        3
    The dataset used and the respective labels and scores can be down-     https://mistral.ai/news/mixtral-of-experts/
    loaded at the following address: https://drive.google.com/file/d/    4
                                                                           https://github.com/RSTLess-research/DanteLLM
                                                                         5
    18HL5v8Hh2GBm1l0dt9Z8cHW0Opy8JgA7/view?usp=sharing.                    https://huggingface.co/raicrits/Hermes7b_ITA
Table 2
Correlation results for each MADRS item in the local CoT scenario.
                 #1.              #2.               #3.             #4.               #5.             #6.             #7.             #8.             #9.         #10.
 Model      P.         S.    P.         S.     P.         S.   P.         S.     P.         S.   P.         S.   P.         S.   P.         S.   P.         S.   P.    S.
 GPT-3.5 0.61 0.80 0.35 0.24 0.48 0.56 0.73 0.81 0.74 0.79 0.60 0.66 0.54 0.58 0.17 0.24 0.31 0.41 0.83 0.87
 GPT-4 0.65 0.51 0.61 0.50 0.70 0.67 0.89 0.79 0.90 0.89 0.18 0.36 0.83 0.76 0.47 0.37 0.84 0.83 0.95 0.96
 Mistral 0.15 0.20 0.64 0.78 0.53 0.21 0.71 0.79 0.21 0.20 0.40 0.54 -0.34 -0.37 0.31 0.31 0.82 0.82 0.94 0.93
 Mixtral 0.46 0.48 0.91 0.88 0.73 0.43 0.76 0.69 0.84 0.90 0.21 0.35 0.72 0.64 -0.52 -0.36 0.36 0.39 0.83 0.87
 Dante -0.32 -0.49 0.49 0.66 0.68 0.75 0.47 0.50 -0.78 -0.76 -0.08 -0.08 -0.25 -0.05 -0.04 0.09 0.11 0.11 0.24 0.25
 Hermes 0.57 0.56 -0.25 -0.61 0.06 0.24 0.07 0.01 -0.16 -0.22 -0.25 -0.32 0.30 0.17 0.24 0.16 0.18 0.29 -0.02 0.22


4.3.2. Global Computation Results                                                            items is generally not very high, although it is objectively
                                                                                             better in some specific items such as #4 (i.e., reduced sleep,
Table 3 shows the results of the prompts and LLMs models
                                                                                             for the models trained on more data), #10 (i.e., suicidal
applied to the global computation approach.
                                                                                             thoughts, again for larger models). The smaller, Italian-
                                                                                             specific models do not correlate well on this task.
Table 3                                                                                         Concerning Figure 2, illustrating the confusion ma-
Overall results for the global computation approach.                                         trix referring to the global computation approach for the
                             ZS                                CoT                           Dante model performed in the CoT scenario, we can ob-
  Model          Acc.        P.          S.         Acc.        P.              S.           serve how the model does not confuse depression severity
                                                                                             classes that are too distant from each other.
  GPT-3.5        0.70        0.66       0.62        0.60       0.79            0.71
  GPT-4          0.60       0.96        0.94        0.40       0.87            0.82
  Mistral        0.20        0.47       0.23        0.60       0.22            0.51
  Mixtral        0.50        0.43       0.57        0.50       0.33            0.20
  Dante          0.30       -0.03       0.13        0.70       0.68            0.86
  Hermes         0.30        0.31       0.47        0.50       0.76            0.64


   The results in this case show that an accuracy of
around 70% can be achieved. It is particularly interesting
to note how the best models are the GPT-based in the
ZS case, while it is Dante in the CoT case, which instead
turns out to be one of the worst using a ZS technique.
Person and Spearman correlation coefficient results illus-
trate a significant increase in correlation in the smaller
models in the CoT scenario, with variable fluctuations in
the case of the larger models.
                                                                                             Figure 2: Dante’s CoT global confusion matrix.
4.3.3. Further Investigating Best Results
Compared to the approaches, prompt engineering tech-
niques, and LLMs considered, it is clear that the use of
the global approach is superior to the local one. This                                       5. Conclusion and Future Research
would seem to suggest that LLMs have a greater chance
of success with respect to the task considered when the                                      This study explored the utilization of generative Artifi-
conversation is considered to produce the global MADRS                                       cial Intelligence (AI) models for automatically mapping
score, without the model being asked to generate MADRS                                       psychiatrist-patient dialogue content to the Montgomery-
item-based scores to be later aggregated. However, we                                        Åsberg Depression Rating Scale (MADRS). Two distinct
operated in a context in which we did not provide specific                                   approaches were investigated: the application of prompt
examples of the model according to a Few-Shot strategy,                                      engineering techniques to compute symptom severity
which need to be investigated in the future.                                                 scores for each MADRS item, and the direct calculation of
   As it emerges from Table 2, referring to the local com-                                   the overall depression severity score. The results demon-
putation approach in the CoT scenario, the correlation                                       strated that the proposed approaches, coupled with the
with respect to the scores predicted in the individual                                       best-performing models, achieved an accuracy of approx-
                                                                                             imately 70% in mapping conversations to MADRS scores.
Though the current accuracy shows promise, there is                 L. Fitzgerald, J. Stroud, D. J. Nutt, R. L. Carhart-
room for improvement. Future studies could refine mod-              Harris, Natural speech algorithm applied to base-
els, improve prompt techniques, explore new methods,                line interview data can predict which patients will
and use more data sources. This could lead to an au-                respond to psilocybin for treatment-resistant de-
tomated system that generates questions and evaluates               pression, Journal of affective disorders (2018).
symptom severity from dialogue analysis.                       [12] K. Kjell, P. Johnsson, S. Sikström, Freely generated
                                                                    word responses analyzed with artificial intelligence
                                                                    predict self-reported symptoms of depression, anx-
References                                                          iety, and worry, Frontiers in Psychology (2021).
                                                               [13] P. Philip, J.-A. Micoulaud-Franchi, P. Sagaspe, E. D.
 [1] J. J. Silverman, M. Galanter, M. Jackson-Triche, D. G.
                                                                    Sevin, J. Olive, S. Bioulac, A. Sauteraud, Virtual
      Jacobs, J. W. Lomax, M. B. Riba, L. D. Tong, K. E.
                                                                    human as a new diagnostic tool, a proof of concept
      Watkins, L. J. Fochtmann, R. S. Rhoads, et al., The
                                                                    study in the field of major depressive disorders,
      american psychiatric association practice guide-
                                                                    Scientific reports 7 (2017) 42656.
      lines for the psychiatric evaluation of adults, Amer-
                                                               [14] G. Dosovitsky, B. S. Pineda, N. C. Jacobson,
      ican Journal of Psychiatry 172 (2015) 798–802.
                                                                    C. Chang, E. L. Bunge, et al., Artificial intelligence
 [2] B. Fantino, N. Moore,                The self-reported
                                                                    chatbot for depression: descriptive study of usage,
      montgomery-åsberg depression rating scale is a
                                                                    JMIR Formative Research 4 (2020) e17065.
      useful evaluative tool in major depressive disorder,
                                                               [15] A. N. Vaidyam, D. Linggonegoro, J. Torous, Changes
      BMC psychiatry 9 (2009) 1–6.
                                                                    to the psychiatric chatbot landscape: A systematic
 [3] K.-B. Ooi, G. W.-H. Tan, M. Al-Emran, M. A. Al-
                                                                    review of conversational agents in serious mental
      Sharafi, A. Capatina, A. Chakraborty, Y. K. Dwivedi,
                                                                    illness: Changements du paysage psychiatrique des
      T.-L. Huang, A. K. Kar, V.-H. Lee, et al., The poten-
                                                                    chatbots: une revue systématique des agents con-
      tial of generative artificial intelligence across disci-
                                                                    versationnels dans la maladie mentale sérieuse, The
      plines: Perspectives and future directions, Journal
                                                                    Canadian Journal of Psychiatry 66 (2021) 339–348.
      of Computer Information Systems (2023) 1–32.
                                                               [16] P. Kaywan, K. Ahmed, A. Ibaida, Y. Miao, B. Gu,
 [4] J. Torous, K. J. Myrick, N. Rauseo-Ricupero, J. Firth,
                                                                    Early detection of depression using a conversational
      et al., Digital mental health and covid-19: using
                                                                    ai bot: A non-clinical trial, Plos one (2023).
      technology today to accelerate the curve on access
                                                               [17] A. I. Jabir, L. Martinengo, X. Lin, J. Torous, M. Sub-
      and quality tomorrow, JMIR mental health (2020).
                                                                    ramaniam, L. Tudor Car, Evaluating conversational
 [5] M. Fokkema, D. Iliescu, S. Greiff, M. Ziegler, Ma-
                                                                    agents for mental health: Scoping review of out-
      chine learning and prediction in psychological as-
                                                                    comes and outcome measurement instruments, J
      sessment, European Journal of Psychological As-
                                                                    Med Internet Res 25 (2023).
      sessment (2022).
                                                               [18] A. Ahmed, A. Hassan, S. Aziz, A. A. Abd-Alrazaq,
 [6] S. S. Panicker, P. Gayathri, A survey of ma-
                                                                    N. Ali, M. Alzubaidi, D. Al-Thani, B. Elhusein, M. A.
      chine learning techniques in physiology based men-
                                                                    Siddig, M. Ahmed, et al., Chatbot features for anxi-
      tal stress detection systems, Biocybernetics and
                                                                    ety and depression: a scoping review, Health infor-
      Biomedical Engineering 39 (2019) 444–469.
                                                                    matics journal 29 (2023) 14604582221146719.
 [7] M. Viviani, C. Crocamo, M. Mazzola, F. Bartoli,
                                                               [19] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu,
      G. Carrà, G. Pasi, Assessing vulnerability to psy-
                                                                    H. Chen, X. Yi, C. Wang, Y. Wang, et al., A sur-
      chological distress during the covid-19 pandemic
                                                                    vey on evaluation of large language models, ACM
      through the analysis of microblogging content, Fu-
                                                                    Transactions on Intelligent Systems and Technol-
      ture Generation Computer Systems (2021).
                                                                    ogy (2023).
 [8] A. N. Vaidyam, H. Wisniewski, J. D. Halamka, M. S.
                                                               [20] P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal,
      Kashavan, J. B. Torous, Chatbots and conversational
                                                                    A. Chadha, A systematic survey of prompt engi-
      agents in mental health: a review of the psychiatric
                                                                    neering in large language models: Techniques and
      landscape, The Canadian Journal of Psychiatry 64
                                                                    applications, 2024. arXiv:2402.07927 .
      (2019) 456–464.
                                                               [21] B. Meskó, Prompt engineering as an important
 [9] A. Viduani, V. Cosenza, R. M. Araújo, C. Kieling,
                                                                    emerging skill for medical professionals: tutorial,
      Chatbots in the field of mental health: challenges
                                                                    Journal of Medical Internet Research (2023).
      and opportunities, Digital Mental Health: A Practi-
                                                               [22] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford,
      tioner’s Guide (2023) 133–148.
                                                                    D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel,
[10] K. Mao, Y. Wu, J. Chen, A systematic review on
                                                                    G. Lample, L. Saulnier, et al., Mistral 7b, arXiv
      automated clinical depression diagnosis, npj Mental
                                                                    preprint arXiv:2310.06825 (2023).
      Health Research 2 (2023) 20.
[11] F. Carrillo, M. Sigman, D. F. Slezak, P. Ashton,