<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Leveraging Prompt Engineering and Large Language Models for Automating MADRS Score Computation for Depression Severity Assessment</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessandro Raganato</string-name>
          <email>alessandro.raganato@unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Bartoli</string-name>
          <email>francesco.bartoli@unimib.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristina Crocamo</string-name>
          <email>cristina.crocamo@unimib.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniele Cavaleri</string-name>
          <email>d.cavaleri1@campus.unimib.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Carrà</string-name>
          <email>giuseppe.carra@unimib.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriella Pasi</string-name>
          <email>gabriella.pasi@unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Viviani</string-name>
          <email>marco.viviani@unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Mental Health, MADRS, Prompt Engineering, Large Language Models, Natural Language Processing</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Informatics</institution>
          ,
          <addr-line>Systems, and Communication</addr-line>
          ,
          <institution>University of Milano-Bicocca</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Division of Psychiatry, University College London</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Medicine and Surgery, University of Milano-Bicocca</institution>
          ,
          <addr-line>Monza</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>tent to the Montgomery-Åsberg Depression Rating Scale</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>This study ventures into the field of psychiatry by investigating the interactive dynamics between psychiatrists and their patients. The primary goal is to create an automated scoring mechanism using prompt engineering techniques applied to Large Language Models (LLMs) to assess the severity of depressive symptoms from these dialogues. In particular, the process of generating a depression severity score against MADRS, a rating scale widely used in psychiatry, is automated. This work aims to highlight the potential of using these techniques to improve traditional diagnostic approaches in psychiatry. The results that have emerged, while not optimal, are promising, including for the purpose of developing a full-fledged system in the future to enable the introduction of more targeted and timely interventions, thereby improving patient outcomes and improving the overall level of mental health.</p>
      </abstract>
      <kwd-group>
        <kwd>This study</kwd>
        <kwd>in particular</kwd>
        <kwd>embarks on the task of au-</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The assessment of symptom severity plays a crucial role
otal in diagnosing and monitoring the mental well-being
of patients [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Traditionally, this evaluation has heavily
relied on clinical experience, sometimes supported by
questionnaires and rating scales during in-person
visits. However, advancements in Machine Learning (ML)
and Natural Language Processing (NLP) techniques ofer
the potential for automated systems that can support in
assessing measures of symptom severity in dialogues
beIn particular, the evolving landscape of prompt
engineering techniques applied to Large Language Models (LLMs)
presents a novel avenue for developing such kind of
systems, to better support psychiatric assessment practices
in the future.
      </p>
      <p>
        Ital-IA 2024: 4th National Conference on Artificial Intelligence,
orga∗Corresponding author.
models [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. To establish a foundation, a manual
mapping process performed by clinical experts is employed
to establish connections between question-answers from
some psychiatrist-patient dialogues and the
corresponding items of the MADRS questionnaire, together with
the corresponding scores (both at the individual item
as a benchmark for subsequent comparison with results
obtained from the considered AI-based approaches.
      </p>
      <p>In a first approach, distinct prompt engineering
techniques applied to LLMs are leveraged to compute
depression severity scores for each MADRS item. Each item is
devoted to assessing a diferent symptom domain, such
as sadness, inner tension, reduced sleep, etc., rated on a
scale from 0 to 6, with higher scores indicating more
severe depressive symptoms. The computed scores are
then further aggregated to provide an overall assessment,
ranging from 0 to 60, with higher scores indicating more
severe depression. In a second approach, we evaluate the
efectiveness of using prompts to directly compute the
overall depression severity score.</p>
      <p>This study serves as a preliminary step to explore the
feasibility, in the future, of creating an advanced
converCEUR</p>
      <p>
        ceur-ws.org
responses to automatically assess symptom severity lev- tools providing feedback to user input related to
wellels. The obtained results illustrate that the proposed being and mental health queries) and their promising
approaches and the best models tested have an accuracy role in screening, assessment, diagnosis, and treatment
of about 70% in making the mapping between conversa- of mental disorders, including the efective
identification and MADRS scores, with a pretty high correlation. tion of people with depressive symptoms [
        <xref ref-type="bibr" rid="ref13 ref14 ref8">8, 13, 14</xref>
        ]. For
While not optimal, this result appears encouraging in the instance, discreet text interfaces possibly allowed
parbelief that refinements on the models (via fine-tuning) ticipants to feel more comfortable using conversational
and prompts could lead to higher results and pursuit of agents in public [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
the goal of developing a fully automated system. Although these approaches appear to ensure optimal
control over conversation flow and topics benefiting
users and providers, a pre-defined response range may
de2. Related Work crease usability in a diverse range of clinical settings with
diferent risks such as possibly disrupting the therapeutic
alliance [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Indeed, a feasible option for developing a
mass screening integrated approach for early detection
of depression is intended as a means of assisting with
automation and concealed communication with verified
scoring systems rather than replacing clinical interviews
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Moreover, the diversity of outcomes and the choice
of outcome measurement instruments employed in
studies on conversational agents for mental health point to
the need for an established minimum core outcome set
and greater use of validated instruments [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Therefore,
an enhanced personalization of conversational agents
leveraging the interdisciplinary use of NLP techniques to
better understand the context of the conversation about
vulnerable experiences related to depressive symptoms
– with a more human-like approach – appears desirable
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        The urgent need for innovation around access and
quality of mental health care has become clear in the last few
years [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. More and more mental health-related digital
strategies for therapeutic approaches have been ofered
via ML and, in general, AI models, thus contributing to
the development of detection systems for mental
disorders, e.g., [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
        ].
      </p>
      <p>
        However, although significant progress has been made
in the field, there are several barriers in the
implementation of detection systems in real-world applications,
including a need for increased transparency and replication
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Moreover, the literature is sparse with a high degree
of heterogeneity between studies and the use of
nonstandardized metrics reporting [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In addition, several
areas remain understudied, including the use of these
approaches among people sufering from mental disorders
such as depression. Nonetheless, a few studies analyzed
automated approaches for evaluating depression.
      </p>
      <p>
        A recent study trained ML models to diagnose
depression from spontaneous responses of 113 outpatients
using interviews by experienced physicians that were
ifrst audio-recorded and transcribed verbatim. The study
showed automated depression diagnosis based on
interviews as a feasible approach [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The use of transcribed
autobiographical memory interviews was also considered
for patients with treatment-resistant depression treated
with psilocybin [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Quantitative speech measures were
computed using the interview data from 17 patients and
18 untreated age-matched healthy control subjects, and
an ML algorithm was developed to classify between
controls and patients and predict treatment response.
Results showed that speech analytics and ML successfully
diferentiated individuals with depression from healthy
controls and identified treatment responders from
nonresponders with a significant level of accuracy and
precision. More generally, question-based computational
language assessment, based on self-reported and freely
generated word responses, analyzed with AI, has been shown
as a potential tool that may complement rating scales and
evaluate mental health issues in clinical settings [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. A
recent systematic review highlighted preliminary
favorable evidence about the use of conversational agents (i.e.,
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. Guiding LLMs to Automate</title>
    </sec>
    <sec id="sec-3">
      <title>MADRS Score Computation</title>
      <p>
        LLMs are advanced AI systems [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], which possess the
capability to generate human-like text across a wide range
of topics, and thus seem to be the most suitable tool for
solving the literature problem enunciated above.
However, to accomplish a particular task, there is the need for
a process for crafting specific instructions or prompts to
guide these models; such a process is known as prompt
engineering [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], and is gauging importance in recent
years in medicine [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <sec id="sec-3-1">
        <title>3.1. Basics of Prompt Engineering</title>
        <p>The main prompting techniques employed today in the
literature are known as Zero-Shot (ZS), Few-Shot (FS),
and Chain-of-Thought (CoT) learning. In ZS learning, the
LLM is provided with a prompt (describing the task to be
accomplished) without any examples or specific training
data for that task. Despite this, the model attempts to
generate a suitable response based solely on its
understanding of the task description. FS learning extends ZS
by providing the model with a small number of examples
or demonstrations for the task at hand. These examples
serve as additional context for the model to understand
the task better. Finally, CoT prompts guide the model
to generate coherent and logically connected responses
by sequentially structuring the prompt. Each step of the
prompt builds upon the previous one, creating a chain of
thoughts that guide the model’s generation process.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Automated Score Computation</title>
        <p>Having made this necessary premise about prompt
engineering, we can illustrate the two diferent approaches
proposed in this article to perform the considered task,
denoted as local and global. For both approaches, we
consider ZS and CoT prompting techniques, being
insuficient in the number of available examples in the
considered dataset (detailed in Section 4.1) to perform FS.
This means designing appropriate prompt templates for
each prompting technique with respect to each approach.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Local Computation Approach</title>
          <p>We ask LLMs appropriately guided by prompts to
generate a score for each item of the MADRS. Such items and
their descriptions are illustrated in Figure 1, while ZS
and CoT prompt templates are detailed in the following.</p>
          <p>Zero-Shot Learning. The model is simply asked to
generate a score for each item of the MADRS. These items
are specified in the template, as follows:</p>
          <p>Given the following document containing a
conversation between a physician and a
patient, denoted by M and P respectively,
following the Montgomery-Åsberg Depression
Rating Scale (MADRS), answer me with the
severity score, from a minimum of 0
(symptom absent) to a maximum of 6 (extremely
severe), for the following item only: [item
title, description]. Answer me only with
a value between a minimum of 0 and a
maximum of 6 related only to the described
label. Below is the document to be analyzed:
[document].</p>
          <p>This template is repeated for each of the 10 items of
MADRS, and [item title, description] contains the title
and description shown in Figure 1 for each item, for
example: Reduced sleep, representing the experience of
reduced duration or depth of sleep compared to the subject’s
own normal pattern when well. Once the scores for each
item are obtained, they are simply added together to
obtain the overall score.</p>
          <p>CoT Learning. In this preliminary work, the CoT
approach is based on simply asking the model to provide
a motivation before performing the task. This helps the
model make a more informed decision than the ZS
scenario. Therefore, the CoT template used is as follows:
[ZS “local” template] + Provide the
rationale before answering.</p>
          <p>Also in this case, the scores for each item are summed
up to obtain the overall score.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Global Computation Approach</title>
          <p>Here, LLMs are appropriately guided to directly generate
the overall depression score with respect to MADRS.
Zero-Shot Learning. The ZS template employed in
this global approach to computation is as follows:
Given the following document containing a
conversation between a physician and a
patient, denoted by M and P respectively,
following the Montgomery-Åsberg Depression
Rating Scale (MADRS), answer me with
what would be the severity score with
respect to depression that you would assign.</p>
          <p>The threshold values are: 0 to 6 no
depression, 7 to 19 mild depression, 20 to 34
moderate depression, and 35 to 60 severe
depression. Answer only with a value between
In this section, we present the results of the comparative
evaluation of the local and global approaches, in relation
to the various proposed prompt engineering techniques
(and thus, regarding the diferent templates used). Firstly,
we introduce the dataset employed in the evaluations and
the technical characteristics of the implemented models.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>4.1. The Conversation Dataset</title>
        <p>It is well understood, especially in such a delicate field
as psychiatry, that dealing with patient data is rather
complex and ethically sensitive. For this reason, for this
preliminary study, a team of medical experts generated a
small dataset in which clinicians took on the roles of both
the doctor and the patient. This was done to create
typical conversations regarding various levels of depression
severity, namely: severe depression, moderate depression,
mild depression, and absence of depression. In total, 10
doctor-patient conversations were generated in Italian,
with at least 3 conversations for the first three previously
outlined severity levels. Clinicians also labeled the
questions and answers against the corresponding items of the
MADRS and provided both item-level and global scores
for the entire conversation.1</p>
      </sec>
      <sec id="sec-3-4">
        <title>4.2. Technical Details</title>
      </sec>
      <sec id="sec-3-5">
        <title>4.3. Results</title>
        <p>The results obtained measure the efectiveness of the
above-mentioned models, in conjunction with the
appropriate prompting templates, in correctly predicting the
item-level scores and overall score of each conversation
compared with those assigned by the medical experts.
They are illustrated in terms of accuracy (Acc.), Pearson
(P.), and Spearman (S.) correlation coeficients.</p>
        <sec id="sec-3-5-1">
          <title>4.3.1. Local Computation Results</title>
          <p>
            Acc.
0.30
0.40
0.40
0.40
0.40
0.60
0.86
0.93
0.85
0.86
0.27
0.31
the minimum of 0 and a maximum of 60. text inputs, emitting text outputs).2 Mistral:
MistralBelow is the document to be analyzed: [doc- 7B-Instruct-v0.2, it is an instruct fine-tuned 7B LLM,
ument]. trained mainly on English data, but also acquainted
with Italian during its pretraining phase [
            <xref ref-type="bibr" rid="ref22">22</xref>
            ]. Mixtral:
CoT Learning. CoT learning in the global approach Mixtral-8x7B-Instruct-v0.1, it is a pretrained
genuses the ZS “global” template in which reasoning is re- erative Sparse Mixture of Experts model, trained mainly
quired before providing the answer: on 5 languages including Italian. It has 46.7B total
parameters but only uses 12.9B parameters per token.3
[ZS “global” template] + Provide the ratio- Dante: DanteLLM_instruct_7b-v0.2-boosted, it is a
nale before answering. recent state-of-the-art Italian LLM based on the 7B
Mistral model.4 Hermes: Hermes7b_ITA, it is a 7B LLM
4. Comparative Evaluation trained on a 120K instruction/answer dataset in Italian.
It is based on Nous-Hermes-llama-2-7b LLM, a version
of meta/Llama-2-7b fine-tuned to follow instructions. 5
To assess the efectiveness of generative models in
addressing the considered problem, various LLMs were It can be seen that from the results in Table 1,
espetested. These models were trained on diverse datasets, cially in terms of accuracy, the local approach does not
tailored for a multilingual context, given that our provide satisfactory overall results. However, a
substanpsychiatrist-patient conversations are in Italian. In par- tial improvement can be appreciated when models are
ticular, the following models were used: GPT-3.5: GPT- asked to explain the reasons for their choices (CoT), and
3.5-turbo-0613, it is an iteration of the Generative Pre- in particular for the Hermes model. Regarding the
cortrained Transformer (GPT) model developed by OpenAI. It relation coeficients of Person and Spearman, we can
is an advanced version of its predecessor, GPT-3, with im- observe how these are globally quite high, improving in
provements in various aspects such as model architecture, the CoT scenario for models trained on larger amounts
training data, and fine-tuning techniques. GPT-4: GPT-4- of data and decreasing on smaller ones.
0613, it is a large multimodal model (accepting image and
2https://platform.openai.com/docs/models/overview
1The dataset used and the respective labels and scores can be down- 3https://mistral.ai/news/mixtral-of-experts/
loaded at the following address: https://drive.google.com/file/d/ 4https://github.com/RSTLess-research/DanteLLM
18HL5v8Hh2GBm1l0dt9Z8cHW0Opy8JgA7/view?usp=sharing. 5https://huggingface.co/raicrits/Hermes7b_ITA
          </p>
          <p>#2.</p>
          <p>#3.</p>
          <p>#4.</p>
          <p>#5.</p>
          <p>#6.</p>
          <p>P.</p>
          <p>#7.</p>
          <p>P.</p>
          <p>#8.</p>
          <p>P.</p>
          <p>#9.</p>
          <p>P.</p>
          <p>#10.</p>
          <p>P.</p>
          <p>GPT-3.5 0.61 0.80 0.35 0.24 0.48 0.56 0.73 0.81 0.74 0.79 0.60 0.66 0.54 0.58 0.17 0.24 0.31 0.41 0.83 0.87
GPT-4 0.65 0.51 0.61 0.50 0.70 0.67 0.89 0.79 0.90 0.89 0.18 0.36 0.83 0.76 0.47 0.37 0.84 0.83 0.95 0.96
Mistral 0.15 0.20 0.64 0.78 0.53 0.21 0.71 0.79 0.21 0.20 0.40 0.54 -0.34 -0.37 0.31 0.31 0.82 0.82 0.94 0.93
Mixtral 0.46 0.48 0.91 0.88 0.73 0.43 0.76 0.69 0.84 0.90 0.21 0.35 0.72 0.64 -0.52 -0.36 0.36 0.39 0.83 0.87
Dante -0.32 -0.49 0.49 0.66 0.68 0.75 0.47 0.50 -0.78 -0.76 -0.08 -0.08 -0.25 -0.05 -0.04 0.09 0.11 0.11 0.24 0.25
Hermes 0.57 0.56 -0.25 -0.61 0.06 0.24 0.07 0.01 -0.16 -0.22 -0.25 -0.32 0.30 0.17 0.24 0.16 0.18 0.29 -0.02 0.22</p>
        </sec>
        <sec id="sec-3-5-2">
          <title>4.3.2. Global Computation Results</title>
          <p>0.79
0.87
0.22
0.33
0.68
0.76</p>
          <p>S.</p>
          <p>The results in this case show that an accuracy of
around 70% can be achieved. It is particularly interesting
to note how the best models are the GPT-based in the
ZS case, while it is Dante in the CoT case, which instead
turns out to be one of the worst using a ZS technique.
Person and Spearman correlation coeficient results
illustrate a significant increase in correlation in the smaller
models in the CoT scenario, with variable fluctuations in
the case of the larger models.</p>
        </sec>
        <sec id="sec-3-5-3">
          <title>4.3.3. Further Investigating Best Results</title>
          <p>items is generally not very high, although it is objectively
better in some specific items such as #4 (i.e., reduced sleep,
for the models trained on more data), #10 (i.e., suicidal
thoughts, again for larger models). The smaller,
Italianspecific models do not correlate well on this task.</p>
          <p>Concerning Figure 2, illustrating the confusion
matrix referring to the global computation approach for the
Dante model performed in the CoT scenario, we can
observe how the model does not confuse depression severity
classes that are too distant from each other.</p>
          <p>Compared to the approaches, prompt engineering
techniques, and LLMs considered, it is clear that the use of
the global approach is superior to the local one. This 5. Conclusion and Future Research
would seem to suggest that LLMs have a greater chance
of success with respect to the task considered when the This study explored the utilization of generative
Artificonversation is considered to produce the global MADRS cial Intelligence (AI) models for automatically mapping
score, without the model being asked to generate MADRS psychiatrist-patient dialogue content to the
Montgomeryitem-based scores to be later aggregated. However, we Åsberg Depression Rating Scale (MADRS). Two distinct
operated in a context in which we did not provide specific approaches were investigated: the application of prompt
examples of the model according to a Few-Shot strategy, engineering techniques to compute symptom severity
which need to be investigated in the future. scores for each MADRS item, and the direct calculation of</p>
          <p>As it emerges from Table 2, referring to the local com- the overall depression severity score. The results
demonputation approach in the CoT scenario, the correlation strated that the proposed approaches, coupled with the
with respect to the scores predicted in the individual best-performing models, achieved an accuracy of
approximately 70% in mapping conversations to MADRS scores.</p>
          <p>Though the current accuracy shows promise, there is
room for improvement. Future studies could refine
models, improve prompt techniques, explore new methods,
and use more data sources. This could lead to an
automated system that generates questions and evaluates
symptom severity from dialogue analysis.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Silverman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Galanter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jackson-Triche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. G.</given-names>
            <surname>Jacobs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Lomax</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Riba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. E.</given-names>
            <surname>Watkins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Fochtmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Rhoads</surname>
          </string-name>
          , et al.,
          <article-title>The american psychiatric association practice guidelines for the psychiatric evaluation of adults</article-title>
          ,
          <source>American Journal of Psychiatry</source>
          <volume>172</volume>
          (
          <year>2015</year>
          )
          <fpage>798</fpage>
          -
          <lpage>802</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Fantino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Moore</surname>
          </string-name>
          ,
          <article-title>The self-reported montgomery-åsberg depression rating scale is a useful evaluative tool in major depressive disorder</article-title>
          ,
          <source>BMC psychiatry 9</source>
          (
          <year>2009</year>
          )
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>K.-B. Ooi</surname>
            ,
            <given-names>G. W.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Al-Emran</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>AlSharafi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Capatina</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
            ,
            <given-names>Y. K.</given-names>
          </string-name>
          <string-name>
            <surname>Dwivedi</surname>
            ,
            <given-names>T.-L.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>A. K.</given-names>
          </string-name>
          <string-name>
            <surname>Kar</surname>
            ,
            <given-names>V.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
          </string-name>
          , et al.,
          <article-title>The potential of generative artificial intelligence across disciplines: Perspectives and future directions</article-title>
          ,
          <source>Journal of Computer Information Systems</source>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Torous</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Myrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rauseo-Ricupero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Firth</surname>
          </string-name>
          , et al.,
          <article-title>Digital mental health and covid-19: using technology today to accelerate the curve on access and quality tomorrow, JMIR mental health (</article-title>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fokkema</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Iliescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Greif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ziegler</surname>
          </string-name>
          ,
          <article-title>Machine learning and prediction in psychological assessment</article-title>
          ,
          <source>European Journal of Psychological Assessment</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Panicker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gayathri</surname>
          </string-name>
          ,
          <article-title>A survey of machine learning techniques in physiology based mental stress detection systems</article-title>
          ,
          <source>Biocybernetics and Biomedical Engineering</source>
          <volume>39</volume>
          (
          <year>2019</year>
          )
          <fpage>444</fpage>
          -
          <lpage>469</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Viviani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Crocamo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mazzola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bartoli</surname>
          </string-name>
          , G. Carrà, G. Pasi,
          <article-title>Assessing vulnerability to psychological distress during the covid-19 pandemic through the analysis of microblogging content, Future Generation Computer Systems (</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Vaidyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wisniewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Halamka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Kashavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Torous</surname>
          </string-name>
          ,
          <article-title>Chatbots and conversational agents in mental health: a review of the psychiatric landscape</article-title>
          ,
          <source>The Canadian Journal of Psychiatry</source>
          <volume>64</volume>
          (
          <year>2019</year>
          )
          <fpage>456</fpage>
          -
          <lpage>464</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Viduani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Cosenza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Araújo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kieling</surname>
          </string-name>
          ,
          <article-title>Chatbots in the field of mental health: challenges and opportunities, Digital Mental Health: A Practitioner's Guide (</article-title>
          <year>2023</year>
          )
          <fpage>133</fpage>
          -
          <lpage>148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>A systematic review on automated clinical depression diagnosis</article-title>
          ,
          <source>npj Mental Health Research</source>
          <volume>2</volume>
          (
          <year>2023</year>
          )
          <fpage>20</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Carrillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sigman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. F.</given-names>
            <surname>Slezak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ashton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fitzgerald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Stroud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Nutt</surname>
          </string-name>
          , R. L. CarhartHarris,
          <article-title>Natural speech algorithm applied to baseline interview data can predict which patients will respond to psilocybin for treatment-resistant depression</article-title>
          ,
          <source>Journal of afective disorders</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kjell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Johnsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sikström</surname>
          </string-name>
          ,
          <article-title>Freely generated word responses analyzed with artificial intelligence predict self-reported symptoms of depression, anxiety, and worry</article-title>
          , Frontiers in Psychology (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Philip</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-A.</given-names>
            <surname>Micoulaud-Franchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sagaspe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Sevin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Olive</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bioulac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sauteraud</surname>
          </string-name>
          ,
          <article-title>Virtual human as a new diagnostic tool, a proof of concept study in the field of major depressive disorders</article-title>
          ,
          <source>Scientific reports 7</source>
          (
          <year>2017</year>
          )
          <fpage>42656</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Dosovitsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. S.</given-names>
            <surname>Pineda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. C.</given-names>
            <surname>Jacobson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Bunge</surname>
          </string-name>
          , et al.,
          <article-title>Artificial intelligence chatbot for depression: descriptive study of usage</article-title>
          ,
          <source>JMIR Formative Research</source>
          <volume>4</volume>
          (
          <year>2020</year>
          )
          <article-title>e17065</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Vaidyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Linggonegoro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Torous</surname>
          </string-name>
          ,
          <article-title>Changes to the psychiatric chatbot landscape: A systematic review of conversational agents in serious mental illness: Changements du paysage psychiatrique des chatbots: une revue systématique des agents conversationnels dans la maladie mentale sérieuse</article-title>
          ,
          <source>The Canadian Journal of Psychiatry</source>
          <volume>66</volume>
          (
          <year>2021</year>
          )
          <fpage>339</fpage>
          -
          <lpage>348</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kaywan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ibaida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Miao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <article-title>Early detection of depression using a conversational ai bot: A non-clinical trial, Plos one (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A. I.</given-names>
            <surname>Jabir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martinengo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Torous</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subramaniam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Tudor</given-names>
            <surname>Car</surname>
          </string-name>
          ,
          <article-title>Evaluating conversational agents for mental health: Scoping review of outcomes and outcome measurement instruments</article-title>
          ,
          <source>J Med Internet Res</source>
          <volume>25</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Aziz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Abd-Alrazaq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Alzubaidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Al-Thani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Elhusein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Siddig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          , et al.,
          <article-title>Chatbot features for anxiety and depression: a scoping review</article-title>
          ,
          <source>Health informatics journal 29</source>
          (
          <year>2023</year>
          )
          <fpage>14604582221146719</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , et al.,
          <article-title>A survey on evaluation of large language models</article-title>
          ,
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sahoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mondal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chadha</surname>
          </string-name>
          ,
          <article-title>A systematic survey of prompt engineering in large language models: Techniques and applications</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2402</volume>
          .
          <fpage>07927</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>B.</given-names>
            <surname>Meskó</surname>
          </string-name>
          ,
          <article-title>Prompt engineering as an important emerging skill for medical professionals: tutorial</article-title>
          ,
          <source>Journal of Medical Internet Research</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bamford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. d. l. Casas,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          , G. Lengyel,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          , et al.,
          <source>Mistral 7b, arXiv preprint arXiv:2310.06825</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>