<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Natural Language Explanations for Suicide Risk Classification Using Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>William Stern</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Seng Jhing Goh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nasheen Nur</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patrick J Aragon</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Mercer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Siddhartha Bhattacharyya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chiradeep Sen</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Van Minh Nguyen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Electrical Engineering and Computer Science, Florida Institute of Technology</institution>
          ,
          <addr-line>Melbourne, FL 32901</addr-line>
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Mathematical Sciences, Florida Institute of Technology</institution>
          ,
          <addr-line>Melbourne, FL 32901</addr-line>
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Mechanical and Civil Engineering, Florida Institute of Technology</institution>
          ,
          <addr-line>Melbourne, FL 32901</addr-line>
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>School of Psychology, Florida Institute of Technology</institution>
          ,
          <addr-line>Melbourne, FL 32901</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent years, automated machine learning has assisted mental health researchers in analyzing large amounts of data. Despite this, many individuals and organizations hesitate to use these methods, as they are black-box methods. Our work uses of-the-shelf generative large language models (LLM) to generate natural language explanations for suicide risk from users' Reddit posts. We benchmark various language models utilizing annotations and explanations by psychology experts. Generated explanations can accurately explain the risk of suicide using evidence from users' posts. We optimized this method for low-resource settings by leveraging pre-existing general instruction-tuned and quantized models. We conduct user studies with experts to compare the explanations and predictions generated by our approach with human expert perception and report some interesting findings for further improving our approach. We find that LLMs can efectively classify and respond with helpful reasoning of a suicidal risk diagnosis.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;mental health</kwd>
        <kwd>large language models</kwd>
        <kwd>social media</kwd>
        <kwd>explainable AI</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>gained many users and applications. The LLaMA family the potential to meaningfully analyze complex sets of
is more lightweight than many other generative LLMs. data points to understand the complex nature of
menConsumer desktop computers and laptops can run many tal disorders better. XAI determines which features in
LLaMA models locally2. This is because they utilized the dataset contribute most to a specific classification
smaller models trained on more data. As a result, these pattern by methods such as saliency maps to visualize
models can achieve similar performances to much larger the relative weight or importance of features in the data
models. Using these models becomes more convenient [10, 11, 12]. Combining XAI with social media data,
facilas a result. Additionally, many generative LLMs are not ity data, and human sensory data is intended to improve
open-sourced and often collect user-inputted data. This the understanding of recurring patterns of mental health
is in contrast to LLaMA 2 [6], which was a permissive disorders [7, 9, 13]. This will help accurately predict risk
license, and the models can be used locally, allowing and disease trajectories, leading to the development of
mental health professionals to control their own data scalable detection and prevention tools.
use. For many mental health professionals, sharing their
client’s data with external companies would violate 2.2. LLMs in Natural Language Processing
confidentiality. We plan to evaluate the quality of the
generated responses from LLaMA models by comparing (NLP)
them to those written by mental health professionals. LLMs have evolved in many variations, showing their
To do that, we used a variety of automatic metrics and powerful capabilities. It is clear that pre-trained models
qualitative analysis of the responses generated. significantly outperform non-pre-trained models in NLP.
The advancement and success of NLP attract researchers
Here is a list of four key contributions: to explore and address mental health behaviors.
According to [14], the lack of a taxonomy, extensive data for
eval• Generating natural language explanations for sui- uation, and the inability to distinguish between mental
cide risk using a generative LLM utilizing anno- health disorder-related behaviors and healthy behaviors
tations directly from mental health professionals. in social media text are challenges faced in ML and NLP
• Measuring the performance of diferent models research. Today, BERT [14] is one of the most popular
for natural language explanations and suicide risk and best-performing methods for language interpretation
classifications. tasks. BERT-like transformer models usually undergo
su• Measuring the accuracy of generated suicide ex- pervised pretraining on a large corpus [14]. Research has
planations by comparing them to expert-written shown that pretraining a model towards a smaller corpus
explanations. or a specific task, such as in Domain Adaptation (DAPT)
• Assessing the strengths and drawbacks of using pretraining and Task Adaptation (TAPT) pretraining, can
generated explanations of suicide risk with hu- improve downstream performance in BERT-like models
man experts in a user experimentation environ- [14]. TAPT refers to using a smaller pretraining corpus
ment. on the unlabeled training set for a given task. In
contrast, DAPT uses a large pretraining corpus of unlabeled
2. Related Work domain-specific text [ 15]. The authors, seeking better
performance, used DAPT to propose PsychBERT, a new
language model modified from the BERT model for
lan2.1. Explainable Artificial Intelligence guage patterns in psychology, psychiatry, mental health,
(XAI) for Mental Health or behavioral health text [14]. BioGPT, a domain-specific
In recent years, the gradual increase of clinical and trans- generative pre-trained transformed language model, was
lational science in medicine has resulted in the progres- created to perform biomedical text mining and
knowlsion of mental healthcare [7]. However, knowledge of edge discovery [16]. It can generate distinct descriptions
mental disorders in terms of their diagnosis, etiopatho- of biomedical terms in various large-scale biomedical
physiology, and treatment has not yet been expanded. literature. Downstream tasks such as end-to-end
relaAccording to the authors, even when conducting in-depth tion extraction, question answering (QA), and document
research in neuroimaging, no single indicator in human classification were applied and adapted by analyzing and
brain biology can accurately diferentiate between vari- designing the target sequence format and the prompt for
ous mental disorders and their subtypes in patients [8, 9]. better modeling of the tasks [16]. The authors contended
Data quality and diversity have increased due to tech- that this model achieves satisfactory performance in
betnical innovation, but data collection alone will not help ter biomedical text generation ability than GPT-2 in text
improve our understanding of mental disorders. XAI has generation tasks [16].</p>
    </sec>
    <sec id="sec-2">
      <title>3. Methodology</title>
      <sec id="sec-2-1">
        <title>2.3. Generative Artificial Intelligence in</title>
      </sec>
      <sec id="sec-2-2">
        <title>Mental Healthcare</title>
        <sec id="sec-2-2-1">
          <title>Our data pipeline begins with the LLM prompts. To craft</title>
          <p>In the realm of NLP, significant strides have recently been these prompts, a psychology professor evaluated 12
sammade in generative AI. This surge in advancements has ples from the University of Maryland Reddit Suicidality
spurred innovation, particularly in leveraging generative Dataset, Version 2 [18, 19] (UMD dataset). The professor
AI for mental health treatment. One current state-of- determined the level of suicide risk (no, low, medium, or
the-art model is MentaLLaMA [5]. This is a fine-tuned high risk) for each post and provided explanations for
version of the LLaMA LLM [6] that was trained on mental why each post indicated that specific level of suicide risk.
health data. There has also been research on the use of One sample was randomly selected from each risk level
general-purpose LLMs like ChatGPT for classifying and for the four-shot prompt input, resulting in four samples
explaining mental health data such as depression and from the posts annotated by the professor. This four-shot
anxiety [17]. In our work, we are looking at applying and prompt was then utilized as input for diferent LLMs. The
evaluating open-source LLMs to classify and generate LLMs, along with the UMD expert-labeled dataset and the
natural language explanations. collected Reddit data, were provided with this prompt to</p>
          <p>In our work, we are investigating the use of instruction- generate explanations for suicide risk. The same datasets
tuned, openly available general LLMs to generate expla- were also given to participants in the user study to
gennations for suicide risk. We also conducted a user study erate human explanations. The quality of the
explanawith psychology experts to evaluate and verify the qual- tions provided by the users was assessed by comparing
ity of the explanations generated from diferent models. them with LLM-generated explanations using Bilingual
Evaluation Understudy (BLEU) [20], and Recall-Oriented
Understudy for Gisting Evaluation (ROUGE) [21] scores.</p>
          <p>This workflow can be seen in Figure 1.</p>
          <p>Instruction: Tell me the risk of suicide in the following social media post and explain the reasoning step by step with quotes.
The risk can be no, low, medium, or high. Take the whole post into context.</p>
          <p>Input: I am feeling....</p>
          <p>Response: This post is no risk because...</p>
          <p>Instruction: Tell me the risk of suicide in the following social media post and explain the reasoning step by step
with quotes. The risk can be no, low, medium, or high. Take the whole post into context.</p>
          <p>Input: I am feeling....</p>
          <p>Response: This post represents high risk because...</p>
          <p>Instruction: Tell me the risk of suicide in the following social media post and explain the reasoning step by step
with quotes. The risk can be no, low, medium, or high. Take the whole into post context.</p>
          <p>Input: I am feeling....</p>
          <p>Response: This post is low risk because...</p>
          <p>Instruction: Tell me the risk of suicide in the following social media post and explain the reasoning step by step
with quotes. The risk can be no, low, medium, or high. Take the whole post into context.</p>
          <p>Input: I am feeling....</p>
          <p>Response: This post represents medium risk because...</p>
          <p>Instruction: Tell me the risk of suicide in the following social media post and explain the reasoning step by step
with quotes. The risk can be no, low, medium, or high. Take the whole post into context.</p>
          <p>Input: (Post from dataset)</p>
          <p>Response: This post represents (Will be completed by generative LLM)</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>3.1. Data Sources and Pre-processing</title>
        <p>We used two social media datasets mined from Reddit;
one was labeled by mental health experts (UMD dataset)
and has been widely used by researchers, and the other
was collected by researchers in our team for other
ongoing research work [3]. The detailed descriptions of
datasets are provided below:
• UMD Dataset: The University of Maryland
Reddit Suicidality Dataset, Version 2 Task A. It
consists of the post history of 850 users of the
r/SuicideWatch subreddit on the social media
website Reddit. Among these 850 posts, 245 posts
were classified by mental health experts, and the
rest were annotated using a crowdsourcing
process. The experts annotated the posts with four
classes: no, low, medium, and high risk of suicide.</p>
        <p>We randomly selected 12 posts annotated from
the crowd-sourced section of the UMD dataset
(3 from each label). Then, these annotated posts
were given to a human expert, a professor of
psychology, to classify into four labels and to provide
an explanation for each classification.</p>
        <p>The UMD dataset has been used to help
researchers study the risk of suicide in posts that
have been uploaded to Reddit. It consists of posts
that were uploaded to Reddit prior to 2015. The
dataset includes two diferent sections: a
crowdsourced testing and training set and a golden
standard set labeled by mental health experts. In this
dataset, each user is given a label: no risk, low
risk, medium risk, or high risk. These risk
factors correspond to the risk of suicide for each
user based on their writing. It consists of
several diferent tasks. These include using only
SuicideWatch subreddit data, using only
nonSuicideWatch Reddit data, and using users’
complete Reddit history, including SuicideWatch and
non-SuicideWatch data. Task A is where one tries
to predict users’ risk of suicide based only on
posts made on the SuicideWatch subreddit, the
largest suicide-related subreddit on Reddit. For
our study, we used the methodology of Task A.
• Scraped Reddit Data: This unlabeled data was
collected by our research team from the
r/SuicideWatch subreddit. Due to the huge amount of data,
we used Reddit posts posted to that subreddit
during February 2023, which contained the most
recent data collected. We generated the dataset
by filtering user posts only from /r/SuicideWatch
and classifying them individually without
considering the user ID. Once again, we limited the text
size to 1000 characters to fit within the context
limit and to standardize the texts.</p>
        <p>Input Text: I have been feeling very sad lately. I have had some thoughts of suicide. I feel very alone. Nobody seems to want
to talk to me. School is not going well, and I feel like I am failing always.</p>
        <p>Speechless 13B: This post represents medium risk. The person mentions they’ve "had some thoughts of suicide," which
indicates a direct connection with self-harm or ending their life, even though the context doesn’t specify if it’s the imminent
actionable plan. Additionally, feelings of sadness and loneliness contribute to this categorization as well.
WizardLM 13B: This post represents low risk. The poster mentions feelings of isolation, academic struggles, and a history of
sadness, but there’s no explicit mention or plan for suicide or any immediate danger. It seems they are looking for some form of
support/connection, which can be seen as an attempt to prevent escalating distress into more serious issues like self-harming
behaviors.</p>
        <p>LLaMA Chat 13B: This post represents low risk. The poster mentions feelings of sadness, as well as a history of non-suicidal
self-injury in the form of cutting. However, there are no explicit statements about suicide plans or intentions and instead
seems to be seeking support from others.</p>
        <p>LLaMA Chat 70B: This post represents medium risk. The poster mentions a history of recent suicidal thinking, feelings of
loneliness/social isolation, decreased interest in activities they used to enjoy (school), as well as hopelessness about their
academic performance, which could increase the future lethality of any potential attempt at self-harm.
3.2. LLMs for Text Generation ing, it was ranked as one of the top LLM on a variety of
LLM benchmarks using the Language Model Evaluation
The Generative LLM we used was the LLaMA model Harness [25]. We chose to use these models since they
developed by Meta AI research [6]. After weighing per- represent some of the most common and best performing
formance benefits and resource limitations, we chose to LLMs available at the time of testing. We think that if
use quantized versions of the models using the llama.cpp these LLMs are able to classify and generate explanations
GGUF format3. We used the Q5_K_M quantization for for suicide risk successfully, then any more advanced
each model. This allowed us to run the models at a high models or fine-tuning will be able to do so as well.
speed using only the CPU while maintaining most of For all the models, we used few-shot prompting to
the performance of the original non-quantized model. generate the explanations. The posts were taken from
We used the recent LLaMA 2 model as our foundational the UMD dataset and were labeled with risk level and
model. To improve LLaMA’s performance across tasks, it explanation from a mental health expert. An example
needs to be instruction-tuned [22]. This means it must be of the few-shot prompt can be seen in Figure 2. This
ifne-tuned on instructions and outputs that are similar to prompt was inputted to the LLM. The classification result
what people might ask it to do. The LLaMA 2 Chat model, from the LLM was scraped from the generated text using
from Meta AI, was fine-tuned using 27,540 instructions. regular expressions.</p>
        <p>It also utilizes Reinforcement Learning with Human
Feedback (RLHF) [23]. We tested both the 70 billion parameter
(70B) and the 13 billion parameter variations (13B) of the 3.3. User Study
chat models. We conducted a user study by including five PhD
stu</p>
        <p>The other models we tested were WizardLM [24] and- dents studying psychology. Four of them completed at
SpeechlessLM4. WizardLM is an instruction-tuned ver- least their second year of study, and one of them is in
sion of the LLaMA 2 foundational model. Its fine-tuning his/her first year. Three of the five participants were
fedataset was generated using a special generation method male, and the rest were male. All participants confirmed
called Evol-Instruct [24] to generate complex instruc- that they use social media regularly. We want to clarify
tions. We tested the 13B model of WizardLM 1.2. The that the psychology professor mentioned earlier is not
third model we tested was SpeechlessLM. It is also an a participant in the study. In addition, he is a member
instruction-tuned version of the LLaMA 2 foundational of the research team and one of the authors. The user
model. This model is a merge of WizardLM-13B-V1.2 study consists of three rounds. First, they were randomly
and OpenOrca-Platypus2-13B, which is another popular presented with two posts from posts originally annotated
instruction-tuned LLaMA 2 model. At the time of test- and explained by our psychology professor to understand
their level of understanding and expertise compared to
3https://github.com/ggerganov/ggml/blob/master/docs/gguf.md
4https://huggingface.co/uukuguy/speechless-llama2-13B
the professor. In the second round, three diferent Reddit highest recall was a tie between MentalBERT and
Speechusers’ posts from the UMD Expert dataset were shown to lessLM. For suicide risk detection, recall is essential as we
them. They were instructed to classify the risk of suicide want to minimize the number of cases that go undetected.
(no, low, medium, or high risk) for each of the posts se- We suspect that the unique training data that is used for
quentially and provide an explanation for their reasoning. the WizardLM and Speechless models helps them achieve
The posts they were given came from a subset of 25 Red- slightly higher accuracy than the LLaMA Chat models.
dit users’ posts. These Reddit users’ posts came from the Compared to MentaLBERT, we can see that the LLMs
UMD expert-labeled dataset that was sampled without were able to classify the risk of suicide fairly efectively.
replacement. In the third round, we repeated the same We also wanted to compare the quality of the
generprocess with our second data source, the Reddit data ated explanations across the diferent LLMs by
genermined by our research team. Table 2 and Table 3 provide ating explanations for the UMD expert-labeled dataset.
the evaluation metrics for rounds 2 and 3, respectively. An example of how each model generates diferent
explanations can be found in Figure 3. Table 2 lists the
3.4. Evaluation average scores grouped by four labels based on ground
truth for the UMD dataset (round 2 of the user study).</p>
        <p>BLEU and ROUGE methods were used to evaluate the Here, the reference explanations are posts from the UMD
generated explanations by comparing them to those the dataset annotated by participants in the user study.
Acparticipants in our user study wrote. The human expla- cording to the scores, the LLM was able to most closely
nations of the user study were used as the baseline and match the users’ responses on the low and medium risk
then compared with those generated from the model. posts, with some models, such as LLaMA 2 Chat 70B,</p>
        <p>On the crowdsourced data portion of the UMD dataset, performing better than the no-risk classifications.
Morewe fine-tuned the MentalBERT model to provide a classi- over, the models of which the explanations most closely
ifcation baseline. The MentalBERT model is a commonly matched the users were the WizardLM and the LLaMA
used model for classification tasks in mental health. The Chat 70B model; however, there was not a significant
MentalBERT model is fine-tuned on the UMD suicide diference from the other models. The LLaMA Chat 70B
crowdsourced training dataset for 6 epochs. The training and the WizardLM models performed the best on the
dataset consists of 484 user posts. entire dataset.</p>
        <p>We used the same few-shot prompts that we used with
Model F1 Precision Recall the UMD dataset for round 3, where the participants in
MentalBERT Base 0.39 0.42 0.43 the study labeled and wrote explanations for our
selfWizardLM 13B 0.41 0.43 0.42 collected Reddit data. The results of the explanation
Speechless 13B 0.41 0.40 0.43 evaluation metrics can be found in Table 3. The results
LLaMA Chat 13B 0.34 0.42 0.36 show that the LLaMA Chat 70B has the highest BLEU
LLaMA Chat 70B 0.32 0.36 0.33 and ROUGE scores, with the WizardLM model being a
Table 1 close second.</p>
        <p>Classification accuracy of LLMs on the UMD dataset BLEU scores are known to penalize longer texts than
the reference ones. In our case, the BLEU scores in both
rounds (both tables) did not exceed 0.3. We suspect that
one factor causing this may be that the model-generated
4. Results and Evaluations explanations are more detailed than the human
explanations collected during the user study. We observed that
This section describes the results of the experimentation during the user study, participants sometimes provided
and user study. Quantitative results include model ac- explanations in phrases and incomplete sentences,
causcuracy and explanation evaluation scores using BLEU ing a greater diference between the AI-generated and
and ROUGE scores. We also discuss the quality of the human explanations, leading to a lower BLEU score. For
generated explanations based on human evaluation. that reason, we chose to use the ROUGE-1 score, as it will
help mitigate the efects of the diferent lengths of
expla4.1. Quantitative Results nations between the participants and the LLM. Similarly
to the BLEU score, the ROUGE score also has limitations.</p>
        <p>We investigated the classification accuracy of the difer- It may not fully capture the semantic meaning or
coherent LLMs. The classification F1, precision, and recall can ence of the text summary, as it relies solely on 1-gram
be found in Table 1. The model with the top F1 was the overlap. We suspect that the lower BLEU and ROUGE
WizardLM 13 billion (B) parameter version and Speech- scores can be partially attributed to the small sample
less 13B. This model also had the highest precision. The size and the short user study explanations. In addition,
our evaluation results show that the instruction-tuning
LLaMA 2 Chat 13B</p>
        <p>WizardLM 13B</p>
        <p>SpeechlessLM 13B</p>
        <p>LLaMA 2 Chat 70B
Avg. F1 Scores
No risk
Low risk
Medium risk
High risk
Entire dataset</p>
        <p>to high risk of suicidal ideation. Some examples of how
LLaMA 2 Chat 13B 0.222 0.391 the diferent explanations look for the diferent versions</p>
        <p>WizardLM 13B 0.283 0.514 of the LLMs can be found in Figure 3.</p>
        <p>SpeechlessLM 13B 0.266 0.464 One issue we observed in the evaluation was the
occurLLaMA 2 Chat 70B 0.286 0.525 rence of generated explanations that were hallucinations.
Table 3 Hallucinations occur when the model starts providing
Generation evaluation metrics for collected Reddit dataset inaccurate information. While this is a common issue in
LLMs, it is highly undesirable [28]. One example of that
can be seen in Figure 3 in the LLaMA Chat 13B models
method used by the WizardLM and SpeechlessLM models explanations. The model mentions that the poster has
and the larger parameter size of the LLaMA 70B model been cutting themself. However, that is not mentioned
give them an advantage in both explanation quality and anywhere in the input post.
classification accuracy. The mental health expert professor reviewed some</p>
        <p>It is often very dificult to judge the risk level of a so- LLM classifications and generated explanations.
Accordcial media post because suicide risk assessment difers ing to him, one of the most significant indicators of risk
between people. In order to measure agreement between of suicide was previous suicide attempts, and the LLM
people in our user study, we calculated the Krippendorf did not take that much into consideration. The LLM puts
alpha coeficient [ 26, 27]. The alpha value is 0.612, mean- a substantial focus on non-suicidal self-harm. The
presing there was a moderate agreement of suicide risk. ence of this factor can be a contributing factor to suicidal
behavior, but it does not always correlate directly with
4.2. Qualitative Results it. However, he agreed that the model and explanations
successfully distinguished between suicidal thinking and
We found that our method produces very promising re- depression. As he observed, neither the model nor the
sults qualitatively. Each of the models is able to provide experts could diferentiate between "no risk" behavior
a reasonable explanation for the classification. The gen- and "low risk" behavior.
erated explanations start by mentioning the risk level
("This post represents medium risk of suicide."). Then,
the explanations provide reasoning behind the classifica- 5. Discussion and Limitations
tion using parts from the users’ posts as evidence. Many In this work, we see that we are able to use LLMs to
explanations, particularly in the low-risk or medium-risk predict the risk of suicide and generate natural language
category, include reasoning for why the post is not in a explanations for the classification.
diferent risk category. As an example, an explanation of There are several limitations in using this method,
a medium risk category post may include reasoning for however. LLMs are very resource intensive, and some
why the post is not high risk and why the post is not low mental health facilities may not have the resources. As
risk. Word clouds generated on the explanations for low a result, proper infrastructure, including training and
(Figure 4) and medium risks (Figure 5) are also in line large amounts of computing, needs to be developed for
with our observations. For example, the predicted expla- LLMs to benefit mental health professionals. Another
nations for low-risk posts (Figure 4 right) emphasize the limitation is that the classification of the risk of suicide
high importance of non-suicidal thoughts with uncer- is very subjective as interpreted from natural language.
tainty by words such as ’feeling’,’mention’, and ’indicat- Each expert may find the risk of suicide to be diferent,
ing’ implying low risk that the person may be suicidal. thus making it very dificult to benchmark. We saw this
However, in Figure 5, for the medium risk, the strength quite frequently in our work. Many of the mental health
of the words is similar for both human participants and experts we consulted difered in opinion on the risk of
generated explanations. This demonstrates a moderate
suicide as risk assessment for suicide is very subjective. locally on a standard computer, so mental health
profesAdditionally, the explanations that are produced do not sionals do not need to share patient data with external
make generative LLM actually interpretable. It is still companies.
not known how the generative LLMs produce classifica- In the future, we will conduct a more extensive user
tions or explanations. The explanations are only useful study and use the responses to fine-tune an LLM to
profor giving estimated reasoning for a classification; expla- vide better explanations and increase the zero-shot
pernations cannot easily tell us the classification methods formance. We will evaluate the performance and
accuof the generative LLM. Lastly, LLMs are often prone to racy of the machine-generated explanations more
efechallucinations, which can lead to incorrect classifications tively this way. Due to incomplete sentences or phrases
and explanations. We saw this during our testing, where delivered by the participants in the user study, we
obthe explanation referenced something that did not occur tained lower BLEU and ROUGE scores. In real life,
psyin the social media post. chologists and doctors often write short notes about their
diagnoses. We plan to automate the generation of full
sentences from diagnostic feedback to enhance AI-generated
6. Conclusion and Future Work explanations. We are also testing other foundational
models to see how they compare to LLaMA. We can also
extrapolate trends and behaviors that describe the LLM’s
classification process, but more research is required to
determine their accuracy.</p>
        <p>Our approach demonstrates a proof of concept method
where LLMs with prompt engineering are used to provide
explanations for classification for evaluating the risk of
suicide. We were able to benchmark the accuracy of risk
assessment of suicide done by LLMs. We also compared
machine-generated explanations with human
explanations to help LLMs generate an explanation humans can
better understand and trust. The LLMs of 13B and 70B
can efectively classify and produce natural language
explanations for the predictions. These LLMs can also run</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>7. Ethical Considerations</title>
      <sec id="sec-3-1">
        <title>The University of Maryland granted access to the protected UMD dataset, requiring an IRB and a detailed approval process. We anonymized all the data to the best</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>8. Acknowledgment</title>
      <sec id="sec-4-1">
        <title>This work is supported by Florida Institute</title>
        <p>of Technology’s Institutional Research
Initiatives (https://www.fit.edu/institutional-research/
institutional-research/) and AccessComputing
(https://www.washington.edu/accesscomputing/).</p>
        <p>Dr. Nasheen Nur is the project’s primary investigator
and the corresponding author of this paper.
of our abilities to reduce bias and the spread of personal
information, and only members of the research team had
access to them. Please note that the research is still a
work in progress, and none of the models guarantee that
the results presented are accurate. These evaluations
should not be used as a substitute for evaluations by
mental health professionals. Experts and users, such as
mental health professionals, are responsible for verifying
any results.
Society, Los Alamitos, CA, USA, 2021, pp. 1077– [26] K. Krippendorf, Content analysis : an introduction
1082. URL: https://doi.ieeecomputersociety.org/ to its methodology, SAGE, 2013.
10.1109/BIBM52615.2021.9669469. doi:10.1109/ [27] S. Castro, Fast Krippendorf: Fast computation of
BIBM52615.2021.9669469. Krippendorf’s alpha agreement measure, https:
[15] S. Gururangan, A. Marasović, S. Swayamdipta, //github.com/pln-fing-udelar/fast-krippendorf,
K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t stop 2017.
pretraining: Adapt language models to domains and [28] M. A. Ahmad, I. Yaramis, T. D. Roy, Creating
trusttasks, arXiv preprint arXiv:2004.10964 (2020). worthy llms: Dealing with hallucinations in
health[16] R. Luo, L. Sun, Y. Xia, T. Qin, S. Zhang, H. Poon, care ai, 2023. arXiv:2311.01463.</p>
        <p>T.-Y. Liu, BioGPT: generative pre-trained
transformer for biomedical text generation and
mining, Briefings in Bioinformatics 23 (2022). URL:
https://doi.org/10.1093%2Fbib%2Fbbac409. doi:10.</p>
        <p>1093/bib/bbac409.
[17] K. Yang, S. Ji, T. Zhang, Q. Xie, Z. Kuang, S.
Ananiadou, Towards interpretable mental health analysis
with chatgpt, 2023. arXiv:2304.03347.
[18] H.-C. Shing, S. Nair, A. Zirikly, M. Friedenberg,</p>
        <p>H. Daumé III, P. Resnik, Expert, crowdsourced,
and machine assessment of suicide risk via online
postings, in: Proceedings of the Fifth Workshop on
Computational Linguistics and Clinical Psychology:</p>
        <p>From Keyboard to Clinic, 2018, pp. 25–36.
[19] A. Zirikly, P. Resnik, Ö. Uzuner, K. Hollingshead,</p>
        <p>CLPsych 2019 shared task: Predicting the degree of
suicide risk in Reddit posts, in: Proceedings of the
Sixth Workshop on Computational Linguistics and</p>
        <p>Clinical Psychology, 2019.
[20] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a
method for automatic evaluation of machine
translation, in: Proceedings of the 40th annual meeting
of the Association for Computational Linguistics,
2002, pp. 311–318.
[21] C. Lin, Rouge: A package for automatic evaluation
of summaries. text summarization branches out,
74–81, 2004.
[22] S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang,</p>
        <p>J. Li, R. Hu, T. Zhang, F. Wu, G. Wang, Instruction
tuning for large language models: A survey, 2023.</p>
        <p>arXiv:2308.10792.
[23] P. Christiano, J. Leike, T. B. Brown, M.
Martic, S. Legg, D. Amodei, Deep reinforcement
learning from human preferences, 2023.</p>
        <p>arXiv:1706.03741.
[24] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng,</p>
        <p>C. Tao, D. Jiang, Wizardlm: Empowering large
language models to follow complex instructions, 2023.</p>
        <p>arXiv:2304.12244.
[25] L. Gao, J. Tow, S. Biderman, S. Black, A. DiPofi,</p>
        <p>C. Foster, L. Golding, J. Hsu, K. McDonell, N.
Muennighof, J. Phang, L. Reynolds, E. Tang, A. Thite,
B. Wang, K. Wang, A. Zou, A framework for
fewshot language model evaluation, 2021. URL: https:
//doi.org/10.5281/zenodo.5371628. doi:10.5281/
zenodo.5371628.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>