1. Introduction

Natural Language Explanations for Suicide Risk Classification Using Large Language Models

William Stern

Seng Jhing Goh

Nasheen Nur

Patrick J Aragon

Thomas Mercer

Siddhartha Bhattacharyya

Chiradeep Sen

Van Minh Nguyen

1 0 Electrical Engineering and Computer Science, Florida Institute of Technology , Melbourne, FL 32901 USA 1 Mathematical Sciences, Florida Institute of Technology , Melbourne, FL 32901 USA 2 Mechanical and Civil Engineering, Florida Institute of Technology , Melbourne, FL 32901 USA 3 School of Psychology, Florida Institute of Technology , Melbourne, FL 32901 USA

In recent years, automated machine learning has assisted mental health researchers in analyzing large amounts of data. Despite this, many individuals and organizations hesitate to use these methods, as they are black-box methods. Our work uses of-the-shelf generative large language models (LLM) to generate natural language explanations for suicide risk from users' Reddit posts. We benchmark various language models utilizing annotations and explanations by psychology experts. Generated explanations can accurately explain the risk of suicide using evidence from users' posts. We optimized this method for low-resource settings by leveraging pre-existing general instruction-tuned and quantized models. We conduct user studies with experts to compare the explanations and predictions generated by our approach with human expert perception and report some interesting findings for further improving our approach. We find that LLMs can efectively classify and respond with helpful reasoning of a suicidal risk diagnosis.

eol>mental health large language models social media explainable AI

1. Introduction

gained many users and applications. The LLaMA family the potential to meaningfully analyze complex sets of is more lightweight than many other generative LLMs. data points to understand the complex nature of menConsumer desktop computers and laptops can run many tal disorders better. XAI determines which features in LLaMA models locally2. This is because they utilized the dataset contribute most to a specific classification smaller models trained on more data. As a result, these pattern by methods such as saliency maps to visualize models can achieve similar performances to much larger the relative weight or importance of features in the data models. Using these models becomes more convenient [10, 11, 12]. Combining XAI with social media data, facilas a result. Additionally, many generative LLMs are not ity data, and human sensory data is intended to improve open-sourced and often collect user-inputted data. This the understanding of recurring patterns of mental health is in contrast to LLaMA 2 [6], which was a permissive disorders [7, 9, 13]. This will help accurately predict risk license, and the models can be used locally, allowing and disease trajectories, leading to the development of mental health professionals to control their own data scalable detection and prevention tools. use. For many mental health professionals, sharing their client’s data with external companies would violate 2.2. LLMs in Natural Language Processing confidentiality. We plan to evaluate the quality of the generated responses from LLaMA models by comparing (NLP) them to those written by mental health professionals. LLMs have evolved in many variations, showing their To do that, we used a variety of automatic metrics and powerful capabilities. It is clear that pre-trained models qualitative analysis of the responses generated. significantly outperform non-pre-trained models in NLP. The advancement and success of NLP attract researchers Here is a list of four key contributions: to explore and address mental health behaviors. According to [14], the lack of a taxonomy, extensive data for eval• Generating natural language explanations for sui- uation, and the inability to distinguish between mental cide risk using a generative LLM utilizing anno- health disorder-related behaviors and healthy behaviors tations directly from mental health professionals. in social media text are challenges faced in ML and NLP • Measuring the performance of diferent models research. Today, BERT [14] is one of the most popular for natural language explanations and suicide risk and best-performing methods for language interpretation classifications. tasks. BERT-like transformer models usually undergo su• Measuring the accuracy of generated suicide ex- pervised pretraining on a large corpus [14]. Research has planations by comparing them to expert-written shown that pretraining a model towards a smaller corpus explanations. or a specific task, such as in Domain Adaptation (DAPT) • Assessing the strengths and drawbacks of using pretraining and Task Adaptation (TAPT) pretraining, can generated explanations of suicide risk with hu- improve downstream performance in BERT-like models man experts in a user experimentation environ- [14]. TAPT refers to using a smaller pretraining corpus ment. on the unlabeled training set for a given task. In contrast, DAPT uses a large pretraining corpus of unlabeled 2. Related Work domain-specific text [ 15]. The authors, seeking better performance, used DAPT to propose PsychBERT, a new language model modified from the BERT model for lan2.1. Explainable Artificial Intelligence guage patterns in psychology, psychiatry, mental health, (XAI) for Mental Health or behavioral health text [14]. BioGPT, a domain-specific In recent years, the gradual increase of clinical and trans- generative pre-trained transformed language model, was lational science in medicine has resulted in the progres- created to perform biomedical text mining and knowlsion of mental healthcare [7]. However, knowledge of edge discovery [16]. It can generate distinct descriptions mental disorders in terms of their diagnosis, etiopatho- of biomedical terms in various large-scale biomedical physiology, and treatment has not yet been expanded. literature. Downstream tasks such as end-to-end relaAccording to the authors, even when conducting in-depth tion extraction, question answering (QA), and document research in neuroimaging, no single indicator in human classification were applied and adapted by analyzing and brain biology can accurately diferentiate between vari- designing the target sequence format and the prompt for ous mental disorders and their subtypes in patients [8, 9]. better modeling of the tasks [16]. The authors contended Data quality and diversity have increased due to tech- that this model achieves satisfactory performance in betnical innovation, but data collection alone will not help ter biomedical text generation ability than GPT-2 in text improve our understanding of mental disorders. XAI has generation tasks [16].

3. Methodology 2.3. Generative Artificial Intelligence in Mental Healthcare Our data pipeline begins with the LLM prompts. To craft

In the realm of NLP, significant strides have recently been these prompts, a psychology professor evaluated 12 sammade in generative AI. This surge in advancements has ples from the University of Maryland Reddit Suicidality spurred innovation, particularly in leveraging generative Dataset, Version 2 [18, 19] (UMD dataset). The professor AI for mental health treatment. One current state-of- determined the level of suicide risk (no, low, medium, or the-art model is MentaLLaMA [5]. This is a fine-tuned high risk) for each post and provided explanations for version of the LLaMA LLM [6] that was trained on mental why each post indicated that specific level of suicide risk. health data. There has also been research on the use of One sample was randomly selected from each risk level general-purpose LLMs like ChatGPT for classifying and for the four-shot prompt input, resulting in four samples explaining mental health data such as depression and from the posts annotated by the professor. This four-shot anxiety [17]. In our work, we are looking at applying and prompt was then utilized as input for diferent LLMs. The evaluating open-source LLMs to classify and generate LLMs, along with the UMD expert-labeled dataset and the natural language explanations. collected Reddit data, were provided with this prompt to

In our work, we are investigating the use of instruction- generate explanations for suicide risk. The same datasets tuned, openly available general LLMs to generate expla- were also given to participants in the user study to gennations for suicide risk. We also conducted a user study erate human explanations. The quality of the explanawith psychology experts to evaluate and verify the qual- tions provided by the users was assessed by comparing ity of the explanations generated from diferent models. them with LLM-generated explanations using Bilingual Evaluation Understudy (BLEU) [20], and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [21] scores.

This workflow can be seen in Figure 1.

Instruction: Tell me the risk of suicide in the following social media post and explain the reasoning step by step with quotes. The risk can be no, low, medium, or high. Take the whole post into context.

Input: I am feeling....

Response: This post is no risk because...

Input: I am feeling....

Response: This post represents high risk because...

Input: I am feeling....

Response: This post is low risk because...

Input: I am feeling....

Response: This post represents medium risk because...

Input: (Post from dataset)

Response: This post represents (Will be completed by generative LLM)

3.1. Data Sources and Pre-processing

We used two social media datasets mined from Reddit; one was labeled by mental health experts (UMD dataset) and has been widely used by researchers, and the other was collected by researchers in our team for other ongoing research work [3]. The detailed descriptions of datasets are provided below: • UMD Dataset: The University of Maryland Reddit Suicidality Dataset, Version 2 Task A. It consists of the post history of 850 users of the r/SuicideWatch subreddit on the social media website Reddit. Among these 850 posts, 245 posts were classified by mental health experts, and the rest were annotated using a crowdsourcing process. The experts annotated the posts with four classes: no, low, medium, and high risk of suicide.

We randomly selected 12 posts annotated from the crowd-sourced section of the UMD dataset (3 from each label). Then, these annotated posts were given to a human expert, a professor of psychology, to classify into four labels and to provide an explanation for each classification.

The UMD dataset has been used to help researchers study the risk of suicide in posts that have been uploaded to Reddit. It consists of posts that were uploaded to Reddit prior to 2015. The dataset includes two diferent sections: a crowdsourced testing and training set and a golden standard set labeled by mental health experts. In this dataset, each user is given a label: no risk, low risk, medium risk, or high risk. These risk factors correspond to the risk of suicide for each user based on their writing. It consists of several diferent tasks. These include using only SuicideWatch subreddit data, using only nonSuicideWatch Reddit data, and using users’ complete Reddit history, including SuicideWatch and non-SuicideWatch data. Task A is where one tries to predict users’ risk of suicide based only on posts made on the SuicideWatch subreddit, the largest suicide-related subreddit on Reddit. For our study, we used the methodology of Task A. • Scraped Reddit Data: This unlabeled data was collected by our research team from the r/SuicideWatch subreddit. Due to the huge amount of data, we used Reddit posts posted to that subreddit during February 2023, which contained the most recent data collected. We generated the dataset by filtering user posts only from /r/SuicideWatch and classifying them individually without considering the user ID. Once again, we limited the text size to 1000 characters to fit within the context limit and to standardize the texts.

Input Text: I have been feeling very sad lately. I have had some thoughts of suicide. I feel very alone. Nobody seems to want to talk to me. School is not going well, and I feel like I am failing always.

Speechless 13B: This post represents medium risk. The person mentions they’ve "had some thoughts of suicide," which indicates a direct connection with self-harm or ending their life, even though the context doesn’t specify if it’s the imminent actionable plan. Additionally, feelings of sadness and loneliness contribute to this categorization as well. WizardLM 13B: This post represents low risk. The poster mentions feelings of isolation, academic struggles, and a history of sadness, but there’s no explicit mention or plan for suicide or any immediate danger. It seems they are looking for some form of support/connection, which can be seen as an attempt to prevent escalating distress into more serious issues like self-harming behaviors.

LLaMA Chat 13B: This post represents low risk. The poster mentions feelings of sadness, as well as a history of non-suicidal self-injury in the form of cutting. However, there are no explicit statements about suicide plans or intentions and instead seems to be seeking support from others.

LLaMA Chat 70B: This post represents medium risk. The poster mentions a history of recent suicidal thinking, feelings of loneliness/social isolation, decreased interest in activities they used to enjoy (school), as well as hopelessness about their academic performance, which could increase the future lethality of any potential attempt at self-harm. 3.2. LLMs for Text Generation ing, it was ranked as one of the top LLM on a variety of LLM benchmarks using the Language Model Evaluation The Generative LLM we used was the LLaMA model Harness [25]. We chose to use these models since they developed by Meta AI research [6]. After weighing per- represent some of the most common and best performing formance benefits and resource limitations, we chose to LLMs available at the time of testing. We think that if use quantized versions of the models using the llama.cpp these LLMs are able to classify and generate explanations GGUF format3. We used the Q5_K_M quantization for for suicide risk successfully, then any more advanced each model. This allowed us to run the models at a high models or fine-tuning will be able to do so as well. speed using only the CPU while maintaining most of For all the models, we used few-shot prompting to the performance of the original non-quantized model. generate the explanations. The posts were taken from We used the recent LLaMA 2 model as our foundational the UMD dataset and were labeled with risk level and model. To improve LLaMA’s performance across tasks, it explanation from a mental health expert. An example needs to be instruction-tuned [22]. This means it must be of the few-shot prompt can be seen in Figure 2. This ifne-tuned on instructions and outputs that are similar to prompt was inputted to the LLM. The classification result what people might ask it to do. The LLaMA 2 Chat model, from the LLM was scraped from the generated text using from Meta AI, was fine-tuned using 27,540 instructions. regular expressions.

It also utilizes Reinforcement Learning with Human Feedback (RLHF) [23]. We tested both the 70 billion parameter (70B) and the 13 billion parameter variations (13B) of the 3.3. User Study chat models. We conducted a user study by including five PhD stu

The other models we tested were WizardLM [24] and- dents studying psychology. Four of them completed at SpeechlessLM4. WizardLM is an instruction-tuned ver- least their second year of study, and one of them is in sion of the LLaMA 2 foundational model. Its fine-tuning his/her first year. Three of the five participants were fedataset was generated using a special generation method male, and the rest were male. All participants confirmed called Evol-Instruct [24] to generate complex instruc- that they use social media regularly. We want to clarify tions. We tested the 13B model of WizardLM 1.2. The that the psychology professor mentioned earlier is not third model we tested was SpeechlessLM. It is also an a participant in the study. In addition, he is a member instruction-tuned version of the LLaMA 2 foundational of the research team and one of the authors. The user model. This model is a merge of WizardLM-13B-V1.2 study consists of three rounds. First, they were randomly and OpenOrca-Platypus2-13B, which is another popular presented with two posts from posts originally annotated instruction-tuned LLaMA 2 model. At the time of test- and explained by our psychology professor to understand their level of understanding and expertise compared to 3https://github.com/ggerganov/ggml/blob/master/docs/gguf.md 4https://huggingface.co/uukuguy/speechless-llama2-13B the professor. In the second round, three diferent Reddit highest recall was a tie between MentalBERT and Speechusers’ posts from the UMD Expert dataset were shown to lessLM. For suicide risk detection, recall is essential as we them. They were instructed to classify the risk of suicide want to minimize the number of cases that go undetected. (no, low, medium, or high risk) for each of the posts se- We suspect that the unique training data that is used for quentially and provide an explanation for their reasoning. the WizardLM and Speechless models helps them achieve The posts they were given came from a subset of 25 Red- slightly higher accuracy than the LLaMA Chat models. dit users’ posts. These Reddit users’ posts came from the Compared to MentaLBERT, we can see that the LLMs UMD expert-labeled dataset that was sampled without were able to classify the risk of suicide fairly efectively. replacement. In the third round, we repeated the same We also wanted to compare the quality of the generprocess with our second data source, the Reddit data ated explanations across the diferent LLMs by genermined by our research team. Table 2 and Table 3 provide ating explanations for the UMD expert-labeled dataset. the evaluation metrics for rounds 2 and 3, respectively. An example of how each model generates diferent explanations can be found in Figure 3. Table 2 lists the 3.4. Evaluation average scores grouped by four labels based on ground truth for the UMD dataset (round 2 of the user study).

BLEU and ROUGE methods were used to evaluate the Here, the reference explanations are posts from the UMD generated explanations by comparing them to those the dataset annotated by participants in the user study. Acparticipants in our user study wrote. The human expla- cording to the scores, the LLM was able to most closely nations of the user study were used as the baseline and match the users’ responses on the low and medium risk then compared with those generated from the model. posts, with some models, such as LLaMA 2 Chat 70B,

On the crowdsourced data portion of the UMD dataset, performing better than the no-risk classifications. Morewe fine-tuned the MentalBERT model to provide a classi- over, the models of which the explanations most closely ifcation baseline. The MentalBERT model is a commonly matched the users were the WizardLM and the LLaMA used model for classification tasks in mental health. The Chat 70B model; however, there was not a significant MentalBERT model is fine-tuned on the UMD suicide diference from the other models. The LLaMA Chat 70B crowdsourced training dataset for 6 epochs. The training and the WizardLM models performed the best on the dataset consists of 484 user posts. entire dataset.

We used the same few-shot prompts that we used with Model F1 Precision Recall the UMD dataset for round 3, where the participants in MentalBERT Base 0.39 0.42 0.43 the study labeled and wrote explanations for our selfWizardLM 13B 0.41 0.43 0.42 collected Reddit data. The results of the explanation Speechless 13B 0.41 0.40 0.43 evaluation metrics can be found in Table 3. The results LLaMA Chat 13B 0.34 0.42 0.36 show that the LLaMA Chat 70B has the highest BLEU LLaMA Chat 70B 0.32 0.36 0.33 and ROUGE scores, with the WizardLM model being a Table 1 close second.

Classification accuracy of LLMs on the UMD dataset BLEU scores are known to penalize longer texts than the reference ones. In our case, the BLEU scores in both rounds (both tables) did not exceed 0.3. We suspect that one factor causing this may be that the model-generated 4. Results and Evaluations explanations are more detailed than the human explanations collected during the user study. We observed that This section describes the results of the experimentation during the user study, participants sometimes provided and user study. Quantitative results include model ac- explanations in phrases and incomplete sentences, causcuracy and explanation evaluation scores using BLEU ing a greater diference between the AI-generated and and ROUGE scores. We also discuss the quality of the human explanations, leading to a lower BLEU score. For generated explanations based on human evaluation. that reason, we chose to use the ROUGE-1 score, as it will help mitigate the efects of the diferent lengths of expla4.1. Quantitative Results nations between the participants and the LLM. Similarly to the BLEU score, the ROUGE score also has limitations.

We investigated the classification accuracy of the difer- It may not fully capture the semantic meaning or coherent LLMs. The classification F1, precision, and recall can ence of the text summary, as it relies solely on 1-gram be found in Table 1. The model with the top F1 was the overlap. We suspect that the lower BLEU and ROUGE WizardLM 13 billion (B) parameter version and Speech- scores can be partially attributed to the small sample less 13B. This model also had the highest precision. The size and the short user study explanations. In addition, our evaluation results show that the instruction-tuning LLaMA 2 Chat 13B

WizardLM 13B

SpeechlessLM 13B

LLaMA 2 Chat 70B Avg. F1 Scores No risk Low risk Medium risk High risk Entire dataset

to high risk of suicidal ideation. Some examples of how LLaMA 2 Chat 13B 0.222 0.391 the diferent explanations look for the diferent versions

WizardLM 13B 0.283 0.514 of the LLMs can be found in Figure 3.

SpeechlessLM 13B 0.266 0.464 One issue we observed in the evaluation was the occurLLaMA 2 Chat 70B 0.286 0.525 rence of generated explanations that were hallucinations. Table 3 Hallucinations occur when the model starts providing Generation evaluation metrics for collected Reddit dataset inaccurate information. While this is a common issue in LLMs, it is highly undesirable [28]. One example of that can be seen in Figure 3 in the LLaMA Chat 13B models method used by the WizardLM and SpeechlessLM models explanations. The model mentions that the poster has and the larger parameter size of the LLaMA 70B model been cutting themself. However, that is not mentioned give them an advantage in both explanation quality and anywhere in the input post. classification accuracy. The mental health expert professor reviewed some

It is often very dificult to judge the risk level of a so- LLM classifications and generated explanations. Accordcial media post because suicide risk assessment difers ing to him, one of the most significant indicators of risk between people. In order to measure agreement between of suicide was previous suicide attempts, and the LLM people in our user study, we calculated the Krippendorf did not take that much into consideration. The LLM puts alpha coeficient [ 26, 27]. The alpha value is 0.612, mean- a substantial focus on non-suicidal self-harm. The presing there was a moderate agreement of suicide risk. ence of this factor can be a contributing factor to suicidal behavior, but it does not always correlate directly with 4.2. Qualitative Results it. However, he agreed that the model and explanations successfully distinguished between suicidal thinking and We found that our method produces very promising re- depression. As he observed, neither the model nor the sults qualitatively. Each of the models is able to provide experts could diferentiate between "no risk" behavior a reasonable explanation for the classification. The gen- and "low risk" behavior. erated explanations start by mentioning the risk level ("This post represents medium risk of suicide."). Then, the explanations provide reasoning behind the classifica- 5. Discussion and Limitations tion using parts from the users’ posts as evidence. Many In this work, we see that we are able to use LLMs to explanations, particularly in the low-risk or medium-risk predict the risk of suicide and generate natural language category, include reasoning for why the post is not in a explanations for the classification. diferent risk category. As an example, an explanation of There are several limitations in using this method, a medium risk category post may include reasoning for however. LLMs are very resource intensive, and some why the post is not high risk and why the post is not low mental health facilities may not have the resources. As risk. Word clouds generated on the explanations for low a result, proper infrastructure, including training and (Figure 4) and medium risks (Figure 5) are also in line large amounts of computing, needs to be developed for with our observations. For example, the predicted expla- LLMs to benefit mental health professionals. Another nations for low-risk posts (Figure 4 right) emphasize the limitation is that the classification of the risk of suicide high importance of non-suicidal thoughts with uncer- is very subjective as interpreted from natural language. tainty by words such as ’feeling’,’mention’, and ’indicat- Each expert may find the risk of suicide to be diferent, ing’ implying low risk that the person may be suicidal. thus making it very dificult to benchmark. We saw this However, in Figure 5, for the medium risk, the strength quite frequently in our work. Many of the mental health of the words is similar for both human participants and experts we consulted difered in opinion on the risk of generated explanations. This demonstrates a moderate suicide as risk assessment for suicide is very subjective. locally on a standard computer, so mental health profesAdditionally, the explanations that are produced do not sionals do not need to share patient data with external make generative LLM actually interpretable. It is still companies. not known how the generative LLMs produce classifica- In the future, we will conduct a more extensive user tions or explanations. The explanations are only useful study and use the responses to fine-tune an LLM to profor giving estimated reasoning for a classification; expla- vide better explanations and increase the zero-shot pernations cannot easily tell us the classification methods formance. We will evaluate the performance and accuof the generative LLM. Lastly, LLMs are often prone to racy of the machine-generated explanations more efechallucinations, which can lead to incorrect classifications tively this way. Due to incomplete sentences or phrases and explanations. We saw this during our testing, where delivered by the participants in the user study, we obthe explanation referenced something that did not occur tained lower BLEU and ROUGE scores. In real life, psyin the social media post. chologists and doctors often write short notes about their diagnoses. We plan to automate the generation of full sentences from diagnostic feedback to enhance AI-generated 6. Conclusion and Future Work explanations. We are also testing other foundational models to see how they compare to LLaMA. We can also extrapolate trends and behaviors that describe the LLM’s classification process, but more research is required to determine their accuracy.

Our approach demonstrates a proof of concept method where LLMs with prompt engineering are used to provide explanations for classification for evaluating the risk of suicide. We were able to benchmark the accuracy of risk assessment of suicide done by LLMs. We also compared machine-generated explanations with human explanations to help LLMs generate an explanation humans can better understand and trust. The LLMs of 13B and 70B can efectively classify and produce natural language explanations for the predictions. These LLMs can also run

7. Ethical Considerations The University of Maryland granted access to the protected UMD dataset, requiring an IRB and a detailed approval process. We anonymized all the data to the best 8. Acknowledgment This work is supported by Florida Institute

of Technology’s Institutional Research Initiatives (https://www.fit.edu/institutional-research/ institutional-research/) and AccessComputing (https://www.washington.edu/accesscomputing/).

Dr. Nasheen Nur is the project’s primary investigator and the corresponding author of this paper. of our abilities to reduce bias and the spread of personal information, and only members of the research team had access to them. Please note that the research is still a work in progress, and none of the models guarantee that the results presented are accurate. These evaluations should not be used as a substitute for evaluations by mental health professionals. Experts and users, such as mental health professionals, are responsible for verifying any results. Society, Los Alamitos, CA, USA, 2021, pp. 1077– [26] K. Krippendorf, Content analysis : an introduction 1082. URL: https://doi.ieeecomputersociety.org/ to its methodology, SAGE, 2013. 10.1109/BIBM52615.2021.9669469. doi:10.1109/ [27] S. Castro, Fast Krippendorf: Fast computation of BIBM52615.2021.9669469. Krippendorf’s alpha agreement measure, https: [15] S. Gururangan, A. Marasović, S. Swayamdipta, //github.com/pln-fing-udelar/fast-krippendorf, K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t stop 2017. pretraining: Adapt language models to domains and [28] M. A. Ahmad, I. Yaramis, T. D. Roy, Creating trusttasks, arXiv preprint arXiv:2004.10964 (2020). worthy llms: Dealing with hallucinations in health[16] R. Luo, L. Sun, Y. Xia, T. Qin, S. Zhang, H. Poon, care ai, 2023. arXiv:2311.01463.

T.-Y. Liu, BioGPT: generative pre-trained transformer for biomedical text generation and mining, Briefings in Bioinformatics 23 (2022). URL: https://doi.org/10.1093%2Fbib%2Fbbac409. doi:10.

1093/bib/bbac409. [17] K. Yang, S. Ji, T. Zhang, Q. Xie, Z. Kuang, S. Ananiadou, Towards interpretable mental health analysis with chatgpt, 2023. arXiv:2304.03347. [18] H.-C. Shing, S. Nair, A. Zirikly, M. Friedenberg,

H. Daumé III, P. Resnik, Expert, crowdsourced, and machine assessment of suicide risk via online postings, in: Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology:

From Keyboard to Clinic, 2018, pp. 25–36. [19] A. Zirikly, P. Resnik, Ö. Uzuner, K. Hollingshead,

CLPsych 2019 shared task: Predicting the degree of suicide risk in Reddit posts, in: Proceedings of the Sixth Workshop on Computational Linguistics and

Clinical Psychology, 2019. [20] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318. [21] C. Lin, Rouge: A package for automatic evaluation of summaries. text summarization branches out, 74–81, 2004. [22] S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang,

J. Li, R. Hu, T. Zhang, F. Wu, G. Wang, Instruction tuning for large language models: A survey, 2023.

arXiv:2308.10792. [23] P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, D. Amodei, Deep reinforcement learning from human preferences, 2023.

arXiv:1706.03741. [24] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng,

C. Tao, D. Jiang, Wizardlm: Empowering large language models to follow complex instructions, 2023.

arXiv:2304.12244. [25] L. Gao, J. Tow, S. Biderman, S. Black, A. DiPofi,

C. Foster, L. Golding, J. Hsu, K. McDonell, N. Muennighof, J. Phang, L. Reynolds, E. Tang, A. Thite, B. Wang, K. Wang, A. Zou, A framework for fewshot language model evaluation, 2021. URL: https: //doi.org/10.5281/zenodo.5371628. doi:10.5281/ zenodo.5371628.