REBECCA at eRisk 2024: Search for Symptoms of Depression Using Sentence Embeddings and Prompt-Based Filtering Notebook for the eRisk Lab at CLEF 2024 Anna Barachanou1,* , Filareti Tsalakanidou1 and Symeon Papadopoulos1 1 Information Technologies Institute, Centre for Research and Technology Hellas (CERTH), Thessaloniki, Greece Abstract Depression is a complex mental health disorder characterized by persistent feelings of sadness, hopelessness, and a lack of interest or pleasure in daily activities. It significantly affects an individual’s well-being, impairing their ability to work, socialize with others and be creative. Social media is used by billions of people globally who interact and generate an abundance of posts and texts. Analysis of social interaction data offers opportunities to gain valuable insights into people’s mental health and potentially take supportive action. eRisk 2024 focuses on the challenge of early risk detection on the Internet and has established a number of tasks for this reason. We participated in Task 1: Search for symptoms of depression. The aim of this task is to rank user sentences in terms of 21 symptoms of depression. This paper presents our approach combining ranking sentences using cosine similarity and Transformer embeddings and refining our results with the use of a Large Language Model (LLM). Our LLM-refined approach was among the best performing ones among the 29 runs submitted by the 9 participating teams. Keywords early risk detection, natural language processing, depression, text retrieval, prompt engineering, transformers 1. Introduction Depression is a debilitating mental health condition affecting 5% of people worldwide according to WHO (World Health Organization)1 . Individuals suffering from depression experience a variety of symptoms beyond a persistently depressed mood and dysphoria. Depression may also manifest as a loss of interest in activities they once enjoyed, significant changes in sleep and appetite, feelings of guilt and hopelessness, fatigue, restlessness, problems with concentration and even suicidal ideation [1]. Beck’s Depression Inventory (BDI-II) [2] is one of the most widely used psychometric assessment tools for depression and it is designed in the form of a questionnaire measuring the severity of such symptoms of depression in adolescents and adults. In today’s digitally connected world, social media such as Facebook, Instagram, YouTube, Twitter, etc. are being used by more than 4.76 billion people worldwide2 . Among these users, there are many people affected by mental health conditions including depression. Through social media, users interact and share their thoughts, opinions and emotions with others. As a result, there are vast amounts of data generated every day that could potentially be leveraged to provide insights into their mental well-being. This presents a unique opportunity for mental health professionals and researchers to analyze language patterns by using modern Natural Language Processing (NLP) techniques. By examining the textual content shared on social media, it should be possible to build methods for early detection of depression. Early detection of risk factors such as depression can prevent numerous negative outcomes to an individual’s life. Recognizing and addressing symptoms of depression early on, facilitates timely helpful CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. $ barachanou@iti.gr (A. Barachanou); filareti@iti.gr (F. Tsalakanidou); papadop@iti.gr (S. Papadopoulos)  0009-0007-1193-7682 (A. Barachanou); 0000-0002-5310-8045 (F. Tsalakanidou); 0000-0002-5441-7341 (S. Papadopoulos) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 https://www.who.int/news-room/fact-sheets/detail/depression 2 https://www.smartinsights.com/social-media-marketing/social-media-strategy/new-global-social-media-research/ CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings intervention and support, which can significantly improve the effectiveness of treatment. Individuals are more likely to respond positively to treatment when intervention begins early and avoid intensified and persistent symptoms. Especially because depression is a major risk factor for suicide, the early offer of support can potentially minimize the risk of suicide and suicidal behaviors. This will eventually lead to an enhanced quality of life free of symptoms of depression, enabling individuals to engage actively and socially in their everyday life. The eRisk lab of CLEF (Conference and Labs of the Evaluation Forum), focuses on early risk prediction on the Internet. Ever since their beginning in 2017 when they piloted a task on Early Detection of depression [3], eRisk’s primary objective has been depression which then expanded to include tasks related to other mental illnesses as well. In this paper, we present our participation, motivated by our involvement in the Horizon 2020 REBECCA project, in Task 1: Search for symptoms of depression. This task is a continuation of the same task in eRisk 2023. We were inspired by the systems developed by the participating teams and we attempted to improve results with the use of Large Language Models (LLMs) and prompt engineering. Moreover, traditional information retrieval methods such as BM25 or TF-IDF can effectively handle document ranking but often lack the semantic depth needed for precise results. Given that depression is a complex and delicate subject, there is a need for highly accurate methods in order to rank sentences with respect to depression symptoms. We initially ranked the sentences to the symptoms using Transformer embeddings and computed the ranking scores with cosine similarity. Subsequently, we decided to leverage the reasoning capabilities of an LLM (namely GPT-4) to refine the results of the base method, emulating the process of providing relevance feedback and removing all non-relevant sentences that do not reflect the author’s state about the symptoms. Our methods achieved highly competitive results among the 9 participating teams, outperforming all competing approaches in terms of Precision@10 in the unanimity setting, and revealed potential in the systems we developed and especially in utilizing GPT-4 to better grasp the concepts of depression in sentences. 2. Related work A significant portion of the related literature about depression is focused on depression identification. For example, Jamil et al. [4] aimed to identify depression from individual tweets and assess the risk of depression from a user’s set of tweets. They computed a small number of features, using indicators like the percentage of depressed tweets, self-reported depression, BOW, and other lexical features. They also employed SVM for classification and used balancing methods like undersampling and SMOTE. Similarly, Peng et al. [5] used various ML models and multi-kernel SVM to combine features from a user’s texts, profile and behaviour. While, Chen et al. [6] used emotion analysis with EMOTIVE [7], linguistic features from LIWC, and behavioral features to identify mental health conditions and employed several ML models for the classification task. eRisk 2023 [8] established three tasks surrounding mental health, including Task 1: Search for symptoms of depression. The task we are currently participating in is a continuation of this, with the aim of expanding research further on this promising topic. The Formula-ML team [9] achieved the best performance by leveraging Transformer embeddings and word2vec for sentence embeddings. Thereafter, they applied soft cosine similarity between sentences and BDI-II terms for each symptom and performed weighted aggregation of these scores to compute the final scores and rank the sentences in relation to symptoms of depression. A number of participating teams utilized LLMs in their systems for eRisk 2023 in various tasks. For Task 1 in particular, the BLUE team [10] utilized ChatGPT to enrich the BDI-II questionnaire terms, enhancing diversity. Then computed embeddings using two Transformer models and performed semantic similarity with cosine similarity to ultimately rank the sentences. Large Language Models are a relatively recent innovation in the field of Artificial Intelligence (AI) and NLP; however, they already show great potential in many fields including mental health. Bakir Hadzic et al. [11] compared the efficacy of three popular LLMs: BERT, GPT-3.5 and GPT-4 for early detection depression in textual data. The research was conducted across three datasets and revealed that GPT-4 significantly outperforms both BERT and GPT-3.5, demonstrating superior performance without prior fine-tuning. This suggests that GPT-4 could be a highly effective tool for early depression detection. The study also highlights the potential of models like GPT-4 in mental health beyond depression, proposing further development and fine-tuning of LLMs. 3. Methodology We participated in Task 1: Search for symptoms of depression for the eRisk 2024 [12] [13] lab of CLEF 2024. This is a continuation of the same task from CLEF eRisk 2023. The task consists of ranking sentences from social media in terms of 21 symptoms of depression (Table 1) from the Beck Depression Inventory–II (BDI-II) questionnaire. The BDI-II questionnaire is a self-report rating inventory and it consists of 21 multiple-choice questions, each one relating to a specific symptom. Each question has four possible answers from least to most severe, associated with a score from 0 to 3 respectively. The scores assigned to each question are then summed to a total score with a maximum score of 63. High total scores indicate a high chance of depressive symptoms. Table 1 The 21 symptoms of depression according to BDI-II Symptoms Sadness Pessimism Past Failure Loss of Pleasure Guilty Feelings Punishment Feelings Self-Dislike Self-Criticalness Suicidal Thoughts or Wishes Crying Agitation Loss of Interest Indecisiveness Worthlessness Loss of Energy Changes in Sleep Patterns Irritability Changes in Appetite Concentration Difficulty Tiredness or Fatigue Loss of Interest in Sex In more detail, each social media sentence should be assigned to the most relevant symptom out of the 21. Subsequently, the sentences assigned to each symptom should be ordered in decreasing order from the most to the least relevant. The relevant sentences should convey the author’s state concerning the symptom, even if the sentiment is positive. For example, a sentence that expresses happiness should be also considered relevant to the symptom of sadness. It is also emphasized that a sentence is only relevant when it is solely about the author’s feelings related to the symptom and not the feelings of other individuals. For example, a user post mentioning that the user’s sister is sad is not considered relevant to sadness for that user because the user is not sad but their sister is. 3.1. Dataset We were provided with two TREC formatted sentence-tagged datasets, one for training and one for testing. Both datasets consist of unlabeled user sentences from Reddit posts. The training dataset consists of last year’s data and the test set contains new data for this year’s eRisk that are to be used for the evaluation of our systems. As presented in Table 2, the test data consist of a total of 15M sentences, which is 11M more sentences than the dataset used in 2023, with approximately 18 words in a sentence on average. We additionally created a small third dataset containing all symptoms and their respective relevant answers from the BDI-II questionnaire. Examples for the symptoms of sadness and pessimism are presented in Table 3. 3.2. Ranking system The system we developed is illustrated in the flowchart of Figure 1. It involves multiple steps that we will expand on below. These include text pre-processing, dataset cleaning by discarding sentences that Table 2 Corpus statistics Training Test Number of sentences 4M 15M Number of users 3,106 553 Avg number of words/sentence 13.99 17.99 Table 3 Relevant answers from the BDI-II for sadness and pessimism Sadness Pessimism I do not feel sad I am not discouraged about my future I feel sad much of the time I feel more discouraged about my future than I used to I am sad all the time I do not expect things to work out for me I am so sad or unhappy that I can’t stand it I feel my future is hopeless and will only get worse are not about the authors, sentence ranking using a pre-trained Transformer for sentence embeddings and cosine similarity, and result refinement using GPT-4. For pre-processing, we translated all texts to English, turned all texts into lowercase, removed punctuation and non-alphabetic symbols, and fixed word contractions. A sentence is considered relevant only when it reflects the author’s state surrounding a symptom, consequently we conducted keyword matching in order to only keep sentences that indicate that the author is talking about themselves (I, me, mine, myself, mine, we, us, our, ourselves, ours). Following the removal of sentences not containing any of the aforementioned keywords, we are confident that we have eliminated a substantial portion of irrelevant texts, simultaneously reducing the computational workload from 15M to 11M sentences (Table 4). Table 4 Number of sentences after cleaning Training Test Initial number of sentences 4M 15M Number of sentences after elimination 1M 11M Due to the datasets provided being unlabeled, we focused on unsupervised methods for our systems. We chose a pre-trained Transformer model to calculate the embeddings for the sentences and the answers of each symptom. The Massive Text Embedding Benchmark (MTEB) [14] evaluates text embeddings across a broad range of tasks and datasets to provide a comprehensive assessment of their performance. It spans 8 embedding tasks, 58 datasets, and 112 languages and tests models to determine their effectiveness. The MTEB Leaderboard 3 presents all tested models across all tasks, including text ranking, along with numerous evaluation metrics. We considered models for the Retrieval and Reranking tasks that were evaluated using NDCG@𝑘 (Normalized Discounted Gain at 𝑘) and MAP (Mean Average Precision), respectively. Since we could already expect how some Transformer-based embeddings would perform thanks to last year’s submissions, we explored new Transfomer models for this part of the task, by excluding models that were involved in last year’s submissions. Based on the above criteria and the need for a model that is as lightweight as possible without sacrificing substantial performance, we opted for the bge-small-en-v1.54 Transformer model [15] that calculates 384-dimensional embeddings and consists of 33M parameters. We calculated the cosine similarity score for each sentence paired with each answer for every 3 https://huggingface.co/spaces/mteb/leaderboard 4 https://huggingface.co/BAAI/bge-small-en-v1.5 Text preprocess Keyword matching Sentence embedings Answer embeddings Cosine similarity Rank sentences Refinement with GPT-4 Figure 1: Methodology flowchart symptom. For every sentence, we kept the max similarity score out of the sentence-answer pairs for each symptom and then assigned the sentence to the symptom with total max score. We then ranked the sentences under every symptom based on the above score and kept the top 1,000 per symptom resulting in a total of 21,000 ranked sentences from the initial corpus. Depression is a complex and delicate subject, hence we expect that our initial ranking using the above method would be a decent but crude approximation to the task. To further refine our results, we resorted to prompt engineering on top of GPT-4, which is considered as one of the state-of-the-art LLMs. We used prompt engineering to discard any non-relevant sentences that were ranked high by the previous steps of our system. Our goal was to use prompts asking GPT-4 to decide whether a sentence is actually relevant (according to GPT-4) to the symptom. We first conducted experiments using ChatGPT testing various candidate prompts comparing a shared prompt strategy (i.e. using the same prompt for all symptoms) versus a symptom-specific prompt strategy. After our initial experimentation, we decided that a symptom-specific strategy was more effective. All symptom-specific prompts followed the same syntax for the sake of uniformity. Subsequently, we used the more powerful gpt4-turbo model that we accessed via the OpenAI API5 for the final results. Our 21 prompts followed the subsequent structure: “We will provide you with some sentences. Your task is to decide whether they are related to symptom in a positive/negative sentiment or not”. Where we included each symptom and its respective positive and negative sentiment. The detailed prompts are provided in Appendix A. Since positive feelings about a symptom are to be considered relevant as well, we made an effort to include positive sentiment in the relevant sentences using our prompt. We removed all text that was not considered relevant by GPT-4 resulting in 14,815 sentences, meaning that 6,185 sentences were discarded as non-relevant. We submitted both the method without prompt engineering and with GPT-4 assessment in order to evaluate if GPT-4 improved the overall performance. 5 https://openai.com/api/ 4. Results We provided two runs with our results in the requested TREC format: TransformerEmbed- dings_CosineSimilarity contaning the results of our baseline method and TransformerEmbed- dings_CosineSimilarity_gpt with our final results using ranking refinement with GPT-4. In total, 9 teams participated in eRisk 2024 Task 1 with 29 submitted runs. eRisk selected a number of sentences from all teams’ submissions using top-k pooling. Then the assessment was performed by human assessors who examined whether a sentence was correctly ranked to a symptom or not. Two types of evaluations took place: a) a majority vote where the agreement of the majority of the assessors is enough to label a ranking as correct (or not); b) a unanimity vote where all of the assessors are required to agree. Five metrics were used for the evaluation of all submissions: AP (Average Precision), MAP (Mean Average Precision), R-PREC (Recall Precision), P@10 (Precision at 10) and NDCG (Normalized Discounted Cumulative Gain). As presented in Tables 5 and 6, our systems demonstrated good performance across all metrics in both the majority and the unanimity vote. Regarding the majority vote, we are approaching the performance levels of the top performing teams across all metrics and we are above the mean and median of total runs submitted by all teams. While our method with GPT-4 ranking refinement TransformerEmbeddings_CosineSimilarity_gpt is improving the performance across all scores except from NDCG. Table 5 Majority voting results Team Method AP R-PREC P10 NDCG MeVer-REBECCA TransformerEmbeddings_CosineSimilarity_gpt 0.301 0.340 0.981 0.506 MeVer-REBECCA TransformerEmbeddings_CosineSimilarity 0.295 0.332 0.976 0.517 NUS-IDS Config 5 0.375 0.434 0.924 0.631 APB-UC3M APB-UC3M sentsim-all-MiniLM-L6-v2 0.354 0.391 0.986 0.591 All team runs Mean 0.226 0.253 0.685 0.375 All team runs Median 0.252 0.322 0.738 0.453 Concerning the unanimity vote, we received the best P@10 score of 0.833 for TransformerEmbed- dings_CosineSimilarity_gpt amongst all 29 runs of the participating teams. In terms of the rest of the metrics, we are close to the best performing team, while our scores exceed both the mean and median values of the scores of all teams runs once again. Consequently, the results indicate the strength of both our baseline model and our refinement method. Our ranking refinement proposal turned out to improve overall performance as there was an increase across all metrics with the exception of NDCG. Table 6 Unanimity voting results Team Method AP R-PREC P10 NDCG MeVer-REBECCA TransformerEmbeddings_CosineSimilarity_gpt 0.305 0.357 0.833 0.551 MeVer-REBECCA TransformerEmbeddings_CosineSimilarity 0.294 0.349 0.824 0.556 NUS-IDS Config 5 0.392 0.436 0.795 0.692 All team runs Mean 0.220 0.248 0.548 0.411 All team runs Median 0.227 0.275 0.576 0.499 5. Conclusion and future work In conclusion, based on the mean and median of the assessment scores of all teams, our methods are competitive and exhibit potential for future research. Our proposed methodology consisted of a few pre-processing and cleaning steps followed by a simple ranking using sentence embeddings, which was further refined based on a prompt engineering strategy on top of GPT-4. However, there is room for improvement in the scores by making enhancements in our methodology. One future step is to experiment with various other prompting strategies that could be more effective in detecting relevant and non-relevant sentences. Moreover, one could leverage publicly available depression-annotated corpora to fine-tune GPT-4 so that it can better recognize the relevance of sentences to depression symptoms. Finally, we could investigate leveraging LLMs to annotate parts of the dataset and use these to train more accurate deep learning models in a supervised manner. Acknowledgments This work has been partially funded by the H2020 project “REBECCA: REsearch on BrEast Cancer induced chronic conditions supported by Causal Analysis of multi-source data” under Grant Agreement no. 965231 (https://rebeccaproject.eu/). References [1] J. W. Kanter, A. M. Busch, C. E. Weeks, S. J. Landes, The nature of clinical depression: symptoms, syn- dromes, and behavior analysis, The Behavior analyst 31 (2008) 1–21. doi:10.1007/BF03392158. [2] A. T. Beck, C. H. Ward, M. Mendelson, J. Mock, J. Erbaugh, An inventory for measuring depression, JAMA Psychiatry 4 (1961) 561–571. doi:10.1001/archpsyc.1961.01710120031004. [3] D. Losada, F. Crestani, A test collection for research on depression and language use, volume 9822, 2016, pp. 28–39. doi:10.1007/978-3-319-44564-9_3. [4] Z. Jamil, D. Inkpen, P. Buddhitha, K. White, Monitoring tweets for depression to detect at-risk users, in: K. Hollingshead, M. E. Ireland, K. Loveys (Eds.), Proceedings of the Fourth Work- shop on Computational Linguistics and Clinical Psychology — From Linguistic Signal to Clin- ical Reality, Association for Computational Linguistics, Vancouver, BC, 2017, pp. 32–40. URL: https://aclanthology.org/W17-3104. doi:10.18653/v1/W17-3104. [5] Z. Peng, Q. Hu, J. Dang, Multi-kernel svm based depression recognition using social media data, International Journal of Machine Learning and Cybernetics 10 (2017) 43 – 57. doi:10.1007/ s13042-017-0697-1. [6] X. Chen, M. Sykora, T. Jackson, S. Elayan, F. Munir, Tweeting your mental health: an exploration of different classifiers and features with emotional signals in identifying mental health conditions, 2018. doi:10.24251/HICSS.2018.421. [7] M. Sykora, T. Jackson, A. O’Brien, S. Elayan, Emotive ontology: Extracting fine-grained emotions from terse, informal messages, International Journal on Computer Science and Information Systems 8 (2013) 106–118. [8] J. Parapar, P. Martin-Rodilla, D. E. Losada, F. Crestani, Overview of erisk 2023: Early risk prediction on the internet, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction: 14th International Conference of the CLEF Association, CLEF 2023, Thessaloniki, Greece, September 18–21, 2023, Proceedings, Springer-Verlag, Berlin, Heidelberg, 2023, p. 294–315. URL: https://doi. org/10.1007/978-3-031-42448-9_22. doi:10.1007/978-3-031-42448-9_22. [9] N. Recharla, P. Bolimera, Y. Gupta, A. K. Madasamy, Exploring depression symptoms through similarity methods in social media posts, 2023. URL: https://ceur-ws.org/Vol-3497/paper-065.pdf. [10] A.-M. Bucur, Utilizing chatgpt generated data to retrieve depression symptoms from social media, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction: 14th International Conference of the CLEF Association, CLEF 2023, Thessaloniki, Greece, September 18–21, 2023, Proceedings, 2023, pp. 662–671. URL: https://ceur-ws.org/Vol-3497/paper-055.pdf. [11] B. Hadzic, P. Mohammed, M. Danner, J. Ohse, Y. Zhang, Y. Shiban, M. Rätsch, Enhancing early de- pression detection with ai: a comparative use of nlp models, SICE Journal of Control, Measurement, and System Integration 17 (2024) 135–143. doi:10.1080/18824889.2024.2342624. [12] J. Parapar, P. Martin-Rodilla, D. E. Losada, F. Crestani, Overview of erisk 2024: Early risk prediction on the internet, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction. 15th International Conference of the CLEF Association, CLEF 2024, Springer International, Grenoble, France, 2024. [13] J. Parapar, P. Martin-Rodilla, D. E. Losada, F. Crestani, Overview of erisk 2024: Early risk prediction on the internet (extended overview), in: Working Notes of the Conference and Labs of the Evaluation Forum CLEF 2024, Grenoble, France, September 9th to 12th, 2024, CLEF 2024, CEUR Workshop Proceedings, 2024. [14] N. Muennighoff, N. Tazi, L. Magne, N. Reimers, MTEB: Massive text embedding benchmark, in: A. Vlachos, I. Augenstein (Eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Dubrovnik, Croatia, 2023, pp. 2014–2037. doi:10.18653/v1/2023.eacl-main.148. [15] S. Xiao, Z. Liu, P. Zhang, N. Muennighoff, C-pack: Packaged resources to advance general chinese embedding, 2023. arXiv:2309.07597. A. GPT-4 Prompts Here we present the prompts used in this work in detail: 1. We will provide you with some sentences. Your task is to decide whether they are related to sadness/happiness or not. 2. We will provide you with some sentences. Your task is to decide whether they are relevant to pessimism/optimism or not. 3. We will provide you with some sentences. Your task is to decide whether they are relevant to past failure/success or not. 4. We will provide you with some sentences. Your task is to decide whether they are relevant to the recent loss (or not) of pleasure. 5. We will provide you with some sentences. Your task is to decide whether they are relevant to feeling (or not feeling) guilty. 6. We will provide you with some sentences. Your task is to decide whether they are relevant to someone feeling like they themselves are being (or not being) punished. 7. We will provide you with some sentences. Your task is to decide whether they are relevant to someone disliking or liking themselves. 8. We will provide you with some sentences. Your task is to decide whether they are relevant to someone feeling (or not feeling) critical towards themselves. 9. We will provide you with some sentences. Your task is to decide whether they are relevant to having (or not having) suicidal thoughts and wishes. 10. We will provide you with some sentences. Your task is to decide whether the sentences mention crying or not crying now or any other time. 11. We will provide you with some sentences. Your task is to decide whether they are relevant to feeling (or not feeling) agitated. 12. We will provide you with some sentences. Your task is to decide whether they are relevant to losing (or not losing) interest in things. 13. We will provide you with some sentences. Your task is to decide whether they are relevant to being (or not being) indecisive. 14. We will provide you with some sentences. Your task is to decide whether they are relevant to feeling (or not feeling) worthless. 15. We will provide you with some sentences. Your task is to decide whether they are relevant to having (or not having) energy. 16. We will provide you with some sentences. Your task is to decide whether they are relevant to experiencing (or not experiencing) changes in sleeping pattern. 17. We will provide you with some sentences. Your task is to decide whether they are relevant to feeling (or not feeling) irritable. 18. We will provide you with some sentences. Your task is to decide whether they are relevant to experiencing (or not experiencing) changes in appetite. 19. We will provide you with some sentences. Your task is to decide whether they are relevant to having (or not having) difficulty concentrating. 20. We will provide you with some sentences. Your task is to decide whether they are relevant to feeling (or not feeling) tired. 21. We will provide you with some sentences. Your task is to decide whether they are relevant to losing (or not losing) interest in sex.