Overview of eRisk at CLEF 2024: Early Risk Prediction on the Internet (Extended Overview) Notebook for the eRisk Lab at CLEF 2024 Javier Parapar1,* , Patricia Martín-Rodilla1 , David E. Losada2 and Fabio Crestani3 1 Information Retrieval Lab, Centro de Investigación en Tecnoloxías da Información e as Comunicacións (CITIC), Universidade da Coruña. Campus de Elviña s/n C.P 15071 A Coruña, Spain 2 Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS),Universidade de Santiago de Compostela. Rúa de Jenaro de la Fuente Domínguez, C.P 15782, Santiago de Compostela, Spain 3 Faculty of Informatics, Universitá della Svizzera italiana (USI). Campus EST, Via alla Santa 1, 6900 Viganello, Switzerland Abstract This paper presents eRisk 2024, the eighth edition of the CLEF conference’s lab dedicated to early risk detection. Since its inception, the lab has been at the forefront of developing and refining evaluation methodologies, effectiveness metrics, and processes for early risk detection across various domains. These early alerting models hold significant value, particularly in sectors focused on health and safety, where timely intervention can be crucial. eRisk 2024 featured three main tasks designed to push the boundaries of early risk detection techniques. The first task challenged participants to rank sentences based on their relevance to standardized depression symptoms, a crucial step in identifying early signs of depression from textual data. The second task focused on the early detection of anorexia indicators, aiming to develop models that can recognize the subtle cues of this eating disorder before it becomes critical. The third task was centered around estimating responses to an eating disorders questionnaire by analyzing users’ social media posts. Participants had to leverage the rich, real-world textual data available on social media to gauge potential mental health risks. Through these tasks, eRisk 2024 continues to advance the field of early risk detection, fostering innovations that could lead to significant improvements in public health interventions. Keywords Early risk, Depression, Anorexia, Eating disorders 1. Introduction The primary goal of eRisk is to explore evaluation methodologies, metrics, and other factors essential for developing research collections and identifying early risk signs. Early detection technologies are increasingly important in safety and health fields. These technologies are particularly useful for detecting mental illness symptoms, identifying interactions between infants and sexual abusers, or spotting antisocial threats online, where they can provide early warnings and potentially prevent harmful outcomes. Our lab focuses on a range of psychological issues, including depression, self-harm, pathological gambling, and eating disorders. We have found that the relationship between psychological conditions and language use is complex, highlighting the need for more effective automatic language-based screening models. This complexity arises from the subtle and varied ways in which psychological distress can manifest in language, necessitating sophisticated analytical techniques. CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. $ javier.parapar@udc.es (J. Parapar); patricia.martin.rodilla@udc.es (P. Martín-Rodilla); david.losada@usc.es (D. E. Losada); fabio.crestani@usi.ch (F. Crestani) € https://www.dc.fi.udc.es/~parapar (J. Parapar); http://www.incipit.csic.es/gl/persoa/patricia-martin-rodilla (P. Martín-Rodilla); http://tec.citius.usc.es/ir/ (D. E. Losada); https://search.usi.ch/en/people/4f0dd874bbd63c00938825fae1843200/crestani-fabio (F. Crestani)  0000-0002-5997-8252 (J. Parapar); 0000-0002-1540-883X2 (P. Martín-Rodilla); 0000-0001-8823-7501 (D. E. Losada); 0000-0001-8672-0700 (F. Crestani) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings In 2017, we initiated our efforts with a task aimed at detecting early signs of depression. This task utilized new evaluation methods and a test dataset described in [1, 2]. The goal was to develop models capable of identifying depressive symptoms from textual data, which could then be used for early intervention. In 2018, we expanded our scope to include the early detection of anorexia [3, 4]. This task required models to identify language patterns indicative of anorexia, providing a tool for early diagnosis. In 2019, we continued our work on anorexia and introduced new challenges. These included detecting early signs of self-harm and estimating responses to a depression questionnaire based on social media activity [5, 6, 7]. The self-harm detection task aimed to identify individuals at risk by analyzing their online posts for signs of self-injurious behavior. The severity estimation task aimed to quantify the level of depressive symptoms exhibited in social media posts, providing a more nuanced understanding of an individual’s mental health status. In 2020, our focus included further development of self-harm detection and a new task on depression severity estimation [8, 9, 10]. In 2021, we concentrated on early detection tasks for pathological gambling and self-harm, along with a task for estimating depression severity [11, 12, 13]. The pathological gambling task involved identifying language patterns associated with gambling addiction, which could be used to flag individuals at risk. The self-harm and depression severity tasks continued to build on our previous work, refining the models and evaluation methods. The 2022 edition of eRisk introduced tasks for early detection of pathological gambling, depression, and severity estimation of eating disorders [14, 15, 16]. These tasks aimed to improve the accuracy and reliability of early detection models, providing valuable tools for mental health professionals. In 2023, eRisk tasks included ranking sentences by their relevance to depression symptoms, early detection of gambling signs, and severity estimation of eating disorders [17, 18, 19]. The sentence ranking task required models to assess the relevance of individual sentences to specific depressive symptoms, as outlined in the BDI-II questionnaire. This task aimed to enhance the precision of symptom identification in textual data. In 2024, eRisk presented three campaign-style tasks [20, 19]. The first task focused on ranking sentences related to the 21 symptoms of depression as per the BDI-II questionnaire, using sentences extracted from social media posts. The second task continued our work on early detection of anorexia, requiring models to identify language patterns indicative of this eating disorder. The third task revisited the severity estimation of eating disorders, aiming to quantify the severity of symptoms exhibited in textual data. Detailed descriptions of these tasks are provided in the subsequent sections of this overview article. In 2024, we had 84 teams registered for the lab. We received results from 17 of them: 29 runs for Task 1, 44 runs for Task 2, and 14 for Task 3. These results provided valuable insights into the effectiveness of different models and approaches, contributing to the ongoing development of early detection technologies. 2. Task 1: Search for Symptoms of Depression This task builds on eRisk 2023’s Task 1, which focused on ranking sentences from user writings based on their relevance to specific depression symptoms. Participants had to order sentences according to their relevance to the 21 standardized symptoms listed in the BDI-II Questionnaire [21]. A sentence was considered relevant if it reflected the user’s condition related to a symptom, including positive statements (e.g., “I feel quite happy lately” is relevant for the symptom “Sadness”). This year, the dataset included the target sentence and the sentences immediately before and after it to provide additional context. 2.1. Dataset The dataset provided was in TREC format, tagged with sentences derived from eRisk’s historical data. Table 1 presents some statistics of the corpus. Table 1 Corpus statistics for Task 1: Search for Symptoms of Depression. Number of users 551,311 Number of sentences 15,542,200 Average number of words per sentence 17.98 1Q0251001_0_1000110myGroupNameMyMethodName 1Q0251202_5_400029.5myGroupNameMyMethodName 1Q0858202_3_200039myGroupNameMyMethodName ... 21Q0153202_2_209981.25myGroupNameMyMethodName 21Q0331302_1_109991myGroupNameMyMethodName 21Q0223133_9_810000.9myGroupNameMyMethodName Figure 1: Example of a participant’s run. 2.2. Assessment Process Participants were given the corpus of sentences and the description of the symptoms from the BDI-II questionnaire. They were free to decide on the best strategy to derive queries for representing the BDI-II symptoms. Each team could submit up to 5 variants (runs). Each run included 21 TREC-style formatted rankings of sentences, as shown in Figure 1. For each symptom, participants submitted up to 1000 results sorted by estimated relevance. We received 29 runs from 9 participating teams (see Table 2). Table 2 Task 1 (Search for Symptoms of Depression): Number of runs from participants. Team # of submissions ThinkIR 1 SINAI [22] 2 RELAI [23] 5 NUS-IDS [24] 5 MindwaveML [25] 3 MeVer-REBECCA [26] 2 GVIS 1 DSGT [27] 5 APB-UC3M [28] 5 Total 29 To create the relevance judgments, three assessors annotated a pool of sentences associated with each symptom. These candidate sentences were obtained by performing top-k pooling from the relevance rankings submitted by the participants. The assessors were given specific instructions (see Figure 2) dto determine the relevance of candi- date sentences. They considered a sentence relevant if it addressed the topic and provided explicit information about the individual’s state in relation to the symptom. This dual concept of relevance (on-topic and reflective of the user’s state with respect to the symptom) introduced a higher level of complexity compared to standard relevance assessments. Consequently, we developed a robust annotation methodology and formal assessment guidelines to ensure consistency and accuracy. The main change from eRisk 2023’s assessment process was that the assessors were presented with the sentence and its context (previous and following sentences, if available). To create the pool of sentences for assessment, we implemented top-k pooling with 𝑘 = 50. The resulting pool sizes per sentence are reported in Table 3. The annotation process involved a team of three assessors with different backgrounds and expertise. Assume you a r e g i v e n a BDI item , e . g . : 1 5 . Loss of Energy − I have a s much e n e r g y a s e v e r . − I have l e s s e n e r g y than I used t o have . − I don ’ t have enough e n e r g y t o do v e r y much . − I don ’ t have enough e n e r g y t o do a n y t h i n g . The t a s k c o n s i s t s o f a n n o t a t i n g s e n t e n c e s i n t h e c o l l e c t i o n t h a t a r e t o p i c a l l y − r e l e v a n t t o t h e i t e m ( r e l a t e d t o t h e q u e s t i o n and / o r t o t h e a n s w e r s ) . Note : A r e l e v a n t s e n t e n c e s h o u l d p r o v i d e some i n f o r m a t i o n a b o u t t h e s t a t e o f t h e i n d i v i d u a l r e l a t e d t o t h e t o p i c o f t h e BDI i t e m . But i t i s n o t n e c e s s a r y t h a t t h e e x a c t same words a r e used . A s s e s s o r s s h o u l d l a b e l t h e r e l e v a n c e o f t h e s e n t e n c e t a k i n g i n t o a c c o u n t i t s c o n t e x t ( p r e c e d i n g and f o l l o w i n g s e n t e n c e ) . Your j o b i s t o a s s e s s s e n t e n c e s on how t o p i c a l l y − r e l e v a n t t h e y a r e f o r a c o n c r e t e BDI i t e m . The r e l e v a n c e g r a d e s a r e : 1 . R e l e v a n t : A r e l e v a n t s e n t e n c e s h o u l d be t o p i c a l l y − r e l a t e d t o t h e BDI − i t e m ( r e g a r d l e s s o f t h e wo rdi ng ) and , a d d i t i o n a l l y , i t s h o u l d r e f e r t o t h e s t a t e o f t h e w r i t e r a b o u t t h e BDI − item . 0 . Non− r e l e v a n t : A non − r e l e v a n t s e n t e n c e d o e s n o t a d d r e s s any t o p i c r e l a t e d t o t h e q u e s t i o n and / o r t h e a n s w e r s o f t h e BDI − i t e m ( o r i t i s r e l a t e d t o t h e t o p i c b u t d o e s n o t r e p r e s e n t t h e w r i t e r ’ s s t a t e a b o u t t h e BDI − i t e m ) . F o r example , f o r BDI − i t e m 1 5 , a s e n t e n c e t h a t d o e s n o t t a l k a b o u t t h e i n d i v i d u a l ’ s l e v e l o f e n e r g y ( r e g a r d l e s s o f t h e wor di ng ) , t h e n i s a non − r e l e v a n t s e n t e n c e . Ex am pl es ( a s s e s s m e n t o f s e n t e n c e s r a n k e d f o r BDI − i t e m number 1 5 ) : I c a n n o t c o n t r o l my e n e r g y t h e s e days : R e l e v a n t My s i s t e r has no e n e r g y a t a l l : Non− r e l e v a n t s e n t e n c e ( b e c a u s e i t d o e s n o t r e f e r t o t h e w r i t e r who p o s t e d t h i s s e n t e n c e ) The book was a b o u t a h i g h l y e n e r g e t i c man : Non− r e l e v a n t s e n t e n c e ( b e c a u s e i t d o e s n o t r e f e r t o t h e w r i t e r who p o s t e d t h i s s e n t e n c e ) I f e e l more t i r e d than u s u a l : R e l e v a n t The f o o t b a l l team i s named Top E n e r g y : Non− r e l e v a n t I am t o t a l l y l o n e l y : Non− r e l e v a n t ( i t d o e s n o t mention e n e r g y ) I have j u s t r e c h a r g e d my b a t t e r i e s : R e l e v a n t I am l o s t : Non− r e l e v a n t We a d v i s e you t o n o t s t o p t h e a s s e s s m e n t s e s s i o n i n t h e m i d d l e o f one BDI − i t e m ( t h i s h e l p s t o maintain c o n s i s t e n c y i n the judgments ) . A s s e s s o r s should l a b e l the r e l e v a n c e of the s e n t e n c e t a k i n g i n t o a c c o u n t i t s c o n t e x t ( p r e c e d i n g and f o l l o w i n g s e n t e n c e ) . To measure t h e a s s e s s m e n t e f f o r t , we a s k you t o r e c o r d t h e t i m e s p e n t on f u l l y e v a l u a t i n g t h e s e n t e n c e s p r e s e n t e d f o r each BDI − i t e m . Figure 2: Guidelines for labelling sentences related to depression symptoms (Task 1). One assessor had professional training in psychology, while the other two were computer science researchers—a postdoctoral fellow and a Ph.D. student—with a specialization in early risk technologies. To ensure consistency and clarity throughout the process, the lab organizers conducted a preparatory session with the assessors. During this session, an initial version of the guidelines was discussed, and any doubts or questions raised by the assessors were addressed. This collaborative effort resulted in the final version of the guidelines1 . According to these guidelines, a sentence is considered relevant only if it provides “some information about the state of the individual related to the topic of the BDI item”. This criterion serves as the basis for determining the relevance of sentences during the annotation process. The final outcomes of the annotation process are presented in Table 3, where the number of relevant sentences per BDI item is reported (last two columns). We marked a sentence as relevant following two aggregation criteria: unanimity and majority. 1 https://erisk.irlab.org/guidelines_erisk24_task1.html Table 3 Task 1 (Search for Symptoms of Depression): Size of the pool for every BDI Item BDI Item (#) pool # rels (3/3) # rels (2/3) Sadness (1) 783 226 442 Pessimism (2) 747 122 294 Past Failure (3) 715 160 270 Loss of Pleasure (4) 652 116 196 Guilty Feelings (5) 737 311 399 Punishment Feelings (6) 611 87 162 Self-Dislike (7) 730 308 385 Self-Criticalness (8) 700 187 281 Suicidal Thoughts or Wishes (9) 701 326 410 Crying (10) 755 311 433 Agitation (11) 758 276 400 Loss of Interest (12) 657 131 211 Indecisiveness (13) 784 164 308 Worthlessness (14) 567 222 258 Loss of Energy (15) 609 181 243 Changes in Sleeping Pattern (16) 777 244 365 Irritability (17) 727 192 305 Changes in Appetite (18) 694 219 334 Concentration Difficulty (19) 581 204 286 Tiredness or Fatigue (20) 682 238 343 Loss of Interest in Sex (21) 847 137 304 2.3. Results The performance results for the participating systems are shown in Tables 4 (majority-based qrels) and 5 (unanimity-based qrels). The tables report several standard performance metrics, such as Mean Average Precision (MAP), mean R-Precision, mean Precision at 10, and mean NDCG at 1000. Run Config_5, from the team NUS-IDS [24], achieved the top-ranking performance for nearly all metrics and relevance judg- ment types. It consists in an ensemble model designed for computing semantic similarity with respect to different expanded descriptions of BDI symptoms. This ensemble leverages three pre-trained language models: all-mpnet-base-v2, all-MiniLM-L12-v2, and all-distilroberta-v1.This approach is similar to the APB-UC3M [28] team’s proposal, which achieved the best results in terms of P@10 using majority voting. In contrast, the MeVer-REBECCA [26] team opted for the bge-small-en-v1.54 embedding model, attaining the highest P@10 scores in the unanimity case. 3. Task 2: Early Detection of Signs of Anorexia This task is the third edition of the challenge to develop models for early identification of anorexia signs. The goal was to process evidence sequentially and detect early indications of anorexia as soon as possible. Participating systems analyzed user posts on social media in the order they were written. Successful outcomes from this task could be used for sequential monitoring of user interactions across various online platforms like blogs, social networks, and other digital media. The test collection used for this task followed the format described by Losada and Crestani [29]. It contains writings, including posts and comments, from a selected group of social media users. Users are categorized into two groups: anorexia and non-anorexia. For each user, the collection contains a sequence of writings arranged in chronological order. To facilitate the task and ensure uniform distribution, we established a server that systematically provided user writings to the participating teams. Further details about the server’s setup are available on the lab’s official website2 . This was a train-test task. During the training stage, teams had access to the entire history of writings 2 https://early.irlab.org/server.html Table 4 Ranking-based evaluation for Task 1 (majority voting) Team Run AP R-PREC P@10 NDCG ThinkIR BM25Similarity 0.203 0.258 0.881 0.410 SINAI SINAI_DR_majority_daug 0.064 0.107 0.562 0.174 SINAI GPT3-Insight-8 0.008 0.024 0.200 0.044 RELAI RELAI_paraphrase-MiniLM-L12-v2 0.267 0.346 0.738 0.525 RELAI RELAI_paraphrase-MiniLM-L6-v2 0.236 0.325 0.590 0.503 RELAI RELAI_all-MiniLM-L6-v2-simcse 0.226 0.322 0.595 0.495 RELAI tfidf_sgd 0.163 0.240 0.552 0.394 RELAI RELAI_word2vec 0.000 0.000 0.000 0.000 NUS-IDS Config_5 0.375 0.434 0.924 0.631 NUS-IDS Config_2 0.352 0.415 0.881 0.616 NUS-IDS Config_4 0.336 0.401 0.890 0.599 NUS-IDS Config_1 0.312 0.386 0.871 0.576 NUS-IDS Config_3 0.286 0.359 0.857 0.556 MindwaveML Mindwave- 0.159 0.240 0.567 0.396 MLMiniLML12MLP_weighted MindwaveML Mindwave-MLMiniLML12MLP_0.5 0.149 0.231 0.538 0.378 MindwaveML Mindwave-MLMiniLML12 0.133 0.212 0.490 0.335 MeVer-REBECCA Transformer- 0.301 0.340 0.981 0.506 Embeddings_CosineSimilarity_gpt MeVer-REBECCA Transformer- 0.295 0.332 0.976 0.517 Embeddings_CosineSimilarity GVIS GVIS 0.000 0.002 0.035 0.005 DSGT logistic_transformer_v5 0.000 0.009 0.000 0.014 DSGT logistic_word2vec_v5 0.000 0.001 0.000 0.003 DSGT count_logistic 0.000 0.000 0.000 0.001 DSGT count_nb 0.000 0.000 0.000 0.000 DSGT word2vec_logistic 0.000 0.000 0.000 0.000 APB-UC3M APB-UC3M_sentsim-all-MiniLM-L6- 0.354 0.391 0.986 0.591 v2 APB-UC3M APB-UC3M_sentsim-all-MiniLM-L12- 0.337 0.378 0.990 0.564 v2 APB-UC3M APB-UC3M_sentsim-all-mpnet-base- 0.293 0.330 0.967 0.525 v2 APB-UC3M APB-UC3M_ensemble 0.057 0.120 0.324 0.191 APB-UC3M APB-UC3M_classifier_roberta-base- 0.056 0.118 0.371 0.206 go_emotions for training users. We indicated which users had explicitly mentioned being diagnosed with anorexia. Participants could tune their systems with this training data. In 2024, the training data included users from previous editions of the anorexia task (2018 and 2019). During the test stage, participants connected to our server and engaged in an iterative process of receiving user writings and sending their responses. At any point within the chronology of user writings, participants could halt the process and issue an alert. After reading each user writing, teams had to decide between two options: i) alerting about the user, indicating a predicted sign of anorexia, or ii) not alerting about the user. Participants made this choice independently for each user in the test split. Once an alert was issued, it was final, and no further decisions regarding that individual were considered. Conversely, the absence of alerts was non-final, allowing participants to submit an alert later if they detected signs of risk. We evaluated the systems’ performance using two indicators: the accuracy of the decisions made and the number of user writings required to reach those decisions. These criteria provide insights into the effectiveness and efficiency of the systems. To support the test stage, we deployed a REST service. The server iteratively distributed user writings and waited for responses from participants. New user data was not provided to a participant until the service received a decision from that team. The submission period for the task was open from February 5th, 2024, until April 12th, 2024. Table 5 Ranking-based evaluation for Task 1 (unanimity) Team Run MAP R-PREC P@10 NDCG ThinkIR BM25Similarity 0.174 0.246 0.652 0.417 SINAI SINAI_DR_majority_daug 0.046 0.098 0.362 0.150 SINAI GPT3-Insight-8 0.001 0.009 0.052 0.014 RELAI RELAI_paraphrase-MiniLM-L12-v2 0.248 0.329 0.576 0.537 RELAI RELAI_paraphrase-MiniLM-L6-v2 0.207 0.287 0.410 0.509 RELAI RELAI_all-MiniLM-L6-v2-simcse 0.194 0.275 0.433 0.499 RELAI tfidf_sgd 0.138 0.207 0.376 0.383 RELAI RELAI_word2vec 0.000 0.000 0.000 0.000 NUS-IDS Config_5 0.392 0.436 0.795 0.692 NUS-IDS Config_2 0.370 0.431 0.752 0.677 NUS-IDS Config_4 0.358 0.416 0.771 0.662 NUS-IDS Config_1 0.329 0.391 0.786 0.636 NUS-IDS Config_3 0.312 0.375 0.757 0.621 MindwaveML Mindwave- 0.158 0.238 0.471 0.427 MLMiniLML12MLP_weighted MindwaveML Mindwave-MLMiniLML12MLP_0.5 0.147 0.227 0.457 0.408 MindwaveML Mindwave-MLMiniLML12 0.128 0.203 0.410 0.360 MeVer-REBECCA Transformer- 0.305 0.357 0.833 0.551 Embeddings_CosineSimilarity_gpt MeVer-REBECCA Transformer- 0.294 0.349 0.824 0.556 Embeddings_CosineSimilarity GVIS GVIS 0.000 0.002 0.030 0.004 DSGT logistic_transformer_v5 0.000 0.006 0.000 0.010 DSGT logistic_word2vec_v5 0.000 0.001 0.000 0.003 DSGT count_logistic 0.000 0.000 0.000 0.000 DSGT count_nb 0.000 0.000 0.000 0.000 DSGT word2vec_logistic 0.000 0.000 0.000 0.000 APB-UC3M APB-UC3M_sentsim-all-MiniLM-L6- 0.345 0.407 0.829 0.630 v2 APB-UC3M APB-UC3M_sentsim-all-MiniLM-L12- 0.333 0.389 0.805 0.608 v2 APB-UC3M APB-UC3M_sentsim-all-mpnet-base- 0.285 0.342 0.776 0.561 v2 APB-UC3M APB-UC3M_ensemble 0.052 0.106 0.248 0.193 APB-UC3M APB-UC3M_classifier_roberta-base- 0.033 0.084 0.190 0.169 go_emotions Table 6 Task 2 (anorexia). Main statistics of test collection Anorexia Control Num. subjects 92 692 Num. submissions (posts & comments) 28,043 338,843 Avg num. of submissions per subject 304.8 489.6 Avg num. of days from first to last submission ≈ 482 ≈ 971 Avg num. words per submission 28.5 21.4 To construct the ground truth assessments, we adopted established approaches to optimize the use of assessors’ time, as documented in previous studies [30, 31]. These methods employ simulated pooling strategies to create effective test collections. The main statistics of the test collection used for T2 are presented in Table 6. 3.1. Decision-based Evaluation This evaluation approach uses the binary decisions made by the participating systems for each user. In addition to standard classification measures such as Precision, Recall, and F1 score (computed with respect to the positive class), we also calculate ERDE (Early Risk Detection Error), used in previous editions of the lab. A detailed description of ERDE was presented by Losada and Crestani in [29]. ERDE is an error measure that incorporates a penalty for delayed correct alerts (true positives). The penalty increases with the delay in issuing the alert, measured by the number of user posts processed before making the alert. Since 2019, we complemented the evaluation report with additional decision-based metrics that try to capture additional aspects of the problem. These metrics try to overcome some limitations of 𝐸𝑅𝐷𝐸, namely: • the penalty associated to true positives goes quickly to 1. This is due to the functional form of the cost function (sigmoid). • a perfect system, which detects the true positive case right after the first round of messages (first chunk), does not get error equal to 0. • with a method based on releasing data in a chunk-based way (as it was done in 2017 and 2018) the contribution of each user to the performance evaluation has a large variance (different for users with few writings per chunk vs users with many writings per chunk). • 𝐸𝑅𝐷𝐸 is not interpretable. Some research teams have analysed these issues and proposed alternative ways for evaluation. Trotzek and colleagues [32] proposed 𝐸𝑅𝐷𝐸𝑜% . This is a variant of ERDE that does not depend on the number of user writings seen before the alert but, instead, it depends on the percentage of user writings seen before the alert. In this way, user’s contributions to the evaluation are normalized (currently, all users weight the same). However, there is an important limitation of 𝐸𝑅𝐷𝐸𝑜% . In real life applications, the overall number of user writings is not known in advance. Social Media users post contents online and screening tools have to make predictions with the evidence seen. In practice, you do not know when (and if) a user’s thread of messages is exhausted. Thus, the performance metric should not depend on knowledge about the total number of user writings. Another proposal of an alternative evaluation metric for early risk prediction was done by Sadeque and colleagues [33]. They proposed 𝐹𝑙𝑎𝑡𝑒𝑛𝑐𝑦 , which fits better with our purposes. This measure is described next. Imagine a user 𝑢 ∈ 𝑈 and an early risk detection system that iteratively analyzes 𝑢’s writings (e.g. in chronological order, as they appear in Social Media) and, after analyzing 𝑘𝑢 user writings (𝑘𝑢 ≥ 1), takes a binary decision 𝑑𝑢 ∈ {0, 1}, which represents the decision of the system about the user being a risk case. By 𝑔𝑢 ∈ {0, 1}, we refer to the user’s golden truth label. A key component of an early risk evaluation should be the delay on detecting true positives (we do not want systems to detect these cases too late). Therefore, a first and intuitive measure of delay can be defined as follows3 : latency𝑇 𝑃 = median{𝑘𝑢 : 𝑢 ∈ 𝑈, 𝑑𝑢 = 𝑔𝑢 = 1} (1) This measure of latency is calculated over the true positives detected by the system and assesses the system’s delay based on the median number of writings that the system had to process to detect such positive cases. This measure can be included in the experimental report together with standard measures such as Precision (P), Recall (R) and the F-measure (F): |𝑢 ∈ 𝑈 : 𝑑𝑢 = 𝑔𝑢 = 1| 𝑃 = (2) |𝑢 ∈ 𝑈 : 𝑑𝑢 = 1| 3 Observe that Sadeque et al (see [33], pg 497) computed the latency for all users such that 𝑔𝑢 = 1. We argue that latency should be computed only for the true positives. The false negatives (𝑔𝑢 = 1, 𝑑𝑢 = 0) are not detected by the system and, therefore, they would not generate an alert. |𝑢 ∈ 𝑈 : 𝑑𝑢 = 𝑔𝑢 = 1| 𝑅 = (3) |𝑢 ∈ 𝑈 : 𝑔𝑢 = 1| 2·𝑃 ·𝑅 𝐹 = (4) 𝑃 +𝑅 Furthermore, Sadeque et al. proposed a measure, 𝐹𝑙𝑎𝑡𝑒𝑛𝑐𝑦 , which combines the effectiveness of the decision (estimated with the F measure) and the delay4 in the decision. This is calculated by multiplying F by a penalty factor based on the median delay. More specifically, each individual (true positive) decision, taken after reading 𝑘𝑢 writings, is assigned the following penalty: 2 𝑝𝑒𝑛𝑎𝑙𝑡𝑦(𝑘𝑢 ) = −1 + (5) 1 + exp−𝑝·(𝑘𝑢 −1) where 𝑝 is a parameter that determines how quickly the penalty should increase. In [33], 𝑝 was set such that the penalty equals 0.5 at the median number of posts of a user5 . Observe that a decision right after the first writing has no penalty (i.e. 𝑝𝑒𝑛𝑎𝑙𝑡𝑦(1) = 0). Figure 3 plots how the latency penalty increases with the number of observed writings. 0.8 penalty 0.4 0.0 0 100 200 300 400 number of observed writings Figure 3: Latency penalty increases with the number of observed writings (𝑘𝑢 ) The system’s overall speed factor is computed as: 𝑠𝑝𝑒𝑒𝑑 = (1 − median{𝑝𝑒𝑛𝑎𝑙𝑡𝑦(𝑘𝑢 ) : 𝑢 ∈ 𝑈, 𝑑𝑢 = 𝑔𝑢 = 1}) (6) where speed equals 1 for a system whose true positives are detected right at the first writing. A slow system, which detects true positives after hundreds of writings, will be assigned a speed score near 0. Finally, the latency-weighted F score is simply: 𝐹𝑙𝑎𝑡𝑒𝑛𝑐𝑦 = 𝐹 · 𝑠𝑝𝑒𝑒𝑑 (7) Since 2019 user’s data were processed by the participants in a post by post basis (i.e. we avoided a chunk-based release of data). Under these conditions, the evaluation approach has the following properties: • smooth grow of penalties; • a perfect system gets 𝐹𝑙𝑎𝑡𝑒𝑛𝑐𝑦 = 1 ; • for each user 𝑢 the system can opt to stop at any point 𝑘𝑢 and, therefore, now we do not have the effect of an imbalanced importance of users; • 𝐹𝑙𝑎𝑡𝑒𝑛𝑐𝑦 is more interpretable than 𝐸𝑅𝐷𝐸. 4 Again, we adopt Sadeque et al.’s proposal but we estimate latency only over the true positives. 5 In the evaluation we set 𝑝 to 0.0078, a setting obtained from the eRisk 2017 collection. Table 7 Task 2 (anorexia): participating teams, number of runs, number of user writings processed by the team, and lapse of time taken for the entire process. team #runs #user writings lapse of time processed (from 1st to last response) BioNLP-IISERB[34] 5 10 09:39 GVIS 5 352 3 days 12:36 Riewe-Perla [35] 5 2001 2 days 11:25 UNSL [36] 3 2001 07:00 UMU [37] 5 2001 06:34 COS-470-Team-2 5 1 - ELiRF-UPV [38] 4 2001 12:27 NLP-UNED [39] 5 2001 09:40 SINAI [22] 5 2001 3 days 23:49 APB-UC3M [28] 2 2001 6 days 21:34 3.2. Ranking-based Evaluation In addition to the evaluation discussed above, we employed an alternative form of evaluation to further assess the systems. After each data release (new user writing, that is post or comment), participants were required to provide the following information for each user in the collection: • A decision for the user (alert or no alert), which was used to calculate the decision-based metrics discussed previously. • A score representing the user’s level of risk, estimated based on the evidence observed thus far. The scores were used to create a ranking of users in descending order of estimated risk. For each participating system, a ranking was generated at each data release point, simulating a continuous re-ranking approach based on the observed evidence. In a real-life scenario, this ranking would be presented to an expert user who could make decisions based on the rankings (e.g., by inspecting the top of the rankings). Each ranking can be evaluated using standard ranking metrics such as P@10 or NDCG. Therefore, we report the performance of the systems based on the rankings after observing different numbers of writings. 3.3. Results Table 7 shows the participating teams, the number of runs submitted, and the approximate lapse of time from the first response to the last response. This time-lapse indicates the degree of automation of each team’s algorithms. Many of the submitted runs processed the entire thread of messages (2001), but a few variants stopped earlier. Five teams processed the thread of messages reasonably fast (less than a day for processing the entire history of user messages). The rest of the teams took several days to run the whole process. Table 8 reports the decision-based performance achieved by the participating teams. In terms of 𝐹 1 and latency-weighted 𝐹 1, the best performing team was NLP-UNED [39] (run 1), while Riewe-Perla [35] was the team that submitted the best run (run 0) in terms of the ERDE metrics. The majority of teams made quick decisions. Overall, these findings indicate that some systems achieved a relatively high level of effectiveness with only a few user submissions. Social and public health systems may use the best predictive algorithms to assist expert humans in detecting signs of anorexia as early as possible. Table 9 presents the ranking-based results. UNSL [36] (run 1) obtained the best overall values after only one writing, while NLP-UNED [39](run 3) obtained the highest scores after 100 writings. These two teams also contributed the best performing variants for the 500 and 1000 cutoffs. Table 8 Decision-based evaluation for Task 2 𝑙𝑎𝑡-𝑤𝑒𝑖𝑔ℎ 𝐹 1 𝑙𝑎𝑡𝑒𝑛𝑐𝑦𝑇 𝑃 𝐸𝑅𝐷𝐸50 𝐸𝑅𝐷𝐸5 𝑠𝑝𝑒𝑒𝑑 𝐹1 Team Run 𝑅 𝑃 BioNLP-IISERB 0 0.53 0.23 0.32 0.10 0.09 2.00 1.00 0.32 BioNLP-IISERB 1 0.54 0.75 0.62 0.08 0.04 4.00 0.99 0.62 BioNLP-IISERB 2 0.58 0.16 0.25 0.10 0.10 1.00 1.00 0.25 BioNLP-IISERB 3 0.67 0.51 0.58 0.08 0.06 3.00 0.99 0.58 BioNLP-IISERB 4 0.73 0.62 0.67 0.08 0.05 4.00 0.99 0.66 GVIS 0 0.12 1.00 0.21 0.12 0.10 1.00 1.00 0.21 GVIS 1 0.12 1.00 0.22 0.12 0.10 1.00 1.00 0.22 GVIS 2 0.12 1.00 0.22 0.12 0.10 1.00 1.00 0.22 GVIS 3 0.12 1.00 0.22 0.12 0.10 1.00 1.00 0.22 GVIS 4 0.12 1.00 0.22 0.12 0.10 1.00 1.00 0.22 Riewe-Perla 0 0.45 0.97 0.62 0.07 0.02 6.00 0.98 0.60 Riewe-Perla 1 0.47 0.95 0.63 0.10 0.03 6.00 0.98 0.62 Riewe-Perla 2 0.47 0.95 0.63 0.10 0.03 6.00 0.98 0.62 Riewe-Perla 3 0.47 0.95 0.63 0.10 0.03 6.00 0.98 0.62 Riewe-Perla 4 0.47 0.95 0.63 0.10 0.03 6.00 0.98 0.62 UNSL 0 0.35 0.99 0.52 0.14 0.03 12.00 0.96 0.49 UNSL 1 0.42 0.96 0.59 0.14 0.03 12.00 0.96 0.56 UNSL 2 0.42 0.97 0.59 0.14 0.03 12.00 0.96 0.56 UMU 0 0.14 0.99 0.25 0.20 0.09 18.00 0.93 0.23 UMU 1 0.15 0.99 0.26 0.19 0.09 27.00 0.90 0.24 UMU 2 0.14 0.99 0.25 0.20 0.09 19.00 0.93 0.23 UMU 3 0.15 0.99 0.27 0.19 0.09 28.00 0.90 0.24 UMU 4 0.16 0.98 0.27 0.19 0.10 35.50 0.87 0.23 COS-470-Team-2 0 0.00 0.00 0.00 0.12 0.12 COS-470-Team-2 1 0.00 0.00 0.00 0.12 0.12 COS-470-Team-2 2 0.00 0.00 0.00 0.12 0.12 COS-470-Team-2 3 0.00 0.00 0.00 0.12 0.12 COS-470-Team-2 4 0.00 0.00 0.00 0.12 0.12 ELiRF-UPV 0 0.43 0.99 0.60 0.10 0.04 12.00 0.96 0.57 ELiRF-UPV 1 0.41 1.00 0.58 0.10 0.04 12.00 0.96 0.56 ELiRF-UPV 2 0.32 0.99 0.49 0.12 0.04 10.00 0.96 0.47 ELiRF-UPV 3 0.43 0.99 0.60 0.11 0.04 15.00 0.94 0.57 NLP-UNED 0 0.64 0.97 0.77 0.09 0.04 13.00 0.95 0.73 NLP-UNED 1 0.67 0.97 0.79 0.09 0.04 14.00 0.95 0.75 NLP-UNED 2 0.63 0.97 0.76 0.09 0.04 12.00 0.96 0.73 NLP-UNED 3 0.63 0.98 0.77 0.09 0.03 11.00 0.96 0.74 NLP-UNED 4 0.63 0.97 0.76 0.09 0.04 14.00 0.95 0.72 SINAI 0 0.21 0.92 0.34 0.10 0.07 3.00 0.99 0.34 SINAI 1 0.21 0.92 0.34 0.10 0.07 3.00 0.99 0.34 SINAI 2 0.21 0.92 0.34 0.10 0.07 3.00 0.99 0.34 SINAI 3 0.12 1.00 0.21 0.13 0.10 2.00 1.00 0.21 SINAI 4 0.12 1.00 0.21 0.13 0.10 2.00 1.00 0.21 APB-UC3M 0 0.17 0.99 0.28 0.15 0.08 9.00 0.97 0.28 APB-UC3M 1 0.15 0.99 0.26 0.13 0.09 2.00 1.00 0.26 4. Task 3: Measuring the Severity of Eating Disorders The objective of the task is to estimate the severity of various symptoms related to the diagnosis of eating disorders. Participants were provided with a thread of user submissions to work with. For each user, a history of posts and comments from Social Media was given, and participants had to estimate the user’s responses to a standardized eating disorder questionnaire based on the evidence found in the history of posts/comments. The questionnaire used in the task is derived from the Eating Disorder Examination Questionnaire (EDE-Q)6 , which is a self-reported questionnaire consisting of 28 items. It is adapted from the semi- structured interview Eating Disorder Examination (EDE)7 [40]. For this task, we focused on questions 1-12 and 19-28 from the EDE-Q. This questionnaire is designed to assess various aspects and severity of 6 https://www.corc.uk.net/media/1273/ede-q_quesionnaire.pdf 7 https://www.corc.uk.net/media/1951/ede_170d.pdf Table 9 Ranking-based evaluation for Task 2 1 writing 100 writings 500 writings 1000 writings 𝑁 𝐷𝐶𝐺@100 𝑁 𝐷𝐶𝐺@100 𝑁 𝐷𝐶𝐺@100 𝑁 𝐷𝐶𝐺@100 𝑁 𝐷𝐶𝐺@10 𝑁 𝐷𝐶𝐺@10 𝑁 𝐷𝐶𝐺@10 𝑁 𝐷𝐶𝐺@10 𝑃 @10 𝑃 @10 𝑃 @10 𝑃 @10 Team Run BioNLP-IISERB 0 0.10 0.19 0.06 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 BioNLP-IISERB 1 0.00 0.00 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 BioNLP-IISERB 2 0.00 0.00 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 BioNLP-IISERB 3 0.10 0.06 0.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 BioNLP-IISERB 4 0.20 0.21 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 GVIS 0 0.40 0.37 0.40 0.20 0.18 0.23 0.00 0.00 0.00 0.00 0.00 0.00 GVIS 1 0.40 0.37 0.40 0.30 0.32 0.42 0.00 0.00 0.00 0.00 0.00 0.00 GVIS 2 0.40 0.37 0.40 0.30 0.32 0.42 0.00 0.00 0.00 0.00 0.00 0.00 GVIS 3 0.40 0.37 0.40 0.30 0.32 0.42 0.00 0.00 0.00 0.00 0.00 0.00 GVIS 4 0.40 0.37 0.40 0.30 0.32 0.42 0.00 0.00 0.00 0.00 0.00 0.00 Riewe-Perla 0 0.50 0.47 0.17 0.70 0.62 0.74 0.70 0.62 0.74 0.70 0.62 0.75 Riewe-Perla 1 0.50 0.47 0.17 0.70 0.62 0.74 0.70 0.62 0.74 0.70 0.62 0.75 Riewe-Perla 2 0.50 0.47 0.17 0.70 0.62 0.74 0.70 0.62 0.74 0.70 0.62 0.75 Riewe-Perla 3 0.50 0.47 0.17 0.70 0.62 0.74 0.70 0.62 0.74 0.70 0.62 0.75 Riewe-Perla 4 0.50 0.47 0.17 0.70 0.62 0.74 0.70 0.62 0.74 0.70 0.62 0.75 UNSL 0 0.90 0.81 0.63 1.00 1.00 0.81 1.00 1.00 0.77 1.00 1.00 0.76 UNSL 1 1.00 1.00 0.69 1.00 1.00 0.80 0.90 0.81 0.69 0.80 0.88 0.72 UNSL 2 0.40 0.38 0.42 0.90 0.92 0.71 0.80 0.85 0.69 0.80 0.84 0.68 UMU 0 0.20 0.12 0.14 0.10 0.06 0.03 0.00 0.00 0.05 0.20 0.21 0.12 UMU 1 0.20 0.12 0.14 0.10 0.06 0.03 0.00 0.00 0.05 0.20 0.21 0.12 UMU 2 0.20 0.12 0.14 0.00 0.00 0.02 0.00 0.00 0.06 0.00 0.00 0.06 UMU 3 0.20 0.12 0.14 0.00 0.00 0.02 0.00 0.00 0.06 0.00 0.00 0.06 UMU 4 0.20 0.12 0.14 0.00 0.00 0.02 0.00 0.00 0.06 0.00 0.00 0.06 COS-470-Team-2 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 COS-470-Team-2 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 COS-470-Team-2 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 COS-470-Team-2 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 COS-470-Team-2 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ELiRF-UPV 0 0.20 0.12 0.14 0.20 0.13 0.14 0.20 0.13 0.14 0.20 0.13 0.14 ELiRF-UPV 1 0.10 0.19 0.17 0.20 0.14 0.15 0.20 0.25 0.14 0.10 0.19 0.11 ELiRF-UPV 2 0.10 0.07 0.13 0.20 0.14 0.15 0.20 0.25 0.14 0.10 0.06 0.10 ELiRF-UPV 3 0.00 0.00 0.11 0.20 0.14 0.15 0.20 0.25 0.14 0.10 0.06 0.10 NLP-UNED 0 1.00 1.00 0.44 1.00 1.00 0.89 1.00 1.00 0.91 1.00 1.00 0.91 NLP-UNED 1 1.00 1.00 0.44 1.00 1.00 0.89 1.00 1.00 0.92 1.00 1.00 0.92 NLP-UNED 2 1.00 1.00 0.44 1.00 1.00 0.89 1.00 1.00 0.91 1.00 1.00 0.91 NLP-UNED 3 1.00 1.00 0.45 1.00 1.00 0.91 1.00 1.00 0.91 1.00 1.00 0.89 NLP-UNED 4 1.00 1.00 0.44 1.00 1.00 0.89 1.00 1.00 0.91 1.00 1.00 0.91 SINAI 0 0.00 0.00 0.07 0.00 0.00 0.02 0.00 0.00 0.02 0.00 0.00 0.03 SINAI 1 0.00 0.00 0.07 0.00 0.00 0.02 0.00 0.00 0.02 0.00 0.00 0.03 SINAI 2 0.00 0.00 0.07 0.00 0.00 0.02 0.00 0.00 0.02 0.00 0.00 0.03 SINAI 3 0.00 0.00 0.07 0.10 0.07 0.06 0.00 0.00 0.07 0.00 0.00 0.07 SINAI 4 0.00 0.00 0.07 0.10 0.07 0.06 0.00 0.00 0.07 0.00 0.00 0.07 APB-UC3M 0 0.00 0.00 0.03 0.40 0.56 0.26 0.00 0.00 0.09 0.00 0.00 0.13 APB-UC3M 1 0.10 0.06 0.07 0.00 0.00 0.18 0.00 0.00 0.10 0.00 0.00 0.08 features associated with eating disorders. It includes four subscales: Restraint, Eating Concern, Shape Concern, and Weight Concern, along with a global score. Table 10 shows an excerpt of the EDE-Q. Table 10: Excerpt of the Eating Disorder Examination Questionarie Instructions: The following questions are concerned with the past four weeks (28 days) only. Please read each question carefully. Please answer all the questions. Thank you.. Table 10: Eating Disorder Examination Questionarie (continued) 1. Have you been deliberately trying to limit the amount of food you eat to influence your shape or weight (whether or not you have succeeded) 0. NO DAYS 1. 1-5 DAYS 2. 6-12 DAYS 3. 13-15 DAYS 4. 16-22 DAYS 5. 23-27 DAYS 6. EVERY DAY 2. Have you gone for long periods of time (8 waking hours or more) without eating anything at all in order to influence your shape or weight? 0. NO DAYS 1. 1-5 DAYS 2. 6-12 DAYS 3. 13-15 DAYS 4. 16-22 DAYS 5. 23-27 DAYS 6. EVERY DAY 3. Have you tried to exclude from your diet any foods that you like in order to influence your shape or weight (whether or not you have succeeded)? 0. NO DAYS 1. 1-5 DAYS 2. 6-12 DAYS 3. 13-15 DAYS 4. 16-22 DAYS 5. 23-27 DAYS 6. EVERY DAY . . . 22. Has your weight influenced how you think about (judge) yourself as a person? 0. NOT AT ALL (0) 1. SLIGHTY (1) 2. SLIGHTY (2) 3. MODERATELY (3) 4. MODERATELY (4) 5. MARKEDLY (5) 6. MARKEDLY (6) 23. Has your shape influenced how you think about (judge) yourself as a person? 0. NOT AT ALL (0) 1. SLIGHTY (1) 2. SLIGHTY (2) 3. MODERATELY (3) Table 10: Eating Disorder Examination Questionarie (continued) 4. MODERATELY (4) 5. MARKEDLY (5) 6. MARKEDLY (6) 24. How much would it have upset you if you had been asked to weigh yourself once a week (no more, or less, often) for the next four weeks? 0. NOT AT ALL (0) 1. SLIGHTY (1) 2. SLIGHTY (2) 3. MODERATELY (3) 4. MODERATELY (4) 5. MARKEDLY (5) 6. MARKEDLY (6) The primary objective of this task was to explore the possibility of automatically estimating the severity of multiple symptoms related to eating disorders. The algorithms are required to estimate the user’s response to each individual question based on their writing history. To evaluate the performance of the participating systems, we collected questionnaires completed by Social Media users along with their corresponding writing history. The user-completed questionnaires serve as the ground truth against which the responses provided by the systems are evaluated. During the training phase, participants were provided with data from 28 users from the 2022 edition and 46 users from the 2023 edition. This training data included the writing history of the users as well as their responses to the EDE-Q questions. In the test phase, there were 18 new users for whom the participating systems had to generate results. The results were expected to follow the following specific file structure: username1 answer1 answer2...answer12 answer19...answer28 username2 answer1 answer2...answer12 answer19...answer28 . . . Each line has the username and 22 values (no answers from 13 to 18). These values correspond with the responses to the questions above (the possible values are 0,1,2,3,4,5,6). 4.1. Evaluation Metrics Evaluation is based on the following effectiveness metrics: • Mean Zero-One Error (𝑀 𝑍𝑂𝐸) between the questionnaire filled by the real user and the questionnaire filled by the system (i.e. fraction of incorrect predictions). |{𝑞𝑖 ∈ 𝑄 : 𝑅(𝑞𝑖 ) ̸= 𝑓 (𝑞𝑖 )}| 𝑀 𝑍𝑂𝐸(𝑓, 𝑄) = (8) |𝑄| where 𝑓 denotes the classification done by an automatic system, 𝑄 is the set of questions of each questionnaire, 𝑞𝑖 is the i-th question, 𝑅(𝑞𝑖 ) is the real user’s answer for the i-th question and 𝑓 (𝑞𝑖 ) is the predicted answer of the system for the i-th question. Each user produces a single 𝑀 𝑍𝑂𝐸 score and the reported 𝑀 𝑍𝑂𝐸 is the average over all 𝑀 𝑍𝑂𝐸 values (mean 𝑀 𝑍𝑂𝐸 over all users). • Mean Absolute Error (𝑀 𝐴𝐸) between the questionnaire filled by the real user and the ques- tionnaire filled by the system (i.e. average deviation of the predicted response from the true response). ∑︀ 𝑞𝑖 ∈𝑄 |𝑅(𝑞𝑖 ) − 𝑓 (𝑞𝑖 )| 𝑀 𝐴𝐸(𝑓, 𝑄) = (9) |𝑄| Again, each user produces a single 𝑀 𝐴𝐸 score and the reported 𝑀 𝐴𝐸 is the average over all 𝑀 𝐴𝐸 values (mean 𝑀 𝐴𝐸 over all users). • Macroaveraged Mean Absolute Error (𝑀 𝐴𝐸𝑚𝑎𝑐𝑟𝑜 ) between the questionnaire filled by the real user and the questionnaire filled by the system (see [41]). 6 ∑︀ 1 ∑︁ 𝑞𝑖 ∈𝑄𝑗 |𝑅(𝑞𝑖 ) − 𝑓 (𝑞𝑖 )| 𝑀 𝐴𝐸𝑚𝑎𝑐𝑟𝑜 (𝑓, 𝑄) = (10) 7 |𝑄𝑗 | 𝑗=0 where 𝑄𝑗 represents the set of questions whose true answer is 𝑗 (note that 𝑗 goes from 0 to 6 because those are the possible answers to each question). Again, each user produces a single 𝑀 𝐴𝐸𝑚𝑎𝑐𝑟𝑜 score and the reported 𝑀 𝐴𝐸𝑚𝑎𝑐𝑟𝑜 is the average over all 𝑀 𝐴𝐸𝑚𝑎𝑐𝑟𝑜 values (mean 𝑀 𝐴𝐸𝑚𝑎𝑐𝑟𝑜 over all users). The following measures are based on aggregated scores obtained from the questionnaires. Further details about the EDE-Q instruments can be found elsewhere (e.g. see the scoring section of the questionnaire). • Restraint Subscale (RS): Given a questionnaire, its restraint score is obtained as the mean response to the first five questions. This measure computes the RMSE between the restraint ED score obtained from the questionnaire filled by the real user and the restraint ED score obtained from the questionnaire filled by the system. Each user 𝑢𝑖 is associated with a real subscale ED score (referred to as 𝑅𝑅𝑆 (𝑢𝑖 )) and an estimated subscale ED score (referred to as 𝑓𝑅𝑆 (𝑢𝑖 )). This metric computes the RMSE between the real and an estimated subscale ED scores as follows: √︃ ∑︀ 2 𝑢𝑖 ∈𝑈 (𝑅𝑅𝑆 (𝑢𝑖 ) − 𝑓𝑅𝑆 (𝑢𝑖 )) 𝑅𝑀 𝑆𝐸(𝑓, 𝑈 ) = (11) |𝑈 | where 𝑈 is the user set. • Eating Concern Subscale (ECS): Given a questionnaire, its eating concern score is obtained as the mean response to the following questions (7, 9, 19, 21, 20). This metric computes the RMSE (equation 12) between the eating concern ED score obtained from the questionnaire filled by the real user and the eating concern ED score obtained from the questionnaire filled by the system. √︃ ∑︀ 2 𝑢𝑖 ∈𝑈 (𝑅𝐸𝐶𝑆 (𝑢𝑖 ) − 𝑓𝐸𝐶𝑆 (𝑢𝑖 )) 𝑅𝑀 𝑆𝐸(𝑓, 𝑈 ) = (12) |𝑈 | • Shape Concern Subscale (SCS): Given a questionnaire, its shape concern score is obtained as the mean response to the following questions (6, 8, 23, 10, 26, 27, 28, 11). This metric computes the RMSE (equation 13) between the shape concern ED score obtained from the questionnaire filled by the real user and the shape concern ED score obtained from the questionnaire filled by the system. √︃ ∑︀ 2 𝑢𝑖 ∈𝑈 (𝑅𝑆𝐶𝑆 (𝑢𝑖 ) − 𝑓𝑆𝐶𝑆 (𝑢𝑖 )) 𝑅𝑀 𝑆𝐸(𝑓, 𝑈 ) = (13) |𝑈 | • Weight Concern Subscale (WCS): Given a questionnaire, its weight concern score is obtained as the mean response to the following questions (22, 24, 8, 25, 12). This metric computes the RMSE (equation 14) between the weight concern ED score obtained from the questionnaire filled by the real user and the weight concern ED score obtained from the questionnaire filled by the system. Table 11 Task 3 Results. Participating teams and runs with corresponding scores for the metrics. 𝑀 𝐴𝐸𝑚𝑎𝑐𝑟𝑜 MZOE WCS MAE GED ECS SCS RS team run ID baseline all 0s 3.790 0.813 4.254 4.472 3.869 4.479 4.363 3.361 baseline all 6s 1.937 0.551 3.018 3.076 3.352 2.868 3.029 2.472 baseline average 1.965 0.884 1.973 2.337 2.486 1.559 2.002 1.783 APB-UC3M [28] 0 2.003 0.869 2.142 2.647 2.253 1.884 2.101 1.823 DSGT [27] 0 1.965 0.588 1.713 2.211 2.321 1.969 1.944 2.117 RELAI [23] 0 2.331 0.914 2.243 2.394 2.222 2.324 2.340 1.812 RELAI 1 2.346 0.917 2.237 2.507 2.199 2.216 2.328 1.836 RELAI 2 2.758 0.934 2.885 2.883 2.767 3.126 3.061 2.171 RELAI 3 2.356 0.775 2.700 2.928 3.266 2.106 2.821 2.310 RELAI 4 2.851 0.884 2.979 3.159 2.784 3.150 3.068 2.336 SCaLAR-NITK [42] 0 1.912 0.591 1.643 2.495 2.713 1.568 1.536 2.098 SCaLAR-NITK 1 1.980 0.664 1.972 2.570 2.562 1.553 1.960 2.066 SCaLAR-NITK 2 1.879 0.568 1.942 2.158 2.477 2.222 2.245 2.364 SCaLAR-NITK 3 1.932 0.586 1.868 2.117 2.430 2.046 2.242 2.407 SCaLAR-NITK 4 1.874 0.672 1.820 2.292 2.140 1.557 1.880 2.061 UMU [37] 0 2.366 0.798 2.833 3.261 3.285 2.659 2.771 2.218 UMU 1 2.227 0.859 2.286 2.326 2.911 2.142 2.560 2.026 √︃ ∑︀ 2 𝑢𝑖 ∈𝑈 (𝑅𝑊 𝐶𝑆 (𝑢𝑖 ) − 𝑓𝑊 𝐶𝑆 (𝑢𝑖 )) 𝑅𝑀 𝑆𝐸(𝑓, 𝑈 ) = (14) |𝑈 | • Global ED (GED): To obtain an overall or ‘global’ score, the four subscales scores are summed and the resulting total divided by the number of subscales (i.e. four) [40]. This metric computes the RMSE between the real and an estimated global ED scores as follows: √︃ ∑︀ 2 𝑢𝑖 ∈𝑈 (𝑅𝐺𝐸𝐷 (𝑢𝑖 ) − 𝑓𝐺𝐸𝐷 (𝑢𝑖 )) 𝑅𝑀 𝑆𝐸(𝑓, 𝑈 ) = (15) |𝑈 | 4.2. Results Table 11 reports the results obtained by the participants in this task. In order to provide some context, the table includes the performance of three baseline variants in the top block: “all 0s”, “all 6s”, and “average”. The “all 0s” variant represents a strategy where the same response (0) is submitted for all questions. Similarly, the “all 6s” variant submits the response 6 for all questions. The “average” variant calculates the mean of the responses provided by all participants for each question and submits the response that is closest to this mean value (e.g. if the mean response provided by the participants equals 3.7 then this average approach would submit a 4). The results indicate that the top-performing system in terms of Mean Absolute Error (MAE) was run 4 by SCaLAR-NITK [42]. This team also got the best MZOE (run 2), the best 𝑀 𝐴𝐸𝑚𝑎𝑐𝑟𝑜 (run 0), the best GED (run 3), the best RS (run 4), the best ECS (run 1), and the best SCS (run 0). The best WCS, instead, was achieved by team RELAI [23] (run 0). In some cases the best participating system was not better that some of the baselines (e.g., lowest MZOE is the “all 6s” baseline). 5. Participating Teams Table 12 reports the participating teams and the runs that they submitted for each eRisk task. The next paragraphs give a brief summary on the techniques implemented by each of them. Further details are available at the CLEF 2024 working notes proceedings for the participants. Table 12 eRisk 2024 participants. Task 1 Task 2 Task 3 team #runs #runs #runs APB-UC3M [28] 5 2 1 BioNLP-IISERB [34] 5 COS-470-Team-2 5 DSGT [27] 5 1 ELiRF-UPV [38] 4 GVIS 1 5 MeVer-REBECCA [26] 2 MindwaveML [25] 3 NLP-UNED [39] 5 NUS-IDS [24] 5 RELAI [23] 5 5 Riewe-Perla [35] 5 SCaLAR-NITK [42] 5 SINAI [22] 2 5 ThinkIR 1 UMU [37] 5 2 UNSL [36] 3 APB-UC3M [28]. The APB-UC3M team, affiliated with Universidad Carlos III de Madrid (UC3M) in Spain, participated in the three tasks of the eRisk 2024 challenge. For Task 1, which involved searching for symptoms of depression, the team employed sentence similarity models to compare BDI items with paragraphs, in conjunction with a RoBERTa classifier. They also explored ensemble methods combining these approaches. In Task 2, focused on the early detection of anorexia, the team used an ensemble model comprising three classification algorithms. They generated embeddings using BART and Doc2Vec models and utilized these embeddings as input for three traditional classifiers: Support Vector Machine (SVM), Logistic Regression (LR), and Random Forest (RF). For Task 3, which involved measuring the severity of signs of eating disorders, the team fine-tuned a neural network model. This model included an embedding layer, a fully connected layer, and a ReLU activation function, and it was trained to predict the 22 categories of the Eating Disorder Examination (EDE) interview. BioNLP-IISERB [34]. The BioNLP-IISERB team, affiliated with the Indian Institute of Science Education and Research, Bhopal, participated in Task 2 of the eRisk 2024 challenge. The team’s approach involved a combination of various classification methods and feature engineering techniques to identify signs of anorexia from the provided texts. They utilized both bag-of-words features and transformer-based embedding methods. For classification, they employed Random Forest, Adaptive Boosting, Logistic Regression, Support Vector Machine (SVM), and transformer-based classifiers. Their experimental analysis revealed that the best performance was achieved using SVMs and an Adaboost classifier, particularly with TF-IDF and entropy-based weighting strategies. Some experimental runs achieved F1 scores higher than 0.65, indicating the potential of these frameworks to identify textual patterns indicative of anorexia. Despite the promising results, the complexity of the task suggests there is room for future improvements. DSGT [27]. The DSGT team, from the Georgia Institute of Technology, participated in Tasks 1 and 3 of the eRisk 2024 challenge. For Task 1, they developed two distinct pipelines to detect signs of depression. Their approach combined traditional NLP techniques, such as TF-IDF, with vector-based models. Specifically, they constructed a logistic regression classifier, treating the 21 symptoms as targets for a multiclass classification problem. The results on the hidden test set demonstrated that vector models and transformer-based models could achieve notable performance on information retrieval metrics, even without advanced sentence filtering and fine-tuning. For Task 3, the team employed simpler models, including XGBoost and Random Forests, which showed better performance on smaller datasets. ELiRF-UPV [38]. The ELiRF-VRAIN team, affiliated with the Valencian Research Institute for Artificial Intelligence (VRAIN) at Universitat Politècnica de València, participated in Task 2. Their work involved three distinct approaches: a Support Vector Machine (SVM) and two pre-trained Transformer models. Among the Transformer models, one approach utilized BERT-like models, while the other employed LongFormer models to expand the context when making decisions. To balance the training example set, the authors proposed a data augmentation method, which yielded positive results in augmenting examples during the training process. MeVer-REBECCA [26]. The REBECCA team, affiliated with the Information Technologies Institute at the Centre for Research and Technology Hellas (CERTH) in Thessaloniki, Greece, participated in Task 1 of the eRisk 2024 challenge, which focused on searching for symptoms of depression. Their approach involved a combination of ranking sentences using cosine similarity and Transformer embeddings, with refinement through a Large Language Model (LLM), specifically ChatGPT-4. The process began with text pre-processing and dataset cleaning, discarding sentences not related to the authors and considering relevant sentences only if they reflected the author’s state surrounding a symptom. They conducted keyword matching with sentences indicating self-reference. Following this, the team used sentence ranking with BGE-M3 (Multi-Linguality, Multi-Functionality, and Multi-Granularity) and the questionnaire answers. MindwaveML [25]. The MindwaveML team, from the University of Bucharest, participated in Task 1. The team leveraged a paraphrasing model to match sentences with BDI texts. Specifically, they encoded the four alternative responses to each of the 21 BDI symptoms into dense embeddings using the paraphrase-MiniLM-L12-v2 model. These embeddings captured the semantic information contained in each response. The sentences from the Reddit corpus were also encoded with paraphrase-MiniLM-L12- v2. The cosine similarity between each sentence and the BDI responses was then computed. Additionally, the team incorporated features to ensure that the sentences contained first-person expressions, ensuring relevance to the individual’s state. The resulting set of features was fed to standard learning algorithms. NLP-UNED [39]. The NLP-UNED team, from UNED in Madrid, participated in Task 2. Their system comprised several steps, starting with an initial embedding representation using sentence encoders. This was followed by a relabelling process based on Approximate Nearest Neighbors (ANN) techniques to generate a training dataset annotated at the message level instead of the user level. The encoding process was further refined with fine-tuning based on contrastive learning, aiming to maximize the distance between embeddings belonging to different classes. For classification, the team also employed ANN techniques, combined with rules and heuristics to expand the number of messages considered from each user when making the final decision. Their system achieved the best results in both the decision-based evaluation and in the ranking-based evaluation. NUS-IDS [24]. The NUS-IDS team, affiliated with the National University of Singapore’s Integrative Sciences and Engineering Programme and the Institute of Data Science, participated in Task 1 of the eRisk 2024 challenge. The team’s approach involved ranking candidate sentences for depression symptoms by their average similarity to a predefined set of training sentences. Utilizing methods for computing dense representations of sentences, the team calculated the score of a test sentence as the average cosine similarity between the test sentence and each sentence in a set of training sentences associated with a specific symptom. The authors experimented with different configurations of this algorithm, employing various models for dense representation computation and different sets of training sentences. This approach allowed the NUS-IDS team to effectively rank sentences by their relevance to depression symptoms, leveraging both similarity metrics and the robustness of multiple model configurations. RELAI [23]. The RELAI team, from Université du Québec à Montréal, Canada, and McMaster Uni- versity, Canada, participated in Tasks 1 and 3. For Task 1, which involved searching for symptoms of depression, the team approached it as a multilabel classification task. They utilized feed-forward neural networks with contextual embeddings to mine sentences relevant to each item in a standard depression questionnaire from a large set of social media sentences. Their methods aimed to be lightweight, minimizing computational costs and infrastructure needs. In Task 3, the team used BERTopic to extract the 16 most correlated topics with signs of eating disorders as features for prediction. They employed feed-forward neural networks with topic probabilities as inputs to automatically fill out a standard eating disorder questionnaire based on social media writing histories. The authors noted significant room for improvement, particularly in exploring different representations of writing history and improving model calibration for classification transformations. Riewe-Perla (MHRec) [35]. The Poznan University of Economics and Business team participated in Task 2. Their approach involved merging language models with recommender systems to analyze and predict if recommended content originated from individuals with mental health conditions. The team’s model was built on document embeddings, user embeddings, and a recommendation engine using the sentence transformer architecture (SBERT). They employed a hybrid recommendation method (LightFM) that leveraged both document and user embeddings to flag publications indicating mental health challenges. The system aimed to facilitate fast classification of new messages, determining as early as possible whether an individual was suffering from anorexia. SCaLAR-NITK [42]. Team SCaLAR-NITK, from the National Institute of Technology Karnataka, Surathkal, participated in Task 3. The team employed a range of standard techniques across 21 different models—one for each symptom question. Their first approach utilized Support Vector Machine (SVM) classifiers with input word embeddings constructed using the traditional TF-IDF method. In the second approach, they again used SVMs but leveraged pre-trained Word2Vec embeddings to model both users and questions, aggregating the question embeddings with each user publication. To address response imbalance, they employed back-translation. Their final method followed the second approach but incorporated Principal Component Analysis (PCA) for dimensionality reduction of embeddings. Their methods performed well, achieving the best results in 7 out of the 8 evaluated metrics. SINAI [22]. The SINAI team, a collaborative effort between the Computer Science Department of Uni- versidad of Jaén (Spain) and Instituto Nacional de Astrofísica, Óptica y Electrónica (Mexico), participated in Tasks 1 and 2. For Task 1, one of SINAI’s approaches involved training a DistilRoBERTa base model on labeled sentences, with additional data augmentation using the BDI-Sen dataset. Another approach for Task 1 utilized GPT-3 prompts to infer connections between PHQ-8 symptoms and BDI symptoms. For Task 2, the team implemented two transformer-based models trained with causal language modeling, one trained on positive user data and the other on negative user data. This dual-model solution was used to produce perplexity estimates. UMU Team [37]. The UMU Team, from the University of Murcia (Spain), participated in Tasks 2 and 3. For Task 2, the team proposed a method that classifies user posts by combining the last-layer hidden representation of a BERT-based model with sentiment features extracted from the text. They utilized BERT and RoBERTa models for text representation, along with the Cardiff NLP TweetEval model for sentiment analysis. This approach aimed to capture both the semantic and emotional aspects of the users’ posts to detect signs of anorexia. For Task 3, they adopted a fine-tuning approach using a sentence transformer model to compute the similarity between the text of the user and the responses of the EDE-Q questionnaire. This method involved measuring the textual closeness between user posts and the EDE-Q questions to assess the severity of eating disorder symptoms. UNSL [36]. The UNSL team, from Universidad Nacional de San Luis (Argentina), participated in Task 2 with a solution named CPI-DMC, focusing on precision and speed independently, as well as a time-aware approach where both objectives are tackled together. The first approach aimed to balance identifying positive users and minimizing the decision-making time, consisting of two separate components: a Classifier with Partial Information (CPI) and another for Deciding the Moment of the Classification (DMC). The second approach aimed to optimize both objectives simultaneously by incorporating time into the learning process and using ERDE as the training objective. To implement this, they included a [TIME] token in the representations, integrating temporal metrics to validate and select the optimal models. Their methods achieved good results for the ERDE50 metric and ranking-based metrics, and demonstrating consistency in solving early risk detection problems. 6. Conclusions This paper provided an extended overview of eRisk 2024, the eighth edition of the lab, which focused on three types of tasks: symptoms search (Task 1 on depression), early detection (Task 2 on anorexia), and severity estimations (Task 3 on eating disorders). Participants in Task 1 were given a collection of sentences and had to rank them according to their relevance to each of the BDI-II depression symptoms. Participants in Task 2 had sequential access to social media posts and had to send alerts about individuals showing risks of anorexia. In Task 3, participants were given the full user history and had to automatically estimate the user’s responses to a standard depression questionnaire. A total of 87 runs were submitted by 17 teams for the proposed tasks. The experimental results demonstrate the value of extracting evidence from social media, indicating that automatic or semi- automatic screening tools to detect at-risk individuals could be promising. These findings highlight the need for the development of benchmarks for text-based risk indicator screening. Acknowledgments This work was supported by project PLEC2021-007662 (MCIN/AEI/10.13039/ 501100011033, Ministerio de Ciencia e Innovación, Agencia Estatal de Investigación, Plan de Recuperación, Transformación y Resiliencia, Unión Europea-Next Generation EU). The first and second authors thank the financial support supplied by the Xunta de Galicia-Consellería de Cultura, Educación, Formación Profesional e Universidade (GPC ED431B 2022/33) and the European Regional Development Fund and project PID2022- 137061OB-C21 (MCIN/AEI/ 10.13039/501100011033, Ministerio de Ciencia e Innovación, Agencia Estatal de Investigación, Proyectos de Generación de Conocimiento; supported by “ERDF A way of making Europe”, by the “European Union”). The CITIC, as a center accredited for excellence within the Galician University System and a member of the CIGUS Network, receives subsidies from the Department of Education, Science, Universities, and Vocational Training of the Xunta de Galicia. Additionally, CITIC is co-financed by the EU through the FEDER Galicia 2021-27 operational program (Ref. ED431G 2023/01). The third author thanks the financial support supplied by the Xunta de Galicia-Consellería de Cultura, Educación, Formación Profesional e Universidade (accreditation 2019-2022 ED431G-2019/04, ED431C 2022/19) and the European Regional Development Fund, which acknowledges the CiTIUS-Research Center in Intelligent Technologies of the University of Santiago de Compostela as a Research Center of the Galician University System. David E. Losada also thanks the financial support obtained from project SUBV23/00002 (Ministerio de Consumo, Subdirección General de Regulación del Juego) and project PID2022-137061OB-C22 (Ministerio de Ciencia e Innovación, Agencia Estatal de Investigación, Proyectos de Generación de Conocimiento; supported by the European Regional Development Fund). References [1] D. E. Losada, F. Crestani, J. Parapar, eRisk 2017: CLEF lab on early risk prediction on the in- ternet: Experimental foundations, in: G. J. Jones, S. Lawless, J. Gonzalo, L. Kelly, L. Goeuriot, T. Mandl, L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction, Springer International Publishing, Cham, 2017, pp. 346–360. [2] D. E. Losada, F. Crestani, J. Parapar, eRisk 2017: CLEF Lab on Early Risk Prediction on the Internet: Experimental foundations, in: CEUR Proceedings of the Conference and Labs of the Evaluation Forum, CLEF 2017, Dublin, Ireland, 2017. [3] D. E. Losada, F. Crestani, J. Parapar, Overview of eRisk: Early Risk Prediction on the Internet, in: P. Bellot, C. Trabelsi, J. Mothe, F. Murtagh, J. Y. Nie, L. Soulier, E. SanJuan, L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction, Springer International Publishing, Cham, 2018, pp. 343–361. [4] D. E. Losada, F. Crestani, J. Parapar, Overview of eRisk 2018: Early Risk Prediction on the Internet (extended lab overview), in: CEUR Proceedings of the Conference and Labs of the Evaluation Forum, CLEF 2018, Avignon, France, 2018. [5] D. E. Losada, F. Crestani, J. Parapar, Overview of eRisk 2019: Early risk prediction on the Internet, in: F. Crestani, M. Braschler, J. Savoy, A. Rauber, H. Müller, D. E. Losada, G. Heinatz Bürki, L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction, Springer International Publishing, 2019, pp. 340–357. [6] D. E. Losada, F. Crestani, J. Parapar, Overview of eRisk at CLEF 2019: Early risk prediction on the Internet (extended overview), in: CEUR Proceedings of the Conference and Labs of the Evaluation Forum, CLEF 2019, Lugano, Switzerland, 2019. [7] D. E. Losada, F. Crestani, J. Parapar, Early detection of risks on the internet: An exploratory campaign, in: Advances in Information Retrieval - 41st European Conference on IR Research, ECIR 2019, Cologne, Germany, April 14-18, 2019, Proceedings, Part II, 2019, pp. 259–266. [8] D. E. Losada, F. Crestani, J. Parapar, Overview of erisk 2020: Early risk prediction on the internet, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction - 11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22-25, 2020, Proceedings, 2020, pp. 272–287. [9] D. E. Losada, F. Crestani, J. Parapar, Overview of erisk at CLEF 2020: Early risk prediction on the internet (extended overview), in: Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, 2020. [10] D. E. Losada, F. Crestani, J. Parapar, erisk 2020: Self-harm and depression challenges, in: Advances in Information Retrieval - 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14-17, 2020, Proceedings, Part II, 2020, pp. 557–563. [11] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Overview of erisk 2021: Early risk prediction on the internet, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction - 12th International Conference of the CLEF Association, CLEF 2021, Virtual Event, September 21-24, 2021, Proceedings, 2021, pp. 324–344. [12] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Overview of erisk at CLEF 2021: Early risk prediction on the internet (extended overview), in: Proceedings of the Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, Bucharest, Romania, September 21st - to - 24th, 2021, 2021, pp. 864–887. [13] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, erisk 2021: Pathological gambling, self-harm and depression challenges, in: Advances in Information Retrieval - 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28 - April 1, 2021, Proceedings, Part II, 2021, pp. 650–656. [14] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Overview of erisk 2022: Early risk prediction on the internet, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction - 13th International Conference of the CLEF Association, CLEF 2022, Bologna, Italy, September 5–8, 2022, 2022, p. 233–256. [15] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Overview of erisk at CLEF 2022: Early risk prediction on the internet (extended overview), in: Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, Bologna, Italy, September 5–8, 2022, 2022, pp. 821–850. [16] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, erisk 2022: Pathological gambling, depression, and eating disorder challenges, in: Advances in Information Retrieval - 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10-14, 2022, Proceedings, Part II, 2022, pp. 436–442. [17] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Overview of erisk 2023: Early risk prediction on the internet, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction - 14th International Conference of the CLEF Association, CLEF 2023, Thessaloniki, Greece, September 18–21, 2023, 2023, p. 233–256. [18] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Overview of erisk at CLEF 2023: Early risk prediction on the internet (extended overview), in: Proceedings of the Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 18–21, 2023, 2023. [19] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, erisk 2023: Depression, pathological gambling, and eating disorder challenges, in: Advances in Information Retrieval - 45th European Conference on IR Research, ECIR 2023, Dublin, Ireland, April 2–6, 2023, Proceedings, Part III, 2023, p. 585–592. [20] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Overview of erisk 2024: Early risk prediction on the internet, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction - 15th International Conference of the CLEF Association, CLEF 2024, Grenoble, France, September 9–12, 2024, 2024. [21] A. T. Beck, C. H. Ward, M. Mendelson, J. Mock, J. Erbaugh, An Inventory for Measuring Depression, JAMA Psychiatry 4 (1961) 561–571. [22] A. M. Mármol-Romero, P. A.-O. Adrián Moreno-Muñoz, K. M. Valencia-Segura, E. Martínez- Cámara, M. García-Vega, A. Montejo-Ráez, SINAI at eRisk@ CLEF 2024: Approaching the Search for Symptoms of Depression and Early Detection of Anorexia Signs using Natural Language Processing, in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024. [23] D. Maupomé, Y. Ferstler, S. Mosser, M.-J. Meurs, Automatically finding evidence, predicting answers in mental health self-report questionnaires , in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024. [24] B. H. Ang, S. D. Gollapalli, S.-K. Ng, NUS-IDS@eRisk2024: Ranking Sentences for Depression Symptoms using Early Maladaptive Schemas and Ensembles , in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024. [25] R.-M. Hanciu, MindwaveML at eRisk 2024: Identifying Depression Symptoms in Reddit Users, in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024. [26] A. Barachanou, F. Tsalakanidou, S. Papadopoulos, REBECCA at eRisk 2024: Search for symptoms of depression using sentence embeddings and prompt-based filtering , in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024. [27] D. Guecha, A. Potdar, A. Miyaguchi, DS@GT eRisk 2024 Working Notes , in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024. [28] A. P. Bascuñana, I. S. Bedmar, APB-UC3M at eRisk 2024: Natural Language Processing and Deep Learning for the Early Detection of Mental Disorders, in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024. [29] D. E. Losada, F. Crestani, A test collection for research on depression and language use, in: Proceedings Conference and Labs of the Evaluation Forum CLEF 2016, Evora, Portugal, 2016. [30] D. Otero, J. Parapar, Á. Barreiro, Beaver: Efficiently building test collections for novel tasks, in: Proceedings of the First Joint Conference of the Information Retrieval Communities in Europe (CIRCLE 2020), Samatan, Gers, France, July 6-9, 2020, 2020. [31] D. Otero, J. Parapar, Á. Barreiro, The wisdom of the rankers: a cost-effective method for building pooled test collections without participant systems, in: SAC ’21: The 36th ACM/SIGAPP Sym- posium on Applied Computing, Virtual Event, Republic of Korea, March 22-26, 2021, 2021, pp. 672–680. [32] M. Trotzek, S. Koitka, C. Friedrich, Utilizing neural networks and linguistic metadata for early detection of depression indications in text sequences, IEEE Transactions on Knowledge and Data Engineering (2018). [33] F. Sadeque, D. Xu, S. Bethard, Measuring the latency of depression detection in social media, in: WSDM, ACM, 2018, pp. 495–503. [34] P. Sarangi, S. Kumar, S. Agrawal, T. Basu, A natural language processing based framework for early detection of anorexia via sequential text processing, in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024. [35] O. Riewe-Perła, A. Filipowska, Combining Recommender Systems and Language Models in Early Detection of Signs of Anorexia, in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024. [36] H. Thompson, M. Errecalde, A Time-Aware Approach to Early Detection of Anorexia: UNSL at eRisk 2024, in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024. [37] R. Pan, J. A. G. Díaz, T. B. Beltrán, R. Valencia-Garcia, UMUTeam at eRisk@CLEF 2024: Fine-Tuning Transformer Models with Sentiment Features for Early Detection and Severity Measurement of Eating Disorders , in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024. [38] A. C. Segarra, V. A. Esteve, A. M. Marco, L.-F. H. Oliver, ELiRF-VRAIN at eRisk 2024: Using Long- Formers for Early Detection of Signs of Anorexia, in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024. [39] H. Fabregat, D. Deniz, A. Duque, L. Araujo, J. Martinez-Romo, NLP-UNED at eRisk 2024: Approx- imate Nearest Neighbors with Encoding Refinement for Early Detecting Signs of Anorexia, in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024. [40] C. G. Fairburn, Z. Cooper, M. O’Connor, Eating disorder examination Edition 17.0D (April, 2014). [41] S. Baccianella, A. Esuli, F. Sebastiani, Evaluation measures for ordinal regression, 2009, pp. 283–287. doi:10.1109/ISDA.2009.230. [42] S. Prasanna, A. S. Gulati, S. Karmakar, M. Y. Hiranmayi, A. K. Madasamy, Measuring the severity of the signs of eating disorders using machine learning techniques, in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024.