1. Introduction

Overview of eRisk at CLEF 2025: Early Risk Prediction on the Internet (Extended Overview)

Javier Parapar

javier.parapar@udc.es 1

Anxo Perez

anxo.pvila@udc.es 1

Xi Wang

xi.wang@shefield.ac.uk 2

Fabio Crestani

fabio.crestani@usi.ch 0 0 Faculty of Informatics, Universitá della Svizzera italiana (USI). Campus EST , Via alla Santa 1, 6900 Viganello , Switzerland 1 Information Retrieval Lab, Centro de Investigación en Tecnoloxías da Información e as Comunicacións (CITIC), Universidade da Coruña. Campus de Elviña s/n C.P 15071 A Coruña , Spain 2 University of Shefield , Shefield, England , United Kingdom

This paper presents an extended overview of eRisk 2025, the ninth edition of the CLEF lab on early risk detection. Since its beginnings, eRisk has served as a benchmark for assessing methodologies, evaluation metrics, and challenges in the early identification of personal risks, particularly within health and safety domains. The 2025 edition marks an important evolution, amplifying the lab's scope toward problems that require richer contextual and conversational understanding. The first task, the only one preserved from last year, asks systems to rank sentences by their relevance to the BDI-II depression symptoms, enabling fine-grained retrieval of depressive cues. The second task reformulates early detection as a contextual decision problem. In this task, the full conversational thread, including the user's posts and all the interactions from the rest of the people involved, is revealed incrementally. At each step, the models must decide whether suficient evidence exists to predict depression for the user, thereby rewarding both accuracy and timeliness. Finally, the pilot task pioneers an interactive scenario: fine-tuned large language models engage participants in dialogue and must infer depressive signals from the evolving conversations, probing the feasibility and safety of conversational screening agents. Together, these three tasks continue to advance the field of early risk detection, open new research avenues and align the evaluation framework more closely with real-world conversational settings.

eol>Early risk detection Depression Conversational analysis LLMs Large Language Models eRisk

1. Introduction

The eRisk lab was designed as a benchmark environment for constructing resources, evaluation protocols, and developing approaches that enable the timely detection of diferent personal risk situations. Early alert technologies are becoming indispensable across healthcare oriented domains. The rapid recognition of warning signs, whether for emergent mental health crises, predatory behaviour, or violent threats, can turn marginal time gains into life saving interventions. eRisk focuses on psychological and mental health risks such as depression, self-harm, pathological gambling, and eating disorders, where language provides subtle yet informative signals. However, the intricate relationship between linguistic expression and mental state continues to challenge automatic methods and screeners, underscoring the need for increasingly robust, context-aware models and annotated public datasets of high quality.

The inaugural eRisk 2017 edition introduced the pilot task on early detection of depression, establishing the sequential evidence evaluation framework that is still present in the lab today [ 1, 2 ]. In 2018 the scope broadened to include anorexia, creating a dual task campaign that demonstrated the generalization of the proposals across related mental-health disorders [ 3, 4 ]. The 2019 programme consolidated anorexia work, introduced a self-harm prediction trask, and, for the first time, asked systems to infer answers to a depression severity questionnaire (BDI-II [ 5 ]) purely from social media activity [ 6, 7, 8 ]. The 2019 inclusion of the BDI-II moved the lab toward symptom level modelling, encouraging participants to move beyond binary depressed/not-depressed labels and design methods that capture the nuances of individual depressive symptoms. eRisk 2020 deepened the self-harm task and added a refined depression severity estimation challenge, further emphasising continuous severity scales over binary outcomes [ 9, 10, 11 ]. The 2021 edition first introduced behavioural addictions; it pioneered an early pathological gambling task while revisiting self-harm detection and severity estimation for depression [ 12, 13, 14 ].

In 2022, we returned to gambling and depression and introduced a new challenge centered on estimating the severity of eating-disorder activities [ 15, 16, 17 ]. The 2023 campaign shifted emphasis to fine-grained symptom prediction, presenting a sentence ranking task that maps individual user sentences to the 21 BDI-II depression symptoms, and retained tasks on gambling risk and eating-disorder severity [ 18, 19, 20 ]. Finally, eRisk 2024 consolidated the BDI-II sentence ranking benchmark, maintained the anorexia early detection task, and updated the eating-disorder severity task, setting the stage for the more conversational focus adopted in 2025 [ 21, 22, 23 ].

The current edition, eRisk 2025 [ 24, 25 ], extends this trajectory by introducing, for the first time, tasks that demand not only early recognition of risk but also deeper contextual reasoning and, in the pilot trask, true conversational interactions. Full task specifications appear in the next sections, yet the broad shift is clear: systems must now interpret entire discussion threads and interactions, bringing the evaluation environment closer to real-world online settings. This year, the eRisk lab had 128 diferent teams registered. We finally received results coming from 25 distinct teams: 67 runs for Task 1, 50 runs for Task 2, and 11 runs for the pilot task.

2. Task 1: Search for Symptoms of Depression

This task continues from eRisk 2023’s and 2024’s Task 1, which involved ranking sentences from user writings based on their relevance to specific depression symptoms. This is the last year of the task. Again, participants were required to order sentences according to their relevance to the 21 standardized symptoms listed in the BDI-II questionnaire [ 5 ]. A sentence was deemed relevant if it reflected the user’s condition related to a symptom, including positive statements (e.g., “I feel quite happy lately” is relevant for the symptom “Sadness”). As in 2024, the test collection provides not only the target sentence but also its immediate predecessor and successor to give more context.

2.1. Task 1: Dataset and Asessment Process

The dataset provided was in TREC format, tagged with sentences derived from Reddit historical data. Table 1 presents some statistics of the corpus. Given the corpus of sentences and the description of the symptoms from the BDI-II questionnaire, the participants were free to decide on the best strategy to derive queries for representing the BDI-II symptoms. Each participating team submitted up to 5 variants (runs). Each run included 21 TREC-style formatted rankings of sentences, as shown in Figure 1. For each symptom, the participants should submit up to 1000 results sorted by estimated relevance. We received 67 runs from 17 participating teams (see Table 2).

Number of users Number of sentences Average number of words per sentence Relevance labels were produced through a stratified, two–stage pooling procedure. First, for every BDI-II symptom we implemented top-k pooling, collecting the top five sentences returned by each submitted run ( = 5), forming an initial pool that served to rank systems provisionally. We then selected the twenty highest-ranked runs and performed a second pooling step that extended the cut-of to the top fifty sentences ( = 50). Unlike the 2023 setup, assessors were shown the target sentence together with its immediate context (the preceding and following sentences), a change designed to reduce annotation ambiguity.

Three annotators worked independently: one with professional training in psychology, and two computer-science researchers specialising in early risk technologies. Before judging, the organisers held a session to walk through an initial guideline draft, resolve doubts, and agree on diferent cases. The consolidated guidelines, publicly available1, defines a sentence as relevant only when it both addresses the symptom and conveys explicit information about the user’s state. This dual concept of relevance (on-topic and reflective of the user’s state with respect to the symptom) introduced a higher level of complexity compared to more standard relevance assessments. Each pooled sentence received three independent judgements, and we provide two ground-truth sets (qrels): • Majority-based qrels: a sentence was deemed relevant if at least two of the three assessors marked it so.

• Unanimity-based qrels: a sentence was deemed relevant only when all three assessors agreed. The final pool sizes and qrels for each symptom are reported in Table 3. Providing both qrels enables analyses with diferent agreement thresholds, continuing the dual-qrel strategy introduced in earlier eRisk campaigns.

2.2. Task 1: Results

The performance results for the participating systems are shown in Tables 4 (majority-based qrels) and 5 (unanimity-based qrels). The tables report several standard performance metrics, such as mean

1https://erisk.irlab.org/guidelines_erisk24_task1.html

Average Precision (AP), mean R-Precision, mean Precision at 10 and mean NDCG at 1000. Remarkably, runs unanimity and max from the team INESC-ID, achieved the top-ranking performance for nearly all metrics and relevance judgement types. The teams UET-Psyche-Warriors, SonUIT, BGU-Data-Science and PJs-Team also obtained close performance. Their efective results demonstrate their exceptional competence in this task. Taken together, the results confirm that sentence-level symptom retrieval remains a challenging task.

3. Task 2: Contextualized Early Detection of Depression (New Task)

This new task in 2025 introduces a diferent scenario in depression detection by incorporating full conversational contexts. Whereas earlier eRisk editions always released isolated posts authored by a single user, the 2025 task provided the entire Reddit discussion thread in which the target user intervened. Consequently, in the test dataset, systems had access not only to the messages produced by the target user but also to every other contribution in the thread and to the interaction structure that links the messages (e.g., the diferent replies to each comment).

This design is motivated by the observation that the clinical relevance of a message often becomes more evident when interpreted alongside the surrounding conversation. Thus, a user’s response may only gain relevance when viewed in conjunction with the preceding or subsequent interactions from other participants. For instance, a seemingly neutral sentence, may reveal hopelessness if it answers a direct plea for support. For this reason, the task is designed to simulate real-world scenarios where depression detection may rely on analyzing exchanges between multiple participants. This setup presents unique challenges, as systems must consider not only the textual content of individual posts but also the interplay between participants and how this context influences the detection of depressive symptoms. The test collection utilised for this task followed the same format as the collection described in the work by Losada and Crestani [37]. The collection contains writings, including posts and comments, obtained from a selected group of social media users. To construct the ground truth assessments, we adopted established approaches that aim to optimise the utilisation of assessors’ time, as documented in previous studies [38, 39]. These methods employ simulated pooling strategies, enabling the efective creation of test collections. The main statistics of the test collection used for Task 2 are presented in Table 6. Within this dataset, users are categorised into two groups: depression and control. For each user, the collection contains a sequence of writings and threads where the user participated in chronological order. To facilitate the task and ensure uniform distribution, we established a dedicated server that systematically provided user writings to the participating teams. Further details regarding the server’s setup and functioning are available at the lab’s oficial website 2.

The task was divided into two phases: • During the training phase, participants worked with a static dataset consisting of isolated user writings from depressed and control users, without any conversational context. This training dataset came from prior editions of eRisk regarding the early detection depression tasks (without any conversational context). • The test phase, in contrast, was carried out interactively. For each target user, the server released a sequence of discussion threads in real time. Each thread constituted a submission round. At any round within the chronology of user writings, participants had the freedom to stop the process and issue an alert. After reading each user thread, teams were required to decide between two options: i) alerting about the target user, indicating a predicted sign of depression, or ii) not alerting about the target user. Participants independently made this choice for each user in the test split. It is important to note that once an alert was issued, it was considered final, and no further decisions regarding that particular user were taken into account. Conversely, the absence of alerts was considered non-final, allowing participants to subsequently submit an alert if they detected signs of risk emerging.

To evaluate the systems’ performance, we employed two indicators: the accuracy of the decisions made and the number of user writings required to reach those decisions. These criteria provide valuable insights into the efectiveness and eficiency of the systems under evaluation. To support the test stage, we deployed a REST service. The server iteratively distributes user writings and waits for responses from participants. Importantly, new user data was not provided to a specific participant until the service received a decision from that particular team. The submission period for the task was open from February 5th, 2025 until April 12th, 2025.

2https://erisk.irlab.org/eRisk25Servert2Details.html 3.1. Task 2: Evaluation Metrics

3.1.1. Decision-based Evaluation This evaluation approach uses the binary decisions made by the participating systems for each user. In addition to standard classification measures such as Precision, Recall, and F1 score (computed with respect to the positive class), we also calculate ERDE (Early Risk Detection Error), used in previous editions of the lab. A detailed description of ERDE was presented by Losada and Crestani in [37]. ERDE is an error measure that incorporates a penalty for delayed correct alerts (true positives). The penalty increases with the delay in issuing the alert, measured by the number of user posts processed before making the alert.

Since 2019, we complemented the evaluation report with additional decision-based metrics that try to capture additional aspects of the problem. These metrics try to overcome some limitations of , namely: • the penalty associated to true positives goes quickly to 1. This is due to the functional form of the cost function (sigmoid). • a perfect system, which detects the true positive case right after the first round of messages (first chunk), does not get error equal to 0. • with a method based on releasing data in a chunk-based way (as it was done in 2017 and 2018) the contribution of each user to the performance evaluation has a large variance (diferent for users with few writings per chunk vs users with many writings per chunk).

• is not interpretable.

Some research teams have analysed these issues and proposed alternative ways for evaluation. Trotzek and colleagues [40] proposed %. This is a variant of ERDE that does not depend on the number of user writings seen before the alert but, instead, it depends on the percentage of user writings seen before the alert. In this way, user’s contributions to the evaluation are normalized (currently, all users weight the same). However, there is an important limitation of %. In real life applications, the overall number of user writings is not known in advance. Social Media users post contents online and screening tools have to make predictions with the evidence seen. In practice, you do not know when (and if) a user’s thread of messages is exhausted. Thus, the performance metric should not depend on knowledge about the total number of user writings.

Another proposal of an alternative evaluation metric for early risk prediction was done by Sadeque and colleagues [41]. They proposed , which fits better with our purposes. This measure is described next.

Imagine a user ∈ and an early risk detection system that iteratively analyzes ’s writings (e.g. in chronological order, as they appear in Social Media) and, after analyzing user writings ( ≥ 1), takes a binary decision ∈ {0, 1}, which represents the decision of the system about the user being a risk case. By ∈ {0, 1}, we refer to the user’s golden truth label. A key component of an early risk evaluation should be the delay on detecting true positives (we do not want systems to detect these cases too late). Therefore, a first and intuitive measure of delay can be defined as follows 3: latency = median{ : ∈ , = = 1} (1) This measure of latency is calculated over the true positives detected by the system and assesses the system’s delay based on the median number of writings that the system had to process to detect such positive cases. This measure can be included in the experimental report together with standard measures such as Precision (P), Recall (R) and the F-measure (F): 3Observe that Sadeque et al (see [41], pg 497) computed the latency for all users such that = 1. We argue that latency should be computed only for the true positives. The false negatives ( = 1, = 0) are not detected by the system and, therefore, they would not generate an alert.

Furthermore, Sadeque et al. proposed a measure, , which combines the efectiveness of the decision (estimated with the F measure) and the delay4 in the decision. This is calculated by multiplying F by a penalty factor based on the median delay. More specifically, each individual (true positive) decision, taken after reading writings, is assigned the following penalty: () = − 1 +

2 1 + exp− · (− 1) where is a parameter that determines how quickly the penalty should increase. In [41], was set such that the penalty equals 0.5 at the median number of posts of a user5. Observe that a decision right after the first writing has no penalty (i.e. (1) = 0). Figure 2 plots how the latency penalty increases with the number of observed writings.

= = = | ∈ : = = 1|

| ∈ : = 1| | ∈ : = = 1|

| ∈ : = 1| 2 · · + (6) (7)

The system’s overall speed factor is computed as:

= (1 − median{() : ∈ , = = 1}) where speed equals 1 for a system whose true positives are detected right at the first writing. A slow system, which detects true positives after hundreds of writings, will be assigned a speed score near 0. Finally, the latency-weighted F score is simply:

= · Since 2019 user’s data were processed by the participants in a post by post basis (i.e. we avoided a chunk-based release of data). Under these conditions, the evaluation approach has the following properties: • smooth grow of penalties; • a perfect system gets = 1 ; • for each user the system can opt to stop at any point and, therefore, now we do not have the efect of an imbalanced importance of users; • is more interpretable than . 4Again, we adopt Sadeque et al.’s proposal but we estimate latency only over the true positives. 5In the evaluation we set to 0.0078, a setting obtained from the eRisk 2017 collection. 3.1.2. Ranking-based Evaluation In addition to the evaluation discussed above, we employed an alternative form of evaluation to further assess the systems. After each data release (new user writing, that is post or comment), participants were required to provide the following information for each user in the collection: • A decision for the user (alert or no alert), which was used to calculate the decision-based metrics discussed previously.

• A score representing the user’s level of risk, estimated based on the evidence observed thus far. The scores were used to create a ranking of users in descending order of estimated risk. For each participating system, a ranking was generated at each data release point, simulating a continuous re-ranking approach based on the observed evidence. In a real-life scenario, this ranking would be presented to an expert user who could make decisions based on the rankings (e.g., by inspecting the top of the rankings). Each ranking can be evaluated using standard ranking metrics such as P@10 or NDCG. Therefore, we report the performance of the systems based on the rankings after observing diferent numbers of writings.

3.2. Task 2: Participant Teams 3.3. Task 2: Results

Table 8 show the decision-based results of Task 2. Table 9 shows the ranking-based results. In the decision setting, HIT-SCIR dominates: its best run attains the highest 1 (0.85) while keeping both ERDE5 and ERDE50 at or very near the minimum error values. That performance is achieved with a median latency of only eight writings, illustrating a good balance between earliness and accuracy. ELiRF-UPV follows at a short distance, with a top 1 of 0.79 but slightly worse error–aware metrics The ranking-based evaluation shows a complementary picture. HIT-SCIR again exhibits near-perfect precision at every cut-of and sustains the highest NDCG values as additional writings become available, confirming the robustness of its retrieval component. Lotu-Ixa excels in the one writing scenario, matching HIT-SCIR for @10 and @10. However, its advantage diminishes once longer histories are considered, suggesting that its decision policy strongly weights the earliest cues. 1.00 0.58 1.00 0.58 1.00 0.58 1.00 0.58 1.00 0.58 0.90 0.88 0.36 0.30 0.25 0.32 0.20 0.31 0.14 0.90 0.94 0.35 0.60 0.75 0.27 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

4. Pilot Task: Conversational Depression Detection via LLMs

We introduced this pilot task in 2025 as a novel challenge to seek the opportunity of embracing conversational agents in detecting depression symptoms. Participants were interacting with LLM-based personas who have been instructed using user writings, simulating real-world conversational exchanges and example user profiles. Twelve distinct personas were instantiated with ChatGPT. The challenge lies in asking participants to determine whether the LLM persona exhibits signs of depression and, if so, what is the level of depression severity and key depression symptoms expressed over conversations. The diagnostic target for the LLMs was framed in terms of the BDI-II, as in Task 1. The BDI-II is a 21-item self-report questionnaire widely used in clinical psychology, which are listed in the Table 10. Each item corresponds to a concrete symptom. For example, Sadness, Loss of Energy, or Indecisiveness. Each symptom is scored 0 to 3 according to severity. Table 11 shows the possible response options (0-3) for the symptoms Sadness and Self-Dislike. The sum of all 21 symptoms yields a global index in the range 0–63. The scores are interpreted into four categories: 0-9 are interpreted as minimal depression, 10–18 as mild, 19–29 as moderate, and 30 or above as severe. Because the personas are simulations, no ground-truth questionnaire exists; instead, a group of three clinicians examined the seed user data that shaped each persona and agreed on both an overall BDI-II score and the subset of symptoms included. These consensual judgments constitute the gold standard.

Participants did not receive any labelled training material. We deliberately framed the task as trainingless to encourage a variety of methodological responses,ranging from rule-based interviewers and zero-shot LLM prompts to diferent classifiers trained on public mental-health corpora. During the test window, teams accessed the links we provided them through ChatGPT interface for creating the dialogue with the LLM-persona. The participant systems interacted with a free-form prompt; the server produced the next turn, and so on. This loop continued until the system chose to terminate the dialogue and submit its diagnosis. Since this is a pilot task, there was no hard cap on the number of turns, but we encouraged the participants to produce their decisions as early as possible.

After ending the conversation with a persona, a participating system had to return two files. The first was a structured log that preserves, in chronological order, every prompt–response pair exchanged with the agent; this file serves auditing and qualitative analysis. The second was a JSON record containing three fields: the predicted BDI-II score (an integer 0–63), the corresponding severity category, and up to four symptom drawn from the BDI-II list in Table 10, that best explained the score.

4.1. Pilot Task: LLM Personas Design and Construction

We adopted a clinician-in-the-loop design workflow to build the twelve LLM personas. A team of three clinical psychologists co-designed a template that captures both general biographical detail and clinically information. Using this template we instantiated a pool of draft personas with GPT-4o, each conditioned on a diferent user history.

The same clinicians then conducted free-form interviews with every draft, rating each dialogue along two main dimensions: • The overall dimension covered traits associated with conversational attributes: human-likeness, lexical fluency, coherence, and afective naturalness. • The diagnostic dimension targeted domain realism, including emotional consistency, fidelity to depressive symptomatology, willingness to elaborate, and cognitive style (rumination, processing speed, abstraction level).

Feedback was recorded on a five-point Likert scale and complemented with qualitative comments. Insights from this evaluation cycle informed a second engineering pass in which every persona was represented through a structured prompt comprising the main following elements: • Core profile. A stable set of attributes: name, age, gender, marital status and an a pre-defined

BDI-II score. • Key negative symptoms. Up to four key BDI-II symptoms (or less for control personas) that the agent should manifest recurrently and coherently. • Memory and reflection. Specific snippets describing life history, social context, and salient past events; these cues allow the agent to maintain narrative continuity and to provide retrospective insight into its mood. • Language and communication style. Use of vocabulary, and typical sentence length so that each persona speaks with a recognisable “voice”. • Behavioural constraints. Guard-rails that prohibit explicit self-diagnosis and that keep the agent away from clinical recommendations, thereby forcing participants to infer depression indirectly. • Response goals. High-level objectives such as “answer candidly but not expansively,” “avoid mentioning diagnosis unless prompted,” and “display mild self-disclosure”. • Environment and context. Brief situational framing (e.g. studying for exams, recent job change) that provides topical depth without locking the dialogue. • Few-shot exemplars. Short question–answer pairs illustrating the expected tone and symptom expression. • Restricted responses. A blacklist of phrases that would break immersion (e.g. “As an AI language model. . . ”) replaced with context-appropriate alternatives.

The final personas were frozen only after a second round of clinician interaction confirmed that they satisfied a minimum threshold on both the overall and diagnostic scales. This iterative, expertguided construction process proved essential to achieve dialogues that are simultaneously natural and diagnostically meaningful, laying the groundwork for future large-scale evaluations of conversational mental-health screening systems.

4.2. Pilot Task: Participant Teams

Table 12 shows the participant teams and some statistics about their interactions such as the mean number of messages per run, and the mean number of characters per message. The numbers reveal a wide range of interaction strategies: • ixa-ave submitted the maximum number of runs (four) and tended to carry out relatively lengthy dialogues (≈ 31 messages each) while keeping their prompts concise (≈ 415 characters per turn). • SINAI-UJA used a fast approach, with only 6–7 turns on average, yet still packed almost 490 characters into every message, suggesting dense, information-rich questioning. • DS-GT followed an intermediate approach, with ≈ 21 messages per run and 783 characters per message, balancing breadth and depth of interaction. • PJs-team produced long messages (≈ 1 045 characters) within a limited number of turns (≈ 8), delivering extended prompts. • LT4SG employed a fixed sequence of ten short messages averaging only 41 characters, representing the most lightweight strategy.

4.3. Pilot Task: Evaluation Metrics

Based on evaluation metrics that have been developed from eRisk 2019 [48], which involved the use of BDI-II questionnaires and scores, we extend and develop the evaluation approaches as follows: • Depression Category Hit Rate (DCHR): Based on the four depression level categories that we have discussed, from minimal depression to severe depression, this efectiveness measure examines the fraction of cases where the BDI-II scores describing simulated personas estimated by the participants lie in the correct depression category. • Average DODL (ADODL): For this pilot task, we reuse the Average Diference between Overall Depression Levels (ADODL), which measures the closeness between the actual and estimated depression level for efectiveness measurement. The ADODL is calculated by following: = ( − | − |)/ , where | − | calculates the absolute value between the Actual Depression Level (ADL) and the Estimated Depression Level (EDL). Then divided by Maximum Absolute Diference (i.e., 63) to obtain a normalised evaluation score in [ 0,1 ]. For example, if a simulated persona has a minor depression severity (depression level 5) and a participant estimates the depression level is 9, the DODL is calculated as (63 − | 9 − 5|)/63 = 0.9365. • Average Symptom Hit Rate (ASHR): For the last efectiveness measure, aside from estimating the depression level of simulated personas as per BDI-II scores, this pilot task also involves the identification of major depression symptoms of simulated personas. Hence, SHR calculates the ratio of cases where the participants can correctly identify the major symptoms of the simulated personas. For example, each simulated persona has four major symptoms. If a participant accurately identifies two of them, then the SHR equals 2/4 = 0.5.

4.4. Pilot task: Results

Table 13 presents the oficial runs, ranked by best ADODL. The strongest submission, SINAI-UJA (run 1), achieves an ADODL of 0.93, meaning the predicted scores difer by less than five points on average from the clinician reference. Its DCHR of 0.58 shows that most of these small errors still fall within the incorrect severity band. DS-GT attains comparable category accuracy (0.50) with only a modest drop in ADODL, rearching similar level reliability despite larger absolute score errors.

Across all teams, however, symptom recognition stays behind score estimation: even the best ASHR values hover below 0.30, indicating that systems often capture the global severity signal without isolating which symptoms drive it.

5. Participating Teams

Table 14 reports the participating teams and the runs that they submitted for each eRisk task. The next paragraphs give a brief summary on the methods implemented by each of them. Further details are available at the CLEF 2025 working notes proceedings for the participants.

Lotu-ixa [42]. The Lotu-ixa team, afiliated with University of the Basque Country, in Spain, participated in task 2 proposed as part of eRisk CLEF this year. The team proposes a method to (i) apply a semantic relabelling process to the training data, (ii) then design and fine-tune a classification model, and (iii) ifnally combine risk signals derived from both the target user and the conversational context. For (i), a similarity score was computed for representative positive and negative examples, and then a percentilebased strategy determined the messages suitable for relabelling. The classifier (ii) was derived from XLM-RoBERTa and fine-tuned on the relabelled dataset from (i) using a binary cross-entropy loss function, with optimised hyperparameters explored via grid search. Finally, (iii) the team computed user risk, context risk and thread risk scores, calculating a binary decision based on these. The team performed five runs using diferent thread risk score settings. In their run #4, they obtain the best recall score (1.0), and runs #0-#3 yielded the best ERDE5 (0.05) among the participants of this task. For the ranking-based metrics, their approach demonstrates highly competitive performance in precision and NDCG (most of their runs achieve 1.0). Their approach has been competitive across all ranking metrics. SINAI-UJA [44]. SINAI-UJA team, from the University of Jaen (Spain), participated in tasks 2 and 3 of the eRisk 2025 challenge. For task 2, the team relied on the provided train and test sets, but also developed a new dataset for this task, to be able to have training data with context. The team fine-tuned RoBERTa and Mental RoBERTa models in diferent settings, optimised with Optuna, and performed ifve runs using diferent settings of model and parameter combinations. Their system was one of the top three in terms of eficiency, completing the task in less than 10 hours. Their system achieved a perfect recall (1.0) in all runs, but with the cost of having a low precision (0.17-0.24) and low F1 (0.29-0.39). In the ranking-based evaluation, the team performed competitively at early stages, aligning the the top-ranked teams. For Task 3, the SINAI team proposed a modular system composed of two collaborating LLMs: (1) is responsible for interacting with the user, and (2) does not interact with the user, but receives the conversation and analyses it and updates the state of the depressive symptoms. Moreover, this LLM reasons whether it needs more information or not, ending the conversation when needed. The team uses Llama-3.1-8B-Instruct model for both LLMs. They submitted three runs, with diferent prompt configurations, achieved the fastest interaction with an average of 6.54 messages per run, as well as achieving the best overall ADODL (0.93), ASHR (0.29) and DCHR (0.66), highlighting that the estimations were highly aligned with the BDI-II levels of the simulated personas and that their approach efectively identified key symptoms.

COTECMAR-UTB [ 31 ]. COTECMAR-UTB, afiliated with Universidad Tecnologica de Bolivar, in Colombia, participated in Tasks 1 and 2 proposed as part of eRisk CLEF this year. For Task 1, the team focused on high-confidence training data and balanced the data using EDA and SMOTE. The authors propose a pipeline that includes data preprocessing and cleaning, training ML models, including LR, SVM and BERT, among others. After that, they apply VADER to identify texts with negative sentiment and score the sentences. They submitted one run, achieving a middle-tier performance. For Task 2, the team trained an LSTM model to predict the risk of depression. The team submitted 2 runs, with moderate performance, achieving a best F1 of 0.40 and a Recall of 0.65. For the ranking-based metrics, the metrics have room for improvement, suggesting that the model had dificulties when prioritizing relevant messages.

HULAT_UC3M [36]. The HULAT-UC3M team, afiliated with Universidad Carlos III, from Madrid (Spain), participated in Task 1 proposed as part of eRisk 2025 challenge. The team proposed training a multi-classifier (SVM) to classify all the sentences into their corresponding symptoms, keeping only the ones with higher probabilities according to diferent thresholds; filtering sentences according to diferent criteria for each run; and scoring the sentences using either VADER or roberta-base-sentiment. The authors use the training data with unanimity to minimise noise. Their best run uses RoBERTa, selecting the top 1000 sentences based on confidence scores, achieving an AP of 0.018. Their two runs using high-confidence-based filtering had a positive impact on the performance, but the scoring method can be improved.

BGU-Data-Science [35]. BGU-Data-Science team, afiliated with Ben-Gurion University of the Negev, in Israel, participated in Task 1 of the eRisk 2025. The authors approached the task as a sentence ranking problem by computing the semantic similarity between user sentences and BDI-II symptom descriptions embedded using Sentence-BERT. The team performed query expansion and filtered out sentences that were not in the first person. For the first person filtering, the team employed three diferent methods: a basic filtering approach using first-person pronouns, a method using spaCy, and Claude Sonnet 3.7 to assess whether a sentence conveys the user’s personal experience. The team achieved their best results with the baseline approach, using only embeddings from Sentence-BERT, which resulted in an AP score of 0.240. Although incorporating query expansion and first-person filtering did not yield the highest AP, it did achieve the highest P@10 compared to other runs from the team.

INESC-ID [33]. The INESC-ID team, afiliated with University of Lisboa, in Portutal, participated in the first task of the eRisk Lab. Although this task is framed as an information retrieval challenge, the authors approach it as a regression or classification problem. The team explored several methods, including fine-tuned foundation models (DeBERTa-v3-large), unsupervised similarity based approaches, and LLM-based classification using GPT-4o-Mini. The authors make use of the training data provided for this task to train and validate their approaches. The DeBERTa model was finetuned for regression to predict a relevancy score ranging from 0 to 1, while the other two methods were framed as binary classification tasks. The best-performing run of the team was an ensemble approach that combined outputs from all the methods, achieving the highest scores AP, R-PREC, and P@10. HU [46]. The HU team, afiliated with the Habib University, in Pakistan, participated in Task 2 from the eRisk 2025 challenge. The runs submitted by the team cover a wide range of approaches, including transformer-based models (ModernBERT) with time-aware loss or data augmentation strategies, Llama 3.1 summarization with BERT classification, a zero-shot model using Llama-4-Scout-17B, and a simple threshold approach. The best performing method, using Llama 3.1 for summarization and BERT classification with an incorporated alert policy (run #1), achieved an F1 score of 0.75, ranking 3rd out of 12 teams in decision-based evaluation. In ranking-based evaluation, the same run obtains a perfect score of 1.00 in P@10 and NDCG@10 after one writing.

FU-TU-DFKI [47]. The FU-TU-DFKI team is afiliated with three diferent organizations from Germany: the Freie Universität Berlin, University of Hannover, and the Technical University of Berlin. They participated in Task 2 of eRisk 2025. The authors conducted two pilot studies that focused on the linguistic analysis of the dataset provided for the task. The first study examined the use of first-person singular pronouns and the verbs commonly associated with them. The second study involved a concept analysis of the keywords found in the data. The insights gained from these studies helped inform their proposed method. The team’s hybrid system combines a transformer-based model (MentalBERT) with linguistically informed features, such as the use of first person pronouns and associated verbs, as well as other relevant keywords. In addition, the system incorporates metadata, including late-night posting frequency and the sentiment of the posts. The team achieved modest results by processing only 449 out of a total of 1 280 user threads.

ThinkIR [ 27 ]. The ThinkIR team comes from two organizations in India, the Indian Institute of Science Education and Research Kolkate, and the Vellore Institute of Technology. ThinkIR submitted five runs for Task 1. Four rely on classical IR ranking with diferent query expansion strategies, namely kNN word embedding expansion, pseudo relevance feedback (PRF), and GPT generated prompt reformulations, while the remaining one uses a RoBERTa based multi label classifier. The best run, which involves RoBERTa fine tuning, achieved an AP of 0.068, R Precision of 0.157, P@10 of 0.409, and NDCG of 0.228, leading every metric among their runs. The experiments confirm that transformer fine-tuning outperformed all classical expansion methods, although PRF on the top ten documents still produced competitive rankings.

Ixa_ave [ 28 ]. The ixa_ave team is afiliated with the HiTZ Basque Center for Language Technology, from the University of the Basque Country (Spain). ixa_ave took part in task 1 and the inaugural pilot Task at eRisk 2025. For task 1 they fine-tuned multilingual BERT, appending a 21-dimensional vector of cosine-similarity scores to each sentence and predicting with a 21-head classifier. They tried two similarity-based data-reduction ideas: (i) skip training sentences whose similarity to any BDI-II item exceeds = 0.5, and (ii) at inference keep only sentences whose similarity is at least ∈ 0.3, 0.5. Among the five submitted runs, base_filter30 ( = 0.3, no training pruning) was best, reaching AP = 0.102 under majority voting. In the Pilot Task they compared a manual questionnaire interview (run 0) with three LLM agents: GPT-4-long (run 1), GPT-4-short (run 2) and Falcon-11B (run 3). Both GPT-4 variants matched the human baseline on DCHR = 0.33, whereas Falcon obtained worse results, with 0.17. UET-Psyche-Warriors [34]. The UET-Psyche-Warriors team is afiliated with the VNU University of Engineering and Technology, in Vietnam. The authors participated in Task 1 and Task 2 of eRisk CLEF 2025 challenge. For task 1, they explored both semantic similarity-based ranking and a machine learning approach using a multi-task DepRoBERTa model fine-tuned for symptom detection and severity estimation. Their best run (Run 4) achieved an NDCG of 0.623 and an AP of 0.339, ranking second overall. For Task 2, the team implemented a multi-stage system combining sentence-level severity scoring with rule-based aggregation strategies. Run 2, which incorporated temporal accumulation with a bonus heuristic, achieved their best results with an F1 score of 0.73 and a latency-aware F1 of 0.68, placing them fourth overall.

ELiRF-UPV [ 30 ]. The ELiRF-UPV team, afiliated with Polytechnic University of Valencia, in Spain, participated in Tasks 1 and 2 of the eRisk 2025 challenge. For Task 1, the team developed an adapter architecture over pre-trained sentence similarity models, incorporating attention over reference embeddings derived from both cluster centroids and the BDI-II question-answer pairs. For Task 2, they explored three approaches: a classical SVM classifier, a Longformer fine-tuned on user-level data, and a task-adapted Longformer model trained using a data augmentation strategy designed to simulate early detection conditions. Their best-performing system in Task 2 was a Linear SVM using TF-IDF features, ranking 6th overall in the competition.

HIT-SCIR [43]. The HIT-SCIR is afiliated with the Harbin Institute of Technology, in the Univervisty of Harbin (China). They participated in task 2 of the CLEF 2025 eRisk Lab. Their proposal focuses on contextualized early detection of depression on social media, utilizing a multi-stage framework. Their approach addresses the challenge of limited interactive context in training data by employing LLMs for contextual data augmentation. Specifically, they use LLMs to simulate social interactions, generating comments for original user posts and then summarizing these comments to create a rich semantic context. A core component of their system is a psychiatric scale-guided risky post screening module, which identifies depression-related information from user post histories. This module calculates a risk score for each post based on its cosine similarity with symptom descriptions from established psychological scales, like the BDI-II. Posts with higher risk scores are then filtered for depression risk detection. The detection itself uses MentalBERT, a BERT variant optimized for mental health texts, to generate post embeddings, and a Transformer with attention mechanisms to model inter-post interactions and generate user features. The entire screening and detection process is trained end-to-end using a Straight-Through Estimator (STE). For early detection testing, a dynamic risky post queue and diferent alerting strategies are employed. The team submitted five runs with varying operational parameters for their dynamic user-level early risk assessment strategy, using a voting ensemble of their top three performing models. This integrated approach led to strong performance, achieving first rank in several evaluation metrics, including F1 (0.85 for HIT-SCIR-4), ERDE50 (0.03 for HIT-SCIR-4), and Flatency (0.82 for HIT-SCIR-2 and HIT-SCIR-4). They also achieved first place in the majority of ranking-based metrics, such as P@10 and NDCG@10 across almost all writings evaluations. PJs-team [ 29 ]. The PJs-team, afiliated from Netaji Subhas University of Technology , from India, presented distinct approaches for three tasks. In the task 1, the team used finetuned bi-encoders (e.g., DistilRoBERTa, e5-small) with CoSENTLoss and their ensemble using Reciprocal Rank Fusion (RRF). They also employed finetuned cross-encoders (’ModernBERT-large’, ’ModernBERT-base’) with BinaryCrossEntropyLoss for reranking, and reranker ensembles using majority voting or scaled mean averaging. The cross-encoder ensemble run gave their top scores (AP 0.279, P@10 0.800). In task 2, they presented a two-stage pipeline first filters each new post with a custom DistilRoBERTa sentencetransformer against four early BDI cues (pessimism, punishment feelings, self-dislike, indecisiveness). High-scoring texts or users previously flagged are analysed by an ensemble of four hosted LLMs (Claude 3.7 Sonnet, Amazon Nova Pro, Llama 3-70B, Claude 3.5 Haiku). Majority vote delivers the final decision. The single-model Sonnet run achieved the team’s best F1 = 0.71 with low ERDE@5 = 0.09. In task 3, they built a single LLM agent (Claude Sonnet) driven by a long system prompt embedding the full BDI-II questionnaire. The agent chats about movies to elicit emotions and updates the 21 BDI scores each turn, ending when all scores are set. On the pilot evaluation it reached ADODL 0.73 and DCHR 0.33. LHS712-Team-1 [32]. The LHS712Team comes from School of Information & Department of Learning Health Sciences, in the University of Michigan, USA. The authors participated in task 1, and benchmarked a wide spectrum of ten runs, covering: () Classical baselines, Logistic Regression and SVM coupled with CountVectorizer or TF-IDF features. () Domain specific embeddings, ClinicalBERT and SentenceBERT sentence vectors fed into Linear-SVC or LR classifiers. () They also fine-tuned BERT, with a “[SYMPTOM] [SEP] sentence” formulation finetuned for five epochs, where a symptom keyword iflter first pruned the 17 million sentence test set to keep inference tractable. () A method based on hybrid retrieval, where BM25 selects candidates that are reranked by SBERT cosine similarity. Finally, the fine-tuned BERT with unanimous-label training was their top performer, yielding AP 0.078, R-Prec 0.169, P@10 0.344 and NDCG 0.287 on the oficial unanimity evaluation, well above their traditional baselines.

DS-GT [45]. The DS-GT team from the Georgia Institute of Technology, in USA, participated in the task two and the pilot task. In task 2, the team contrasted two pipelines: Voting Classifier combining engineered features (TF-IDF, VADER sentiment, LIWC-style counts, posting-gap timings) in a soft vote ensemble of Random Forest, SGD-LogReg and Gradient Boosting. Here, lightGBM + temporal attention where MentalRoBERTa sentence embeddings feed a linearly-weighted recency mechanism and a sparse “depression-indicator” content matrix before classification. Both runs achieved recall = 1.0 but low precision (P = 0.11, F1 = 0.20) and identical 5 = 0.12, with the embedding-based model yielding far better ranking scores (P@10 = 0.90, NDCG@10 = 0.92 on the 1-writing cut). In the pilot Task, a unified prompt-engineering framework used several LLMs (Claude 3.7 Sonnet, GPT-4o, Gemini Flash/Pro) to conduct ≈ 20 turn interviews, outputting structured JSON with item-level BDI-II scores and key symptoms. The best run (Claude Sonnet) placed second overall (DCHR 0.50, ADODL 0.89, ASHR 0.27). Exploratory analysis showed strong cross-model consistency (R2 = 0.91 between label level and BDI score) but wide variance on appetite and agitation cues.

SonUIT [ 26 ]. The SonUIT team is afiliated with the University of Information Technology (UIT), in Vietnam, and participated in task 1. Their system uses a two-stage pipeline: () Filtering, where they build averaged all-MiniLM-L6-v2 embeddings for each BDI-II symptom and pull the top 1000 sentences per symptom via cosine similarity. () Reranking, where the candidate set is optionally resorted with BM25, a cross-encoder, or larger embedding models (bge-large-en-v1.5 and text-embedding-3-large). Five runs explored raw vs. pre-processed text and the diferent rerankers. Their configuration #2 (pre-processed text + embedding filter) posted the team’s best scores and placed within the top-three teams on every evaluation metric (MAP = 0.334, R-Prec = 0.392, P@10 = 0.790, NDCG@1000 = 0.613).

6. Conclusions

This paper provided an overview of eRisk 2025, the ninth edition of the eRisk lab, which moved toward two new tasks that require richer conversational understanding and interactive settings. The Task 1, which was the final edition of the sentence-ranking challenge for BDI-II symptoms, attracted 67 runs from 17 teams. Task 2 introduced full-thread context for the first time in early detection of depression. In this task, we received 50 runs from 12 teams, and showed that models able to exploit dialogue structure can issue accurate alerts after remarkably few turns, although a clear trade-of persists between earliness and recall. The pilot task went a step further, replacing static corpora with live interaction against LLM-driven personas. Despite the absence of training data, five teams submitted 13 runs; top systems achieved near-perfect BDI-II score estimation yet still struggled to pinpoint the specific symptoms that reflect those scores, highlighting the dificulty of symptom-level grounding in open conversation. Taken together, the 130 runs submitted this year confirm both the community’s engagement and the practicality of evaluation settings that approach real conversational use cases. Three broad lessons emerge: adding even modest context improves detection, timeliness must remain a core metric. Moreover, clinician-guided LLM personas, despite having a lot of room for improvement, are able to create realistic yet privacy-preserving frameworks. Future eRisk editions will continue to shift toward dialogue-centric tasks and deeper integration of LLM capabilities to keep pace with how people communicate online and how assistive technologies are deployed.

7. Acknowledgments

The authors thank the financial support supplied by the grant PID2022-137061OB-C21 funded by MICIU/AEI/10.13039/501100011033 and by “ERDF/EU”. The authors also thank the funding supplied by the Consellería de Cultura, Educación, Formación Profesional e Universidades (accreditations ED431G 2023/01 and ED431C 2025/49) and the European Regional Development Fund, which acknowledges the CITIC, as a center accredited for excellence within the Galician University System and a member of the CIGUS Network, receives subsidies from the Department of Education, Science, Universities, and Vocational Training of the Xunta de Galicia. Additionally, it is co-financed by the EU through the FEDER Galicia 2021-27 operational program (Ref.

8. Declaration on Generative AI

During the preparation of this manuscript, generative AI tools were employed solely for light editing purposes, including proofreading, grammar correction, vocabulary improvement, and overall language polishing. All substantive ideas, analyses, experiments, and written content were created by the co-authors without direct text generation from any AI model. Madrid, Spain, September 9-12, 2025. [32] A. Benloucif, Y. Nannapuraju, S. Bellam, Y. Hu, Z. Zhao, V. Vydiswaran, Lhs712team-1 at eRisk@clef 2025: Searching for depression symptoms using various natural language processing algorithms, in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, Madrid, Spain, September 9-12, 2025. [33] D. A. Nunes, E. Ribeiro, Inesc-id @ eRisk 2025: Exploring fine-tuned, similarity-based, and prompt-based approaches to depression symptom identification, in: Working Notes of CLEF 2025 Conference and Labs of the Evaluation Forum, Madrid, Spain, September 9-12, 2025. [34] T.-P. Mai, M.-H. L. H., D.-L. Tran, D.-C. Can, H.-Q. Le, Uet@eRisk2025: Severity estimation for depression symptoms searching and early risk detection, in: Working Notes of CLEF 2025 Conference and Labs of the Evaluation Forum, Madrid, Spain, September 9-12, 2025. [35] N. Munz, E. Aharon, A. Segal, K. Gal, Semantic retrieval of bdi symptoms in user writings, in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, Madrid, Spain, September 9-12, 2025. [36] J. C. Molina, P. M. Fernandez, Hulat-uc3m at task 1@eRisk 2025: Detecting depression using machine learning approaches, in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, Madrid, Spain, September 9-12, 2025. [37] D. E. Losada, F. Crestani, A test collection for research on depression and language use, in:

Proceedings Conference and Labs of the Evaluation Forum CLEF 2016, Evora, Portugal, 2016. [38] D. Otero, J. Parapar, Á. Barreiro, Beaver: Eficiently building test collections for novel tasks, in: Proceedings of the First Joint Conference of the Information Retrieval Communities in Europe (CIRCLE 2020), Samatan, Gers, France, July 6-9, 2020, 2020. [39] D. Otero, J. Parapar, Á. Barreiro, The wisdom of the rankers: a cost-efective method for building pooled test collections without participant systems, in: SAC ’21: The 36th ACM/SIGAPP Symposium on Applied Computing, Virtual Event, Republic of Korea, March 22-26, 2021, 2021, pp. 672–680. [40] M. Trotzek, S. Koitka, C. Friedrich, Utilizing neural networks and linguistic metadata for early detection of depression indications in text sequences, IEEE Transactions on Knowledge and Data Engineering (2018). [41] F. Sadeque, D. Xu, S. Bethard, Measuring the latency of depression detection in social media, in:

WSDM, ACM, 2018, pp. 495–503. [42] X. Larrayoz, A. Casillas, A. Pérez, Leveraging conversational context and semantic relabeling for early depression detection, in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, Madrid, Spain, September 9-12, 2025. [43] Y. Zi, B. Wang, Y. Zhao, B. Qin, Hit-scir@eRisk2025: Exploring the potential of a learnable screening model and risk post bufer-based framework for contextualized early prediction of depression on social media, in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, Madrid, Spain, September 9-12, 2025. [44] A. M. Mármol-Romero, M. García-Vega, M. Ángel García-Cumbreras, A. Montejo-Ráez, Sinai at eRisk@clef 2025: Transformer-based and conversational strategies for depression detection, in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, Madrid, Spain, September 9-12, 2025. [45] D. Guecha, Y. Chiu, A. Miyaguchi, S. Gaur, Ds@gt at eRisk 2025: From prompts to predictions, benchmarking early depression detection with conversational agent based assessments and temporal attention models, in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, Madrid, Spain, September 9-12, 2025. [46] M. Saad, M. Abbas, A. U. Chaudhry, F. Alvi, A. Samad, Contextualized early detection of depression – hybrid and time-aware approaches: Hu at eRisk task 2 2025, in: Working Notes of CLEF 2025 Conference and Labs of the Evaluation Forum, Madrid, Spain, September 9-12, 2025. [47] E. Kara, R. E. M. Peña, L. Raithel, Fu-tu-dfki@eRisk 2025: A linguistically informed but overdiagnosing approach to early depression detection, in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, Madrid, Spain, September 9-12, 2025. [48] D. E. Losada, F. Crestani, J. Parapar, Overview of eRisk 2019 early risk prediction on the internet, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction: 10th International Conference of the CLEF Association, CLEF 2019, Lugano, Switzerland, September 9–12, 2019, Proceedings 10, Springer, 2019, pp. 340–357.

[1]

D. E.

Losada ,

Crestani , J. Parapar, eRisk 2017 : CLEF lab on early risk prediction on the internet: Experimental foundations , in: G. J. Jones , S.

Lawless , J.

Gonzalo , L.

Kelly , L.

Goeuriot , T.

Mandl , L.

Cappellato , N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction , Springer International Publishing, Cham, 2017 , pp. 346 - 360 .

[2]

D. E.

Losada ,

Crestani , J. Parapar, eRisk 2017 : CLEF Lab on Early Risk Prediction on the Internet: Experimental foundations , in: CEUR Proceedings of the Conference and Labs of the Evaluation Forum, CLEF 2017 , Dublin, Ireland, 2017 .

[3]

D. E.

Losada ,

Crestani ,

Parapar , Overview of eRisk: Early Risk Prediction on the Internet , in: P. Bellot,

Trabelsi ,

Mothe ,

Murtagh ,

J. Y.

Nie ,

Soulier , E. SanJuan, L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction , Springer International Publishing, Cham, 2018 , pp. 343 - 361 .

[4]

D. E.

Losada ,

Crestani ,

Parapar , Overview of eRisk 2018: Early Risk Prediction on the Internet (extended lab overview) , in: CEUR Proceedings of the Conference and Labs of the Evaluation Forum, CLEF 2018 , Avignon, France, 2018 .

[5]

A. T.

Beck ,

C. H.

Ward ,

Mendelson ,

Mock ,

Erbaugh , An Inventory for Measuring Depression, JAMA Psychiatry 4 ( 1961 ) 561 - 571 .

[6]

D. E.

Losada ,

Crestani ,

Parapar , Overview of eRisk 2019: Early risk prediction on the Internet , in: F. Crestani,

Braschler ,

Savoy ,

Rauber ,

Müller ,

D. E.

Losada ,

G. Heinatz

Bürki ,

Cappellato , N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction , Springer International Publishing, 2019 , pp. 340 - 357 .

[7]

D. E.

Losada ,

Crestani ,

Parapar , Overview of eRisk at CLEF 2019: Early risk prediction on the Internet (extended overview) , in: CEUR Proceedings of the Conference and Labs of the Evaluation Forum, CLEF 2019 , Lugano, Switzerland, 2019 .

[8]

D. E.

Losada ,

Crestani ,

Parapar , Early detection of risks on the internet: An exploratory campaign , in: Advances in Information Retrieval - 41st European Conference on IR Research , ECIR 2019 , Cologne, Germany, April 14- 18 , 2019 , Proceedings, Part

, 2019 , pp. 259 - 266 .

[9]

D. E.

Losada ,

Crestani ,

Parapar , Overview of eRisk 2020: Early risk prediction on the internet, in: Experimental IR Meets Multilinguality , Multimodality, and Interaction - 11th International Conference of the CLEF Association, CLEF 2020 , Thessaloniki, Greece, September 22-25 , 2020 , Proceedings, 2020 , pp. 272 - 287 .

[10]

D. E.

Losada ,

Crestani ,

Parapar , Overview of eRisk at CLEF 2020: Early risk prediction on the internet (extended overview) , in: Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum , Thessaloniki, Greece, September 22-25 , 2020 , 2020 .

[11]

D. E.

Losada ,

Crestani , J. Parapar, eRisk 2020 : Self-harm and depression challenges , in: Advances in Information Retrieval - 42nd European Conference on IR Research , ECIR 2020 , Lisbon, Portugal, April 14-17 , 2020 , Proceedings, Part

, 2020 , pp. 557 - 563 .

[12]

Parapar ,

Martín-Rodilla ,

D. E.

Losada ,

Crestani , Overview of eRisk 2021: Early risk prediction on the internet, in: Experimental IR Meets Multilinguality , Multimodality, and Interaction - 12th International Conference of the CLEF Association, CLEF 2021 ,

Virtual

Event , September 21-24 , 2021 , Proceedings, 2021 , pp. 324 - 344 .

[13]

Parapar ,

Martín-Rodilla ,

D. E.

Losada ,

Crestani , Overview of eRisk at CLEF 2021: Early risk prediction on the internet (extended overview) , in: Proceedings of the Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum , Bucharest, Romania, September 21st - to - 24th, 2021 , 2021 , pp. 864 - 887 .

[14]

Parapar ,

Martín-Rodilla ,

D. E.

Losada ,

Crestani , eRisk 2021 : Pathological gambling, self-harm and depression challenges , in: Advances in Information Retrieval - 43rd European Conference on IR Research , ECIR 2021 , Virtual

Event

, March 28 - April 1, 2021 , Proceedings, Part

, 2021 , pp. 650 - 656 .

[15]

Parapar ,

Martín-Rodilla ,

D. E.

Losada ,

Crestani , Overview of eRisk 2022: Early risk prediction on the internet, in: Experimental IR Meets Multilinguality , Multimodality, and Interaction - 13th International Conference of the CLEF Association, CLEF 2022 , Bologna, Italy, September 5- 8 , 2022 , 2022 , p. 233 - 256 .

[16]

Parapar ,

Martín-Rodilla ,

D. E.

Losada ,

Crestani , Overview of eRisk at CLEF 2022: Early risk prediction on the internet (extended overview) , in: Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum , Bologna, Italy, September 5- 8 , 2022 , 2022 , pp. 821 - 850 .

[17]

Parapar ,

Martín-Rodilla ,

D. E.

Losada ,

Crestani , eRisk 2022 : Pathological gambling, depression, and eating disorder challenges , in: Advances in Information Retrieval - 44th European Conference on IR Research , ECIR 2022 , Stavanger, Norway, April 10-14 , 2022 , Proceedings, Part

, 2022 , pp. 436 - 442 .

[18]

Parapar ,

Martín-Rodilla ,

D. E.

Losada ,

Crestani , Overview of eRisk 2023: Early risk prediction on the internet, in: Experimental IR Meets Multilinguality , Multimodality, and Interaction - 14th International Conference of the CLEF Association, CLEF 2023 , Thessaloniki, Greece, September 18-21 , 2023 , 2023 , p. 233 - 256 .

[19]

Parapar ,

Martín-Rodilla ,

D. E.

Losada ,

Crestani , Overview of eRisk at CLEF 2023: Early risk prediction on the internet (extended overview) , in: Proceedings of the Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum , Thessaloniki, Greece, September 18-21 , 2023 , 2023 .

[20]

Parapar ,

Martín-Rodilla ,

D. E.

Losada ,

Crestani , eRisk 2023 : Depression, pathological gambling, and eating disorder challenges , in: Advances in Information Retrieval - 45th European Conference on IR Research , ECIR 2023 , Dublin, Ireland, April 2- 6 , 2023 , Proceedings, Part

III

, 2023 , p. 585 - 592 .

[21]

Parapar ,

Martín-Rodilla ,

D. E.

Losada ,

Crestani , eRisk 2024 : Depression, anorexia, and eating disorder challenges , in: N. Goharian , N.

Tonellotto , Y.

He , A.

Lipani , G.

McDonald , C.

Macdonald , I. Ounis (Eds.), Advances in Information Retrieval - 46th European Conference on Information Retrieval , ECIR 2024 , Glasgow, UK, March 24 -28, 2024 , Proceedings, Part

, volume 14612 of Lecture Notes in Computer Science, Springer, 2024 , pp. 474 - 481 . URL: https://doi.org/10.1007/ 978-3- 031 -56069-9_ 65 . doi: 10 .1007/978-3- 031 -56069-9\_ 65 .

[22]

Parapar ,

Martín-Rodilla ,

D. E.

Losada ,

Crestani , Overview of eRisk 2024: Early risk prediction on the internet (extended overview) , in: G. Faggioli,

Ferro ,

Galuscáková , A. G. S. de Herrera (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024 ), Grenoble, France, 9 - 12 September , 2024 , volume 3740 of CEUR Workshop Proceedings, CEUR-WS.org , 2024 , pp. 759 - 781 . URL: https://ceur-ws. org/ Vol- 3740 /paper-72.pdf.

[23]

Parapar ,

Martín-Rodilla ,

D. E.

Losada ,

Crestani , Overview of eRisk 2024: Early risk prediction on the internet , in: L. Goeuriot , P.

Mulhem , G.

Quénot , D.

Schwab , G. M. D. Nunzio , L.

Soulier , P.

Galuscáková , A. G. S. de Herrera , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality , Multimodality, and Interaction - 15th International Conference of the CLEF Association, CLEF 2024 , Grenoble, France, September 9- 12 , 2024 , Proceedings, Part

, volume 14959 of Lecture Notes in Computer Science, Springer, 2024 , pp. 73 - 92 . URL: https://doi.org/10.1007/ 978-3- 031 -71908- 0 _4. doi: 10 .1007/978-3- 031 -71908-0\_4.

[24]

Parapar ,

Perez ,

Wang ,

Crestani , eRisk 2025 : contextual and conversational approaches for depression challenges , in: European Conference on Information Retrieval , Springer, 2025 , pp. 416 - 424 .

[25]

Parapar ,

Perez ,

Wang ,

Crestani , Overview of eRisk 2025: Early risk prediction on the internet (extended overview) , in: G. Faggioli,

Ferro ,

Rosso , D. Spina (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2025 ), Madrid, Spain, 9 - 12 September , 2025 , CEUR Workshop Proceedings, CEUR-WS.org, 2025 .

[26]

N. M.

Son ,

D. V.

Thin , Sonuit eRisk2025: Enhanced depression detection on social media via ifltering and re-ranking , in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum , Madrid, Spain, September 9- 12 , 2025 .

[27]

Adhikary , J. Das , D. Roy , Thinkir at eRisk 2025: Early detection and risk assessment of depression using transformer models , in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum , Madrid, Spain, September 9- 12 , 2025 .

[28]

Varela ,

Oronoz ,

Casillas ,

Pérez , Detection of depression with symptom similarity: Data reduction and llm personas , in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum , Madrid, Spain, September 9- 12 , 2025 .

[29]

Vachharajani , Transformer ensembles and llm-powered approaches for depression symptom analysis and contextualized early risk detection , in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum , Madrid, Spain, September 9- 12 , 2025 .

[30]

A. C.

Segarra ,

V. A.

Esteve ,

A. M.

Marco ,

L.-F. H.

Oliver , Elirf-upv at eRisk 2025: New approaches to the detection and early detection of symptoms and signs of depression , in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum , Madrid, Spain, September 9- 12 , 2025 .

[31]

L. F. M.

Cardona ,

J. M. S.

Loaiza ,

E. A. P. D.

Castillo ,

J. C. M.

Santos ,

J. E. S.

Castañeda , Cotecmar-utb at eRisk 2025: Semantic-centroid symptom ranking and early depression detection using adaptive decision rule , in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum,