<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of eRisk at CLEF 2025: Early Risk Prediction on the Internet (Extended Overview)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Javier Parapar</string-name>
          <email>javier.parapar@udc.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anxo Perez</string-name>
          <email>anxo.pvila@udc.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xi Wang</string-name>
          <email>xi.wang@shefield.ac.uk</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Crestani</string-name>
          <email>fabio.crestani@usi.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Informatics, Universitá della Svizzera italiana (USI). Campus EST</institution>
          ,
          <addr-line>Via alla Santa 1, 6900 Viganello</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Information Retrieval Lab, Centro de Investigación en Tecnoloxías da Información e as Comunicacións (CITIC), Universidade da Coruña.</institution>
          <addr-line>Campus de Elviña s/n C.P 15071 A Coruña</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Shefield</institution>
          ,
          <addr-line>Shefield, England</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents an extended overview of eRisk 2025, the ninth edition of the CLEF lab on early risk detection. Since its beginnings, eRisk has served as a benchmark for assessing methodologies, evaluation metrics, and challenges in the early identification of personal risks, particularly within health and safety domains. The 2025 edition marks an important evolution, amplifying the lab's scope toward problems that require richer contextual and conversational understanding. The first task, the only one preserved from last year, asks systems to rank sentences by their relevance to the BDI-II depression symptoms, enabling fine-grained retrieval of depressive cues. The second task reformulates early detection as a contextual decision problem. In this task, the full conversational thread, including the user's posts and all the interactions from the rest of the people involved, is revealed incrementally. At each step, the models must decide whether suficient evidence exists to predict depression for the user, thereby rewarding both accuracy and timeliness. Finally, the pilot task pioneers an interactive scenario: fine-tuned large language models engage participants in dialogue and must infer depressive signals from the evolving conversations, probing the feasibility and safety of conversational screening agents. Together, these three tasks continue to advance the field of early risk detection, open new research avenues and align the evaluation framework more closely with real-world conversational settings.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Early risk detection</kwd>
        <kwd>Depression</kwd>
        <kwd>Conversational analysis</kwd>
        <kwd>LLMs</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>eRisk</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The eRisk lab was designed as a benchmark environment for constructing resources, evaluation protocols,
and developing approaches that enable the timely detection of diferent personal risk situations. Early
alert technologies are becoming indispensable across healthcare oriented domains. The rapid recognition
of warning signs, whether for emergent mental health crises, predatory behaviour, or violent threats,
can turn marginal time gains into life saving interventions.
eRisk focuses on psychological and mental health risks such as depression, self-harm, pathological
gambling, and eating disorders, where language provides subtle yet informative signals. However, the
intricate relationship between linguistic expression and mental state continues to challenge automatic
methods and screeners, underscoring the need for increasingly robust, context-aware models and
annotated public datasets of high quality.</p>
      <p>
        The inaugural eRisk 2017 edition introduced the pilot task on early detection of depression, establishing
the sequential evidence evaluation framework that is still present in the lab today [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. In 2018 the scope
broadened to include anorexia, creating a dual task campaign that demonstrated the generalization of
the proposals across related mental-health disorders [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. The 2019 programme consolidated anorexia
work, introduced a self-harm prediction trask, and, for the first time, asked systems to infer answers to
a depression severity questionnaire (BDI-II [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) purely from social media activity [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ].
The 2019 inclusion of the BDI-II moved the lab toward symptom level modelling, encouraging
participants to move beyond binary depressed/not-depressed labels and design methods that capture the
nuances of individual depressive symptoms. eRisk 2020 deepened the self-harm task and added a
refined depression severity estimation challenge, further emphasising continuous severity scales over
binary outcomes [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
        ]. The 2021 edition first introduced behavioural addictions; it pioneered
an early pathological gambling task while revisiting self-harm detection and severity estimation for
depression [
        <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
        ].
      </p>
      <p>
        In 2022, we returned to gambling and depression and introduced a new challenge centered on estimating
the severity of eating-disorder activities [
        <xref ref-type="bibr" rid="ref15 ref16 ref17">15, 16, 17</xref>
        ]. The 2023 campaign shifted emphasis to fine-grained
symptom prediction, presenting a sentence ranking task that maps individual user sentences to the 21
BDI-II depression symptoms, and retained tasks on gambling risk and eating-disorder severity [
        <xref ref-type="bibr" rid="ref18 ref19 ref20">18, 19, 20</xref>
        ].
Finally, eRisk 2024 consolidated the BDI-II sentence ranking benchmark, maintained the anorexia
early detection task, and updated the eating-disorder severity task, setting the stage for the more
conversational focus adopted in 2025 [
        <xref ref-type="bibr" rid="ref21 ref22 ref23">21, 22, 23</xref>
        ].
      </p>
      <p>
        The current edition, eRisk 2025 [
        <xref ref-type="bibr" rid="ref24 ref25">24, 25</xref>
        ], extends this trajectory by introducing, for the first time, tasks
that demand not only early recognition of risk but also deeper contextual reasoning and, in the pilot
trask, true conversational interactions. Full task specifications appear in the next sections, yet the
broad shift is clear: systems must now interpret entire discussion threads and interactions, bringing the
evaluation environment closer to real-world online settings. This year, the eRisk lab had 128 diferent
teams registered. We finally received results coming from 25 distinct teams: 67 runs for Task 1, 50 runs
for Task 2, and 11 runs for the pilot task.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Task 1: Search for Symptoms of Depression</title>
      <p>
        This task continues from eRisk 2023’s and 2024’s Task 1, which involved ranking sentences from user
writings based on their relevance to specific depression symptoms. This is the last year of the task.
Again, participants were required to order sentences according to their relevance to the 21 standardized
symptoms listed in the BDI-II questionnaire [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. A sentence was deemed relevant if it reflected the
user’s condition related to a symptom, including positive statements (e.g., “I feel quite happy lately”
is relevant for the symptom “Sadness”). As in 2024, the test collection provides not only the target
sentence but also its immediate predecessor and successor to give more context.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Task 1: Dataset and Asessment Process</title>
        <p>The dataset provided was in TREC format, tagged with sentences derived from Reddit historical data.
Table 1 presents some statistics of the corpus. Given the corpus of sentences and the description of
the symptoms from the BDI-II questionnaire, the participants were free to decide on the best strategy
to derive queries for representing the BDI-II symptoms. Each participating team submitted up to 5
variants (runs). Each run included 21 TREC-style formatted rankings of sentences, as shown in Figure 1.
For each symptom, the participants should submit up to 1000 results sorted by estimated relevance. We
received 67 runs from 17 participating teams (see Table 2).</p>
        <p>Number of users
Number of sentences
Average number of words per sentence
Relevance labels were produced through a stratified, two–stage pooling procedure. First, for every
BDI-II symptom we implemented top-k pooling, collecting the top five sentences returned by each
submitted run ( = 5), forming an initial pool that served to rank systems provisionally. We then
selected the twenty highest-ranked runs and performed a second pooling step that extended the cut-of
to the top fifty sentences (  = 50). Unlike the 2023 setup, assessors were shown the target sentence
together with its immediate context (the preceding and following sentences), a change designed to
reduce annotation ambiguity.</p>
        <p>Three annotators worked independently: one with professional training in psychology, and two
computer-science researchers specialising in early risk technologies. Before judging, the
organisers held a session to walk through an initial guideline draft, resolve doubts, and agree on diferent
cases. The consolidated guidelines, publicly available1, defines a sentence as relevant only when it both
addresses the symptom and conveys explicit information about the user’s state. This dual concept of
relevance (on-topic and reflective of the user’s state with respect to the symptom) introduced a higher
level of complexity compared to more standard relevance assessments. Each pooled sentence received
three independent judgements, and we provide two ground-truth sets (qrels):
• Majority-based qrels: a sentence was deemed relevant if at least two of the three assessors
marked it so.</p>
        <p>• Unanimity-based qrels: a sentence was deemed relevant only when all three assessors agreed.
The final pool sizes and qrels for each symptom are reported in Table 3. Providing both qrels enables
analyses with diferent agreement thresholds, continuing the dual-qrel strategy introduced in earlier
eRisk campaigns.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Task 1: Results</title>
        <p>The performance results for the participating systems are shown in Tables 4 (majority-based qrels)
and 5 (unanimity-based qrels). The tables report several standard performance metrics, such as mean</p>
        <sec id="sec-2-2-1">
          <title>1https://erisk.irlab.org/guidelines_erisk24_task1.html</title>
          <p>Average Precision (AP), mean R-Precision, mean Precision at 10 and mean NDCG at 1000. Remarkably,
runs unanimity and max from the team INESC-ID, achieved the top-ranking performance for nearly all
metrics and relevance judgement types. The teams UET-Psyche-Warriors, SonUIT, BGU-Data-Science
and PJs-Team also obtained close performance. Their efective results demonstrate their exceptional
competence in this task. Taken together, the results confirm that sentence-level symptom retrieval
remains a challenging task.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Task 2: Contextualized Early Detection of Depression (New Task)</title>
      <p>This new task in 2025 introduces a diferent scenario in depression detection by incorporating full
conversational contexts. Whereas earlier eRisk editions always released isolated posts authored by
a single user, the 2025 task provided the entire Reddit discussion thread in which the target user
intervened. Consequently, in the test dataset, systems had access not only to the messages produced by
the target user but also to every other contribution in the thread and to the interaction structure that
links the messages (e.g., the diferent replies to each comment).</p>
      <p>This design is motivated by the observation that the clinical relevance of a message often becomes more
evident when interpreted alongside the surrounding conversation. Thus, a user’s response may only
gain relevance when viewed in conjunction with the preceding or subsequent interactions from other
participants. For instance, a seemingly neutral sentence, may reveal hopelessness if it answers a direct
plea for support. For this reason, the task is designed to simulate real-world scenarios where depression
detection may rely on analyzing exchanges between multiple participants. This setup presents unique
challenges, as systems must consider not only the textual content of individual posts but also the
interplay between participants and how this context influences the detection of depressive symptoms.
The test collection utilised for this task followed the same format as the collection described in the
work by Losada and Crestani [37]. The collection contains writings, including posts and comments,
obtained from a selected group of social media users. To construct the ground truth assessments, we
adopted established approaches that aim to optimise the utilisation of assessors’ time, as documented
in previous studies [38, 39]. These methods employ simulated pooling strategies, enabling the efective
creation of test collections. The main statistics of the test collection used for Task 2 are presented in
Table 6.
Within this dataset, users are categorised into two groups: depression and control. For each user, the
collection contains a sequence of writings and threads where the user participated in chronological
order. To facilitate the task and ensure uniform distribution, we established a dedicated server that
systematically provided user writings to the participating teams. Further details regarding the server’s
setup and functioning are available at the lab’s oficial website 2.</p>
      <p>The task was divided into two phases:
• During the training phase, participants worked with a static dataset consisting of isolated user
writings from depressed and control users, without any conversational context. This training
dataset came from prior editions of eRisk regarding the early detection depression tasks (without
any conversational context).
• The test phase, in contrast, was carried out interactively. For each target user, the server released
a sequence of discussion threads in real time. Each thread constituted a submission round. At any
round within the chronology of user writings, participants had the freedom to stop the process
and issue an alert. After reading each user thread, teams were required to decide between two
options: i) alerting about the target user, indicating a predicted sign of depression, or ii) not
alerting about the target user. Participants independently made this choice for each user in the
test split. It is important to note that once an alert was issued, it was considered final, and no
further decisions regarding that particular user were taken into account. Conversely, the absence
of alerts was considered non-final, allowing participants to subsequently submit an alert if they
detected signs of risk emerging.</p>
      <p>To evaluate the systems’ performance, we employed two indicators: the accuracy of the decisions made
and the number of user writings required to reach those decisions. These criteria provide valuable
insights into the efectiveness and eficiency of the systems under evaluation. To support the test stage,
we deployed a REST service. The server iteratively distributes user writings and waits for responses
from participants. Importantly, new user data was not provided to a specific participant until the
service received a decision from that particular team. The submission period for the task was open
from February 5th, 2025 until April 12th, 2025.</p>
      <sec id="sec-3-1">
        <title>2https://erisk.irlab.org/eRisk25Servert2Details.html</title>
        <sec id="sec-3-1-1">
          <title>3.1. Task 2: Evaluation Metrics</title>
          <p>3.1.1. Decision-based Evaluation
This evaluation approach uses the binary decisions made by the participating systems for each user.
In addition to standard classification measures such as Precision, Recall, and F1 score (computed with
respect to the positive class), we also calculate ERDE (Early Risk Detection Error), used in previous
editions of the lab. A detailed description of ERDE was presented by Losada and Crestani in [37]. ERDE
is an error measure that incorporates a penalty for delayed correct alerts (true positives). The penalty
increases with the delay in issuing the alert, measured by the number of user posts processed before
making the alert.</p>
          <p>Since 2019, we complemented the evaluation report with additional decision-based metrics that try to
capture additional aspects of the problem. These metrics try to overcome some limitations of ,
namely:
• the penalty associated to true positives goes quickly to 1. This is due to the functional form of
the cost function (sigmoid).
• a perfect system, which detects the true positive case right after the first round of messages (first
chunk), does not get error equal to 0.
• with a method based on releasing data in a chunk-based way (as it was done in 2017 and 2018)
the contribution of each user to the performance evaluation has a large variance (diferent for
users with few writings per chunk vs users with many writings per chunk).</p>
          <p>•  is not interpretable.</p>
          <p>Some research teams have analysed these issues and proposed alternative ways for evaluation. Trotzek
and colleagues [40] proposed %. This is a variant of ERDE that does not depend on the number
of user writings seen before the alert but, instead, it depends on the percentage of user writings seen
before the alert. In this way, user’s contributions to the evaluation are normalized (currently, all users
weight the same). However, there is an important limitation of %. In real life applications, the
overall number of user writings is not known in advance. Social Media users post contents online and
screening tools have to make predictions with the evidence seen. In practice, you do not know when
(and if) a user’s thread of messages is exhausted. Thus, the performance metric should not depend on
knowledge about the total number of user writings.</p>
          <p>Another proposal of an alternative evaluation metric for early risk prediction was done by Sadeque and
colleagues [41]. They proposed , which fits better with our purposes. This measure is described
next.</p>
          <p>Imagine a user  ∈  and an early risk detection system that iteratively analyzes ’s writings (e.g. in
chronological order, as they appear in Social Media) and, after analyzing  user writings ( ≥ 1),
takes a binary decision  ∈ {0, 1}, which represents the decision of the system about the user being a
risk case. By  ∈ {0, 1}, we refer to the user’s golden truth label. A key component of an early risk
evaluation should be the delay on detecting true positives (we do not want systems to detect these cases
too late). Therefore, a first and intuitive measure of delay can be defined as follows 3:
latency 
=
median{ :  ∈ ,  =  = 1}
(1)
This measure of latency is calculated over the true positives detected by the system and assesses the
system’s delay based on the median number of writings that the system had to process to detect such
positive cases. This measure can be included in the experimental report together with standard measures
such as Precision (P), Recall (R) and the F-measure (F):
3Observe that Sadeque et al (see [41], pg 497) computed the latency for all users such that  = 1. We argue that latency
should be computed only for the true positives. The false negatives ( = 1,  = 0) are not detected by the system and,
therefore, they would not generate an alert.</p>
          <p>Furthermore, Sadeque et al. proposed a measure, , which combines the efectiveness of the
decision (estimated with the F measure) and the delay4 in the decision. This is calculated by multiplying
F by a penalty factor based on the median delay. More specifically, each individual (true positive)
decision, taken after reading  writings, is assigned the following penalty:
() = − 1 +</p>
          <p>2
1 + exp− · (− 1)
where  is a parameter that determines how quickly the penalty should increase. In [41],  was set such
that the penalty equals 0.5 at the median number of posts of a user5. Observe that a decision right after
the first writing has no penalty (i.e. (1) = 0). Figure 2 plots how the latency penalty increases
with the number of observed writings.</p>
          <p>=
 =
 =
| ∈  :  =  = 1|</p>
          <p>| ∈  :  = 1|
| ∈  :  =  = 1|</p>
          <p>| ∈  :  = 1|
2 ·  · 
 + 
(6)
(7)</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>The system’s overall speed factor is computed as:</title>
        <p>= (1 − median{() :  ∈ ,  =  = 1})
where speed equals 1 for a system whose true positives are detected right at the first writing. A slow
system, which detects true positives after hundreds of writings, will be assigned a speed score near 0.
Finally, the latency-weighted F score is simply:</p>
        <p>=  · 
Since 2019 user’s data were processed by the participants in a post by post basis (i.e. we avoided
a chunk-based release of data). Under these conditions, the evaluation approach has the following
properties:
• smooth grow of penalties;
• a perfect system gets  = 1 ;
• for each user  the system can opt to stop at any point  and, therefore, now we do not have the
efect of an imbalanced importance of users;
•  is more interpretable than .
4Again, we adopt Sadeque et al.’s proposal but we estimate latency only over the true positives.
5In the evaluation we set  to 0.0078, a setting obtained from the eRisk 2017 collection.
3.1.2. Ranking-based Evaluation
In addition to the evaluation discussed above, we employed an alternative form of evaluation to further
assess the systems. After each data release (new user writing, that is post or comment), participants
were required to provide the following information for each user in the collection:
• A decision for the user (alert or no alert), which was used to calculate the decision-based metrics
discussed previously.</p>
        <p>• A score representing the user’s level of risk, estimated based on the evidence observed thus far.
The scores were used to create a ranking of users in descending order of estimated risk. For each
participating system, a ranking was generated at each data release point, simulating a continuous
re-ranking approach based on the observed evidence. In a real-life scenario, this ranking would be
presented to an expert user who could make decisions based on the rankings (e.g., by inspecting the
top of the rankings). Each ranking can be evaluated using standard ranking metrics such as P@10 or
NDCG. Therefore, we report the performance of the systems based on the rankings after observing
diferent numbers of writings.</p>
        <sec id="sec-3-2-1">
          <title>3.2. Task 2: Participant Teams</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.3. Task 2: Results</title>
          <p>Table 8 show the decision-based results of Task 2. Table 9 shows the ranking-based results. In the
decision setting, HIT-SCIR dominates: its best run attains the highest 1 (0.85) while keeping both
ERDE5 and ERDE50 at or very near the minimum error values. That performance is achieved with
a median latency of only eight writings, illustrating a good balance between earliness and accuracy.
ELiRF-UPV follows at a short distance, with a top 1 of 0.79 but slightly worse error–aware metrics
The ranking-based evaluation shows a complementary picture. HIT-SCIR again exhibits near-perfect
precision at every cut-of and sustains the highest NDCG values as additional writings become available,
confirming the robustness of its retrieval component. Lotu-Ixa excels in the one writing scenario,
matching HIT-SCIR for  @10 and   @10. However, its advantage diminishes once longer histories
are considered, suggesting that its decision policy strongly weights the earliest cues.
1.00 0.58
1.00 0.58
1.00 0.58
1.00 0.58
1.00 0.58
0.90 0.88 0.36
0.30 0.25 0.32
0.20 0.31 0.14
0.90 0.94 0.35
0.60 0.75 0.27
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Pilot Task: Conversational Depression Detection via LLMs</title>
      <p>We introduced this pilot task in 2025 as a novel challenge to seek the opportunity of embracing
conversational agents in detecting depression symptoms. Participants were interacting with LLM-based
personas who have been instructed using user writings, simulating real-world conversational exchanges
and example user profiles. Twelve distinct personas were instantiated with ChatGPT.
The challenge lies in asking participants to determine whether the LLM persona exhibits signs of
depression and, if so, what is the level of depression severity and key depression symptoms expressed
over conversations. The diagnostic target for the LLMs was framed in terms of the BDI-II, as in Task 1.
The BDI-II is a 21-item self-report questionnaire widely used in clinical psychology, which are listed in
the Table 10.
Each item corresponds to a concrete symptom. For example, Sadness, Loss of Energy, or Indecisiveness.
Each symptom is scored 0 to 3 according to severity. Table 11 shows the possible response options (0-3)
for the symptoms Sadness and Self-Dislike. The sum of all 21 symptoms yields a global index in the
range 0–63. The scores are interpreted into four categories: 0-9 are interpreted as minimal depression,
10–18 as mild, 19–29 as moderate, and 30 or above as severe. Because the personas are simulations, no
ground-truth questionnaire exists; instead, a group of three clinicians examined the seed user data that
shaped each persona and agreed on both an overall BDI-II score and the subset of symptoms included.
These consensual judgments constitute the gold standard.</p>
      <p>Participants did not receive any labelled training material. We deliberately framed the task as
trainingless to encourage a variety of methodological responses,ranging from rule-based interviewers and
zero-shot LLM prompts to diferent classifiers trained on public mental-health corpora. During the
test window, teams accessed the links we provided them through ChatGPT interface for creating the
dialogue with the LLM-persona. The participant systems interacted with a free-form prompt; the server
produced the next turn, and so on. This loop continued until the system chose to terminate the dialogue
and submit its diagnosis. Since this is a pilot task, there was no hard cap on the number of turns, but
we encouraged the participants to produce their decisions as early as possible.</p>
      <p>After ending the conversation with a persona, a participating system had to return two files. The first
was a structured log that preserves, in chronological order, every prompt–response pair exchanged with
the agent; this file serves auditing and qualitative analysis. The second was a JSON record containing
three fields: the predicted BDI-II score (an integer 0–63), the corresponding severity category, and up to
four symptom drawn from the BDI-II list in Table 10, that best explained the score.</p>
      <sec id="sec-4-1">
        <title>4.1. Pilot Task: LLM Personas Design and Construction</title>
        <p>We adopted a clinician-in-the-loop design workflow to build the twelve LLM personas. A team of
three clinical psychologists co-designed a template that captures both general biographical detail and
clinically information. Using this template we instantiated a pool of draft personas with GPT-4o, each
conditioned on a diferent user history.</p>
        <p>The same clinicians then conducted free-form interviews with every draft, rating each dialogue along
two main dimensions:
• The overall dimension covered traits associated with conversational attributes: human-likeness,
lexical fluency, coherence, and afective naturalness.
• The diagnostic dimension targeted domain realism, including emotional consistency, fidelity to
depressive symptomatology, willingness to elaborate, and cognitive style (rumination, processing
speed, abstraction level).</p>
        <p>Feedback was recorded on a five-point Likert scale and complemented with qualitative comments.
Insights from this evaluation cycle informed a second engineering pass in which every persona was
represented through a structured prompt comprising the main following elements:
• Core profile. A stable set of attributes: name, age, gender, marital status and an a pre-defined</p>
        <p>BDI-II score.
• Key negative symptoms. Up to four key BDI-II symptoms (or less for control personas) that the
agent should manifest recurrently and coherently.
• Memory and reflection. Specific snippets describing life history, social context, and salient past
events; these cues allow the agent to maintain narrative continuity and to provide retrospective
insight into its mood.
• Language and communication style. Use of vocabulary, and typical sentence length so that each
persona speaks with a recognisable “voice”.
• Behavioural constraints. Guard-rails that prohibit explicit self-diagnosis and that keep the agent
away from clinical recommendations, thereby forcing participants to infer depression indirectly.
• Response goals. High-level objectives such as “answer candidly but not expansively,” “avoid
mentioning diagnosis unless prompted,” and “display mild self-disclosure”.
• Environment and context. Brief situational framing (e.g. studying for exams, recent job change)
that provides topical depth without locking the dialogue.
• Few-shot exemplars. Short question–answer pairs illustrating the expected tone and symptom
expression.
• Restricted responses. A blacklist of phrases that would break immersion (e.g. “As an AI language
model. . . ”) replaced with context-appropriate alternatives.</p>
        <p>The final personas were frozen only after a second round of clinician interaction confirmed that
they satisfied a minimum threshold on both the overall and diagnostic scales. This iterative,
expertguided construction process proved essential to achieve dialogues that are simultaneously natural and
diagnostically meaningful, laying the groundwork for future large-scale evaluations of conversational
mental-health screening systems.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Pilot Task: Participant Teams</title>
        <p>Table 12 shows the participant teams and some statistics about their interactions such as the mean
number of messages per run, and the mean number of characters per message. The numbers reveal a
wide range of interaction strategies:
• ixa-ave submitted the maximum number of runs (four) and tended to carry out relatively lengthy
dialogues (≈ 31 messages each) while keeping their prompts concise (≈ 415 characters per turn).
• SINAI-UJA used a fast approach, with only 6–7 turns on average, yet still packed almost 490
characters into every message, suggesting dense, information-rich questioning.
• DS-GT followed an intermediate approach, with ≈ 21 messages per run and 783 characters per
message, balancing breadth and depth of interaction.
• PJs-team produced long messages (≈ 1 045 characters) within a limited number of turns (≈ 8),
delivering extended prompts.
• LT4SG employed a fixed sequence of ten short messages averaging only 41 characters, representing
the most lightweight strategy.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Pilot Task: Evaluation Metrics</title>
        <p>
          Based on evaluation metrics that have been developed from eRisk 2019 [48], which involved the use of
BDI-II questionnaires and scores, we extend and develop the evaluation approaches as follows:
• Depression Category Hit Rate (DCHR): Based on the four depression level categories that
we have discussed, from minimal depression to severe depression, this efectiveness measure
examines the fraction of cases where the BDI-II scores describing simulated personas estimated
by the participants lie in the correct depression category.
• Average DODL (ADODL): For this pilot task, we reuse the Average Diference between Overall
Depression Levels (ADODL), which measures the closeness between the actual and estimated
depression level for efectiveness measurement. The ADODL is calculated by following:  =
(  − |  − |)/ , where | − | calculates the absolute value between
the Actual Depression Level (ADL) and the Estimated Depression Level (EDL). Then divided
by Maximum Absolute Diference (i.e., 63) to obtain a normalised evaluation score in [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ].
For example, if a simulated persona has a minor depression severity (depression level 5) and a
participant estimates the depression level is 9, the DODL is calculated as (63 − | 9 − 5|)/63 =
0.9365.
• Average Symptom Hit Rate (ASHR): For the last efectiveness measure, aside from estimating
the depression level of simulated personas as per BDI-II scores, this pilot task also involves the
identification of major depression symptoms of simulated personas. Hence, SHR calculates the
ratio of cases where the participants can correctly identify the major symptoms of the simulated
personas. For example, each simulated persona has four major symptoms. If a participant
accurately identifies two of them, then the SHR equals 2/4 = 0.5.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Pilot task: Results</title>
        <p>Table 13 presents the oficial runs, ranked by best ADODL. The strongest submission, SINAI-UJA (run
1), achieves an ADODL of 0.93, meaning the predicted scores difer by less than five points on average
from the clinician reference. Its DCHR of 0.58 shows that most of these small errors still fall within the
incorrect severity band. DS-GT attains comparable category accuracy (0.50) with only a modest drop in
ADODL, rearching similar level reliability despite larger absolute score errors.</p>
        <p>Across all teams, however, symptom recognition stays behind score estimation: even the best ASHR
values hover below 0.30, indicating that systems often capture the global severity signal without isolating
which symptoms drive it.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Participating Teams</title>
      <p>Table 14 reports the participating teams and the runs that they submitted for each eRisk task. The next
paragraphs give a brief summary on the methods implemented by each of them. Further details are
available at the CLEF 2025 working notes proceedings for the participants.</p>
      <p>Lotu-ixa [42]. The Lotu-ixa team, afiliated with University of the Basque Country, in Spain, participated
in task 2 proposed as part of eRisk CLEF this year. The team proposes a method to (i) apply a semantic
relabelling process to the training data, (ii) then design and fine-tune a classification model, and (iii)
ifnally combine risk signals derived from both the target user and the conversational context. For (i), a
similarity score was computed for representative positive and negative examples, and then a
percentilebased strategy determined the messages suitable for relabelling. The classifier (ii) was derived from
XLM-RoBERTa and fine-tuned on the relabelled dataset from (i) using a binary cross-entropy loss
function, with optimised hyperparameters explored via grid search. Finally, (iii) the team computed
user risk, context risk and thread risk scores, calculating a binary decision based on these. The team
performed five runs using diferent thread risk score settings. In their run #4, they obtain the best recall
score (1.0), and runs #0-#3 yielded the best ERDE5 (0.05) among the participants of this task. For the
ranking-based metrics, their approach demonstrates highly competitive performance in precision and
NDCG (most of their runs achieve 1.0). Their approach has been competitive across all ranking metrics.
SINAI-UJA [44]. SINAI-UJA team, from the University of Jaen (Spain), participated in tasks 2 and 3
of the eRisk 2025 challenge. For task 2, the team relied on the provided train and test sets, but also
developed a new dataset for this task, to be able to have training data with context. The team fine-tuned
RoBERTa and Mental RoBERTa models in diferent settings, optimised with Optuna, and performed
ifve runs using diferent settings of model and parameter combinations. Their system was one of the
top three in terms of eficiency, completing the task in less than 10 hours. Their system achieved
a perfect recall (1.0) in all runs, but with the cost of having a low precision (0.17-0.24) and low F1
(0.29-0.39). In the ranking-based evaluation, the team performed competitively at early stages, aligning
the the top-ranked teams. For Task 3, the SINAI team proposed a modular system composed of two
collaborating LLMs: (1) is responsible for interacting with the user, and (2) does not interact with the
user, but receives the conversation and analyses it and updates the state of the depressive symptoms.
Moreover, this LLM reasons whether it needs more information or not, ending the conversation when
needed. The team uses Llama-3.1-8B-Instruct model for both LLMs. They submitted three runs, with
diferent prompt configurations, achieved the fastest interaction with an average of 6.54 messages per
run, as well as achieving the best overall ADODL (0.93), ASHR (0.29) and DCHR (0.66), highlighting
that the estimations were highly aligned with the BDI-II levels of the simulated personas and that their
approach efectively identified key symptoms.</p>
      <p>
        COTECMAR-UTB [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. COTECMAR-UTB, afiliated with Universidad Tecnologica de Bolivar, in
Colombia, participated in Tasks 1 and 2 proposed as part of eRisk CLEF this year. For Task 1, the team
focused on high-confidence training data and balanced the data using EDA and SMOTE. The authors
propose a pipeline that includes data preprocessing and cleaning, training ML models, including LR,
SVM and BERT, among others. After that, they apply VADER to identify texts with negative sentiment
and score the sentences. They submitted one run, achieving a middle-tier performance. For Task 2,
the team trained an LSTM model to predict the risk of depression. The team submitted 2 runs, with
moderate performance, achieving a best F1 of 0.40 and a Recall of 0.65. For the ranking-based metrics,
the metrics have room for improvement, suggesting that the model had dificulties when prioritizing
relevant messages.
      </p>
      <p>HULAT_UC3M [36]. The HULAT-UC3M team, afiliated with Universidad Carlos III, from Madrid
(Spain), participated in Task 1 proposed as part of eRisk 2025 challenge. The team proposed training a
multi-classifier (SVM) to classify all the sentences into their corresponding symptoms, keeping only
the ones with higher probabilities according to diferent thresholds; filtering sentences according to
diferent criteria for each run; and scoring the sentences using either VADER or roberta-base-sentiment.
The authors use the training data with unanimity to minimise noise. Their best run uses RoBERTa,
selecting the top 1000 sentences based on confidence scores, achieving an AP of 0.018. Their two runs
using high-confidence-based filtering had a positive impact on the performance, but the scoring method
can be improved.</p>
      <p>BGU-Data-Science [35]. BGU-Data-Science team, afiliated with Ben-Gurion University of the Negev,
in Israel, participated in Task 1 of the eRisk 2025. The authors approached the task as a sentence ranking
problem by computing the semantic similarity between user sentences and BDI-II symptom descriptions
embedded using Sentence-BERT. The team performed query expansion and filtered out sentences that
were not in the first person. For the first person filtering, the team employed three diferent methods: a
basic filtering approach using first-person pronouns, a method using spaCy, and Claude Sonnet 3.7 to
assess whether a sentence conveys the user’s personal experience. The team achieved their best results
with the baseline approach, using only embeddings from Sentence-BERT, which resulted in an AP score
of 0.240. Although incorporating query expansion and first-person filtering did not yield the highest
AP, it did achieve the highest P@10 compared to other runs from the team.</p>
      <p>INESC-ID [33]. The INESC-ID team, afiliated with University of Lisboa, in Portutal, participated
in the first task of the eRisk Lab. Although this task is framed as an information retrieval challenge,
the authors approach it as a regression or classification problem. The team explored several methods,
including fine-tuned foundation models (DeBERTa-v3-large), unsupervised similarity based approaches,
and LLM-based classification using GPT-4o-Mini. The authors make use of the training data provided
for this task to train and validate their approaches. The DeBERTa model was finetuned for regression
to predict a relevancy score ranging from 0 to 1, while the other two methods were framed as binary
classification tasks. The best-performing run of the team was an ensemble approach that combined
outputs from all the methods, achieving the highest scores AP, R-PREC, and P@10.
HU [46]. The HU team, afiliated with the Habib University, in Pakistan, participated in Task 2 from
the eRisk 2025 challenge. The runs submitted by the team cover a wide range of approaches, including
transformer-based models (ModernBERT) with time-aware loss or data augmentation strategies, Llama
3.1 summarization with BERT classification, a zero-shot model using Llama-4-Scout-17B, and a simple
threshold approach. The best performing method, using Llama 3.1 for summarization and BERT
classification with an incorporated alert policy (run #1), achieved an F1 score of 0.75, ranking 3rd out
of 12 teams in decision-based evaluation. In ranking-based evaluation, the same run obtains a perfect
score of 1.00 in P@10 and NDCG@10 after one writing.</p>
      <p>FU-TU-DFKI [47]. The FU-TU-DFKI team is afiliated with three diferent organizations from Germany:
the Freie Universität Berlin, University of Hannover, and the Technical University of Berlin. They
participated in Task 2 of eRisk 2025. The authors conducted two pilot studies that focused on the
linguistic analysis of the dataset provided for the task. The first study examined the use of first-person
singular pronouns and the verbs commonly associated with them. The second study involved a concept
analysis of the keywords found in the data. The insights gained from these studies helped inform their
proposed method. The team’s hybrid system combines a transformer-based model (MentalBERT) with
linguistically informed features, such as the use of first person pronouns and associated verbs, as well
as other relevant keywords. In addition, the system incorporates metadata, including late-night posting
frequency and the sentiment of the posts. The team achieved modest results by processing only 449 out
of a total of 1 280 user threads.</p>
      <p>
        ThinkIR [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. The ThinkIR team comes from two organizations in India, the Indian Institute of Science
Education and Research Kolkate, and the Vellore Institute of Technology. ThinkIR submitted five runs
for Task 1. Four rely on classical IR ranking with diferent query expansion strategies, namely kNN word
embedding expansion, pseudo relevance feedback (PRF), and GPT generated prompt reformulations,
while the remaining one uses a RoBERTa based multi label classifier. The best run, which involves
RoBERTa fine tuning, achieved an AP of 0.068, R Precision of 0.157, P@10 of 0.409, and NDCG of
0.228, leading every metric among their runs. The experiments confirm that transformer fine-tuning
outperformed all classical expansion methods, although PRF on the top ten documents still produced
competitive rankings.
      </p>
      <p>
        Ixa_ave [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. The ixa_ave team is afiliated with the HiTZ Basque Center for Language Technology,
from the University of the Basque Country (Spain). ixa_ave took part in task 1 and the inaugural pilot
Task at eRisk 2025. For task 1 they fine-tuned multilingual BERT, appending a 21-dimensional vector
of cosine-similarity scores to each sentence and predicting with a 21-head classifier. They tried two
similarity-based data-reduction ideas: (i) skip training sentences whose similarity to any BDI-II item
exceeds  = 0.5, and (ii) at inference keep only sentences whose similarity is at least  ∈ 0.3, 0.5. Among
the five submitted runs, base_filter30 (  = 0.3, no training pruning) was best, reaching AP = 0.102 under
majority voting. In the Pilot Task they compared a manual questionnaire interview (run 0) with three
LLM agents: GPT-4-long (run 1), GPT-4-short (run 2) and Falcon-11B (run 3). Both GPT-4 variants
matched the human baseline on DCHR = 0.33, whereas Falcon obtained worse results, with 0.17.
UET-Psyche-Warriors [34]. The UET-Psyche-Warriors team is afiliated with the VNU University
of Engineering and Technology, in Vietnam. The authors participated in Task 1 and Task 2 of eRisk
CLEF 2025 challenge. For task 1, they explored both semantic similarity-based ranking and a machine
learning approach using a multi-task DepRoBERTa model fine-tuned for symptom detection and severity
estimation. Their best run (Run 4) achieved an NDCG of 0.623 and an AP of 0.339, ranking second
overall. For Task 2, the team implemented a multi-stage system combining sentence-level severity
scoring with rule-based aggregation strategies. Run 2, which incorporated temporal accumulation with
a bonus heuristic, achieved their best results with an F1 score of 0.73 and a latency-aware F1 of 0.68,
placing them fourth overall.
      </p>
      <p>
        ELiRF-UPV [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. The ELiRF-UPV team, afiliated with Polytechnic University of Valencia, in Spain,
participated in Tasks 1 and 2 of the eRisk 2025 challenge. For Task 1, the team developed an adapter
architecture over pre-trained sentence similarity models, incorporating attention over reference
embeddings derived from both cluster centroids and the BDI-II question-answer pairs. For Task 2, they
explored three approaches: a classical SVM classifier, a Longformer fine-tuned on user-level data, and a
task-adapted Longformer model trained using a data augmentation strategy designed to simulate early
detection conditions. Their best-performing system in Task 2 was a Linear SVM using TF-IDF features,
ranking 6th overall in the competition.
      </p>
      <p>
        HIT-SCIR [43]. The HIT-SCIR is afiliated with the Harbin Institute of Technology, in the Univervisty
of Harbin (China). They participated in task 2 of the CLEF 2025 eRisk Lab. Their proposal focuses on
contextualized early detection of depression on social media, utilizing a multi-stage framework. Their
approach addresses the challenge of limited interactive context in training data by employing LLMs for
contextual data augmentation. Specifically, they use LLMs to simulate social interactions, generating
comments for original user posts and then summarizing these comments to create a rich semantic
context. A core component of their system is a psychiatric scale-guided risky post screening module,
which identifies depression-related information from user post histories. This module calculates a
risk score for each post based on its cosine similarity with symptom descriptions from established
psychological scales, like the BDI-II. Posts with higher risk scores are then filtered for depression
risk detection. The detection itself uses MentalBERT, a BERT variant optimized for mental health
texts, to generate post embeddings, and a Transformer with attention mechanisms to model inter-post
interactions and generate user features. The entire screening and detection process is trained end-to-end
using a Straight-Through Estimator (STE). For early detection testing, a dynamic risky post queue
and diferent alerting strategies are employed. The team submitted five runs with varying operational
parameters for their dynamic user-level early risk assessment strategy, using a voting ensemble of
their top three performing models. This integrated approach led to strong performance, achieving first
rank in several evaluation metrics, including F1 (0.85 for HIT-SCIR-4), ERDE50 (0.03 for HIT-SCIR-4),
and Flatency (0.82 for HIT-SCIR-2 and HIT-SCIR-4). They also achieved first place in the majority of
ranking-based metrics, such as P@10 and NDCG@10 across almost all writings evaluations.
PJs-team [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. The PJs-team, afiliated from Netaji Subhas University of Technology , from India,
presented distinct approaches for three tasks. In the task 1, the team used finetuned bi-encoders
(e.g., DistilRoBERTa, e5-small) with CoSENTLoss and their ensemble using Reciprocal Rank Fusion
(RRF). They also employed finetuned cross-encoders (’ModernBERT-large’, ’ModernBERT-base’) with
BinaryCrossEntropyLoss for reranking, and reranker ensembles using majority voting or scaled mean
averaging. The cross-encoder ensemble run gave their top scores (AP 0.279, P@10 0.800). In task 2,
they presented a two-stage pipeline first filters each new post with a custom DistilRoBERTa
sentencetransformer against four early BDI cues (pessimism, punishment feelings, self-dislike, indecisiveness).
High-scoring texts or users previously flagged are analysed by an ensemble of four hosted LLMs (Claude
3.7 Sonnet, Amazon Nova Pro, Llama 3-70B, Claude 3.5 Haiku). Majority vote delivers the final decision.
The single-model Sonnet run achieved the team’s best F1 = 0.71 with low ERDE@5 = 0.09. In task 3, they
built a single LLM agent (Claude Sonnet) driven by a long system prompt embedding the full BDI-II
questionnaire. The agent chats about movies to elicit emotions and updates the 21 BDI scores each turn,
ending when all scores are set. On the pilot evaluation it reached ADODL 0.73 and DCHR 0.33.
LHS712-Team-1 [32]. The LHS712Team comes from School of Information &amp; Department of Learning
Health Sciences, in the University of Michigan, USA. The authors participated in task 1, and benchmarked
a wide spectrum of ten runs, covering: () Classical baselines, Logistic Regression and SVM coupled
with CountVectorizer or TF-IDF features. () Domain specific embeddings, ClinicalBERT and
SentenceBERT sentence vectors fed into Linear-SVC or LR classifiers. () They also fine-tuned BERT, with
a “[SYMPTOM] [SEP] sentence” formulation finetuned for five epochs, where a symptom keyword
iflter first pruned the 17 million sentence test set to keep inference tractable. () A method based on
hybrid retrieval, where BM25 selects candidates that are reranked by SBERT cosine similarity. Finally,
the fine-tuned BERT with unanimous-label training was their top performer, yielding AP 0.078, R-Prec
0.169, P@10 0.344 and NDCG 0.287 on the oficial unanimity evaluation, well above their traditional
baselines.
      </p>
      <p>DS-GT [45]. The DS-GT team from the Georgia Institute of Technology, in USA, participated in the
task two and the pilot task. In task 2, the team contrasted two pipelines: Voting Classifier combining
engineered features (TF-IDF, VADER sentiment, LIWC-style counts, posting-gap timings) in a soft
vote ensemble of Random Forest, SGD-LogReg and Gradient Boosting. Here, lightGBM + temporal
attention where MentalRoBERTa sentence embeddings feed a linearly-weighted recency mechanism
and a sparse “depression-indicator” content matrix before classification. Both runs achieved recall = 1.0
but low precision (P = 0.11, F1 = 0.20) and identical 5 = 0.12, with the embedding-based model
yielding far better ranking scores (P@10 = 0.90, NDCG@10 = 0.92 on the 1-writing cut). In the pilot
Task, a unified prompt-engineering framework used several LLMs (Claude 3.7 Sonnet, GPT-4o, Gemini
Flash/Pro) to conduct ≈ 20 turn interviews, outputting structured JSON with item-level BDI-II scores
and key symptoms. The best run (Claude Sonnet) placed second overall (DCHR 0.50, ADODL 0.89,
ASHR 0.27). Exploratory analysis showed strong cross-model consistency (R2 = 0.91 between label level
and BDI score) but wide variance on appetite and agitation cues.</p>
      <p>
        SonUIT [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. The SonUIT team is afiliated with the University of Information Technology (UIT), in
Vietnam, and participated in task 1. Their system uses a two-stage pipeline: () Filtering, where they
build averaged all-MiniLM-L6-v2 embeddings for each BDI-II symptom and pull the top 1000 sentences
per symptom via cosine similarity. () Reranking, where the candidate set is optionally resorted with
BM25, a cross-encoder, or larger embedding models (bge-large-en-v1.5 and text-embedding-3-large).
Five runs explored raw vs. pre-processed text and the diferent rerankers. Their configuration #2
(pre-processed text + embedding filter) posted the team’s best scores and placed within the top-three
teams on every evaluation metric (MAP = 0.334, R-Prec = 0.392, P@10 = 0.790, NDCG@1000 = 0.613).
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>This paper provided an overview of eRisk 2025, the ninth edition of the eRisk lab, which moved toward
two new tasks that require richer conversational understanding and interactive settings. The Task 1,
which was the final edition of the sentence-ranking challenge for BDI-II symptoms, attracted 67 runs
from 17 teams. Task 2 introduced full-thread context for the first time in early detection of depression. In
this task, we received 50 runs from 12 teams, and showed that models able to exploit dialogue structure
can issue accurate alerts after remarkably few turns, although a clear trade-of persists between earliness
and recall. The pilot task went a step further, replacing static corpora with live interaction against
LLM-driven personas. Despite the absence of training data, five teams submitted 13 runs; top systems
achieved near-perfect BDI-II score estimation yet still struggled to pinpoint the specific symptoms that
reflect those scores, highlighting the dificulty of symptom-level grounding in open conversation.
Taken together, the 130 runs submitted this year confirm both the community’s engagement and the
practicality of evaluation settings that approach real conversational use cases. Three broad lessons
emerge: adding even modest context improves detection, timeliness must remain a core metric. Moreover,
clinician-guided LLM personas, despite having a lot of room for improvement, are able to create realistic
yet privacy-preserving frameworks. Future eRisk editions will continue to shift toward dialogue-centric
tasks and deeper integration of LLM capabilities to keep pace with how people communicate online
and how assistive technologies are deployed.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgments</title>
      <p>The authors thank the financial support supplied by the grant PID2022-137061OB-C21 funded by
MICIU/AEI/10.13039/501100011033 and by “ERDF/EU”. The authors also thank the funding supplied by
the Consellería de Cultura, Educación, Formación Profesional e Universidades (accreditations ED431G
2023/01 and ED431C 2025/49) and the European Regional Development Fund, which acknowledges
the CITIC, as a center accredited for excellence within the Galician University System and a member
of the CIGUS Network, receives subsidies from the Department of Education, Science, Universities,
and Vocational Training of the Xunta de Galicia. Additionally, it is co-financed by the EU through the
FEDER Galicia 2021-27 operational program (Ref.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Declaration on Generative AI</title>
      <p>During the preparation of this manuscript, generative AI tools were employed solely for light editing
purposes, including proofreading, grammar correction, vocabulary improvement, and overall language
polishing. All substantive ideas, analyses, experiments, and written content were created by the
co-authors without direct text generation from any AI model.
Madrid, Spain, September 9-12, 2025.
[32] A. Benloucif, Y. Nannapuraju, S. Bellam, Y. Hu, Z. Zhao, V. Vydiswaran, Lhs712team-1 at eRisk@clef
2025: Searching for depression symptoms using various natural language processing algorithms,
in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, Madrid, Spain,
September 9-12, 2025.
[33] D. A. Nunes, E. Ribeiro, Inesc-id @ eRisk 2025: Exploring fine-tuned, similarity-based, and
prompt-based approaches to depression symptom identification, in: Working Notes of CLEF 2025
Conference and Labs of the Evaluation Forum, Madrid, Spain, September 9-12, 2025.
[34] T.-P. Mai, M.-H. L. H., D.-L. Tran, D.-C. Can, H.-Q. Le, Uet@eRisk2025: Severity estimation
for depression symptoms searching and early risk detection, in: Working Notes of CLEF 2025
Conference and Labs of the Evaluation Forum, Madrid, Spain, September 9-12, 2025.
[35] N. Munz, E. Aharon, A. Segal, K. Gal, Semantic retrieval of bdi symptoms in user writings, in:
Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, Madrid, Spain,
September 9-12, 2025.
[36] J. C. Molina, P. M. Fernandez, Hulat-uc3m at task 1@eRisk 2025: Detecting depression using
machine learning approaches, in: Working Notes of CLEF 2025 - Conference and Labs of the
Evaluation Forum, Madrid, Spain, September 9-12, 2025.
[37] D. E. Losada, F. Crestani, A test collection for research on depression and language use, in:</p>
      <p>Proceedings Conference and Labs of the Evaluation Forum CLEF 2016, Evora, Portugal, 2016.
[38] D. Otero, J. Parapar, Á. Barreiro, Beaver: Eficiently building test collections for novel tasks, in:
Proceedings of the First Joint Conference of the Information Retrieval Communities in Europe
(CIRCLE 2020), Samatan, Gers, France, July 6-9, 2020, 2020.
[39] D. Otero, J. Parapar, Á. Barreiro, The wisdom of the rankers: a cost-efective method for building
pooled test collections without participant systems, in: SAC ’21: The 36th ACM/SIGAPP
Symposium on Applied Computing, Virtual Event, Republic of Korea, March 22-26, 2021, 2021, pp.
672–680.
[40] M. Trotzek, S. Koitka, C. Friedrich, Utilizing neural networks and linguistic metadata for early
detection of depression indications in text sequences, IEEE Transactions on Knowledge and Data
Engineering (2018).
[41] F. Sadeque, D. Xu, S. Bethard, Measuring the latency of depression detection in social media, in:</p>
      <p>WSDM, ACM, 2018, pp. 495–503.
[42] X. Larrayoz, A. Casillas, A. Pérez, Leveraging conversational context and semantic relabeling
for early depression detection, in: Working Notes of CLEF 2025 - Conference and Labs of the
Evaluation Forum, Madrid, Spain, September 9-12, 2025.
[43] Y. Zi, B. Wang, Y. Zhao, B. Qin, Hit-scir@eRisk2025: Exploring the potential of a learnable screening
model and risk post bufer-based framework for contextualized early prediction of depression on
social media, in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum,
Madrid, Spain, September 9-12, 2025.
[44] A. M. Mármol-Romero, M. García-Vega, M. Ángel García-Cumbreras, A. Montejo-Ráez, Sinai
at eRisk@clef 2025: Transformer-based and conversational strategies for depression detection,
in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, Madrid, Spain,
September 9-12, 2025.
[45] D. Guecha, Y. Chiu, A. Miyaguchi, S. Gaur, Ds@gt at eRisk 2025: From prompts to predictions,
benchmarking early depression detection with conversational agent based assessments and
temporal attention models, in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation
Forum, Madrid, Spain, September 9-12, 2025.
[46] M. Saad, M. Abbas, A. U. Chaudhry, F. Alvi, A. Samad, Contextualized early detection of depression
– hybrid and time-aware approaches: Hu at eRisk task 2 2025, in: Working Notes of CLEF 2025
Conference and Labs of the Evaluation Forum, Madrid, Spain, September 9-12, 2025.
[47] E. Kara, R. E. M. Peña, L. Raithel, Fu-tu-dfki@eRisk 2025: A linguistically informed but
overdiagnosing approach to early depression detection, in: Working Notes of CLEF 2025 - Conference and
Labs of the Evaluation Forum, Madrid, Spain, September 9-12, 2025.
[48] D. E. Losada, F. Crestani, J. Parapar, Overview of eRisk 2019 early risk prediction on the internet,
in: Experimental IR Meets Multilinguality, Multimodality, and Interaction: 10th International
Conference of the CLEF Association, CLEF 2019, Lugano, Switzerland, September 9–12, 2019,
Proceedings 10, Springer, 2019, pp. 340–357.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , J. Parapar, eRisk
          <year>2017</year>
          :
          <article-title>CLEF lab on early risk prediction on the internet: Experimental foundations</article-title>
          , in: G. J.
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Lawless</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Mandl</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Cappellato</surname>
          </string-name>
          , N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction</source>
          , Springer International Publishing, Cham,
          <year>2017</year>
          , pp.
          <fpage>346</fpage>
          -
          <lpage>360</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , J. Parapar, eRisk
          <year>2017</year>
          :
          <article-title>CLEF Lab on Early Risk Prediction on the Internet: Experimental foundations</article-title>
          ,
          <source>in: CEUR Proceedings of the Conference and Labs of the Evaluation Forum, CLEF</source>
          <year>2017</year>
          , Dublin, Ireland,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <article-title>Overview of eRisk: Early Risk Prediction on the Internet</article-title>
          , in: P. Bellot,
          <string-name>
            <given-names>C.</given-names>
            <surname>Trabelsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mothe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Murtagh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Soulier</surname>
          </string-name>
          , E. SanJuan, L. Cappellato, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction</source>
          , Springer International Publishing, Cham,
          <year>2018</year>
          , pp.
          <fpage>343</fpage>
          -
          <lpage>361</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          , Overview of eRisk 2018:
          <article-title>Early Risk Prediction on the Internet (extended lab overview)</article-title>
          ,
          <source>in: CEUR Proceedings of the Conference and Labs of the Evaluation Forum, CLEF</source>
          <year>2018</year>
          , Avignon, France,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Beck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Ward</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mendelson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Erbaugh</surname>
          </string-name>
          ,
          <article-title>An Inventory for Measuring Depression, JAMA Psychiatry 4 (</article-title>
          <year>1961</year>
          )
          <fpage>561</fpage>
          -
          <lpage>571</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          , Overview of eRisk 2019:
          <article-title>Early risk prediction on the Internet</article-title>
          , in: F. Crestani,
          <string-name>
            <given-names>M.</given-names>
            <surname>Braschler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Savoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rauber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. Heinatz</given-names>
            <surname>Bürki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cappellato</surname>
          </string-name>
          , N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction</source>
          , Springer International Publishing,
          <year>2019</year>
          , pp.
          <fpage>340</fpage>
          -
          <lpage>357</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          , Overview of eRisk at CLEF 2019:
          <article-title>Early risk prediction on the Internet (extended overview)</article-title>
          ,
          <source>in: CEUR Proceedings of the Conference and Labs of the Evaluation Forum, CLEF</source>
          <year>2019</year>
          , Lugano, Switzerland,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <article-title>Early detection of risks on the internet: An exploratory campaign</article-title>
          ,
          <source>in: Advances in Information Retrieval - 41st European Conference on IR Research</source>
          , ECIR
          <year>2019</year>
          , Cologne, Germany, April 14-
          <issue>18</issue>
          ,
          <year>2019</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          ,
          <year>2019</year>
          , pp.
          <fpage>259</fpage>
          -
          <lpage>266</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          , Overview of eRisk 2020:
          <article-title>Early risk prediction on the internet, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction - 11th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2020</year>
          , Thessaloniki, Greece,
          <source>September 22-25</source>
          ,
          <year>2020</year>
          , Proceedings,
          <year>2020</year>
          , pp.
          <fpage>272</fpage>
          -
          <lpage>287</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          , Overview of eRisk at CLEF 2020:
          <article-title>Early risk prediction on the internet (extended overview)</article-title>
          ,
          <source>in: Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum</source>
          , Thessaloniki, Greece,
          <source>September 22-25</source>
          ,
          <year>2020</year>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , J. Parapar, eRisk
          <year>2020</year>
          :
          <article-title>Self-harm and depression challenges</article-title>
          ,
          <source>in: Advances in Information Retrieval - 42nd European Conference on IR Research</source>
          , ECIR
          <year>2020</year>
          , Lisbon, Portugal,
          <source>April 14-17</source>
          ,
          <year>2020</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          ,
          <year>2020</year>
          , pp.
          <fpage>557</fpage>
          -
          <lpage>563</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of eRisk 2021:
          <article-title>Early risk prediction on the internet, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction - 12th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2021</year>
          ,
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Event</surname>
          </string-name>
          ,
          <source>September 21-24</source>
          ,
          <year>2021</year>
          , Proceedings,
          <year>2021</year>
          , pp.
          <fpage>324</fpage>
          -
          <lpage>344</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of eRisk at CLEF 2021:
          <article-title>Early risk prediction on the internet (extended overview)</article-title>
          ,
          <source>in: Proceedings of the Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum</source>
          , Bucharest, Romania, September 21st - to - 24th,
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>864</fpage>
          -
          <lpage>887</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , eRisk
          <year>2021</year>
          :
          <article-title>Pathological gambling, self-harm and depression challenges</article-title>
          ,
          <source>in: Advances in Information Retrieval - 43rd European Conference on IR Research</source>
          , ECIR
          <year>2021</year>
          ,
          <string-name>
            <surname>Virtual</surname>
            <given-names>Event</given-names>
          </string-name>
          ,
          <year>March</year>
          28 - April 1,
          <year>2021</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>650</fpage>
          -
          <lpage>656</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of eRisk 2022:
          <article-title>Early risk prediction on the internet, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction - 13th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2022</year>
          , Bologna, Italy, September 5-
          <issue>8</issue>
          ,
          <year>2022</year>
          ,
          <year>2022</year>
          , p.
          <fpage>233</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of eRisk at CLEF 2022:
          <article-title>Early risk prediction on the internet (extended overview)</article-title>
          ,
          <source>in: Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum</source>
          , Bologna, Italy, September 5-
          <issue>8</issue>
          ,
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>821</fpage>
          -
          <lpage>850</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , eRisk
          <year>2022</year>
          :
          <article-title>Pathological gambling, depression, and eating disorder challenges</article-title>
          ,
          <source>in: Advances in Information Retrieval - 44th European Conference on IR Research</source>
          , ECIR
          <year>2022</year>
          , Stavanger, Norway,
          <source>April 10-14</source>
          ,
          <year>2022</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          ,
          <year>2022</year>
          , pp.
          <fpage>436</fpage>
          -
          <lpage>442</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of eRisk 2023:
          <article-title>Early risk prediction on the internet, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction - 14th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2023</year>
          , Thessaloniki, Greece,
          <source>September 18-21</source>
          ,
          <year>2023</year>
          ,
          <year>2023</year>
          , p.
          <fpage>233</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of eRisk at CLEF 2023:
          <article-title>Early risk prediction on the internet (extended overview)</article-title>
          ,
          <source>in: Proceedings of the Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum</source>
          , Thessaloniki, Greece,
          <source>September 18-21</source>
          ,
          <year>2023</year>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , eRisk
          <year>2023</year>
          :
          <article-title>Depression, pathological gambling, and eating disorder challenges</article-title>
          ,
          <source>in: Advances in Information Retrieval - 45th European Conference on IR Research</source>
          , ECIR
          <year>2023</year>
          , Dublin, Ireland, April 2-
          <issue>6</issue>
          ,
          <year>2023</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>III</given-names>
          </string-name>
          ,
          <year>2023</year>
          , p.
          <fpage>585</fpage>
          -
          <lpage>592</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , eRisk
          <year>2024</year>
          :
          <article-title>Depression, anorexia, and eating disorder challenges</article-title>
          , in: N.
          <string-name>
            <surname>Goharian</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Tonellotto</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lipani</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>McDonald</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macdonald</surname>
          </string-name>
          , I. Ounis (Eds.),
          <source>Advances in Information Retrieval - 46th European Conference on Information Retrieval</source>
          ,
          <string-name>
            <surname>ECIR</surname>
          </string-name>
          <year>2024</year>
          , Glasgow, UK, March
          <volume>24</volume>
          -28,
          <year>2024</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>V</given-names>
          </string-name>
          , volume
          <volume>14612</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2024</year>
          , pp.
          <fpage>474</fpage>
          -
          <lpage>481</lpage>
          . URL: https://doi.org/10.1007/ 978-3-
          <fpage>031</fpage>
          -56069-9_
          <fpage>65</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -56069-9\_
          <fpage>65</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of eRisk 2024:
          <article-title>Early risk prediction on the internet (extended overview)</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuscáková</surname>
          </string-name>
          , A. G. S. de Herrera (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), Grenoble, France,
          <fpage>9</fpage>
          -
          <issue>12</issue>
          <year>September</year>
          ,
          <year>2024</year>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>759</fpage>
          -
          <lpage>781</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-72.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of eRisk 2024:
          <article-title>Early risk prediction on the internet</article-title>
          , in: L.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mulhem</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Quénot</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. M. D. Nunzio</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Soulier</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Galuscáková</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality</source>
          , Multimodality, and Interaction - 15th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2024</year>
          , Grenoble, France, September 9-
          <issue>12</issue>
          ,
          <year>2024</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , volume
          <volume>14959</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2024</year>
          , pp.
          <fpage>73</fpage>
          -
          <lpage>92</lpage>
          . URL: https://doi.org/10.1007/ 978-3-
          <fpage>031</fpage>
          -71908-
          <issue>0</issue>
          _4. doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -71908-0\_4.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , eRisk
          <year>2025</year>
          :
          <article-title>contextual and conversational approaches for depression challenges</article-title>
          ,
          <source>in: European Conference on Information Retrieval</source>
          , Springer,
          <year>2025</year>
          , pp.
          <fpage>416</fpage>
          -
          <lpage>424</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of eRisk 2025:
          <article-title>Early risk prediction on the internet (extended overview)</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2025</year>
          ), Madrid, Spain,
          <fpage>9</fpage>
          -
          <issue>12</issue>
          <year>September</year>
          ,
          <year>2025</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Son</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. V.</given-names>
            <surname>Thin</surname>
          </string-name>
          , Sonuit eRisk2025:
          <article-title>Enhanced depression detection on social media via ifltering and re-ranking</article-title>
          ,
          <source>in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum</source>
          , Madrid, Spain, September 9-
          <issue>12</issue>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>S.</given-names>
            <surname>Adhikary</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Das</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Roy</surname>
          </string-name>
          , Thinkir at eRisk 2025:
          <article-title>Early detection and risk assessment of depression using transformer models</article-title>
          ,
          <source>in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum</source>
          , Madrid, Spain, September 9-
          <issue>12</issue>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Varela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Oronoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Casillas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <article-title>Detection of depression with symptom similarity: Data reduction and llm personas</article-title>
          ,
          <source>in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum</source>
          , Madrid, Spain, September 9-
          <issue>12</issue>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>P.</given-names>
            <surname>Vachharajani</surname>
          </string-name>
          ,
          <article-title>Transformer ensembles and llm-powered approaches for depression symptom analysis and contextualized early risk detection</article-title>
          ,
          <source>in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum</source>
          , Madrid, Spain, September 9-
          <issue>12</issue>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Segarra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. A.</given-names>
            <surname>Esteve</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Marco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-F. H.</given-names>
            <surname>Oliver</surname>
          </string-name>
          , Elirf-upv at eRisk 2025:
          <article-title>New approaches to the detection and early detection of symptoms and signs of depression</article-title>
          ,
          <source>in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum</source>
          , Madrid, Spain, September 9-
          <issue>12</issue>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>L. F. M.</given-names>
            <surname>Cardona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M. S.</given-names>
            <surname>Loaiza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A. P. D.</given-names>
            <surname>Castillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C. M.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E. S.</given-names>
            <surname>Castañeda</surname>
          </string-name>
          , Cotecmar-utb at eRisk 2025:
          <article-title>Semantic-centroid symptom ranking and early depression detection using adaptive decision rule</article-title>
          , in: Working Notes of CLEF 2025 -
          <article-title>Conference and Labs of the Evaluation Forum,</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>