<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UET@eRisk2025: Severity Estimation for Depression Symptoms Searching and Early Risk Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tu-Phuong Mai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minh-Ha H. Le</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Duc-Luong Tran</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Duy-Cat Can</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hoang-Quynh Le</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>VNU University of Engineering and Technology</institution>
          ,
          <addr-line>144 Xuan Thuy, Cau Giay, Hanoi</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this working note, we describe our participation in Task 1 and Task 2 of the CLEF eRisk 2025 Lab, which focuses on the early detection of depression based on Reddit user-generated content. For Task 1, which involves ranking up to 1,000 sentences according to their relevance to each of the 21 BDI-II depressive symptoms, we combined symptom classification with two approaches: (i) semantic similarity-based, where clustering techniques to group and rank sentences based on their relevance to specific depressive symptoms; and (ii) machine learning-based, where we use the output scores from a model fine-tuned for symptom detection and directly rank sentences based on predicted relevance scores. For Task 2, which targets early detection of depression within multi-user conversations, we design a multi-stage architecture that performs sentence-level symptom and severity detection, aggregates these signals at the post level, and finally estimates depression risk at the conversation level. This layered structure allows the model to capture both localized symptom cues and broader conversational patterns.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Depression</kwd>
        <kwd>Symptoms Searching</kwd>
        <kwd>Early Risk Detection</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Social Media</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The CLEF eRisk 2025 Lab [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] focuses on the early detection of mental health risks through the
analysis of online user-generated content. Competition promotes the development of natural language
processing (NLP) systems that are capable of identifying early signs of depression based on social
media text. The data used in eRisk tasks is collected from the Reddit platform, where users share
personal experiences through posts or discussions. These environments often encourage openness
and anonymity, resulting in large volumes of natural language data that reflect individuals’ thoughts,
emotions, and behaviors. This year, eRisk 2025 features three tasks: (1) sentence ranking for depression
symptoms, based on the 21 symptoms from the Beck Depression Inventory-II (BDI-II) questionnaire [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ];
(2) contextualized early detection of depression, using full, multi-user conversational threads presented
in chronological order; and (3) a pilot task involving the detection of depression in LLM-powered
conversational agents, where systems must infer the mental state of a simulated user. Together, these
tasks aim to support the development of practical and scalable methods for mental health monitoring
and early intervention.
      </p>
      <p>
        Task 1 continues the setup from the eRisk 2024 challenge [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], several teams employed retrieval-based
approaches by ranking user-generated content based on its cosine similarity to the Beck Depression
Inventory-II (BDI-II) questionnaire [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Among them, the NUS-IDS team [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] achieved top performance by
leveraging ensemble learning and contrastive fine-tuning. Their system combined sentence-transformer
models fine-tuned on task-specific data with expressive exemplars generated via prompting
GPT4 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], incorporating both BDI symptoms and features from the Early Maladaptive Schemas (EMS)
taxonomy [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Task 2 continues the setup from eRisk 2022 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], where the NLPGroup-IISERB team [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
attained top performance using entropy-based bag-of-words features combined with an SVM classifier.
      </p>
      <p>Their approach demonstrated that traditional feature engineering, when carefully designed, can remain
competitive for early risk detection.</p>
      <p>
        Our team participated in Task 1 and Task 2 of the CLEF eRisk 2025 Lab. We leverage DepRoBERTa [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
a RoBERTa-based [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] model pre-trained for depression detection, in both tasks to filter out irrelevant
sentences that do not reflect depressive content. In Task 1, after identifying relevant sentences using
the filtering model, we adopt two approaches: (i) a semantic similarity-based method, where we cluster
sentence embeddings to group semantically similar expressions for each symptom and rank sentences
based on their distance to cluster centroids; and (ii) a machine learning-based method, where we use the
output scores from a multi-task DepRoBERTa model fine-tuned for symptom detection and directly rank
sentences based on predicted relevance scores. For Task 2, we proposed a multi-stage framework that
ifrst used the same filtering model, which was further utilized to produce sentence-level embeddings,
which were used to detect symptom presence and estimate severity. Then aggregates this information
at the post and conversation levels to estimate depression risk, integrating both local and contextual
cues for early mental health detection.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <sec id="sec-2-1">
        <title>2.1. Task 1: Search for Symptoms of Depression</title>
        <p>This task focuses on ranking documents relevant to symptoms of depression as outlined in the BDI-II
questionnaire. The goal is to produce ranked lists containing up to 1,000 of the most relevant sentences
for each specific symptom. Evaluation involved expert annotators labeling pooled candidate sentences
as relevant if they addressed the symptom and reflected the individual’s state, with context provided
for accuracy. The final relevance scores were determined using two approaches: majority voting, where
a sentence is marked relevant if most assessors agree, and unanimity, where all assessors must agree on
relevance. These methods ensure reliable and consistent evaluation for training and testing.</p>
        <p>The training data for Task 1 was provided from previous editions of the same task, specifically from
eRisk 2023 and eRisk 2024. The test set for this year includes data collected from 9,000 Reddit users,
comprising over 17 million sentences. The data is formatted according to the TREC format. The main
statistics1 of the corpus are presented in Table 1.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Task 2: Contextualized Early Detection of Depression</title>
        <p>This task focuses on the early detection of depression by analyzing full conversational contexts. Unlike
previous tasks that consider isolated user posts, this task processes interactions among all participants
in a conversation sequentially, reflecting real-world social media dynamics. The dataset includes the
target user’s writing history and all comments from conversation members, enabling timely depression
detection based on evolving dialogue.</p>
        <p>
          The dataset follows the format described in Losada &amp; Crestani (2016)[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and consists of Reddit
conversations where each conversation forms a tree-structured thread centered around a target user.
The objective is to predict a depression score  ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] for the target user based on contextual signals
from the conversation.
1Statistics of the training set are based on reports from the eRisk 2023 and eRisk 2024.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Method</title>
      <sec id="sec-3-1">
        <title>3.1. Task 1: Search for Symptoms of Depression</title>
        <p>Our approach to Task 1 is based on two directions: (i) a semantic similarity-based ranking pipeline and
(ii) a machine learning-based ranking model.</p>
        <p>Semantic similarity-based approach. This direction first uses a multi-label classification model
to filter out irrelevant sentences. The remaining relevant sentences are embedded using sentence
transformers and grouped into symptom-specific clusters. At inference, test sentences are ranked based
on their similarity to these clusters. We explore three configurations: (a) direct semantic similarity, (b)
embeddings fine-tuned via contrastive learning, and (c) an ensemble of multiple embedding models.
Machine learning-based approach. In this direction, we directly use the output scores from a
ifne-tuned multi-task model (described in Section 3.1.3) to rank sentences by relevance.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Pre-processing</title>
          <p>We began with the oficial sentence-level annotations provided in TREC format, where each sentence is
associated with a user ID and timestamp. To ensure high-quality input and reduce noise, we applied
several filtering steps. Texts were lowercased, and non-linguistic tokens such as URLs, emojis, and
special characters were removed. Crucially, we filtered for first-person expressions by detecting
firstperson pronouns (e.g., “I”, “me”, “my”, ...), under the hypothesis that self-reported experiences better
reflect the user’s mental state than statements about others or general opinions. The resulting dataset
included relevant sentences from the 2023 and 2024 editions of eRisk, which were used for model
training and clustering.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Semantic similarity-based approach</title>
          <p>We combine filtering with clustering to identify semantically representative symptom expressions. First,
sentences are filtered using a DepRoBERTa-based multi-label classifier to retain only those relevant to
any of the 21 BDI-II symptoms. After filtering for relevant sentences, we group them into semantic
clusters and rank new sentences based on their distance to these the nearest centroid. This enables the
system to identify symptom-relevant sentences that may express depressive cues in more varied ways.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>Clustering and Semantic Representation.</title>
          <p>
            To capture variations in how each symptom is
linguistically expressed, we performed clustering over the relevant training sentences. Each sentence  was
embedded using a Sentence Transformer [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] model - specifically the nomic-embed-text-v1.5 [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]
- to obtain a  −  vector representation:
          </p>
          <p>v = Embed() ∈ R
to form  clusters:</p>
          <p>For each symptom  ∈ {1, . . . , } (with  = 21 for BDI-II symptoms), we collected the subset of
training embeddings {v()} relevant to that symptom. Then, we applied K-means clustering to this set
{ (1), . . . ,  ()} = KMeans {v }
() )︁
where  () denotes the centroid of the -th cluster for symptom .</p>
          <p>This clustering strategy groups semantically similar sentences into coherent sub-themes within each
symptom category. The choice of  could be made to balance between intra-cluster similarity and
inter-cluster diversity.
︁(

(1)
(2)
Contrastive Learning. To improve the discriminative quality of sentence embeddings, we
applied contrastive learning using the InfoNCE loss. Each sentence embedding v obtained from the
nomic-embed-text-v1.5 model was first projected into a lower-dimensional space via a linear
mapping layer:
h =  · v + b,
where  ∈ R128×  is a trainable weight matrix and  is the original embedding dimension.</p>
          <p>Given a batch of training samples with known symptom labels, positive pairs were constructed from
sentences annotated with the same symptom, and negatives from sentences belonging to diferent
symptoms. The InfoNCE loss was then applied to pull embeddings of similar sentences closer and push
dissimilar ones apart:
ℒℎ = − log ∑︀∈() exp(sim(hi, ha)/ )</p>
          <p>∑︀∈ () exp(sim(hi, hp)/ )
Where:
Where:
•  () = { ̸= | = }: Set of positive indices with the same label as anchor.
• () = { ̸= }: Set or all samples.</p>
          <p>• sim(· , · ) denotes cosine similarity and  is a temperature hyperparameter.</p>
          <p>This training objective encourages the embedding space to reflect symptom-level semantic distinctions
more clearly, enhancing the quality of downstream clustering and similarity-based ranking.
Sentence Assignment and Ranking. Each test sentence predicted as relevant was also embedded
using the same nomic model. Then, for each symptom, we applied -nearest neighbor search ( = 11)
to identify the closest cluster centroid (among training clusters of that symptom). We assigned each test
sentence to the nearest cluster and computed its distance to the centroid. This distance was converted
to a normalized similarity score via:
(3)
(4)
(5)
Similarity() = 1 −</p>
          <p>‖v −  ‖2
max(‖v −  ‖2)
∈
• v is the embedding vector of test sentence .
•   is the nearest cluster centroid for symptom .</p>
          <p>•  is the set of all test sentences predicted as relevant to symptom .</p>
          <p>The final ranking was derived by sorting all test sentences for each symptom in descending order of
similarity, selecting the top 1,000 as the system output.</p>
          <p>This approach combines high-precision filtering from the symptom classifier with semantic
granularity from clustering, enabling the system to surface sentences that are not only relevant to a symptom
but also representative of its most prototypical or central expressions.</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>3.1.3. Machine Learning-based approach</title>
          <p>
            Given that the severity of a sentence often correlates with the presence and intensity of specific
depressive symptoms, we adopt a multi-task learning approach to jointly model both aspects. Specifically,
we fine-tune a DepRoBERTa [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] model to simultaneously predict symptom presence (as a 21-dimensional
multi-label output) and estimate severity (as a continuous score in [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ]). This joint training not only
allows the two tasks to benefit from shared representations but also encourages the model to capture
subtle linguistic cues that reflect both the type and intensity of depressive expressions. This model
takes an individual sentence as input and produces two outputs: a binary vector indicating the presence
of relevant symptoms, and a scalar severity score.
          </p>
        </sec>
        <sec id="sec-3-1-5">
          <title>Severity Label Generation Using Large Language Models. To create a reliable dataset for</title>
          <p>
            sentence-level symptom detection and severity estimation, we extended the Task 1 training data, which
includes annotations for symptom relevance but lacks severity labels, by generating severity scores
using a large language model (LLM). For each relevant sentence, we prompted the LLM with the sentence
text and corresponding BDI-II symptom descriptions to assign a severity score in {0, 1, 2, 3} based on the
BDI-II criteria. These scores were then normalized to a continuous scale in [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ]. This process leverages
both the clinical structure of BDI-II and the contextual reasoning capabilities of the LLM to provide
consistent and meaningful severity annotations. The resulting dataset, containing both relevance and
severity scores, enables supervised training of a multi-task model while avoiding the need for costly
manual labeling.
          </p>
          <p>
            Architecture. Figure 1 illustrates the architecture of our multi-task fine-tuned DepRoBERTa model
for sentence-level symptoms detection and severity estimation. The architecture consists of:
• Shared Backbone: The first 18 layers of the pre-trained DepRoBERTa model are frozen during
training.
• Branch 1 — Symptom Detection: A task-specific branch with 6 transformer layers and a
pooler, followed by a multi-label classification head to predict the relevance of a sentence to 21
depression-related symptoms.
• Branch 2 — Severity Estimation: Another 6-layer branch with a pooler. The pooled vector
from this branch is concatenated with the pooled output from Branch 1, then passed through
a linear mapping layer to combine features. The resulting vector is used both for computing
contrastive loss and as input to a regression MLP head that outputs the severity score  ∈ [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ].
          </p>
        </sec>
        <sec id="sec-3-1-6">
          <title>Training Strategy.</title>
          <p>We employ a two-phase training procedure:
1. Phase 1: Train the symptom detection branch while freezing the severity estimation branch.
2. Phase 2: Once Branch 1 stabilizes, we freeze it and start training Branch 2.</p>
          <p>
            Loss Functions. The model is optimized using a combination of three loss components:
ℒtotal = ℒBCE + ℒMSE +  · ℒ InfoCL
(6)
• ℒBCE: Binary cross-entropy loss for multi-label classification.
• ℒMSE: Mean squared error for severity score regression.
• ℒInfoCL: Contrastive loss (InfoNCE [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ]) applied on the pooled sentence embeddings from Branch
2 to improve representation quality.
          </p>
          <p>•  : A weighting factor to balance the contrastive loss.</p>
          <p>
            Sentence Ranking Given a post or comment  consisting of  sentences:  = {1, 2, . . . , }, each
sentence  is passed through a multi-task model:
sym() = z ∈ {0, 1}21, sev() =  ∈ [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ]
(7)
where z is a binary symptom vector for the 21 depressive symptoms, and  is the predicted severity
score if  is relevant.
          </p>
          <p>For each symptom  ∈ {1, . . . , 21}, we rank all sentences by their predicted probability [] in
descending order and select the top 1000 sentences. This method directly uses the model’s outputs to
perform sentence ranking and was used in our best-performing configuration (Run 4).</p>
        </sec>
        <sec id="sec-3-1-7">
          <title>3.1.4. Submitted Configurations</title>
          <p>We submitted five configurations for Task 1, described as follows:
Run 0: Similarity. Semantic similarity-based approach using the original
nomic-embed-text-v1.5 model without contrastive learning. K-means clustering was applied with
 = 11 to form symptom-specific clusters.</p>
          <p>Run 1: Ensemble Similarity. An ensemble of three cluster-based similarity runs: (i)
nomic-embed-text-v1.5 with  = 5, (ii) nomic-embed-text-v1.5 with  = 11, and (iii)
modernbert-embed-base with  = 11.</p>
          <p>Run 2: Contrastive Learning Similar to Run 0 but using contrastive learning, fine-tuned
nomic-embed-text-v1.5 embeddings. Embeddings were projected to a 128-dimensional space
and trained using InfoNCE loss to improve symptom-level semantic separation.</p>
          <p>Run 3: Ensemble Contrastive Learning. An ensemble combining Run 1 and Run 2, leveraging
both diverse embedding sources and contrastive learning enhanced representations for more robust
similarity ranking.</p>
          <p>Run 4: Machine Learning. A machine learning-based approach using output scores from the
ifne-tuned multi-task model described in Section 3.1.3. This model directly predicts symptom relevance
and severity, and its scores are used to rank the sentences.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Task 2: Contextualized Early Detection of Depression</title>
        <p>Our pipeline consists of three stages:
1. Sentence-level symptom detection and severity estimation: Each sentence is analyzed to
identify the presence of depressive symptoms and to assign a fine-grained severity score. This
stage uses a multi-task model, which is described in Section 3.1.3.
2. Post-level depression scoring: Relevant sentence representations and their associated severity
scores are aggregated to compute a depression score for each post or comment.
3. User-level depression estimation: Finally, a set of rule-based heuristics is applied to combine
post-level scores across the conversation tree, yielding a final depression score for the target user.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Pre-processing</title>
          <p>We adopt the dataset released by the organizers, which filters out noisy or incomplete threads. To
preserve relevant contextual information, each conversation tree is pruned to keep only the branches:
• leads to the target user (i.e., ancestor nodes),
• or are direct responses from the target user (i.e., children nodes).</p>
          <p>In cases where parent nodes are missing, dummy nodes are inserted to maintain the tree structure and
avoid losing conversation branches.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Post-level Depression Scoring</title>
          <p>
            In this stage, we aggregate sentence-level information to estimate a depression score for each post or

comment. Let { }=1 denote the set of relevant sentences extracted from a given post, each associated
with a severity score  ∈ [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ].
          </p>
          <p>Each sentence  is encoded using the fine-tuned DepRoBERTa model and extracts the sentence
representation from the Pooler layer of Branch 2.</p>
          <p>
            The final representation is obtained by concatenating the textual and severity embeddings, followed
by a multi-layer perceptron (MLP) with a sigmoid activation to produce the depression score ˆ ∈ [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ]:
ˆ =  (MLP([htext; hsev]))
(11)
          </p>
          <p>The sequence of sentence embeddings {h }=1 is passed through a bidirectional LSTM to capture
contextual dependencies among sentences:</p>
          <p>Similarly, the sequence of scalar severity scores { }=1 is fed into a separate BiLSTM to capture the
temporal structure and progression of severity:</p>
          <p>h = Pooler(DepRoBERTa( ))
htext = BiLSTMtext({h }=1) ∈ R
hsev = BiLSTMsev({ }=1) ∈ R</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>3.2.3. User-level Depression Prediction</title>
          <p>
            In the final stage, we aggregate sentence-level severity scores to produce a depression prediction for the
target user. Given the severity scores of relevant posts across a conversation tree, we implement multiple
rule-based configurations to explore diferent aggregation strategies. Each configuration defines specific
rules for decision making:
Run 0: Target Node Only, Max Score We consider only the posts authored by the target user
(target nodes) and take the maximum severity score as the final prediction:
ˆ =
max (score())
∈target
Where:
• target is the set of posts authored by the target user in the current conversation.
• score() denotes the predicted severity score (in [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ]) for post .
(8)
(9)
(10)
(12)
          </p>
        </sec>
        <sec id="sec-3-2-4">
          <title>Run 1: Temporal Accumulation, Target Nodes Only</title>
          <p>We include historical posts of the target
user and again use the maximum score across all such posts:
ˆ =</p>
          <p>max
∈{ht∪ct}</p>
          <p>(score())
• ct is the set of posts authored by the target user in the current conversation.
• ht is the set of posts authored by the same user in previous conversations.</p>
          <p>
            • score() denotes the predicted severity score (in [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ]) for post .
          </p>
          <p>Run 2: Temporal Accumulation with Bonus Similar to run 1, but we add a bonus score if the
maximum severity exceeds a high threshold  high. The final score is computed as:
ˆ =</p>
          <p>max
∈{ht∪ct}
(score()) + bonus · L
︂[</p>
          <p>max
∈{ht∪ct}
(score()) &gt;  high
︂]
(13)
(14)
(15)
(16)
•  is the original severity score of post .
• parent() is the score of the parent node of  in the conversation tree.
• root is the score of the root node of the conversation.
•  low is the low depression score threshold.
•  high is the high depression score threshold.
•  is the weight for the parent node influence.</p>
          <p>•  is the weight for the root node influence.</p>
          <p>The final decision score is the maximum of all adjusted scores:
ˆ =
∈target</p>
          <p>
            max (′)
• ct is the set of posts authored by the target user in the current conversation.
• ht is the set of posts authored by the same user in previous conversations.
• score() is the predicted severity score (in [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ]) for post .
•  high is the high depression score threshold.
• bonus is the bonus term added to the final score when the threshold condition is met.
• L[· ] is the indicator function that returns 1 if the condition inside is true, otherwise 0.
          </p>
        </sec>
        <sec id="sec-3-2-5">
          <title>Run 3: Temporal Accumulation with Neighbor-based Uncertainty Handling</title>
          <p>We consider both
current and historical posts from the target user. For each post  ∈ target, if its severity score falls
within an uncertainty range [ low,  high], we apply a neighbor-based adjustment using its parent and
root scores. The adjusted score ′ is defined as:
{︃(1 −  ) ·  +  · parent() +  · root if  low &lt;  ≤  high
otherwise</p>
        </sec>
        <sec id="sec-3-2-6">
          <title>Run 4: Temporal Accumulation with Community-based Adjustment We accumulate both</title>
          <p>current and historical posts from the target user. For each target post  ∈ all target, we consider all
posts in the conversation branch from the root node to , excluding  itself, as:</p>
          <p>Let  be the original severity score of , and  be the average severity score of the community:
 = { ∈ Branch(, ) |  ̸= }
 =
1</p>
          <p>∑︁ 
|| ∈
⎧⎪ + bonus · ( − )
⎨</p>
          <p>if  &gt; max( high, )
The final prediction score is the maximum over all adjusted scores:
ˆ =</p>
          <p>max (′)
∈target
•  is the original severity score of post .
•  high,  low are the high depression score and low depression score threshold.
• bonus, penalty are the bonus term added to the final score and the penalty term subtracted from
the final score when the threshold condition is met.</p>
          <p>Each run serves as a configuration of the decision logic and can be evaluated independently to assess
the robustness of rule-based aggregation methods over tree-structured social media conversations.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation Results &amp; Discussion</title>
      <sec id="sec-4-1">
        <title>4.1. Task 1: Search for Symptoms of Depression</title>
        <p>A total of 67 runs from participants were submitted for this task. In Table 2, we present the
rankingbased evaluation results for Task 1 (majority setting), comparing the best configuration from each
participating team. Our submission achieved the second-best performance in both NDCG and AP, while
also maintaining strong results across R-PREC and P@10. This demonstrates that our approach ofers a
well-balanced trade-of between ranking quality and precision.</p>
        <p>Tables 3 show the ranking-based performance of our system under majority voting schemes. Among
our runs, the machine learning configuration consistently achieves the best results, notably with
an NDCG of 0.623 (majority) and 0.577 (unanimity), highlighting its efectiveness. Similarity-based
approaches also perform reasonably well, with slight improvements when ensembling is applied. In
contrast, contrastive learning methods underperform across all metrics, suggesting they may not be
well-suited for this task without further tuning.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Task 2: Contextualized Early Detection of Depression</title>
        <p>We report the results of our models on the public leaderboard in Table 4. Among our runs, Run 2:
Temporal Accumulation with Bonus consistently yields the best performance, with an latency of 0.68
and 1 of 0.73, demonstrating the benefit of incorporating historical context and severity-based reward.</p>
        <p>A total of 50 runs from 12 participants were submitted for this task. In Table 5, we present the
decision-based evaluation results for Task 2, comparing the best configuration from each participating
team. Our submission achieved the fourth-best performance in both  1 and latency.
(17)
(18)
(19)
(20)</p>
        <p>NDCG</p>
        <p>NDCG
0.378
0.390
0.258
0.228
0.394</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this report, we present our approaches for both Task 1 and Task 2 of the eRisk 2025 challenge,
focusing on early detection and symptom identification of depression from social media posts. For
Task 1, we explored various ranking-based methods based on two approaches: (i) Semantic
similaritybased methods that cluster sentence embeddings and rank by proximity to symptom centroids, and (ii)
ERDE5
Machine learning-based methods that directly use the output scores from the multi-task model. Among
these, the second approach achieved the best performance across evaluation metrics, demonstrating the
efectiveness of our fine-tuned multi-task model for sentence-level symptom detection.</p>
      <p>For Task 2, we designed several temporal aggregation strategies to detect early warning signs of
depression. These configurations leverage both current and historical user data, with enhancements such
as uncertainty handling and community-based score adjustment. The most efective setup integrated
severity scoring with threshold-based boosting, resulting in competitive latency-aware performance.</p>
      <p>Across both tasks, we incorporated the multi-task model to sentence ranking in Task 1 and provided
representations while filtering out irrelevant content in Task 2, contributing to improved robustness
and precision. Overall, our approaches highlight the efectiveness of combining fine-tuned language
models with task-specific heuristics and temporal context for early detection of mental health risks.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>In the preparation of this report, we only used Grammarly and ChatGPT for spell/grammar checking
and improving the readability of the manuscript. No part of the content, analyses, or results was
generated by AI tools. All methodological design, implementation, experiments, and interpretations
were conducted solely by the authors.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk 2025:
          <article-title>Early risk prediction on the internet, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction - 16th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2025</year>
          , Madrid, Spain, September 9-
          <issue>12</issue>
          ,
          <year>2025</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , volume To be
          <source>published of Lecture Notes in Computer Science</source>
          , Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk 2025:
          <article-title>Early risk prediction on the internet (extended overview)</article-title>
          ,
          <source>in: Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2025</year>
          ), Madrid, Spain,
          <fpage>9</fpage>
          -
          <issue>12</issue>
          <year>September</year>
          ,
          <year>2025</year>
          , volume To be published of CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Beck</surname>
          </string-name>
          ,
          <article-title>Beck depression inventory-ii, Psychological assessment (</article-title>
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B. H.</given-names>
            <surname>Ang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Gollapalli</surname>
          </string-name>
          , S.
          <article-title>-</article-title>
          K. Ng,
          <article-title>Nus-ids@ erisk2024: ranking sentences for depression symptoms using early maladaptive schemas and ensembles</article-title>
          , Working Notes of CLEF (
          <year>2024</year>
          )
          <fpage>9</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Achiam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Adler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Akkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. L.</given-names>
            <surname>Aleman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Altenschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Anadkat</surname>
          </string-name>
          , et al.,
          <source>Gpt-4 technical report, arXiv preprint arXiv:2303.08774</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>J. E. Young,</surname>
          </string-name>
          <article-title>Cognitive therapy for personality disorders: A schema-focused approach</article-title>
          , Professional Resource Press/Professional Resource Exchange,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , erisk
          <year>2022</year>
          :
          <article-title>pathological gambling, depression, and eating disorder challenges</article-title>
          ,
          <source>in: European Conference on Information Retrieval</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>436</fpage>
          -
          <lpage>442</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lijin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sruthi</surname>
          </string-name>
          , T. Basu,
          <article-title>Nlp-iiserb@ erisk2022: Exploring the potential of bag of words, document embeddings and transformer based framework for early prediction of eating disorder, depression and pathological gambling over social media</article-title>
          .,
          <source>in: CLEF (Working Notes)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>972</fpage>
          -
          <lpage>986</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Poświata</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Perełkiewicz, OPI@LT-EDI-ACL2022: Detecting signs of depression from social media text using RoBERTa pre-trained language models</article-title>
          ,
          <source>in: Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion</source>
          , Association for Computational Linguistics, Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>276</fpage>
          -
          <lpage>282</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .ltedi-
          <volume>1</volume>
          .40. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .ltedi-
          <volume>1</volume>
          .
          <fpage>40</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <article-title>A test collection for research on depression and language use</article-title>
          , volume
          <volume>9822</volume>
          ,
          <year>2016</year>
          , pp.
          <fpage>28</fpage>
          -
          <lpage>39</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -44564-
          <issue>9</issue>
          _
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1908</year>
          .10084.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Nussbaum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. X.</given-names>
            <surname>Morris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Duderstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mulyar</surname>
          </string-name>
          ,
          <article-title>Nomic embed: Training a reproducible long context text embedder</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2402.01613. arXiv:
          <volume>2402</volume>
          .
          <fpage>01613</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>A.</surname>
          </string-name>
          v. d. Oord,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <article-title>Representation learning with contrastive predictive coding</article-title>
          , arXiv preprint arXiv:
          <year>1807</year>
          .
          <volume>03748</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>