<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Contextualized Early Detection of Depression - Hybrid and Time-Aware Approaches: HU at eRisk Task 2 2025</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Muhammad Saad</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Asad Ullah Chaudhry</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Meesum Abbas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Faisal Alvi</string-name>
          <email>faisal.alvi@sse.habib.edu.pk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abdul Samad</string-name>
          <email>abdul.samad@sse.habib.edu.pk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dhanani School of Science and Engineering, Habib University</institution>
          ,
          <addr-line>Karachi</addr-line>
          ,
          <country country="PK">Pakistan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The early detection of depression in online conversational threads remains a pivotal challenge in computational mental health, particularly under the real-time, context-aware requirements of CLEF eRisk 2025 Task 2. We propose a multifaceted study evaluating five innovative approaches, including transformer-based models (e.g., ModernBERT with time-aware loss), Classification with Partial Information and Decision-Making Component (CPI+DMC) frameworks enhanced by Llama 3.1, data augmentation strategies, simple threshold policies, and zero-shot learning with Llama-4-Scout-17B. Leveraging the eRisk dataset, our methodologies integrate recent advances in time-aware training and hybrid ensembles, addressing the trade-ofs between classification accuracy, earliness, and computational eficiency. Our results demonstrate that the CPI+DMC approach achieves a best F1 score of 0.75, securing a 3rd place ranking among 12 teams, with a competitive NDCG@100 of 0.62 for early detection and an ERDE50 of 0.05, highlighting its efectiveness in balancing accuracy and latency. These findings ofer valuable insights into real-time mental health monitoring and underscore the potential for future research to refine decision policies and enhance long-term ranking stability.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;CLEF eRisk</kwd>
        <kwd>Early Depression Detection</kwd>
        <kwd>Conversational Context Analysis</kwd>
        <kwd>Transformer-based Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Literature Review</title>
      <p>
        Early depression detection from text initially used recurrent neural networks (RNNs) and convolutional
neural networks (CNNs). Trotzek et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] combined CNNs with linguistic metadata, like sentiment
and syntactic complexity, to improve classification by capturing semantic patterns and psychological
insights. Data sparsity and class imbalance necessitated ensemble techniques. Transformer-based
models, particularly BERT, advanced the field. Martínez-Castaño et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] showed fine-tuned BERT
outperformed RNNs and CNNs in early detection, while Devaguptam et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] noted DeBERTa’s slight
edge due to better contextual embeddings. These models require threshold tuning to balance sensitivity
and false positives, with computational costs limiting real-time use.
      </p>
      <p>
        Recent work explored large language models (LLMs) like GPT-3.5 and LLaMA2. Munir et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
achieved near-perfect accuracy with fine-tuned LLMs on social media text, surpassing BERT and
DeBERTa. Hadzic et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] reported GPT-4’s 81% precision and 71% F1 score on conversational
data, outperforming BERT. LLMs’ computational demands and lack of early decision mechanisms
require external policies. Zhang &amp; Poellabauer [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] introduced contextual position encoding (CoPE) for
multimodal detection, efective for clinical data but challenging for social media.
      </p>
      <p>
        Thompson &amp; Errecalde [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] used time-aware training with timestamp indicators and an ERDE loss
function, achieving top ERDE50 scores. Loyola et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposed a Classification-Prediction &amp; Decision
(CPI+DMC) approach, with a transformer outputting probabilities and a separate alert module, reducing
false positives via adaptive policies. The CPI+DMC approach treats early risk detection as a
multiobjective problem, balancing precision through a classification component (CPI) and speed via a
decision-making component (DMC) that determines the optimal moment for issuing alerts based on
prediction history. This method has shown robustness across eRisk challenges by allowing independent
optimization of classification accuracy and timely decision-making. Gui et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] applied reinforcement
learning to select relevant text, improving precision by 14.6%, though requiring extensive data.
      </p>
      <p>
        Evaluation metrics like ERDE [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] penalize late detections but face interpretability issues. Sadeque et
al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] proposed a latency-weighted F1 score (Flatency) for clarity. Ranking metrics like Precision@10
and NDCG assess real-time prioritization. The UNSL team at eRisk 2024 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] combined a BERT-based
CPI+DMC model with a time-aware transformer for top performance. Future work may integrate RL,
time-aware LLMs, and multimodal data. Transformers have replaced CNNs and RNNs, but
computational costs, data imbalance, and interpretability remain challenges, with eRisk 2025 emphasizing
conversational context for hybrid model advances.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>
        For eRisk 2025 Task 2, we developed five approaches 1 to detect depression early in Reddit conversational
threads, analyzing posts and comments sequentially [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Common preprocessing includes lowercase
conversion, URL replacement with “[URL]”, HTML/Unicode artifact removal, and truncation to 2048
tokens (except Run 4) to balance context and eficiency [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. A hashmap tracks user IDs across rounds,
and tiered thresholds optimize decision-making [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. To visualize the pipeline of our five approaches,
including preprocessing steps and methodological components, refer to Figure 1. Below, we detail each
approach (Runs 0–4) with enumerated steps and key methodological insights.
      </p>
      <sec id="sec-3-1">
        <title>3.1. ModernBERT with Time-Aware Loss and Class Weighting (Run 0)</title>
        <p>
          ModernBERT’s 8192-token capacity enables processing of long conversational sequences, unlike BERT’s
512-token limit, making it ideal for capturing Reddit thread context. However, due to hardware
constraints, we truncate to 2048-token input. In addition, we incorporate a time-aware loss to prioritize
early detection, using ERDE50 to penalize late predictions [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <sec id="sec-3-1-1">
          <title>1Source code available at https://github.com/meesuma5/erisk_task2.</title>
          <p>Run 0
Run 1
Run 2
Run 3
Run 4</p>
          <p>Data Prep
Preprocess,
2048 tokens</p>
          <p>Data Prep
Llama 3.1 sum.
(200 words)
Data Prep
nlpaug aug.
(281 to 3936)
Data Prep
Reuse Run 2
(augmented)
Data Prep
JSON extract,
preprocess</p>
          <p>Model</p>
          <p>Model
ModernBERT</p>
          <p>Model
BERT-base</p>
          <p>Model
ModernBERT</p>
          <p>Model
Reuse Run 2</p>
          <p>Model
Llama-4-17B
(zero-shot)</p>
          <p>Training</p>
          <p>Training
ERDE50,
class weight</p>
          <p>Training
Cross-entropy</p>
          <p>Training
CrossEntropy,</p>
          <p>AdamW
Training
None
Training
None</p>
          <p>Inference</p>
          <p>Decision</p>
          <p>Inference
Concat posts,
prob. output
Inference
Summarize,
BERT prob.</p>
          <p>Inference
Concat posts,
prob. output</p>
          <p>Inference
Concat posts,
prob. output</p>
          <p>Inference
JSON prompt,
zero-shot</p>
          <p>Decision
Tiered thresh
(0.9–0.7)</p>
          <p>Decision
5-pred window,
mean &gt; 0.7</p>
          <p>Decision
5-pred window,</p>
          <p>mean &gt; 0.7
Simple Decision</p>
          <p>Threshold
(&gt; 0.5)
Decision
Hybrid rule
(&gt; 0.9 or mean)</p>
          <p>1. Data Preparation: Extract &lt;TEXT&gt; or &lt;TITLE&gt; from each user’s posts in the training dataset,
concatenate posts as [CLS] post1 [SEP] post2 [SEP] ..., preprocess with lowercase and
URL replacement.
2. Model: Fine-tune ModernBERT-base with CrossEntropy and ERDE50 loss, applying class
weighting to address 10% positive class imbalance.
3. Training: Split data 80:20 (train:validation), train for 5 epochs, learning rate 2e-5, achieving
validation F1-macro of 0.66.
4. Inference: Process posts with context as ([CLS] post [CONTEXT] comment1 [SEP] ...),
or if the target user is the author of a comment process it as ([CLS] comment of target
user (1) [CONTEXT] parent of 1 (2) [SEP] parent of
2 (3) [SEP] ...), and then output depression probability.
5. Decision: Apply tiered thresholds on mean scores (0.9 for 1 post, 0.85 for 2, 0.8 for 3, 0.75 for 4
and 0.7 for 5+ posts) using hashmap for user tracking. Tiered thresholds were used to balance
precision and speed by requiring higher confidence for early predictions with limited posts,
while allowing lower thresholds as more posts provide greater context, thus optimizing early risk
detection performance.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Summarization-Enhanced CPI+DMC with BERT (Run 1)</title>
        <p>
          This approach addresses the domain mismatch between training (isolated writings) and test
(conversational threads) data by using Llama 3.1 to summarize texts, preserving emotional cues. The CPI+DMC
framework balances classification accuracy with timely decisions [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>1. Data Preparation: For each user in the training set, all available posts are concatenated to form
a composite document. This aggregated text is then summarized to a maximum of 200 words
using Llama 3.1 with prompts designed to retain emotional depth and salient personal context.
The prompts used for summarization are as follows:
System Prompt:</p>
        <p>You are a focused, analytical summarizer. Your role is to extract and condense content
into a concise summary that captures the emotional state, notable life events, and
communication style expressed in the input text. Your output must be a standalone
summary of no more than 200 words, with no additional commentary or introductory
phrases.</p>
        <p>User Prompt:</p>
        <p>You have been provided with a text for analysis. Summarize the text into a concise
summary that focuses on the emotional state and notable life events of the person in
it (if any). Include any signs of sadness, depression, or concerning words or phrases
you find in the text. Ensure the summary is no longer than 200 words and contains
only the summary content. Always answer starting with "The user..."</p>
        <p>
          Text: {text}
Token length of model output was capped to 500 tokens in line with BERT’s input token limit of
512 tokens. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
2. Model: Fine-tune BERT-base-uncased (512-token input, 12 layers, 110M parameters) with linear
layer (768-to-1) and sigmoid for probability [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. BERT was chosen in line with UNSL 2024’s
choices and their obtained results [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
3. Training: Use 80:20 split, 3 epochs, Adam (lr 2e-5), binary cross-entropy; validation precision
0.977 (non-depressed), 0.418 (depressed).
4. Inference: Summarize threads with Llama 3.1, classify with BERT to output probability.
5. Decision: DMC uses 5-prediction sliding window, triggers alert if mean score &gt; 0.7 after 10
predictions.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. ModernBERT with Data Augmentation and DMC (Run 2)</title>
        <p>
          To mitigate class imbalance (312 depressed vs. 2824 non-depressed users), we augment positive samples
using multiple techniques, enhancing ModernBERT’s robustness. The DMC ensures controlled alert
triggering [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>
          1. Data Preparation: Parse XML, concatenate writings, augment 281 positive samples with nlpaug
(BackTranslation: English→French/German→English, SynonymAug, ContextualWordEmbsAug)
to 3936 samples [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
2. Model: Fine-tune ModernBERT-base (2048-token input) for sequence classification.
3. Training: Use 90:10 split, 5 epochs, AdamW (lr 2e-5), CrossEntropy loss; validation F1-macro
0.771.
4. Inference: Concatenate post title/body/comments, preprocess, tokenize, output probability.
5. Decision: DMC with 5-prediction window, alert if mean score &gt; 0.7 after 5 writings.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. ModernBERT with Simple Decision Threshold (Run 3)</title>
        <p>
          This approach simplifies decision-making to evaluate minimal DMC impact, reusing Run 2’s model and
data for eficiency [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>1. Data Preparation: Reuse Run 2’s augmented dataset (3936 samples) and preprocessing pipeline.
2. Model: Reuse Run 2’s fine-tuned ModernBERT-base.
3. Training: No additional training; leverage Run 2’s model.
4. Inference: Same as Run 2. Concatenate post title/body/comments, preprocess, tokenize, output
probability.</p>
        <p>5. Decision: Trigger alert if current probability &gt; 0.5, no sliding window.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Zero-Shot with Hybrid Decision Rule (Run 4)</title>
        <p>
          Using Cloudflare Workers AI Llama-4-Scout-17B (17B parameters, 16 experts), this approach avoids
training, relying on zero-shot inference for low-latency detection. A hybrid decision rule balances
immediate and trend-based alerts [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>Respond with a valid JSON object containing a ‘score’ field (float, 0-1). Do not include
any additional text.</p>
        <p>User Prompt</p>
        <p>You are a mental health expert specializing in the detection of depression through text
analysis. Your task is to evaluate the following text written by {target_subject} for
signs of depression. Look for indicators such as expressions of sadness, hopelessness,
worthlessness, guilt, fatigue, loss of interest in activities, social withdrawal, or suicidal
thoughts. Consider the tone, word choice, and emotional context of the text. Assign a
depression likelihood score as a float between 0 and 1, where 0 indicates no signs of
depression and 1 indicates a very high likelihood of depression. Ensure your assessment
is balanced and avoids overgeneralization.</p>
        <p>Text to analyze is {target_text}
Return a JSON object with a single field "score" representing the likelihood of
depression as a float between 0 and 1. Do not include any additional text or explanations
outside the JSON object. For example: {"score": 0.5}</p>
        <sec id="sec-3-5-1">
          <title>3. Training: None; rely on pretrained model. 4. Inference: Query model with JSON prompt for score (0–1), default to 0.5 if invalid. 5. Decision: Use 5-score window; alert if current score &gt; 0.9 or mean &gt; 0.7.</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>
        We submitted five distinct runs to address Task 2 of the eRisk 2025 challenge, focusing on the early
detection of depression through contextualized analysis of Reddit discussion threads. Each run
corresponds to a specific methodological approach designed to balance classification accuracy and decision
latency, ranging from transformer-based models with time-aware training to zero-shot learning with
large language models. Table 1 summarizes the mapping of our approaches to the oficial run IDs,
providing a comprehensive overview of the evaluated strategies. Based on the results released [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], we achieved a commendable 3rd place ranking out of 12 participating teams, underscoring the
efectiveness of our hybrid methodologies in this competitive setting.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Decision-Based Evaluation</title>
        <p>
          The decision-based evaluation metrics, derived from the eRisk 2025 results [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], quantify the
performance of our runs in terms of classification accuracy (Precision, Recall, F 1), earliness (ERDE5,
ERDE50, latency  ), speed, and latency-weighted performance (Flatency). Table 2 presents these metrics
for the top three teams—HIT-SCIR, ELiRF-UPV, and us—highlighting the competitive landscape and our
standing.
        </p>
        <p>Run 1 from our team, integrating Classification with Partial Information (CPI) and a Decision-Making
Component (DMC) with Llama 3.1 summarization and BERT-base-uncased classification, achieved an
F1 score of 0.75, securing our 3rd place ranking. This score reflects a balanced precision (0.72) and
recall (0.77), demonstrating robust classification across diverse conversational contexts. The ERDE
metrics (ERDE5 = 0.10, ERDE50 = 0.05) indicate competitive earliness, though they are outperformed by
HIT-SCIR’s best runs (ERDE50 = 0.03 across all runs), which also achieved the highest F1 of 0.85 in Run
4. ELiRF-UPV’s Run 0, with an F1 of 0.79, shows a strong precision-recall balance (0.78 and 0.81), but its
earliness (ERDE50 = 0.04) is slightly less optimal than HIT-SCIR’s. Run 0 from our team, employing
ModernBERT with time-aware loss, achieved an F1 of 0.68 and an ERDE50 of 0.05, closely aligning with
Run 1’s earliness but with reduced accuracy. Runs 2 and 3, despite high recall (0.94 and 1.00), sufer from
low precision (0.14 and 0.11), reflecting over-prediction issues, with Run 3’s simple threshold yielding
the fastest latency  (1.00) and perfect speed (1.00) at the cost of Flatency (0.20). Run 4, leveraging a
zero-shot Cloudflare Workers AI model (Llama-4-Scout-17B), achieved a moderate F 1 of 0.41 with a
recall of 0.88, but its ERDE50 of 0.07 suggests reasonable earliness, albeit with lower precision (0.27).</p>
        <p>Compared to the top teams, our best performance (Run 1) trails HIT-SCIR’s peak F1 (0.85) and
ELiRF-UPV’s Run 0 (0.79), but our Flatency of 0.72 is competitive with HIT-SCIR’s 0.82, indicating a strong
latency-weighted performance. The superior earliness of HIT-SCIR (ERDE50 = 0.03) and ELiRF-UPV’s
Run 1 speed (1.00) suggest that we could improve by refining decision thresholds or leveraging faster
inference mechanisms.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Ranking-Based Evaluation</title>
        <p>The ranking-based evaluation metrics assess the ability to prioritize at-risk users using Precision@10
(P@10), NDCG@10, and NDCG@100 across varying numbers of writings (1, 100, 500, 1000). Table 3
presents these metrics for the top three teams, providing insights into early detection capabilities.</p>
        <p>Run 1 from our team exhibits exceptional early detection capabilities, achieving perfect P@10 (1.00)
and NDCG@10 (1.00) scores after processing a single writing, with an NDCG@100 of 0.62. This
performance underscores the eficacy of the CPI+DMC approach, enhanced by Llama 3.1 summarization,
in prioritizing at-risk users from minimal data. Run 0 follows with strong early metrics (P@10 = 0.90,
NDCG@100 = 0.53), maintaining consistency up to 1000 writings (NDCG@100 = 0.49). HIT-SCIR
demonstrates superior long-term ranking performance, with all runs achieving an NDCG@100 of 0.90
across 1000 writings, reflecting a robust ability to sustain prioritization over time. ELiRF-UPV’s Run 0
excels early with an NDCG@100 of 0.36 after one writing, but its performance improves to 0.74 by 1000
writings, indicating a more stable ranking capability compared to our runs. Runs 2 and 3 from our team
show poor early performance (NDCG@10 = 0.21) and a complete drop by 500 writings, while Run 4
achieves moderate early scores (NDCG@100 = 0.33) but converges to 0.26 by 1000 writings, aligning
with Runs 1, 2 and 3.</p>
        <p>Against the top teams, our runs perform well early but fall short as writings increase (except for run
0), suggesting that our approaches require enhancement for long-term stability. The convergence of
our runs to similar NDCG@100 scores (0.26) by 1000 writings also indicates a common challenge in
maintaining ranking quality with increased data.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Discussion</title>
        <p>The results afirm that Run 1 from our team, with an F 1 of 0.75 and Flatency of 0.72, represents our
strongest approach, securing a 3rd place ranking among 12 teams. The integration of Llama 3.1 for
summarization efectively captures contextual nuances, while the DMC component balances accuracy
and earliness, as evidenced by its competitive ERDE50 (0.05) and perfect early ranking scores (P@10 =
1.00, NDCG@10 = 1.00). Run 0’s solid performance (F1 = 0.68, ERDE50 = 0.05) highlights the potential
of time-aware training, suggesting a promising direction for refinement. However, Runs 2 and 3’s
low precision (0.14 and 0.11) despite high recall (0.94 and 1.00) indicates over-prediction, with Run 3’s
speed (1.00) and latency  (1.00) compromised by an Flatency of 0.20. Run 4’s zero-shot approach with
Cloudflare Workers AI model (Llama-4-Scout-17B) ofers moderate performance (F 1 = 0.41, ERDE50
= 0.07), but its reliance on pretrained knowledge limits precision (0.27), underscoring the need for
task-specific adaptation.</p>
        <p>Compared to HIT-SCIR, whose Run 4 achieves the highest F1 (0.85) and ERDE50 (0.03), and
ELiRFUPV’s Run 0 (F1 = 0.79), our best runs demonstrate competitive accuracy but lag in earliness and
long-term ranking stability (NDCG@100 = 0.90 for HIT-SCIR vs. 0.26 for us at 1000 writings). The
superior earliness of HIT-SCIR and ELiRF-UPV’s Run 1 speed (1.00) suggest that we could improve by
refining decision thresholds or leveraging faster inference mechanisms. The degradation in ranking
metrics over time across all our runs points to the need for dynamic models, potentially incorporating
temporal embeddings or ensemble techniques combining Run 1’s accuracy with Run 3’s speed.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>Our exploration of hybrid and time-aware approaches for eRisk 2025 Task 2 has yielded meaningful
insights into the challenge of detecting depression early within Reddit’s dynamic conversational threads.
Achieving an F1 score of 0.75 with our CPI+DMC approach, securing a 3rd place ranking among 12
teams, underscores the potential of integrating Llama 3.1 summarization with BERT-based classification
to navigate complex social media contexts. The competitive ERDE50 of 0.05 and strong early ranking
metrics (P@10 = 1.00, NDCG@10 = 1.00) reflect our success in balancing timely detection with accuracy,
a critical step toward supporting timely mental health interventions. While our approaches faced
challenges in long-term ranking stability, these findings highlight opportunities to refine
decisionmaking policies and model architectures. This work not only contributes to the growing field of
computational mental health, but also reinforces the importance of context-aware, real-time systems in
addressing global mental health challenges. We are optimistic that continued research will build on
these foundations, paving the way for more efective tools to identify and support individuals at risk.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Future Work</title>
      <p>The promising results of our study open several avenues for advancing early depression detection in
conversational contexts. A primary direction is the development of ensemble methods that integrate
the strengths of our five approaches. For instance, combining the time-aware precision of Run 0’s
ModernBERT with the contextual summarization of Run 1’s CPI+DMC framework could yield a model
that excels in both earliness and accuracy. Exploring ModernBERT’s full 8192-token capacity may further
enhance context capture, particularly for complex Reddit threads, though this would require optimizing
computational eficiency to ensure scalability in real-time applications. Another promising area is
refining data augmentation strategies, as seen in Run 2, to reduce noise and improve generalization
across diverse user expressions. For Run 4’s zero-shot approach, fine-tuning large language models
like Llama-4-Scout-17B on eRisk-specific data could address precision limitations, potentially bridging
the gap with supervised methods. Additionally, incorporating temporal embeddings to model user
behavior over time could improve long-term ranking stability, addressing the degradation observed in
our NDCG@100 scores. Beyond technical enhancements, we aim to explore cross-domain applications,
such as adapting our models for other mental health conditions or platforms like X, to broaden the
impact of real-time monitoring. These directions collectively aim to create more robust, adaptive
systems for early intervention in mental health.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We would like to acknowledge the support provided by the Ofice of Research (OoR) at Habib University,
Karachi, Pakistan for funding this project through the internal research grant IRG-2235.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <sec id="sec-8-1">
        <title>The authors have not employed any Generative AI tools.</title>
        <p>International Conference of the CLEF Association, CLEF 2025, Madrid, Spain, September 9-12,
2025, Proceedings, Part II, volume To be published of Lecture Notes in Computer Science, Springer,
2025.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of eRisk 2023:
          <article-title>Early risk prediction on the internet</article-title>
          , in: A.
          <string-name>
            <surname>Arampatzis</surname>
            , E. Kanoulas,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Tsikrika</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Vrochidis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Giachanou</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Aliannejadi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Vlachos</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction</source>
          , Springer Nature Switzerland,
          <year>2023</year>
          , pp.
          <fpage>294</fpage>
          -
          <lpage>315</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>031</fpage>
          -42448-9_
          <fpage>22</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Trotzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Koitka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <article-title>Utilizing Neural Networks and Linguistic Metadata for Early Detection of Depression Indications in Text Sequences</article-title>
          ,
          <source>IEEE Transactions on Knowledge &amp; Data Engineering</source>
          <volume>32</volume>
          (
          <year>2020</year>
          )
          <fpage>588</fpage>
          -
          <lpage>601</lpage>
          . URL: https://doi.ieeecomputersociety.
          <source>org/10</source>
          .1109/TKDE.
          <year>2018</year>
          .
          <volume>2885515</volume>
          . doi:
          <volume>10</volume>
          .1109/TKDE.
          <year>2018</year>
          .
          <volume>2885515</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Martínez-Castaño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Htait</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Azzopardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Moshfeghi</surname>
          </string-name>
          ,
          <article-title>BERT-based transformers for early detection of mental health illnesses</article-title>
          , in: K. S. Candan,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Larsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maistro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction</source>
          , Springer International Publishing,
          <year>2021</year>
          , pp.
          <fpage>189</fpage>
          -
          <lpage>200</lpage>
          . doi:
          <volume>10</volume>
          . 1007/978-3-
          <fpage>030</fpage>
          -85251-1_
          <fpage>15</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Devaguptam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kogatam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kotian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Early detection of depression using bert and deberta</article-title>
          ,
          <source>in: Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2022</year>
          . URL: https://api.semanticscholar. org/CorpusID:251471697.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Gillani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S. A.</given-names>
            <surname>Baig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Saleem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Siddiqui</surname>
          </string-name>
          ,
          <article-title>Advancing depression detection on social media platforms through fine-tuned large language models</article-title>
          ,
          <year>2024</year>
          . URL: http: //arxiv.org/abs/2409.14794. doi:
          <volume>10</volume>
          .48550/arXiv.2409.14794. arXiv:
          <volume>2409</volume>
          .14794 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Tank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Katoch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <article-title>Depression detection and analysis using large language models on textual and audio-visual modalities</article-title>
          ,
          <year>2024</year>
          . URL: http://arxiv.org/abs/2407. 06125. doi:
          <volume>10</volume>
          .48550/arXiv.2407.06125. arXiv:
          <volume>2407</volume>
          .06125 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , C. Poellabauer,
          <article-title>Multimodal depression detection with contextual position encoding and latent space regularization</article-title>
          ,
          <year>2025</year>
          . URL: https://openreview.net/forum?id=miOYgWl60q.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Thompson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Errecalde</surname>
          </string-name>
          ,
          <article-title>A time-aware approach to early detection of anorexia:</article-title>
          <source>UNSL at eRisk</source>
          <year>2024</year>
          ,
          <year>2024</year>
          . URL: http://arxiv.org/abs/2410.17963. doi:
          <volume>10</volume>
          .48550/arXiv.2410.17963. arXiv:
          <volume>2410</volume>
          .17963 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Loyola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Burdisso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Thompson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Cagnina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Errecalde</surname>
          </string-name>
          , UNSL at eRisk 2021:
          <article-title>A comparison of three early alert policies for early risk detection</article-title>
          .,
          <source>in: CLEF (working notes)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>992</fpage>
          -
          <lpage>1021</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2936</volume>
          /paper-81.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Depression detection on social media with reinforcement learning</article-title>
          ,
          <source>in: Chinese Computational Linguistics: 18th China National Conference, CCL</source>
          <year>2019</year>
          , Kunming, China,
          <source>October 18-20</source>
          ,
          <year>2019</year>
          , Proceedings, Springer-Verlag,
          <year>2019</year>
          , pp.
          <fpage>613</fpage>
          -
          <lpage>624</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -32381-3_
          <fpage>49</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -32381-3_
          <fpage>49</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <article-title>A test collection for research on depression and language use</article-title>
          , in: N.
          <string-name>
            <surname>Fuhr</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Quaresma</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Gonçalves</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Larsen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Balog</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macdonald</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Cappellato</surname>
          </string-name>
          , N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction</source>
          , volume
          <volume>9822</volume>
          , Springer International Publishing,
          <year>2016</year>
          , pp.
          <fpage>28</fpage>
          -
          <lpage>39</lpage>
          . URL: http://link.springer.com/10.1007/978-3-
          <fpage>319</fpage>
          -44564-
          <issue>9</issue>
          _3. doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -44564-
          <issue>9</issue>
          _3,
          <string-name>
            <surname>series</surname>
            <given-names>Title</given-names>
          </string-name>
          : Lecture Notes in Computer Science.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sadeque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bethard</surname>
          </string-name>
          ,
          <article-title>Measuring the latency of depression detection in social media (</article-title>
          <year>2018</year>
          )
          <fpage>495</fpage>
          -
          <lpage>503</lpage>
          . URL: https://dl.acm.org/doi/10.1145/3159652.3159725. doi:
          <volume>10</volume>
          .1145/3159652.3159725, conference Name:
          <source>WSDM 2018: The Eleventh ACM International Conference on Web Search and Data Mining ISBN: 9781450355810 Place: Marina Del Rey CA USA Publisher: ACM.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk 2025:
          <article-title>Early risk prediction on the internet (extended overview)</article-title>
          ,
          <source>in: Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2025</year>
          ), Madrid, Spain,
          <fpage>9</fpage>
          -
          <issue>12</issue>
          <year>September</year>
          ,
          <year>2025</year>
          , volume To be published of CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk 2025:
          <article-title>Early risk prediction on the internet, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction - 16th
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>