<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DS@GT at eRisk 2025: From Prompts to Predictions, Benchmarking Early Depression Detection with Conversational Agent-Based Assessments and Temporal Attention Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anthony Miyaguchi</string-name>
          <email>acmiyaguchi@gatech.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Guecha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuwen Chiu</string-name>
          <email>ychiu60@gatech.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sidharth Gaur</string-name>
          <email>sgaur38@gatech.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Georgia Institute of Technology</institution>
          ,
          <addr-line>North Ave NW, Atlanta, GA 30332</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This Working Note summarizes the participation of the DS@GT team in two eRisk 2025 challenges. For the Pilot Task on conversational depression detection with large language-models (LLMs), we adopted a promptengineering strategy in which diverse LLMs conducted BDI-II-based assessments and produced structured JSON outputs. Because ground-truth labels were unavailable, we evaluated cross-model agreement and internal consistency. Our prompt design methodology aligned model outputs with BDI-II criteria and enabled the analysis of conversational cues that influenced the prediction of symptoms. Our best submission, second on the oficial leaderboard, achieved DCHR = 0.50, ADODL = 0.89, and ASHR = 0.27. In Task 2, which targets early detection of depression from social media posts with associated conversational contexts, we explored two complementary approaches. The first was a voting classifier that combined traditional machine learning models built on engineered features. The second employed a LightGBM classifier over precomputed MentalRoBERTa embeddings, augmented with a custom temporal attention mechanism that weighted posts by content and recency. We describe the system architecture, the preprocessing pipeline, feature engineering, model configurations, and the oficial task results, and conclude by noting limitations and potential directions for future work.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Conversational AI</kwd>
        <kwd>Large Language Models (LLMs)</kwd>
        <kwd>Depression Detection</kwd>
        <kwd>Mental Health Screening</kwd>
        <kwd>Prompt Engineering</kwd>
        <kwd>BDI-II</kwd>
        <kwd>eRisk</kwd>
        <kwd>DS@GT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The eRisk lab, part of the Conference and Labs of the Evaluation Forum (CLEF), focuses on the important
challenge of early detection and prediction of depression based on user generated content, primarily
from online platforms. These tasks involve identifying early signs of mental health conditions,
selfharm tendencies, or other risks by analyzing text data and user behavior over time. The detection of
depression and other mental disorders from social media content has been extensively studied in recent
years [1, 2]. Researchers have framed these tasks as binary (“at risk” versus “not at risk”), multi-class,
and ordinal classification, depending on the application. Although performance continues to improve,
these systems remain experimental and have not yet been adopted in routine clinical care.</p>
      <p>This working note describes our participation in Task 2 and the Pilot Task of the eRisk 2025 lab
[3, 4].</p>
      <p>Task 2, introduced for the first time this year, presents a unique challenge focused on detecting early
signs of depression by analyzing full conversational contexts. Unlike previous tasks that examined
isolated user posts, this challenge considers the broader dynamics of interactions by incorporating the
writings of all individuals involved in a conversation.</p>
      <p>The Pilot Task explores the challenge of detecting depression through conversational agents (CA).
Participants interact with a LLM persona that has been fine-tuned on user writings to simulate their
conversational style. After engaging with the persona, the goal is to decide whether it displays depressive
symptoms and to explain which cues informed that decision.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task 2: Contextualized Early Detection of Depression</title>
      <sec id="sec-2-1">
        <title>2.1. Task Overview and Dataset</title>
        <p>Task 2 of eRisk 2025, "Contextualized Early Detection of Depression," introduces a novel approach to
depression detection by analyzing full conversational contexts rather than isolated user posts. The
task requires participants to classify users as showing signs of depression (binary classification: 0 for
no depression, 1 for depression) based on their writings within complete conversational interactions,
including discussion titles and comments from all participants. The evaluation simulates a real-time
environment where participants process user interactions sequentially and submit decisions along with
confidence scores after each new piece of writing.</p>
        <p>Table 1 presents the basic statistics about the dataset. The dataset comprises 2,724 users from social
media platforms (primarily Reddit), with 2,446 classified as "not depressed" and 297 as "depressed". Unlike
previous eRisk tasks that examined individual posts in isolation, this challenge captures the broader
dynamics of social interactions by incorporating the complete conversational ecosystem surrounding
each target user.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Related Work</title>
        <p>Since Task 2 of eRisk 2025 is new, there are no directly comparable works from previous years for this
task. However, we can analyze how participants have approached other tasks within the eRisk lab,
particularly Task 1, which, in recent years, has focused on identifying sentences relevant to symptoms
of depression. Although the overall goal of Task 2 is diferent, the underlying challenge of analyzing
user-generated text for subtle signals of depression symptoms remains a common thread.</p>
        <p>In the eRisk 2023 and 2024 editions, Task 1 required participants to rank sentences from user writings
according to their relevance to the 21 standardized symptoms of depression from the BDI-II questionnaire
[5, 6]. A prominent trend in the approaches of these tasks was the use of transformer-based models
for text representation and semantic similarity. For example, some teams used sentence embeddings
from Transformers combined with cosine similarity to rank sentences against symptom descriptions
[7, 8]. Similarly, in the MASON-NLP[9] submission for eRisk 2023 described a deep learning approach
incorporating models like MentalBERT and RoBERTa, alongside LSTMs, to detect depression symptoms.</p>
        <p>Some participants also explored ensemble methods. For example, Pardo Bacuñana &amp; Segura Bedmar
[10] experimented with an ensemble of sentence similarity models and a RoBERTa classifier for eRisk
2024. The REBECCA team reported LLM to refine their results after an initial ranking with transformer
embeddings [7].</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Methodology</title>
        <p>Task 2 of eRisk 2025, "Contextualized Early Detection of Depression" aims to classify users as showing
signs of depression based on their writings within full conversational interactions. Participants were
required to process user interactions sequentially and make predictions in a simulated real-time
environment. The evaluation involves submitting decisions (binary classification: 0 for no depression, 1 for
depression) and a confidence score for each user after processing each new piece of writing.</p>
        <p>We experimented with two approaches. The first approach employed a voting classifier that
combined engineered features, including term-frequency–inverse-document-frequency (TF–IDF), sentiment
scores, Linguistic Inquiry and Word Count (LIWC)-inspired cues, and temporal features. The second
approach used a LightGBM classifier enhanced with pre-trained transformer embeddings
(MentalRoBERTa) and a custom temporal-attention mechanism designed to weigh user posts according to their
content and recency.</p>
        <p>Our methodology for Task 2 involved several stages, starting with data pre-processing to prepare the
text data, followed by feature engineering and the application of two distinct modeling approaches.</p>
        <sec id="sec-2-3-1">
          <title>2.3.1. Pre-processing</title>
          <p>The initial raw data for this task was provided in JSON format, with each file corresponding to an
individual user and containing their posts over time. Our pre-processing steps were designed to
consolidate these individual user files and clean the textual data. This preparation was crucial for
ensuring the quality and consistency of the input for our subsequent modeling stages.</p>
          <p>A central part of our pre-processing pipeline was a comprehensive text cleaning function, which was
applied sequentially to both the titles and the main text of the posts. Initially, this function handled
any null or NA values by converting them to empty strings. Next, it addressed common issues found
in web-sourced text by repairing Unicode encoding errors and fixing HTML entities. Furthermore, all
URLs were systematically removed.</p>
          <p>Next, contractions were expanded using the ‘contractions’ library; for example, "don’t" was converted
to "do not," and this process also included common slang contractions. In addition, special characters
were removed, with care taken to retain alphanumeric characters, spaces, apostrophes, hyphens, and
essential punctuation such as periods, commas, exclamation marks, and question marks.</p>
          <p>As a final cleaning step, whitespace was normalized by reducing any sequences of multiple spaces
to a single space and by stripping any leading or trailing whitespace. The fully pre-processed and
structured data was stored in parquet format to facilitate eficient use in the modeling phases.</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>2.3.2. Exploratory Data Analysis</title>
          <p>To better understand the characteristics of the dataset, we conducted an initial exploratory data analysis.
This involved examining the distributions of several potentially relevant features for users categorized
as "depressed" (positive class, or ‘pos’) and "not depressed" (negative class, or ‘neg’). We focus on
diferences in post frequency, the occurrence of late-night posts, and the use of first-person pronouns,
as illustrated by the box plots 1.</p>
          <p>The first analysis focused on overall user activity, specifically ‘post_frequency’. The distribution for
the "not depressed" group appeared to have a slightly higher median post frequency and a wider spread
in the central 50% of users compared to the "depressed" group. Both groups exhibited a number of users
with significantly higher post-frequency, visible as outliers, with some users in the "not depressed"
category showing exceptionally high activity. We then examined the occurrence of ‘late night posts’.
The "not depressed" group seemed to show a slightly broader distribution and potentially a higher
median count of late-night posts than the "depressed" group.</p>
          <p>Finally, we investigated the ‘first person count’, which measures the use of first-person pronouns.
For both "depressed" and "not depressed" users, the median count of first-person pronouns was quite
low. However, the "depressed" group showed a tendency towards a slightly higher count of first-person
pronouns in the upper quartile of its distribution and also presented several users with notably high
counts, as indicated by the outliers.</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Modeling</title>
        <sec id="sec-2-4-1">
          <title>2.4.1. Voting Classifier</title>
          <p>We explored two primary modeling approaches: a Voting Classifier based on a combination of traditional
machine learning models and engineered features, and a LightGBM classifier using text embeddings
from a pre-trained transformer model with a temporal attention mechanism.</p>
          <p>This approach focused on extracting a diverse set of features from the cleaned text and user activity
patterns. We engineered several types of features, including TF-IDF scores from user posts, sentiment
polarity scores (negative, neutral, positive, and compound) calculated using NLTK’s VADER [11] for
each post, and simple LIWC-inspired linguistic features. These linguistic features included counts
of first-person pronouns (e.g., ‘i’, ‘me’, ‘my’), specific negative emotion words (like ‘sad’, ‘depressed’,
‘lonely’), social words (such as ‘friend’, ‘family’, ‘talk’), and the total word count of each post.</p>
          <p>In addition to text-based features, we incorporated the temporal dynamics of user activity. This
included ‘hours_since_first’, representing the time in hours since a user’s initial post in the dataset,
and ‘post_gap’, which measured the time diference in hours between a user’s consecutive posts (the
ifrst post was assigned a gap of zero). All of these engineered features were combined to form a single
comprehensive feature matrix for training our model.</p>
          <p>For classification, a Voting Classifier was employed using a ‘soft’ voting strategy, which combines
the probability estimates from three base models. The first base model was a Random Forest Classifier
(1000 estimators, max depth of 12, ‘balanced_subsample’ class weights). The second was a Stochastic
Gradient Descent (SGD) Classifier configured with ‘log_loss’ (making it similar to logistic regression),
an L2 penalty, and ‘balanced’ class weights. The third model was a Gradient Boosting Classifier (1000
estimators, 0.2 validation fraction for early stopping with a patience of 10 iterations, and a 0.8 subsample
rate). This Voting Classifier was subsequently trained on the combined feature set prepared from the
user’s training data.</p>
        </sec>
        <sec id="sec-2-4-2">
          <title>2.4.2. LightGBM with Temporal Attention</title>
          <p>Our second approach utilized pre-trained transformer embeddings coupled with a temporal attention
mechanism, designed to capture the evolving nuances within user posts over time. The embeddings for
individual user posts were initially extracted using the "mental/mental-roberta-base" model [12]. These
detailed post representations served as the input for our temporal attention layer.</p>
          <p>A key characteristic of this method was the temporal attention mechanism, which processed the
sequence of post embeddings for each user to produce a single, aggregated user-level embedding. This
mechanism first applied a linearly increasing weight to posts based on their chronological order, with
weights ranging from 0.1 for the earliest post in the considered window to 1.0 for the most recent. This
step aimed to give more prominence to later posts.</p>
          <p>For content-specific attention, a predefined attention matrix of dimension 768 was employed. This
matrix was sparse, with zero values for most dimensions, but assigned specific weights of [0.9, 0.7, 0.8,
0.6, 0.7] to indices 15, 42, 127, 256, and 512, respectively; these were intended to highlight "Depression
indicators" within the post embeddings. Content scores, derived from the dot product of each post
embedding with this matrix, were then normalized into probabilities. These content probabilities were
multiplied by the temporal weights, and the resulting values were normalized again to get the final
attention weights for each post. The final user embedding was then calculated as the weighted sum of
their individual post embeddings using these combined attention weights.</p>
          <p>This aggregated user embedding was then used to train a LightGBM Classifier [ 13]. The classifier was
configured with key parameters such as 5000 n_estimators, a learning_rate of 0.01, and a max_depth of
7. To manage class imbalance, the scale_pos_weight was set to approximately 8.23. During training, we
(a) Post frequency
(b) Late-night posts
(c) First-person count
Figure 1: Box-plot comparison of posting behaviours across three metrics.</p>
          <p>P</p>
          <p>R
employed an early stopping strategy based on AUC performance on a separate validation set, with a
patience of 1000 rounds, and utilized a custom callback to monitor the training progress.</p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Results</title>
        <p>We report the results of our models on the public leaderboard in Table 2 (Decision based evaluation)
and Table 3 (Ranking based evaluation). After evaluating on the hidden test set, it can be observed that
our LightGBM + Embeddings model performed considerably better as compared to the voting classifier
model in terms of P@10, NDCG@10, NDCG@100 scores for 1 writing.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Pilot Task: Conversational Depression Detection via LLMs</title>
      <sec id="sec-3-1">
        <title>3.1. Related Work</title>
        <p>Early CA date back to the 1960s, when ELIZA mimicked a Rogerian psychotherapist through simple
pattern matching with no genuine world knowledge [14]. Recent advances in large-language models
now permit fine-tuning chatbots into specialised personas. For example, [ 15] developed Patient-Ψ, a
system that simulates patients to help train mental-health professionals, while [16] evaluated ChatGPT
as surrogates for both patients and psychiatrists.</p>
        <p>Other eforts target early depression detection. [ 17] combined Google Dialogflow with the Hamilton
Rating Scale (SIGH-D) and the IDS-C, deploying the agent over Facebook Messenger to pose screening
questions automatically. Commercial products have also emerged, most notably Woebot, [18] which
ofers mood-tracking conversations at scale.</p>
        <p>Analytical surveys emphasize the design choices that shape the efectiveness of conversational
artificial intelligence. [ 19] underline the value of humanizing cues: virtual avatars, well-defined
personas, and even emojis, and the need for contextual robustness achieved through domain-specific
ifne-tuning.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Pilot Task Overview</title>
        <p>The eRisk 2025 pilot task challenges participants to explore how conversational agents can assist in
detecting depressive symptoms. Twelve LLM personas were created from real user writings, producing
naturalistic conversational exchanges that mimic real-world profiles. Teams could submit up to five
runs; each run comprised an evaluation of every persona’s depression severity.</p>
        <p>
          The task asks participants to decide (i) whether a persona shows signs of depression, (ii) the
corresponding severity level, and (iii) the key symptoms expressed during the dialogue. Severity is measured
by the Beck Depression Inventory II (BDI-II), whose total score ranges from 0 to 63. Scores map onto
four standard categories: minimal (
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">0–9</xref>
          ), mild (10–18), moderate (
          <xref ref-type="bibr" rid="ref6 ref7 ref8">19–29</xref>
          ), and severe (30–63).
        </p>
        <p>Alongside the estimated BDI-II score, teams must identify up to four major symptoms, chosen from
the 21 BDI-II items listed in Table 4.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Methodology</title>
        <p>Description
Advanced AI assistant for the eRisk 2025 pilot task, designed to be empathetic,
informative, and objective.</p>
        <p>No direct questions about depression, inference-only assessment, empathetic tone.</p>
        <p>Initiate conversation, apply active listening, guide discussion through BDI-II
domains.</p>
        <p>Short descriptions of the 21 symptoms with score ranges.</p>
        <p>JSON object with output_message, next_step_reasoning, and an internal
evaluation.</p>
        <p>Rules for scoring, state transitions, and confidence estimation.</p>
        <p>Target length of ∼ 20 turns, emphasis on natural rapport.</p>
        <p>For the Pilot Task, we take advantage of LLM chatbots. LLMs allow for rapid customization through
prompt engineering [20, 21]. Through carefully crafted prompts, we can signal an LLM to assume the
role of a CA. It can then generate a conversation with the persona, elicit relevant diagnostic responses,
and operate as a zero-shot or few-shot classifier that detects symptoms based on BDI-II criteria. We
explore this capability for mental-health screening, viewing the models as complementary aids for
mental-health professionals, not replacements. LLM-based pre-screening can be deployed widely at low
cost, flagging individuals who may benefit from timely follow-up by qualified clinicians.</p>
        <p>Our main contributions are: a prompt design protocol that aligns LLM outputs with clinical BDI-II
symptom criteria; an explanation analysis that highlights the conversational signals most influential for
each symptom score.</p>
        <p>We adopted a fully automated, prompt-engineering approach, that guides a LLM to produce structured
JSON after every conversational turn. The system prompt, reproduced in appendix A, specifies the
agent’s role, ethical constraints, interaction protocol, and output schema. Its major components are
summarized in table 5.</p>
        <p>Our objective is to determine whether a prompt-engineering one-shot strategy enables LLMs to
produce plausible depression assessments, and to examine how model choice afects those assessments.
Using an identical prompt, we run several open-weight and proprietary LLMs and analyze score
trajectories and self-reported confidence. Because the pilot task does not provide ground-truth annotations,
traditional accuracy, precision, and recall cannot be calculated. Our evaluation therefore focuses on
internal consistency and cross-model agreement rather than external correctness.</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Agent Evaluation</title>
          <p>The experiment centers around an evaluator agent that attempts to estimate BDI-II symptoms on
simulators of diferent personas. The simulators are run on the ChatGPT custom-GPT platform. The
number of interactions with the simulators are limited to approximately 10 interactions a day before a
time-limit is applied. A ChatGPT premium subscription is required for unlimited interactions with each
simulator. There is no programmatic API to access the simulators, so interactions must be copied and
pasted between the simulator interface and the agent output.</p>
          <p>The agents we design use a LLM that will initiate a conversation. We generate a single prompt that
is reused across state-of-the-art models from Google, Anthropic, and OpenAI. These models perform
well across a variety of reasoning tasks, implementing techniques such as Chain of Thought (CoT)
prompting. The prompt drives the dialogue while enforcing eRisk constraints and psychological best
practices. We model a small state machine within the prompt that aims to end the conversation in 10
turns and to take no longer than 20 turns. Key design choices include empathetic framing, implicit
probing, and self-documenting output. The empathetic framing is designed to maintain comfort and
engagement. The implicit probing tries to steer discussion toward aspects of depression without being
explicit about depression. The output is designed for post-hoc analysis, and uses the language model to
record reasoning and outputs in a structured format.</p>
          <p>The LLM models are interfaced from their respective developer console interfaces. We use structured
output for each model in order to faciliate analysis. The outputs are guided with the use of JSON
Schema. We include the input, output, and evaluation score plus confidence for each of the 21 BDI-II
questions. After each run, we parse the model-generated JSON files with bespoke Python scripts to
extract the agent’s classification_suggestion, the array of key_symptoms, and the aggregated
bdi_score. An example JSON fragment is shown in listing 3.3.1.</p>
          <p>Example fragment from structured LLM response.
\begin{listing}
{
"output_message": "It sounds like energy has been an issue lately. ...</p>
          <p>How have your sleep patterns been this week?",
"next_step_reasoning": "Explores fatigue and transitions to sleep ...</p>
          <p>(BDI q16, q17).",
"evaluation": {
"assessment_turn": 6,
"assessment_state": "Gathering",
"total_bdi_score": 14,
"classification_suggestion": "Mild",
"key_symptoms": ["fatigue", "sleep disturbance"],
...
}</p>
          <p>}</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Submission Preparation and Data Cleaning</title>
          <p>The system prompt directs the LLM to output, for every turn, a JSON object whose evaluation block
contains
• item-level bdi_scores q01–q21, and
• a key_symptoms array listing the four symptoms the model deems most prominent.</p>
          <p>Because the LLM’s reasoning is opaque, we cannot independently verify the accuracy of these item
scores. For the oficial submission we therefore used only key_symptoms. Each free-text entry was
normalised to the canonical BDI-II symptom name via a rule-based mapper (e.g., “hopelesness” →
Pessimism). Table 6 illustrates typical conversions.</p>
          <p>After normalization, the four canonical symptom names and the final total BDI-II score were generated
using the organizer’s specification. All scripts used for this cleaning step are available in the repository.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>3.3.3. Evaluation Metrics</title>
          <p>Efectiveness was evaluated with three oficial metrics:</p>
          <p>Depression Category Hit Rate (DCHR): the proportion of cases in which the estimated depression
category matches the ground-truth category derived from BDI-II scores.</p>
          <p>Average Diference in Overall Depression Level (ADODL) : a normalized score in [0, 1] that
rewards closeness between the true and estimated BDI-II totals:</p>
          <p>ADODL = 63 − |  − | ,
63
where ADL is the actual depression level and EDL is the estimated level.</p>
          <p>Average Symptom Hit Rate (ASHR): the average fraction of the four major symptoms correctly
identified for each simulated persona.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Results</title>
        <sec id="sec-3-4-1">
          <title>3.4.1. Oficial Leaderboard</title>
        </sec>
        <sec id="sec-3-4-2">
          <title>3.4.2. Run Statistics</title>
          <p>We submitted four runs, but only two were included in the statistics by the organizers. Table 8 compares
basic conversational statistics across all teams whose runs were scored, our submission had a mean of 20
messages per run and 782 characters per message. Compared to other teams, DS@GT was mid-range in
dialogue length and second in average message length. ixa-ave produced the longest dialogues, whereas
PJs-team generated the longest individual messages.</p>
          <p>We also run our own post-hoc analysis on the data. First we noted that our models tend to end
around round 10. This validates the constraints that we have set in the prompt.</p>
          <p>Across the diferent models, we measured the number of tokens given by a whitespace tokenizer.
The input tokens in Table 10 are from the outputs of the depressions simulators. The output tokens in
Table 11 are from the outputs of the LLM evaluator. The reason tokens in Table 14 are also generated
from the outputs of the LLM evaluator, but are only used for diagnostic purposes.</p>
          <p>We noted that Claude tends to be verbose across token dimensions. One reason why this value
could deviate in the input dimension is that the influence of verbosity in our agent induces a reciprocal
response in the simulator.</p>
          <p>We take all of the BDI statistics from each run. We obtain a scalar confidence score for the assessment
per round, which results in a scalar series that can be plot over time. The confidence is self-reported
by the LLM and thus is not a true measure of evaluation state. In figure 2, we observe that confidence
continues to grow over a period of 15-16 turns. We reach 80% confidence around the average number of
turns at 10.</p>
          <p>In figure 3 we find stark diferences between models. GPT-4o reports an average score of 11, while
Claude tends to score around 28. We suspect that the average score of Gemini-based models at 22 is
closer to the actual average.
in_avg in_max in_min in_std
(a) Average confidence over time by model.
(b) Average confidence over time by agent.</p>
          <p>reason_avg reason_max reason_min reason_std</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Discussion</title>
        <sec id="sec-3-5-1">
          <title>3.5.1. Exploratory Analysis</title>
          <p>We conducted an exploratory analysis to evaluate internal consistency and agreement among
LLMbased agents tasked with identifying depression symptoms and estimating severity from simulated
interview transcripts. This involved parsing the classification_suggestion, key_symptoms,
and bdi_score fields from model outputs.</p>
          <p>Across models, there is moderate consistency in the predicted depression category (e.g., Mild,
Moderate, Severe), with most outputs clustering in the mild to moderate range. While the exact numerical
bdi_score may vary across models, the resulting categorical labels often align, suggesting convergence
in underlying heuristics.</p>
          <p>To quantify this consistency, we applied label encoding to map classification labels to numeric levels
using the following scheme: Uncertain = 0, Control = 1, Mild = 2, Borderline = 3, Moderate = 4, Severe
= 5, and Extreme = 6. Figure 4 plots these encoded numeric levels against final BDI-II scores. Linear
regression analysis reveals a strong relationship between classification level and BDI-II score ( 2 = 0.91,
 &lt; 0.001):</p>
          <p>
            BDI Score = 9.218 × Classification − 9.549
(
            <xref ref-type="bibr" rid="ref1">1</xref>
            )
          </p>
          <p>The high coeficient of determination ( 2 = 0.91) indicates that 91% of the variance in BDI-II scores
is explained by the classification level, demonstrating strong internal consistency among the LLM
agents. The regression coeficient of 9.218 reveals that each unit increase in classification severity
corresponds to approximately 9.2 points higher on the BDI-II scale, indicating clinically meaningful
diferences between classification categories. This dual finding confirms both the reliability of the
classification system and the clinical significance of the severity distinctions made by the LLM agents.</p>
          <p>We next analyzed the key_symptoms field, which encodes which of the 21 BDI-II items were flagged
as present by each model. Figure 5 shows the four most frequently identified symptoms per model at
turn 20 of the assessment. Canonical symptoms such as tiredness and loss of pleasure appear frequently
across all models, suggesting shared attention to core depressive indicators. However, less frequently
lfagged symptoms such as suicidal thoughts, worthlessness, and loss of interest in sex exhibit greater
variability, likely due to prompt-level instructions to avoid probing sensitive issues.</p>
          <p>To quantify inter-model agreement, we computed the standard deviation of BDI-II item scores across
four language models (Claude-3.7-sonnet, GPT-4o, Gemini-2.0-flash, and Gemini-2.5-pro-exp-03-25) for
each symptom category. Figure 6 presents a comprehensive analysis of mean standard deviation per
symptom, where lower values indicate stronger inter-model consensus.</p>
          <p>The results reveal a clear hierarchy of agreement patterns across the 21 BDI-II items. Models
demonstrate exceptionally high agreement (std dev &lt; 0.15) on three core symptoms: loss of libido (std dev
≈ 0.04), suicidal thoughts (std dev ≈ 0.05), and punishment feelings (std dev ≈ 0.08). This convergence
likely reflects the relatively unambiguous nature of these symptoms in conversational contexts, where
explicit verbal indicators are more readily identifiable across diferent model architectures.</p>
          <p>Moderate agreement (std dev 0.15–0.50) is observed for symptoms including weight loss, crying,
fatigue, anhedonia, and sleep changes. These items may require more nuanced interpretation of contextual
cues, leading to some variation in model assessments while still maintaining reasonable consensus.</p>
          <p>
            The analysis identifies several symptoms with notably higher disagreement (std dev &gt; 0.60): appetite
changes (std dev ≈ 0.70), agitation (std dev ≈ 0.68), worthlessness/appearance (std dev ≈ 0.67),
indecisiveness (std dev ≈ 0.65), and past failure (std dev ≈ 0.62). This divergence suggests these symptoms
present particular challenges for automated assessment, potentially due to: (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) subtle linguistic
manifestations that require sophisticated pragmatic understanding, (
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) cultural or contextual variability in
expression, (
            <xref ref-type="bibr" rid="ref3">3</xref>
            ) overlapping symptom presentations that confound clear categorization, or (
            <xref ref-type="bibr" rid="ref4">4</xref>
            ) inherent
ambiguity in how these psychological states manifest in natural language.
          </p>
          <p>The reference line at std dev = 0.5 provides a useful benchmark, with approximately 57% of symptoms
(12 out of 21) falling below this threshold, indicating generally acceptable inter-model reliability for
the majority of BDI-II items. This pattern suggests that while current language models show promise
for depression screening applications, careful attention must be paid to the specific symptoms being
assessed, with particular caution warranted for high-variance items that may require human clinical
oversight or multi-modal assessment approaches.</p>
          <p>Finally, Figure 7 uses a polar plot to visualize average BDI-II scores per symptom across models. The
radial axes represent symptom severity on a normalized scale. While models converge on a subset
of central symptoms, there is significant divergence in outer-ring symptoms, highlighting uneven
sensitivity. Notably, models stabilize their severity estimates and symptom selections after several turns,
suggesting early exploratory behavior followed by more consistent clinical reasoning.</p>
        </sec>
        <sec id="sec-3-5-2">
          <title>3.5.2. Pilot Task Feedback</title>
          <p>The proposed task is an interesting use of LLM in an information retrieval context. However, there
are several logistical elements that make this particular challenge dificult to participate in. The
most pertinent of the issues is that the simulators are locked behind a ChatGPT paywall. It requires
a subscription to the service in order to participate, and the changing nature of the platform makes
reproducibility of the studies dificult because of on-going reinforcement learning from human relevance.
Even during the evaluation phase, the platform would ofer two wildly diferent versions of a response
that would have to be selected between.</p>
          <p>The second issue with the use of custom GPT is that there were no programmatic ways to interact
with the simulators. Automation, such as the use of browser-orchestration tools like Playwright, go
against the terms of service of the platform. We are then left to our own devices to copy and paste
responses from various providers until we reach an ending condition in the state machine loosely
defined in our structured output. Each of these conversations took about 10 minutes to run through,
and thus a conservative estimation for our four oficial runs is about 120 minutes per model, for a total
of 8 hours of manual data input spread across the team.</p>
          <p>What might make this task better in the future is to have some element of retrieval-augmented
generation (RAG) from a database of responses, and to expose a chat completion API behind an
authenticated service. The generation model should be pinned to a specific model version, but could
possibly be varied across several models depending on the experimental context. It may be worth
looking at the Retrieval-Augmented Debate Task from Touche, who provide an agent-based simulation
for debates between two systems. They provide both an ElasticSearch API against a large claims
database, as well as an API and response format that allows for evaluation of generated claims. In any
case, experiments should be able to run in an automated fashion to reduce burden on task participants.</p>
          <p>Despite the issues, being forced to participate in the structure provided by the organizers this year did
lead to interesting insights in both how technology and role-play can be integrated to take advantage of
generative AI. In addition, the structure of our prompting allowed us to see in real time how evaluation
of various aspects in the BDI-II were being applied. However, the analysis of such conversations should
likely be left in the hands of skilled professionals or at least done in consultation with both professionals
and users.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>Our participation in eRisk 2025 Task 2 with the Voting Classifier and the LightGBM model augmented
by temporal attention yielded results below expectations. One potential reason is that these models
may not have fully captured the deeper semantic nuances and subtle linguistic markers indicative of
depression in user writings. Although our approach included temporal features, post gaps, and the time
elapsed since the first post of a user, these features were likely too basic and may not have adequately
modeled the complex ways in which risk evolves over time within the user’s post history.</p>
      <p>To address these limitations and improve future performance, we plan to experiment with deep
learning models. Such architectures could be more efective at capturing latent linguistic markers and
intricate patterns within user posts, potentially leading to better results. Additionally, exploring the
capabilities of LLMs will be a key focus of our subsequent research eforts.</p>
      <p>In the pilot task, this study investigated the consistency and reasoning behavior of LLM-based agents
conducting structured mental health assessments, specifically focused on detecting depressive
symptoms through simulated BDI-II-based interviews. By analyzing the classification_suggestion,
key_symptoms, and bdi_score fields across multiple models, we observed moderate cross-model
agreement in final depression categories, with most predictions clustering in the mild-to-moderate
range.</p>
      <p>Linear regression analysis revealed a robust relationship between label-encoded classification levels
and BDI-II scores ( = 9.218 − 9.549, 2 = 0.91,  &lt; 0.001). The high coeficient of determination
indicates that 91% of the variance in BDI-II scores is explained by the classification level, demonstrating
a strong internal consistency among LLM agents. Furthermore, the slope coeficient of 9.218 reveals
that each unit increase in classification severity corresponds to approximately 9.2 points higher on
the BDI-II scale, indicating clinically meaningful diferences between severity categories. This dual
ifnding confirms both the reliability of the LLM-based classification system and the clinical significance
of severity distinctions, suggesting that LLM agents exhibit consistent underlying severity estimation
logic with meaningful clinical implications.</p>
      <p>The models agreed on symptoms such as loss of libido, suicidal thoughts, and punishment feelings.
Still, they diverged on items like appetite changes and agitation, likely due to subjective interpretations
or ambiguous language cues. While self-reported confidence increased and stabilized around turn 10,
most models struggled to summarize total BDI-II scores from item-level responses accurately, with
correctness rates ranging from 23 % to 61 % (Table 13).</p>
      <p>Although the manual nature of the task presented challenges in scalability and reproducibility, this
set-up revealed important insights into the clinical reasoning patterns of LLMs. Future iterations should
incorporate automation, fixed model checkpoints, and expert collaboration to improve reliability and
reduce participant burden.</p>
      <p>The code for this paper can be found at github.com/dsgt-arc/erisk-2025.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>We thank the Data Science at Georgia Tech (DS@GT) CLEF competition group for their support. This
research was supported in part through cyber-infrastructure research resources and services provided by
the Partnership for an Advanced Computing Environment (PACE) at the Georgia Institute of Technology,
Atlanta, Georgia, USA [22].</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Gemini 2.5, ChatGPT-o3 and Grammarly to do
grammar and style check, formatting assistance, abstract drafting. Author(s) reviewed and edited the
content as needed and take(s) full responsibility for the content of the publication.
M. Aliannejadi, M. Vlachos, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality,
Multimodality, and Interaction, Springer Nature Switzerland, Cham, 2023, pp. 294–315.
[6] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Overview of eRisk 2024: Early Risk
Prediction on the Internet, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, G. M. Di Nunzio,
L. Soulier, P. Galuščáková, A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR
Meets Multilinguality, Multimodality, and Interaction, Springer Nature Switzerland, Cham, 2024,
pp. 73–92. doi:10.1007/978-3-031-71908-0_4.
[7] A. Barachanou, F. Tsalakanidou, S. Papadopoulos, REBECCA at eRisk 2024: Search for Symptoms
of Depression Using Sentence Embeddings and Prompt-Based Filtering (2024).
[8] J. Martinez-Romo, L. Araujo, X. Larrayoz, M. Oronoz, A. Pérez, OBSER-MENH at eRisk 2023: Deep
Learning-Based Approaches for Symptom Detection in Depression and Early Identification of
Pathological Gambling Indicators (2023).
[9] F. A. Sakib, A. A. Choudhury, O. Uzuner, MASON-NLP at eRisk 2023: Deep Learning-Based
Detection of Depression Symptoms from Social Media Texts, 2023. URL: http://arxiv.org/abs/2310.
10941. doi:10.48550/arXiv.2310.10941, arXiv:2310.10941 [cs].
[10] A. P. Bacuñana, I. S. Bedmar, APB-UC3M at eRisk 2024: Natural Language Processing and Deep</p>
      <p>Learning for the Early Detection of Mental Disorders (2024).
[11] C. Hutto, E. Gilbert, VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social
Media Text, Proceedings of the International AAAI Conference on Web and Social Media 8 (2014)
216–225. URL: https://ojs.aaai.org/index.php/ICWSM/article/view/14550. doi:10.1609/icwsm.
v8i1.14550, number: 1.
[12] mental/mental-roberta-base · Hugging Face, 2021. URL: https://huggingface.co/mental/
mental-roberta-base.
[13] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T.-Y. Liu, LightGBM: A Highly Eficient
Gradient Boosting Decision Tree, in: Advances in Neural Information Processing Systems,
volume 30, Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper_files/paper/
2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html.
[14] J. Weizenbaum, ELIZA—a computer program for the study of natural language communication
between man and machine, Commun. ACM 9 (1966) 36–45. URL: https://dl.acm.org/doi/10.1145/
365153.365168. doi:10.1145/365153.365168.
[15] R. Wang, S. Milani, J. C. Chiu, J. Zhi, S. M. Eack, T. Labrum, S. M. Murphy, N. Jones, K. Hardy,
H. Shen, F. Fang, Z. Z. Chen, PATIENT-Ψ: Using Large Language Models to Simulate Patients for
Training Mental Health Professionals, 2024. URL: http://arxiv.org/abs/2405.19660. doi:10.48550/
arXiv.2405.19660, arXiv:2405.19660 [cs] version: 2.
[16] S. Chen, M. Wu, K. Q. Zhu, K. Lan, Z. Zhang, L. Cui, LLM-empowered Chatbots for Psychiatrist
and Patient Simulation: Application and Evaluation, 2023. URL: http://arxiv.org/abs/2305.13614.
doi:10.48550/arXiv.2305.13614, arXiv:2305.13614 [cs].
[17] P. Kaywan, K. Ahmed, A. Ibaida, Y. Miao, B. Gu, Early detection of depression using a
conversational AI bot: A non-clinical trial, PLOS ONE 18 (2023) e0279743. URL: https://journals.plos.
org/plosone/article?id=10.1371/journal.pone.0279743. doi:10.1371/journal.pone.0279743,
publisher: Public Library of Science.
[18] K. K. Fitzpatrick, A. Darcy, M. Vierhile, Delivering Cognitive Behavior Therapy to Young Adults
With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent
(Woebot): A Randomized Controlled Trial, JMIR Mental Health 4 (2017) e7785. URL: https://mental.
jmir.org/2017/2/e19. doi:10.2196/mental.7785, company: JMIR Mental Health Distributor:
JMIR Mental Health Institution: JMIR Mental Health Label: JMIR Mental Health Publisher: JMIR
Publications Inc., Toronto, Canada.
[19] A. Ferrario, J. Sedlakova, M. Trachsel, The Role of Humanization and Robustness of Large
Language Models in Conversational Artificial Intelligence for Individuals With Depression: A
Critical Analysis, JMIR Mental Health 11 (2024) e56569. URL: https://mental.jmir.org/2024/1/e56569.
doi:10.2196/56569, company: JMIR Mental Health Distributor: JMIR Mental Health Institution:
JMIR Mental Health Label: JMIR Mental Health Publisher: JMIR Publications Inc., Toronto, Canada.</p>
    </sec>
    <sec id="sec-7">
      <title>A. System Prompt</title>
      <p>The system prompt for the LLM-based agent in unrendered markdown. The initial system prompt
was generated using Gemini 2.5-pro via http://gemini.google.com with shared conversation at https:
//g.co/gemini/share/564ab75a7713. Revisions of the prompt can be found in the project source code.
## System Prompt: Proactive, Empathetic &amp; Rigorous Conversational Assessor (eRisk 2025 BDI</p>
      <p>Task v2)
**Your Role:** You are an advanced AI designed for the eRisk 2025 "Conversational Depression</p>
      <p>Detection via LLMs" task. Your function is threefold:
1. **Empathetic Conversational Partner:** Engage naturally, supportively, and in detail
with an LLM persona simulating a human user. Build rapport.
2. **Systematic Information Gatherer:** Proactively and smoothly guide the conversation to
touch upon various life domains relevant to the Beck Depression Inventory (BDI)
criteria (provided below for reference) within approximately 20 turns.
3. **Objective Assessor:** Conduct a rigorous, structured assessment to determine if the
persona exhibits signs consistent with depression, based *only* on conversational
inference. Your approach should mirror the objectivity and methodical rigor expected in
professional psychological assessments, adapted strictly within the ethical and
taskspecific constraints.
**Primary Goal:** Assess the persona for potential signs of depression by analyzing
conversational cues against BDI criteria, inferring information subtly without direct
questioning, while maintaining a natural and empathetic conversational flow.
**Core Task Constraints (CRITICAL):**
1. **No Direct Questions about Depression:** You MUST NOT ask the persona directly if they
are depressed, feeling down, sad, or about their mental health status, symptoms, or
diagnosis. This is a strict task rule.
2. **Inference Only:** Your assessment must be based purely on objective inferences drawn
from the persona’s language, tone, expressed thoughts, feelings, and behaviors as
revealed *naturally* during the conversation. Avoid projecting or over-interpreting.
3. **Initiate Conversation:** You are responsible for starting the conversation.
4. **Maintain Empathy:** Despite the need for rigor and proactive questioning, your
conversational tone MUST remain empathetic, supportive, curious, and non-judgmental at
all times. Your responses can be detailed and natural, not necessarily short.
**Interaction Protocol:**
1. **Initiation:** Begin with a gentle, open-ended question to establish rapport (e.g., "Hi
there, how have things been going for you lately?", "Hello, hope you’re having an okay
week. What’s been on your mind?").
2. **Empathetic Engagement:** Use active listening (reflecting, clarifying, summarizing)
and validate the persona’s feelings appropriately. Respond thoughtfully to their
messages.
3. **Proactive &amp; Structured Information Gathering:**
* During the ’Gathering’ and ’Consolidating’ states, systematically aim to touch
upon different BDI-relevant domains (e.g., mood, outlook, self-perception,
interests/anhedonia, energy/sleep/appetite, social interactions, concentration/
decisiveness).
* Use open-ended questions related to these domains (e.g., "What kinds of things
have you been finding enjoyable recently?", "How has your energy been holding
up during the week?", "What’s been taking up most of your headspace lately?", "</p>
      <p>
        How do you usually approach making decisions when you have a few options?").
* Ask natural follow-up questions to explore relevant topics more deeply when they
arise.
* Transition between topics smoothly and empathetically, linking to previous parts
of the conversation where possible (e.g., "You mentioned feeling tired earlier,
has that affected how you’ve been sleeping?").
4. **Avoid Interrogation:** Balance information gathering with rapport building. Do not
rapid-fire questions. Allow the conversation to flow naturally.
to sleep.
17. **Tiredness or Fatigue:** (0) I don’t get more tired than usual. (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) I get tired more
easily than I used to. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) I get tired from doing almost anything. (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) I am too tired
to do anything.
18. **Changes in Appetite:** (0) My appetite is no worse than usual. (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) My appetite is not
as good as it used to be. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) My appetite is much worse now. (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) I have no appetite at
all anymore.
19. **Weight Loss:** (0) I haven’t lost much weight, if any, lately. (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) I have lost more
than five pounds. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) I have lost more than ten pounds. (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) I have lost more than
fifteen pounds. *(Assess ONLY if spontaneously mentioned/implied or strongly indicated
by other cues like appetite changes. DO NOT PROBE.)*
20. **Somatic Preoccupation (Health Worries):** (0) I am no more worried about my health
than usual. (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) I am worried about physical problems like aches, pains, upset stomach,
or constipation. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) I am very worried about physical problems and it’s hard to think
of much else. (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) I am so worried about my physical problems that I cannot think of
anything else.
21. **Loss of Libido:** (0) I have not noticed any recent change in my interest in sex. (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
I am less interested in sex than I used to be. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) I have almost no interest in sex.
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) I have lost interest in sex completely. *(Assess ONLY if spontaneously mentioned/
implied. DO NOT PROBE.)*
**Structured Output (JSON): Reasoning, Response &amp; Assessment (Perform AFTER deciding next
step &amp; generating response):**
*After* you determine your next conversational move and generate the ‘persona_response‘ text
, you MUST ALSO generate a JSON object containing that response, your reasoning for it,
and your updated internal assessment. **THIS JSON IS FOR ANALYSIS ONLY.**
**JSON Schema Description:**
The JSON object should contain:
1. ‘input_message‘: (String) The actual text message the persona responds with, or "start"
2. ‘output_message‘: (String) The actual text message you generated to send to the persona.
3. ‘next_step_reasoning‘: (String) Your explicit reasoning for choosing this specific ‘
output_message‘. Explain how it connects to the assessment goals (e.g., exploring a BDI
domain, clarifying a previous point, building rapport) based on the current
conversation context and assessment state.
4. ‘evaluation‘: (Object) Your internal BDI-based evaluation, containing:
* ‘assessment_turn‘: (Integer) Current turn number.
* ‘assessment_state‘: (String Enum: "Initializing", "Gathering", "Consolidating", "
      </p>
      <p>Concluding", "Finalized") Current phase.
* ‘assessment_complete‘: (Boolean) True if assessment is considered finished.
* ‘total_bdi_score‘: (Integer) Sum of the 21 scores below.
* ‘classification_suggestion‘: (String Enum: "Control", "Mild", "Borderline", "</p>
      <p>Moderate", "Severe", "Extreme", "Uncertain") Current best guess.
* ‘confidence‘: (Float 0.0-1.0) Confidence in the classification.
* ‘key_symptoms‘: (Array of Strings) Up to 4 most prominent symptom keywords</p>
      <p>inferred.
* ‘bdi_scores‘: (Object) Containing 21 objects (e.g., ‘q01_sadness‘), each with:
* ‘score‘: (Integer Enum: 0, 1, 2, 3) Inferred score for the item.</p>
      <p>
        * ‘reason‘: (String) Concise justification (&lt;60 words) based on conversation.
**Assessment Guidance:**
* **Scoring:** Assign scores (
        <xref ref-type="bibr" rid="ref1 ref2 ref3">0-3</xref>
        ) objectively based *only* on conversational evidence for
each of the 21 BDI items. Use the reference text above for grounding. Justify each
score concisely. Default to 0 if no information. Handle sensitive items (q09, q19, q21)
with extreme caution, scoring &gt; 0 only if spontaneously mentioned/strongly implied.
* **State &amp; Completion:** Update ‘assessment_state‘ based on the interaction phase (
Initializing -&gt; Gathering -&gt; Consolidating -&gt; Concluding -&gt; Finalized). Set ‘
assessment_complete‘ to ‘true‘ when confidence is high, scores are stable, and you are
in the Concluding/Finalized state, typically around turn 15-20.
* **Next Step Reasoning:** Clearly articulate *why* you are asking the next question or
making the next statement in ‘persona_response‘. Link it to your assessment strategy (e
.g., "Transitioning to assess anhedonia (q04) after discussing mood," "Asking for
**Conversation Flow &amp; Time Guideline:**
* Continue the conversation turn-by-turn, proactively guiding it to cover relevant BDI
domains while maintaining empathy. Generate the ‘persona_response‘ and the full JSON
output (including ‘next_step_reasoning‘ and ‘assessment‘) at each turn.
* Aim to gather sufficient information to reach a confident conclusion (‘assessment_complete
: true‘) within approximately **20 turns**. Prioritize assessment quality over strictly
adhering to the turn limit if crucial information is still emerging.
* Your ‘assessment_complete‘ flag signals readiness, but the external system makes the final
decision to stop.
**Summary:** Act as an empathetic, proactive, yet rigorous assessor. Build rapport,
systematically guide the conversation to explore BDI-relevant themes (using the
embedded reference), avoid direct questions, and meticulously document your reasoning,
response, and evolving assessment in the specified JSON format after each turn, aiming
for a finalized assessment within ~20 turns.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Coppersmith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Leary</surname>
          </string-name>
          , E. Whyne, T. Wood,
          <article-title>Quantifying mental health signals in twitter</article-title>
          ,
          <source>in: Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <article-title>Overview of erisk: Early risk detection on the internet</article-title>
          ,
          <source>in: International Conference of the CLEF Association</source>
          , Springer, Cham,
          <year>2017</year>
          , pp.
          <fpage>346</fpage>
          -
          <lpage>360</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk 2025:
          <article-title>Early risk prediction on the internet (extended overview)</article-title>
          ,
          <source>in: Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2025</year>
          ), Madrid, Spain,
          <fpage>9</fpage>
          -
          <issue>12</issue>
          <year>September</year>
          ,
          <year>2025</year>
          , volume To be published of CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk 2025:
          <article-title>Early risk prediction on the internet, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction - 16th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2025</year>
          , Madrid, Spain, September 9-
          <issue>12</issue>
          ,
          <year>2025</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , volume To be
          <source>published of Lecture Notes in Computer Science</source>
          , Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of eRisk 2023:
          <article-title>Early Risk Prediction on the Internet</article-title>
          , in: A.
          <string-name>
            <surname>Arampatzis</surname>
            , E. Kanoulas,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Tsikrika</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Vrochidis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Giachanou</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. Sutskever</given-names>
            ,
            <surname>Language Models Are Unsupervised Multitask Learners</surname>
          </string-name>
          ,
          <source>Technical Report, OpenAI</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [22]
          <string-name>
            <surname>PACE</surname>
          </string-name>
          ,
          <article-title>Partnership for an Advanced Computing Environment (PACE</article-title>
          ),
          <year>2017</year>
          . URL: http://www. pace.gatech.edu.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>