<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>D. A.P. Nunes);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Similarity-Based, and Prompt-Based Approaches to Depression Symptom Identification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Diogo A.P. Nunes</string-name>
          <email>diogo.p.nunes@inesc-id.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eugénio Ribeiro</string-name>
          <email>eugenio.ribeiro@inesc-id.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>INESC-ID Lisboa</institution>
          ,
          <addr-line>Rua Alves Redol 9, 1000-029 Lisboa</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Instituto Superior Técnico, Universidade de Lisboa</institution>
          ,
          <addr-line>Av. Rovisco Pais, 1049-001 Lisboa</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Instituto Universitário de Lisboa (ISCTE-IUL), Avenida das Forças Armadas</institution>
          ,
          <addr-line>1649-026 Lisboa</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>In this work, we describe our team's approach to eRisk's 2025 Task 1: Search for Symptoms of Depression. Given a set of sentences and the Beck's Depression Inventory - II (BDI) questionnaire, participants were tasked with submitting up to 1, 000 sentences per depression symptom in the BDI, sorted by relevance. Participant submissions were evaluated according to standard Information Retrieval (IR) metrics, including Average Precision (AP) and R-Precision (R-PREC). The provided training data, however, consisted of sentences labeled as to whether a given sentence was relevant or not w.r.t. one of BDI's symptoms. Due to this labeling limitation, we framed our development as a binary classification task for each we split the available labeled data into training and validation sets, and explored foundation model fine-tuning, sentence similarity, Large Language Model (LLM) prompting, and ensemble techniques. The validation results revealed that fine-tuning foundation models yielded the best performance, particularly when enhanced with synthetic data to mitigate class imbalance. We also observed that the optimal approach varied by symptom. Based on these insights, we devised five independent test runs, two of which used ensemble methods. These runs achieved the highest scores in the oficial IR evaluation, outperforming submissions from ∗Corresponding author. †These authors contributed equally.</p>
      </abstract>
      <kwd-group>
        <kwd>eRisk</kwd>
        <kwd>depression symptoms</kwd>
        <kwd>fine-tuning</kwd>
        <kwd>sentence similarity</kwd>
        <kwd>large language models</kwd>
        <kwd>prompting</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Mental health is central to the overall physical health. Indeed, other diseases are of increased risk in
the presence of psychological disorders [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Depression is one such disorder; it can be caused by both
physiological and psychological factors, and its symptoms may include a depressive mood, lack of
interest and pleasure, and reduced energy [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. According to the World Health Organization (WHO)1,
5% of the global population sufers from depression, with a higher incidence on women. Depression
is also one of the most common comorbidities of chronic diseases, such as cancer and chronic pain,
in part because of their psychosocial burden; in these cases, the depression diagnosis is an increased
challenge due to the overlapping symptoms and confounding factors [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        The relation between mental disorders and the linguistic expression has been increasingly explored
[
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7">4, 5, 6, 7</xref>
        ]. In fact, depression symptoms manifest in patients’ language commonly as short and directive
communication, limited development of concepts, self-focused attention, negative sentiment, verbosity
of auxiliary terms, and disfluencies [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ]. This motivates the development of Natural Language
Processing (NLP) techniques to monitor and detect depression from language use. However, language
      </p>
      <p>
        ceur-ws.org
is modulated by a plethora of factors beyond psychological and clinical states, namely demographic
and sociocultural variables, which can be confounding factors towards that objective [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Social media presents an opportunity for the development of monitoring and detection systems for
depression in online platforms. These may allow for early detection and quick action on a large-scale,
giving emergence to eRisk’s task of sentence ranking for depression symptoms, which was introduced in
2023 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Previous participant submissions to this and similar tasks included (key)word-based frequency
features with downstream classification and ranking models [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], sentence embeddings for similarity
ranking [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], and, more recently, Large Language Models (LLMs) for synthetic dataset generation [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>Our team’s participation in this task comprised the exploration of multiple methods to select and
rank relevant sentences for a given Beck’s Depression Inventory - II (BDI) symptom. Although the
oficial task evaluation was based on standard Information Retrieval ( IR) metrics, we mainly framed
our methods as binary classification or regression tasks due to training data limitations, as described
below. Our methodology included the fine-tuning of foundation models, similarity-based ranking
in an unsupervised setting, LLM relevancy prompting and synthetic data generation, and ensemble
techniques. We developed and validated these approaches in our training and validation splits of the
provided labeled training dataset, based on classification metrics. A high-precision, ensemble run was
our best performing submission in the oficial IR evaluation, placed 1st among 17 teams, for a total of
67 runs. This paper describes our approach and its results in detail.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Focusing on text as an instantiation of language, previous work has attempted to identify the linguistic
markers of depression. These are characteristics of language use that can be used to separate depression
patients from controls. Trifu et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] conducted such a study with 62 patients diagnosed with major
depressive disorder and 43 controls. They sampled language use through prompted narratives on
something that provided (or used to provide) pleasure. Participants’ transcribed answers were analyzed
with Linguistic Inquiry and Word Count (LIWC) [13], which is a proprietary knowledge- and
dictionarybased psycholinguistic feature extractor. Their statistical analysis found that there were significant
language use diferences between patients and controls; for instance, depression patients used shorter
sentences, and more frequently the personal pronoun plural (“we”), informal language, interrogations,
and other punctuation in general. Their sentences were also more likely to be formed in the past tense.
Semantically, depression patients were more likely to talk about biological processes, health, and money,
and less about leisure. Other analyses observed similar findings [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], laying the foundation for the type
of information that should be monitored by systems for early detection of depression online.
      </p>
      <p>
        eRisk’s task [
        <xref ref-type="bibr" rid="ref9">9, 14, 15</xref>
        ] of ranking sentences for depression symptom detection in online platforms
entails slightly diferent constraints from the related work above. It focuses on learning the relevancy of
a given sentence for a given BDI symptom. BDI includes 21 symptoms, such as sadness, pessimism, loss
of pleasure, self-dislike, worthlessness, and agitation. For each symptom, a number of descriptions are
provided, seemingly in order of intensity. Tab. 1 shows two such examples. The two major constraints
in this task are: 1) symptom-level detection is more granular than binary depression diagnosis, and
2) sentence-level detection of depression lacks in context w.r.t. user-level detection. Indeed, since the
2024 edition of this task, sentences were contextualized with previous and subsequent sentences; this,
however, is still far from the context available for user-level detection of depression. Below, we briefly
describe the approaches of the best performing teams in the past two editions.
      </p>
      <p>
        In 2023, Recharla et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] submitted four runs, all unsupervised and similarity-based (notably,
training data was not available in this edition, since it was the first). After pre-processing, they calculated
two types of embeddings for each sentence and BDI symptom option (locally trained Word2Vec [16] and
the pretrained paraphrase-MiniLM-L3-v22 SentenceTransformer [17]). They selected and ranked the top
1, 000 most similar corpus sentences to each BDI symptom, according to their average similarity to the
symptom’s options. They included both weighted and unweighted similarity averages (where the weight
      </p>
      <sec id="sec-2-1">
        <title>2https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L3-v2</title>
        <p>0. I do not feel sad. 0. I don’t cry anymore than I used to.
1. I feel sad much of the time. 1. I cry more than I used to.
2. I am sad all the time. 2. I cry over every little thing.</p>
        <p>3. I am so sad or unhappy that I can’t stand it. 3. I feel like crying, but I can’t.
was given by the increasing intensity of the symptom’s options; see Tab. 1). SentenceTransformer-based
embeddings outperformed locally trained Word2Vec-based embeddings by a large margin. Overall,
unweighted similarity average of SentenceTransformer-based embeddings performed the best.</p>
        <p>
          Ang et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] submitted five runs in 2024 [ 14]. All of their runs were also similarity-based. In
order to calculate the relevance of a candidate sentence w.r.t. a BDI symptom, they developed three
sets of symptom exemplars. These included a set with the original BDI symptom options (see Tab. 1),
the previous set plus GPT-4 [18] synthesized exemplars based on Early Maladaptive Schemas (EMS),
and the previous two sets plus synthetic exemplars demonstrating positive-sentiment user state (e.g.,
“I’m sad” versus “I’m happy” for the Sadness symptom). They extracted embeddings of both
candidate sentences and symptom exemplars with pretrained and fine-tuned SentenceTransformer models,
including all-mpnet-base-v23, all-MiniLM-L12-v24, and all-distilroberta-v15. Their fine-tuning was
based on contrastive learning with annotated training data, which was oficially available starting
in 2024. Like Recharla et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], similarity was measured with average cosine-similarity. Although
candidate sentence context was available in 2024, Ang et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] do not report to have leveraged it
in their approach. Their best performing run was an ensemble of various pretrained and fine-tuned
sentence embeddings and symptom exemplars. Data from both previous editions and corresponding
annotations were available for this year’s edition.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>In this section we describe in detail the experimental setup defining our approach to eRisk’s “Task 1:
Search for Symptoms of Depression”. First, we discuss the oficial training and test data, and our own
development training and validation splits. We then discuss our technical approach, encompassing
foundation model fine-tuning, similarity-based methods, and LLM prompting. Finally, we describe our
regression/classification evaluation framework in contrast to the oficial evaluation based on IR metrics.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>Participants were provided with oficial training and test splits of the dataset. All sentences in the
dataset were presented in TREC format, and were characterized by a unique ID (&lt;DOCNO&gt;) and their
text (&lt;TEXT&gt;). Some sentences were also characterized by their surrounding context, when available,
i.e., the text of the previous and subsequent sentences (&lt;PRE&gt; and &lt;POST&gt;, respectively).</p>
        <p>
          The oficial training set comprised data from the two previous editions of this task [
          <xref ref-type="bibr" rid="ref9">9, 14</xref>
          ]. A portion
of this set was labeled according to the task’s annotation guidelines, i.e., whether a given sentence
was relevant or not to a given BDI symptom. In fact, two binary labels were provided per annotated
sentence, one representing the annotators’ majority vote, and the other the annotators’ unanimous vote
w.r.t. that relevancy. Not all annotated sentences were labeled for all BDI symptoms. For development,
we randomly split the oficial annotated subsection of the training set ( 26, 290 sentences) in training
(train; 80%) and validation (val; 20%) sets. We stratified the splits per symptom and per label (majority
        </p>
        <sec id="sec-3-1-1">
          <title>3https://huggingface.co/sentence-transformers/all-mpnet-base-v2 4https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2 5https://huggingface.co/sentence-transformers/all-distilroberta-v1</title>
          <p>Symptom
Sadness
Pessimism
Past failure
Loss of pleasure
Guilty feelings
Punishment feelings
Self-dislike
Self-criticalness
Suicidal thoughts or wishes
Crying
Agitation
Loss of interest
Indecisiveness
Worthlessness
Loss of energy
Changes in sleeping pattern
Irritability
Changes in appetite
Concentration dificulty
Tiredness or fatigue
Loss of interest in sex
Total</p>
          <p>train
and unanimity). Tab. 2 shows the distribution of labels per BDI symptom in our development splits. We
purposefully mixed annotated sentences from the 2023 and 2024 editions in both train and val splits to,
ifrst, avoid annotation biases that might have occurred in any one of the previous editions, and second,
improve data imbalances.</p>
          <p>During the data exploration stage, we noticed duplicated sentences in the oficial training set, although
with varying capitalization or formatting (e.g., “I’m sad” and “i’m sad.”). These duplicates were not
always coherently annotated. These represented both a source of possible training data leakage and
labeling noise. To avoid these, we preemptively dropped all lower-cased and stripped duplicates, keeping
only the first occurrence. We labeled the kept occurrence with the majority voting of the majority or
unanimity labels of the corresponding duplicates.</p>
          <p>The oficial test set comprised 17, 558, 066 sentences. Labels were not available for the oficial test set
during the development stage. Indeed, the task’s objective was not to classify the relevance of each test
sentence for each BDI symptom, but instead to retrieve and rank up to 1, 000 sentences from the test
set, for each BDI symptom.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Fine-Tuning of Foundation Models</title>
        <p>Given the dichotomy in available labels per sentence in the train and val splits (i.e., the majority and
unanimity annotations), we framed foundation model fine-tuning as a regression task to take advantage
of all the available data. To that end, we mapped all majority and unanimity labels to a continuous
scale between 0 − 1 using the mapping function shown in Eq. 1. This scale encodes the intuition
that unanimity labels are closer to the given BDI symptom than majority labels. In this regression
framework, we fine-tuned the pretrained deberta-v3-large 6 [19] foundation model on the train split
for each of the 21 BDI symptoms, obtaining 21 fine-tuned models. Each model was fine-tuned for 20</p>
        <sec id="sec-3-2-1">
          <title>6https://huggingface.co/microsoft/deberta-v3-large</title>
          <p>Please generate 100 diferent sentences that are topically relevant to this item. Be as diverse in the language as possible. Just the sentences,
nothing else.
epochs and the epoch with highest performance on the val split was selected. We refer to this as the
mix23 approach. When reverting to the classification setting, outputs ≥ 0.5 were considered positive
(i.e., relevant sentences).</p>
          <p>mapping() =
⎧0, if majority_label() = 0</p>
          <p>2 , if majority_label() = 1 and unanimity_label() = 0
⎨ 3
⎩1, if unanimity_label() = 1
(1)</p>
          <p>As reflected in Tab. 2, there were more negative relevance labels than positive. To overcome this,
we up-sampled our train split by synthesizing positive examples for each BDI symptom. Accordingly,
for each symptom, we prompted GPT-4o7, Claude Sonnet 3.78, and Qwen2.5-32B [20] to generate 100
relevant sentences each. Thus, 300 positively labeled synthetic sentences were added to the train data of
each symptom. We prompted these three models, instead of a single one, in order to promote variability
in the up-sampling. The data synthesis prompt is shown in Tab. 3. We performed the same foundation
model fine-tuning as described above with the up-sampled data mixed with the original train data.
We refer to this as the mix23-aug-1step approach. We also included another approach, referred to
as mix23-aug-2step, which further fine-tuned the mix23-aug-1step models with just the original data.
We performed this second fine-tuning step to ensure that the model observed the original train data
distribution last.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Unsupervised Similarity-Based Approach</title>
        <p>We identified several labeling inconsistencies in the train and val splits, which motivated us to include
an unsupervised approach. Following related work, for each BDI symptom, we listed its options and
extracted the corresponding embeddings with the all-mpnet-base-v29 SentenceTransformer. We used
the same model to extract all train and val sentence embeddings. We calculated the maximum similarity
between each candidate sentence and the list of options of each BDI symptom, obtaining a single
cosine-similarity score per sentence in both train and val splits.</p>
        <p>Contrarily to the related work, we preferred the maximum similarity between a candidate sentence
and the list of symptom options, instead of the average similarity, because, assumingly, a sentence does
not have to be semantically similar to all symptom options to be considered relevant. This appears
especially true given the increase of symptom intensity entailed in the option listing (see examples
in Tab.1); indeed, a sentence relevant to the maximum intensity option of a given symptom may very
well be semantically distant from the least intense option, the averaging of which would dilute this
information, hence our choosing to observe the maximum similarity score.</p>
        <sec id="sec-3-3-1">
          <title>7https://openai.com/index/hello-gpt-4o 8https://www.anthropic.com/news/claude-3-7-sonnet 9https://huggingface.co/sentence-transformers/all-mpnet-base-v2</title>
          <p>Together with each sentence, you will receive a set of examples to help with the classification. Answer with just the grade. Use the format
[GRADE].</p>
          <p>Example: {example sentence}. Classification: {example classification}
(... other examples ...)
Sentence: {sentence to assess}. Classification:</p>
          <p>We used the similarity scores of train sentences for a given BDI symptom to define its classification
threshold: the average sentence similarity score plus two standard deviations. We mapped all similarity
scores to binary labels according to these thresholds. We refer to this as the maxcos approach.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Prompt-Based Approaches</title>
        <p>
          LLMs have demonstrated impressive zero-shot and few-shot performance in several tasks and domains
[21, 18], which motivated us to explore these approaches for this task. We prompted GPT-4o-Mini
whether a given sentence was relevant or not for a given BDI symptom. We performed a prompt
experimentation stage to arrive at the most adequate prompt wording, but also to gauge performance
w.r.t. providing sentence context,  -shot prompting ( ∈ [
          <xref ref-type="bibr" rid="ref1 ref3 ref5">0, 1, 3, 5</xref>
          ] ), random examples, and semantic
similarity examples. We observed the following general behaviors:
• Adding sentence &lt;PRE&gt; and &lt;POST&gt; context decreased performance when compared to no context.
• With few-shot prompting ( &gt; 0 ):
– Selecting  random examples decreased performance below 0-shot prompting.
– Selecting  semantically similar examples increased performance above 0-shot prompting.
– The relevance of the selected examples to the sentence under assessment was crucial for
improved performance, i.e., the definition of the semantic similarity strategy.
        </p>
        <p>
          Given these observations, we arrived at a  -shot prompting strategy, where the  examples were
selected based on their semantic similarity to the sentence under assessment. The pool of exemplars
was restricted to the 0 and 1 labels of the mapping shown in Eq. 1. This ensured a clear separation of
the two possible outcomes. Note that 2 ×  examples are always selected (i.e.,  per relevance label).
Our prompt is shown in Tab. 4. The prompt’s preamble was based on the task’s previous edition oficial
annotation guidelines [14]. We refer to this as the  -shot approach,  ∈ [
          <xref ref-type="bibr" rid="ref1 ref3 ref5">0, 1, 3, 5</xref>
          ] .
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Evaluation</title>
        <p>The oficial evaluation is based on IR metrics, such as Average Precision (AP), R-Precision (R-PREC),
Precision @10 (P@10), and Normalized Discounted Cumulative Gain @1000 (NDCG@1000). We believe
that these metrics cannot be locally implemented due to under-specification. Given the binary labels
available in the oficial training set and data imbalances, we evaluated our approaches under classical
classification metrics, namely  1. We designed our approaches to maximize  1 in the val split.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>In this section we present and discuss the results of our approaches w.r.t. the development stage, i.e.,
based on standard classification metrics, namely  1, followed by the oficial evaluation results of our
submissions, which was based on standard IR metrics.</p>
      <sec id="sec-4-1">
        <title>4.1. Development Stage</title>
        <p>Tab. 5 shows the average  1 performance of the previously described approaches in our development
stage (i.e., on the val split, described in Tab. 2). We emphasize again that we framed our development
as a classification task, in light of the oficially available training annotator majority and unanimity
binary labels. Indeed, similar to past edition’s oficial evaluation, we observe model performance in both
majority and unanimity annotation settings. The average performance (± standard deviation) is across
all 21 BDI symptoms. Foundation model fine-tuning approaches ( mix23, mix23-aug-1step, and
mix23-aug2step) were the best performing across the board (↑  1), and the most stable across the various symptoms
(↓ standard deviation). Although there was a small average improvement with train data up-sampling, it
does not seem to have been critical, as evidenced by the small deltas between mix23 and mix23-aug-1step,
and mix23 and mix23-aug-2step. The unsupervised, similarity-based approach (maxcos) was the worst
performing, with the largest variation across symptoms. We note that zero-shot prompting (0-shot)
is also unsupervised and the worst performing of the prompt-based methods, although with higher
performance than maxcos, revealing the positive impact of model size for language representation and
encoding (estimated parameter size of GPT-4 ≫ all-mpnet-base-v2). The performance of the few-shot
prompting approaches (k-shot,  ≥ 1 ) is aligned with our preliminary findings: performance increases
with the number of in-context semantically similar examples  (as opposed to random examples);
however, performance does seem to plateau for  ≥ 5 .</p>
        <p>We note that the performance of all approaches dropped from the majority annotation setting to the
unanimity one (see the Δ column in Tab. 5). We believe there are two main reasons for this: 1) there were
less positively labeled sentences in the unanimity setting, further exacerbating an already unbalanced
scenario, and 2) counter-intuitively, we observed in a preliminary stage that the unanimity labels were
the most noisy, leading to labeling inconsistencies learned by the models. Regarding these, we see
that there was a smaller delta between majority and unanimity performance in mix23-aug-1step and
mix23-aug-2step, than in mix23. Under this assessment, it becomes clear that the up-sampling strategy
with synthesized examples was critical in improving prediction robustness. The same conclusion can
be extrapolated to the prompting strategies, since the delta between majority and unanimity settings
decreased as the number of  examples increased. The maxcos approach had the smallest delta.</p>
        <p>Fig. 1 shows the  1 performance distribution of each approach, for each BDI symptom. Performance
varied between symptoms. We emphasize two main measures in this plot: the median value and the
distribution wideness. The higher the median value, the better  1 performance for that symptom across
Guilty Feelings</p>
        <p>Loss of Energy
Concentration Difficulty</p>
        <p>Worthlessness</p>
        <p>Punishment Feelings
Suicidal Thoughts or Wishes</p>
        <p>Irritability
Self-Criticalness</p>
        <p>Self-Dislike</p>
        <p>Crying</p>
        <p>Agitation</p>
        <p>Loss of Pleasure
Tiredness or Fatigue
Changes in Appetite</p>
        <p>Loss of Interest</p>
        <p>Sadness</p>
        <p>Past Failure</p>
        <p>Indecisiveness
Changes in Sleeping Pattern</p>
        <p>Pessimism
Loss of Interest in Sex
0.7 0.8 0.9
score in "majority" setting
0.6 0.7 0.8
score in "unanimity" setting
0.9
methodological approaches. The wider the distribution, the more variation in  1 performance for that
symptom across the same approaches. Thus, tight distributions with high median performance are
indicative of symptoms that are, overall, “easy” to detect (under eRisk’s task definition). This includes,
e.g., the symptom of Guilty Feelings. Conversely, wide distributions with low median performance
are indicative of symptoms that are, overall, “hard” to detect. This includes, e.g., the symptoms of
Past Failure, Indecisiveness, and Loss of Interest in Sex. We also observe that there were distribution
diferences in certain symptoms, when comparing between the annotation majority and unanimity
settings. Symptoms of Agitation, Changes in Sleeping Pattern, and Pessimism are such examples.
However, the overall trend (as given by the decreasing order of median performance) was maintained
between evaluation settings, suggesting that BDI symptoms were equally easy or dificult to detect
under both settings. Tab. 6 complements these results by showing the best performing approach (and
corresponding  1 score) for each BDI symptom, under both majority and unanimity evaluation settings.
This shows that there was not a single best methodological approach for the detection of all BDI
symptoms. However, as already suggested in Tab. 5, the foundation model fine-tuning approaches were
by far the most frequently best performing across symptoms. There was only one symptom for which
the best performing approach was unsupervised (Loss of Pleasure; 0-shot).</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Oficial Submission Evaluation</title>
        <p>Each team was allowed to submit five independent runs to eRisk’s task. We designed our submissions
according to the development stage results discussed above. Note that the output of the foundation
model fine-tuning and similarity-based approaches were self-ranked (fine-tuning was framed as a
regression task in the 0 − 1 scale). This was not true for prompt-based approaches (output was always
binary). Due to the size of the oficial test set, we used the maxcos approach to first filter the candidate
test sentences to those that would be positively labeled as relevant by this approach. Our submitted
runs were based on the remaining test sentences. Our five runs are detailed below:
• mix23. This submission consisted entirely of the sorted output with the mix23 approach described
above. For each symptom, we selected up to the first 1, 000 sentences that this approach predicted
as positive label.
• aug-best. This submission consisted of obtaining the regression scores of each sentence with
both mix23-aug-1step and mix23-aug-2step, described above, and choosing that which performed
best in the development stage per symptom (see Tab. 6). For each symptom, we selected up to the
ifrst 1, 000 sentences that this approach predicted as positive label.
• maxcos. This submission consisted entirely of the sorted output with the maxcos approach
described above. For each symptom, we selected up to the first 1, 000 sentences that this approach
predicted as positive label (i.e., output &gt; symptom-specific threshold).
• max. This submission consisted of ranking the test candidate sentences according to the
maximum score per sentence as given by the previous three submissions (mix23, aug-best, and maxcos).
This was an ensemble approach leveraging the findings in Tab. 6: indeed, some methods may be
particularly better (and more confident) than others, in detecting sentence relevancy. By ranking
candidate sentences based on the maximum score of three diferent approaches, this ensemble
prioritizes the individual capacity of each approach.
• unanimity. This submission consisted of selecting only those sentences that were predicted as
positive label by all of the first three submissions ( mix23, aug-best, and maxcos) and, subsequently,
also positively predicted with the prompt-based approach with  = 5 . These sentences were
ranked according to the minimum score of those three submissions. This ensemble approach
emphasizes precision and is further conservative in its minimum-score ranking.</p>
        <p>The oficial evaluation performance of our runs, according to IR metrics, is shown in Tab. 7. The
run unanimity performed the best for the AP, R-PREC, and P@10 metrics. The run max was the
best performing for the NDCG@1000 metric. Notably, both of these runs were ensemble methods,
highlighting the importance of leveraging diferent approaches to capture all the relevant information</p>
        <p>NDCG@1000
annotator majority evaluation
mix23
aug-best
maxcos
max
unanimity
mix23
aug-best
maxcos
max
unanimity
in the candidate sentences. This was in line with our discussion of results in the development stage. The
NDCG@1000 metric, particularly, emphasizes not only correctly predicted sentences as relevant, but
also their ranking; indeed, the run max placed first those sentences which one of its ensembled methods
was highly confident, thus, performing better than the precision-centric and highly conservative
unanimity run for this metric. We also note that mix23 outperformed aug-best in all metrics, except
P@10. The LLM-synthesized sentences, used to fine-tune the mix23-aug-1step and mix23-aug-2step
approaches, were fairly obvious w.r.t. their symptom relevance. This may have caused the aug-best
run to accurately perform for “obvious” candidate sentences (hence, the superior P@10 score), at the
cost of under-performing for those that were less “obvious” (and hence placed further down the list, not
captured by P@10). The relative performance of our runs was identical in both majority and unanimity
annotation settings. We were the best performing team for all evaluation metrics.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>The relation between mental health and linguistic expression opens up opportunities for early detection
of depression symptoms in online platforms. eRisk’s task of sentence ranking for depression symptoms
aims to explore these opportunities. In this work, we discussed our approaches to this year’s edition of
the task. Our methodology was largely aligned with related work and tackled some of the oficial data’s
limitations, such as duplicates, labeling inconsistencies, label imbalances, and labeling dichotomy (i.e.,
the majority and unanimity annotations). We explored multiple techniques, including foundation model
ifne-tuning in a regression framework (to leverage all data available in the two annotations) with and
without additional synthetic data, similarity-based unsupervised methods, and LLM few-shot prompting.
Our local development evaluation, based on classification metrics, revealed foundation model
finetuning as the best performing, followed by few-shot prompting with  = 5 examples. Unsupervised
similarity-based methods were the worst performing. Based on these results, we submitted five runs for
oficial IR-metric evaluation, two of which used ensemble methods. These achieved the highest scores,
outperforming submissions from 16 other teams.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by Portuguese national funds through FCT, Fundação para a Ciência e a
Tecnologia, under project UIDB/50021/2020 (doi:10.54499/UIDB/50021/2020), and by the Portuguese
Recovery and Resilience Plan and Next Generation EU European Funds, through project
C64486576200000008 (Accelerat.AI).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <sec id="sec-7-1">
        <title>The author(s) have not employed any Generative AI tools.</title>
        <p>and Labs of the Evaluation Forum (CLEF), 2024, pp. 9–12. URL: https://ceur-ws.org/Vol-3740/
paper-73.pdf.
[13] Y. Tausczik, J. Pennebaker, The Psychological Meaning of Words: LIWC and Computerized Text
Analysis Methods, Journal of Language and Social Psychology 29 (2009) 24–54. doi:10.1177/
0261927X09351676.
[14] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Overview of eRisk 2024: Early
Risk Prediction on the Internet, in: Experimental IR Meets Multilinguality, Multimodality,
and Interaction: International Conference of the CLEF Association, Part II, 2024, pp. 73–92.
doi:10.1007/978-3-031-71908-0_4.
[15] J. Parapar, A. Perez, X. Wang, F. Crestani, Overview of eRisk 2025: Early Risk Prediction on the
Internet (Extended Overview), in: Working Notes of the Conference and Labs of the Evaluation
Forum (CLEF 2025), Madrid, Spain, 9-12 September, 2025, series = CEUR Workshop Proceedings,
volume To be published, CEUR-WS.org, 2025.
[16] T. Mikolov, K. Chen, G. Corrado, J. Dean, Eficient Estimation of Word Representations in Vector</p>
        <p>Space, Computing Research Repository arXiv:1301.3781 (2013). doi:10.48550/arXiv.1301.3781.
[17] N. Reimers, I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,
in: Proceedings of the Conference on Empirical Methods in Natural Language Processing and the
International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for
Computational Linguistics, 2019, pp. 3982–3992. doi:10.18653/v1/D19-1410.
[18] OpenAI, GPT-4 Technical Report, Computing Research Repository arXiv:2303.08774 (2024).</p>
        <p>doi:10.48550/arXiv.2303.08774.
[19] P. He, J. Gao, W. Chen, DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing, Computing Research Repository arXiv:2111.09543
(2021). doi:10.48550/arXiv.2111.09543.
[20] Qwen Team, Qwen2.5 Technical Report, Computing Research Repository arXiv:2412.15115 (2024).</p>
        <p>doi:10.48550/arXiv.2412.15115.
[21] DeepSeek-AI, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement
Learning, Computing Research Repository arXiv:2501.12948 (2025). doi:10.48550/arXiv.2501.
12948.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Prince</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saxena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Maselko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Phillips</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <source>No Health Without Mental Health, The Lancet</source>
          <volume>370</volume>
          (
          <year>2007</year>
          )
          <fpage>859</fpage>
          -
          <lpage>877</lpage>
          . doi:
          <volume>10</volume>
          .1016/s0140-
          <volume>6736</volume>
          (
          <issue>07</issue>
          )
          <fpage>61238</fpage>
          -
          <lpage>0</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Paykel</surname>
          </string-name>
          , Basic Concepts of Depression,
          <source>Dialogues in Clinical Neuroscience</source>
          <volume>10</volume>
          (
          <year>2008</year>
          )
          <fpage>279</fpage>
          -
          <lpage>289</lpage>
          . doi:
          <volume>10</volume>
          .31887/dcns.
          <year>2008</year>
          .
          <volume>10</volume>
          .3/espaykel.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Gold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Köhler-Forsberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Moss-Morris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mehnert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Miranda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bullinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Steptoe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Whooley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Otte</surname>
          </string-name>
          , Comorbid Depression in Medical Diseases,
          <source>Nature Reviews Disease Primers</source>
          <volume>6</volume>
          (
          <year>2020</year>
          )
          <article-title>69</article-title>
          . doi:
          <volume>10</volume>
          .1038/s41572-020-0211-z.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goharian</surname>
          </string-name>
          ,
          <article-title>Depression and Self-Harm Risk Assessment in Online Forums</article-title>
          ,
          <source>in: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <source>ACL</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2968</fpage>
          -
          <lpage>2978</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D17</fpage>
          -1322.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Desmet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Soldaini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Macavaney</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Goharian, SMHD: A Large-Scale Resource for Exploring Online Language Usage for Multiple Mental Health Conditions</article-title>
          ,
          <source>in: Proceedings of the International Conference on Computational Linguistics (COLING)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1485</fpage>
          -
          <lpage>1497</lpage>
          . URL: https://aclanthology.org/C18-1126/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B. O</given-names>
            <surname>'Dea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. W.</given-names>
            <surname>Boonstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Larsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Venkatesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Christensen</surname>
          </string-name>
          ,
          <article-title>The Relationship Between Linguistic Expression in Blog Content and Symptoms of Depression, Anxiety, and Suicidal Thoughts: A Longitudinal Study</article-title>
          ,
          <source>Plos One</source>
          <volume>16</volume>
          (
          <year>2021</year>
          )
          <article-title>e0251787</article-title>
          . doi:
          <volume>10</volume>
          .1371/journal. pone.
          <volume>0251787</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N. H.</given-names>
            <surname>Yahya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. Abdul</given-names>
            <surname>Rahim</surname>
          </string-name>
          ,
          <article-title>Linguistic Markers of Depression: Insights from English-Language Tweets Before and During the COVID-19 Pandemic, Language and Health 1 (</article-title>
          <year>2023</year>
          )
          <fpage>36</fpage>
          -
          <lpage>50</lpage>
          . doi:
          <volume>10</volume>
          . 1016/j.laheal.
          <year>2023</year>
          .
          <volume>10</volume>
          .001.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R. N.</given-names>
            <surname>Trifu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Nemeș</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Herta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bodea-Hategan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Talaș</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Coman</surname>
          </string-name>
          ,
          <article-title>Linguistic Markers for Major Depressive Disorder: A Cross-Sectional Study using an Automated Procedure</article-title>
          , Frontiers in Psychology 15 (
          <year>2024</year>
          )
          <article-title>1355734</article-title>
          . doi:
          <volume>10</volume>
          .3389/fpsyg.
          <year>2024</year>
          .
          <volume>1355734</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Parapar</surname>
          </string-name>
          , Javier and
          <string-name>
            <surname>Martín-Rodilla</surname>
            , Patricia and Losada,
            <given-names>David E</given-names>
          </string-name>
          and Crestani, Fabio, Overview of eRisk 2023:
          <article-title>Early Risk Prediction on the Internet, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction: International Conference of the CLEF Association, Springer,
          <year>2023</year>
          , pp.
          <fpage>294</fpage>
          -
          <lpage>315</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -42448-9_
          <fpage>22</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , J. Parapar, eRISK
          <year>2017</year>
          :
          <article-title>CLEF Lab on Early Risk Prediction on the Internet: Experimental Foundations, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction: International Conference of the CLEF Association,
          <year>2017</year>
          , pp.
          <fpage>346</fpage>
          -
          <lpage>360</lpage>
          . doi:
          <volume>10</volume>
          . 1007/978-3-
          <fpage>319</fpage>
          -65813-1_
          <fpage>30</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Recharla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bolimera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Madasamy</surname>
          </string-name>
          ,
          <article-title>Exploring Depression Symptoms through Similarity Methods in Social Media Posts</article-title>
          ,
          <source>in: Working Notes of the Conference and Labs of the Evaluation Forum (CLEF)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>763</fpage>
          -
          <lpage>772</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3497</volume>
          /paper-065.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B. H.</given-names>
            <surname>Ang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Gollapalli</surname>
          </string-name>
          , S.
          <article-title>-</article-title>
          K. Ng,
          <article-title>NUS-IDS@eRisk2024: Ranking Sentences for Depression Symptoms Using Early Maladaptive Schemas and Ensembles</article-title>
          , in: Working Notes of the Conference
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>