<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ELiRF-UPV at eRisk 2025: New Approaches to the Detection and Early Detection of Symptoms and Signs of Depression</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andreu Casamayor</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vicent Ahuir</string-name>
          <email>vahuir@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Molina</string-name>
          <email>amolina@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lluís-Felip Hurtado</string-name>
          <email>lhurtado@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València</institution>
          ,
          <addr-line>Camino de Vera s/n, 46022 Valencia.</addr-line>
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ValgrAI: Valencian Graduate School and Research Network of Artificial Intelligence</institution>
          ,
          <addr-line>Camí de Vera s/n, València, 46022</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the approaches taken by the ELiRF-UPV team at Tasks 1 and 2 of eRisk at CLEF 2025. The tasks focus on the detection of mental health disorders in English-language social media, specifically: Search for Symptoms of Depression and Contextualized Early Detection of Depression. For Task 1, we developed lightweight adapters for pre-trained similarity Transformers, enabling eficient adaptation to limited data. These adapters used attention over reference embeddings to enhance input representations for relevance prediction across 21 depression symptoms. For Task 2, we explored three approaches: a SVM model and two Longformer-based models pre-trained in the mental health domain. One of the two Longformer-based approaches employed a straightforward fine-tuning strategy, while the other used task-adapted training enhanced by a data augmentation technique. Based on the validation results, overall performance fell short of expectations. A detailed analysis is required to identify the underlying causes.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Longformer</kwd>
        <kwd>Transformers</kwd>
        <kwd>Support Vector Machine</kwd>
        <kwd>Sentence embeddings</kwd>
        <kwd>self-attention</kwd>
        <kwd>Mental disorder detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Depression, clinically known as major depressive disorder (MDD), is a common and serious mental
health condition that negatively afects how a person feels, thinks, and behaves. It is characterized by
persistent sadness, loss of interest or pleasure in daily activities, feelings of worthlessness, and, in severe
cases, thoughts of death or suicide. Depression afects people across all age groups and backgrounds.
Still, it is particularly prevalent among adolescents and young adults, with women being nearly twice
as likely to experience it compared to men [1].</p>
      <p>Beyond its emotional toll, depression can severely impair physical health and social functioning,
often co-occurring with other psychiatric disorders such as anxiety, substance use, or eating disorders.
Despite its high prevalence and the availability of efective treatments, depression frequently goes
undiagnosed or untreated due to stigma, lack of awareness, and insuficient access to mental health
services [2].</p>
      <p>As a result, analyzing social interactions has recently become a key approach for identifying the risk
of depression. However, detecting depression is complex due to factors such as the quantity and quality
of the available data. To address this challenge, CLEF eRisk [3, 4] introduced a series of tasks designed
to provide high-quality data and foster the development of models for early detection.</p>
      <p>
        This year, the organizers of eRisk have proposed three tasks: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Search for Symptoms of Depression,
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Contextualized Early Detection of Depression, and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) Conversational Depression Detection via
Large Language Models (LLMs).
      </p>
      <p>Our team has focused on the first two tasks, applying distinct approaches to address the specific
challenges presented by each one:
• Task 1: To approach this task, we designed systems with minimal training computational cost and
capable of adapting to limited data scenarios. Our approach focused on developing adapters for
pre-trained similarity Transformer models. Our systems leverage the model’s existing capabilities
while refining its performance specifically for the relevance task tied to the 21 distinct symptoms
of depression. The core addition lay in attention mechanisms within this adapter module; these
utilized diferent sets of reference embeddings to dynamically enhance the input representation
before applying regression feedforward layers.
• Task 2:
1. The first approach is based on a classical machine learning algorithm: Support Vector
Machines (SVM). SVMs have consistently performed in long-text classification tasks, making
them a suitable choice for this context. This method is intended to serve as a baseline for
assessing the efectiveness of the other models.
2. The second approach leverages Transformer architectures [5], using a pre-trained
Longformer model [6] as its foundation. Due to its capacity to handle significantly longer input
sequences, Longformer is particularly well-suited for tasks that demand the assimilation of
extensive contextual information. This approach employed a straightforward fine-tuning
strategy for model training, where a single sample represents each user.
3. The final approach builds upon the second by further adapting the model to the specific
task of early detection. We employed task-adaptive fine-tuning to better align the training
process with the objectives of the target task. To this end, we constructed a new dataset by
concatenating user messages to generate multiple samples per user. This strategy exposes
the model to various samples of each user’s textual data with diferent labels.</p>
      <p>For Task 1, we submitted five runs using the earlier approach. For Task 2, we submitted five runs:
one based on the first approach, two based on the second, and two based on the final approach. In each
case, the best-performing model was selected through a preliminary evaluation phase, during which
multiple model configurations and datasets were tested to identify the most efective setup.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Description of Tasks and Datasets</title>
      <p>This year, the competition has focused on detecting depression across diferent formats. For each task,
a distinct dataset and a specific objective were provided.
2.1. Task 1
Task 1 ranks a collection of user-generated texts according to the relevance of depression-related
symptoms. The ranking reflects how indicative each text is concerning the presence or absence of
specific symptoms. For this task, the Beck Depression Inventory (BDI) questionnaire was used, which
defines 21 possible symptoms of depression. A text is considered relevant if it provides information
about the current state of any symptom, whether that information is positive or negative.</p>
      <p>The dataset consisted of a TREC-formatted, sentence-tagged corpus constructed from previous eRisk
data [7], in combination with the BDI questionnaire. Each sentence was annotated to facilitate the
identification of content relevant to the 21 depressive symptoms defined by the BDI. The provided
training dataset contained 19 806 893 sentences, of which only 28 025 (0.14% of them) were considered
informative for one or more depressive symptoms. For testing, 17 553 441 sentences were provided.
Participants were expected to label all samples using the BDI questionnaire, specifying the degree
of relevance for each symptom, and contribute the top 1000 most informative/relevant sentences per
symptom.
2.2. Task 2
Task 2 focuses on the early detection of signs of depression through the comprehensive analysis of social
media interactions, that is, considering the full context of conversations. This year, the task difers from
previous editions [7], where the analysis was limited to isolated user content. Texts must be processed
in the chronological order in which they were created, simulating more realistically how a system
would monitor user interactions in real time on blogs, social networks, or other online platforms.</p>
      <p>The dataset provided by the organizers has been extracted from the competitions of previous years
and follows a specific format [ 8]. It consists of a collection of writings sourced from social media. The
dataset includes both the messages written by a user and the complete interaction of that user with
others for each message.</p>
      <p>The dataset is composed of two classes: users diagnosed with depression and control users
(nondepressed). Table 1 shows the distribution across the diferent labels.</p>
      <p>
        As mentioned, the primary goal of this competition is to predict signs of depression as early as
possible. To simulate realistic conditions, the organizers set up a server that sequentially delivers
data packets, each containing a conversation in which the target user participates. The system must
determine whether the user shows signs of depression by analyzing the current message and the history
of prior messages before receiving the next packet.
3. Task 1
For this competition, we wanted to explore solutions with low computational cost during training
and that could adapt to situations with limited data. Therefore, we decided to start with similarity
Transformer models and add an adapter at the output that would perform the relevance task for the
21 depression symptoms. This adapter adds feedforward layers to perform the regression task and
attention mechanisms regarding diferent sets of reference embeddings to improve the representation
passed on to the dense layer intended for regression. In this work, we have explored adding reference
embeddings in two aspects: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) a centroid set extracted from the embeddings of training phrases, and
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) embeddings of question-answer pairs from the 21 questions of the Beck Depression Inventory (BDI)
questionnaire for detecting depression symptoms (84 pairs in total, four answers per question).
      </p>
      <sec id="sec-2-1">
        <title>3.1. Partitions for development</title>
        <p>Since the provided training dataset was highly unbalanced towards non-relevant samples, we decided
to reduce that imbalance during the development. If we had preserved the real distribution, the systems
would have tended to score sentences as irrelevant most of the time. Therefore, we provided two
irrelevant sentences for each relevant sentence. Additionally, we made a stratified partition of the
relevant samples by combining the symptoms for which the sentence is informative (one sentence could
be relevant for more than one symptom). We reserved 80% of the relevant samples for training and 20%
for evaluation. Table 2 shows the distribution of the development dataset.</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.2. Adapter architecture</title>
        <p>The regression adapter was designed to predict relevancy for the 21 symptoms of depression by
transforming an input similarity embedding into a richer representation through the concatenation of
three distinct embeddings. First, the original embedding served as the foundational input, capturing raw
feature interactions. Second, a self-attention layer processed this embedding about a set of precomputed
centroid clusters derived from clustering algorithms (e.g., k-means), generating an embedding that
encoded similarity and dissimilarity metrics relative to these clusters, thereby highlighting structural
patterns or group memberships. Third, another self-attention layer compared the original embedding to
the collection of embeddings extracted from each question-answer pair from the BDI guide, which aimed
to capture contextual relationships and semantic alignment with the guide. The adapter synthesized
a comprehensive representation that integrated local feature interactions, cluster-level abstractions,
and external contextual cues by concatenating these three components- original, cluster-based, and
reference-aligned components.</p>
        <p>In this work, we explored diferent combinations of these three types of embedding for performing
the regression task. In that way, we could evaluate each piece of information’s relevance to the system’s
ranking performance.</p>
      </sec>
      <sec id="sec-2-3">
        <title>3.3. Adapter self-attention mechanism</title>
        <p>The self-attention used in this work was based on the one used by the Transformer architecture [5].
The attention mechanism operated within a framework that aimed to dynamically modify the input
sentence representations based on a collection of reference embeddings. In this process, the system
computed first the attention information over reference embeddings (which embeddings were more
relevant for the given sentence embedding), and then modulated the sentence embedding based on that
information.</p>
        <p>For computing the attention information, the list of reference embeddings was treated as a sequence
of length  (number of references) and the self-attention mechanism was computed in the same way as
Vaswani et al. (2017). However, query (), key (), and value () were modulated by the input embedding
 before computing the attention scores by performing the dot product between  and each reference
embedding. Once the information of  was taken into account, the attention scores were computed as
usual ( is the embedding dimension):
 · 
 = √ .</p>
        <p>Once the attention scores were calculated, it was applied a softmax normalization and dropout
regularization over the attention weights, followed by the computation of the context-aware output by
aggregating values  according to refined probabilities:</p>
        <p>= dropout(softmax()) · .</p>
        <p>With the attention information computed (AttInfo), we used it for modifying the original sentence
embedding () with a contraction operation to obtain the final sentence embedding ( ′):
′ =  · ∑︁</p>
        <p>.</p>
      </sec>
      <sec id="sec-2-4">
        <title>3.4. Sentence embedding and reference embeddings</title>
        <p>
          Since the shared task was conducted with English sentences, we employed the all-MiniLM-L6-v2
(https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) semantic similarity model from
Sentence Transformers [9] to generate sentence embeddings. This choice was driven by two key factors:
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) its compact size significantly reduces computational overhead, making it more eficient for
resourceconstrained environments; and (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) in preliminary experiments, the embeddings produced by this mo
which could be beneficial for our task.
        </p>
        <p>We extracted reference embeddings from two sources:
1. Clustering centroids: Based on the training data, we extracted two diferent sets of centroids.</p>
        <p>With KMeans, we extracted 300 centroids, and with Agglomerative Clustering, 300 and 512
centroids. For clustering, we used the Scikit-Learn implementation [10].
2. BDI embeddings: For each question, we combined the question with an answer and extracted
the embedding. To create each unique question-answer pair, we used the following template:
"question: answer".</p>
      </sec>
      <sec id="sec-2-5">
        <title>3.5. System configurations</title>
        <p>Table 3 summarizes the diferent configurations used for approaching task 1. We used one system
that combined the sentence embedding with the BDI guide set of embeddings: x-guide. Two that
combine the sentence embedding with clustering as references: x-cluster-kmeans[300] and
x-clusteragglomerative[300]. And another two that combine sentence embedding with clustering and BDI as
references: x-cluster-agglomerative[300]-guide and x-cluster-agglomerative[512]-guide.</p>
        <p>All systems were trained during 20 epochs, with a linear scheduler, a linear warmup 0.1, and a
learning rate of 1 × 10− 4. To select the final configuration for each model, we employed the validation
partition and chose the configuration model that maximized the Average Precision (AP). AP measures
how well a ranked list of retrieved documents matches the relevant items, considering both the precision
(accuracy of retrieved items) and their ranking order; it is defined as follows:
 = 1 ∑︁  () × ()</p>
        <p>=1
Where:
-  = Total relevant documents for the query.
-  () = Precision at cutof position .</p>
        <p>- () = Binary indicator (1 if the -th document is relevant, else 0).
4. Task 2</p>
      </sec>
      <sec id="sec-2-6">
        <title>4.1. System architecture</title>
        <p>For this competition, we aimed to investigate two critical factors relevant to this type of task: the size
of the conversational context and the efectiveness of task-specific fine-tuning approaches.</p>
        <p>The first factor concerns the amount of context required for accurate detection. Since each user may
produce a large number of messages and conversations, input length becomes a critical consideration.
One of our main objectives was to investigate the impact of contextual information on system
performance, that is, to evaluate how diferent models perform based on the volume of context they are
capable of processing. To this end, we evaluated three distinct systems: one based on classical machine
learning techniques, another utilizing a RoBERTa model [11], and a third employing a Longformer
model. These systems support varying input lengths, enabling us to analyze how context size influences
prediction accuracy.</p>
        <p>1. Classical machine learning approaches have no limit on the input size.
2. The selected RoBERTa model has a limit of 512 tokens in the input.
3. The selected Longformer model has a limit of 4096 tokens in the input.</p>
        <p>The second factor explores the impact of task-specific fine-tuning compared to more general,
traditional fine-tuning approaches. To evaluate this, we constructed two distinct datasets for training and
evaluating the performance of the Transformer-based systems. This setup allowed us to analyze how
closely aligned fine-tuning strategies afect model efectiveness on the depression detection task.
1. Dataset 1: We created a single sample per user by aggregating all their messages for positively
and negatively labeled users. This approach ensures the dataset captures each user’s context and
communication patterns.
2. Dataset 2: If a priori evidence were available indicating the specific message in which a user
begins to exhibit symptoms associated with mental health risk, we could label all preceding
messages as negative and the message in question, along with all subsequent ones, as positive.
This would allow us to generate additional positive samples, potentially leading to more precise
model training.</p>
        <p>We split the original dataset into training and evaluation sets for our experiments, allocating 80% of
the users to training and 20% to evaluation. We ensured that the proportion of positive (Depression)
and negative (None) users was maintained in both partitions. ?? shows the distribution of classes across
the dataset.</p>
      </sec>
      <sec id="sec-2-7">
        <title>4.2. Classical Machine Learning Classifier Approach</title>
        <p>To evaluate the role of contextual information, we employed a classical machine learning classifier
capable of processing the entire input context. A key limitation of Transformer-based models lies in their
constrained capacity to manage lengthy texts, primarily due to input size restrictions. This limitation
can adversely afect classification performance, as the input may fail to encompass the complete sample,
thereby risking the omission of valuable information.</p>
        <p>Initially, we compared several classical machine learning classifiers. For this purpose, we utilized the
Scikit-learn library [10], which ofers a comprehensive set of tools to support experimental workflows.
All classifiers were evaluated using their default hyperparameters to ensure a fair and unbiased
comparison, without applying data preprocessing or preliminary analysis. Feature extraction was carried out
using the TF-IDF method provided by Scikit-learn, which produces a vector representation based on the
vocabulary size. The configuration employed in this setting matched the previous year’s competition
[12], where it demonstrated superior performance in empirical evaluations. Configuration: TF-IDF with
"char_wb" analyzer and 4–5 character n-grams, extracting subword patterns within word boundaries.
This helps capture prefixes, sufixes, and internal character sequences that are useful for handling
spelling variations.
1. TweetTokenizer with stop word removal: In this approach, text was first tokenized using
the TweetTokenizer, followed by eliminating stop words to reduce noise in the input.
2. TweetTokenizer with cleaning and lemmatization: This method extends the previous
approach by incorporating additional preprocessing steps, including text cleaning (removal
of non-alphanumeric characters) and lemmatization of tokens to standardize word forms.
• Sentiment Analysis: We utilized the Transformer-based model
"lxyuan/distilbert-basemultilingual-cased-sentiments-student" [13] to perform sentiment analysis on each message
for every user. The model classifies input into three sentiment categories: positive, negative, and
neutral. The resulting sentiment scores were normalized and integrated as additional features
alongside the TF-IDF vectors.</p>
        <p>We performed an exhaustive grid search using Scikit-learn’s GridSearchCV utility to determine the
hyperparameters for each model to improve the performance. The search explored combinations of
the parameters C, tol, and loss, evaluating their impact on model performance. The ranges of values
considered for each parameter were as follows:
• C: [1− 1, 1, 10, 100]
• tol: [0.1, 0.01, 0.001, 0.0001]
• loss: [hinge, squared_hinge]</p>
        <p>In total, we obtained four distinct configurations for experimentation. Table 6 summarizes these
configurations, with the hyperparameter values for each listed in the Hyperparameters column.</p>
        <p>Table 7 presents the results obtained on the evaluation partition. We observe that both the second
and fourth configurations achieved identical results, concluding that thorough data cleaning is the
fundamental factor influencing performance. Consequently, since both configurations yielded the
same outcome, we opted for the one that demands fewer computational resources and preprocessing
efort. So, the selected configuration was SVM-t1-2, corresponding to a Linear SVM integrated without
sentiment analysis and the second preprocessing method, encompassing comprehensive data cleaning
and preprocessing steps.</p>
      </sec>
      <sec id="sec-2-8">
        <title>4.3. Straightforward Fine-tuning Approach</title>
        <p>Transformer architecture is widely recognized as a state-of-the-art natural language processing system.
We employed two Transformer-based models for this shared task: RoBERTa and Longformer [11, 6].
• RoBERTa: RoBERTa is known for its robustness and strong performance in various classification
tasks. However, a significant limitation of this model lies in its inability to process input sequences
longer than 512 tokens. This constraint poses challenges for tasks requiring long-range contextual
understanding, such as those explored in this study. As such, RoBERTa was used as a baseline
model for comparison with architectures designed to handle longer inputs.
• Longformer: The Longformer (short for "Long-Document Transformer") was specifically
developed to process long sequences more eficiently, making it better suited than conventional
Transformer models like BERT or RoBERTa for tasks involving extended context. The architecture
incorporates several key innovations:
– New attention mechanism: Rather than employing full self-attention, Longformer uses a
sliding window attention mechanism in which each token attends only to a fixed number of
neighboring tokens. This design significantly reduces the computational cost.
– Global attention: In addition to local attention, Longformer enables select tokens to receive
global attention, allowing them to attend to all other tokens in the sequence. This hybrid
mechanism enhances eficiency and the model’s ability to capture long-range dependencies.</p>
        <p>We conducted a literature review to identify a base model pre-trained on domains related to depression
and mental health. According to findings by Alireza Pourkeyvan [ 14], the current state-of-the-art model
for mental disorder detection is MentalRoBERTa [15]. MentalRoBERTa is a domain-adapted variant of
the RoBERTa architecture, tailored explicitly for mental health applications. It has been pre-trained on
a specialized corpus comprising texts from mental health forums, clinical notes, and general language
sources, the Suicidal and Mental Health (SWMH) corpus [16]. This domain-specific pre-training
allows the model to more efectively comprehend and process mental health–related language, thereby
improving its performance and relevance in related tasks. In addition to MentalRoBERTa, the authors
have also developed MentalLongformer, an adaptation of the Longformer architecture, which is similarly
pre-trained on mental health-related data.</p>
        <p>The models selected for our experiments were AIMH/mental-roberta-large [17], representing
the RoBERTa-based architecture, and AIMH/mental-longformer-base-4096 [18], corresponding to the
Longformer-based architecture.</p>
        <p>Each base model was fine-tuned using Dataset 1, applying a straightforward fine-tuning strategy.
During this process, the models were trained using the complete contextual information available for
each user. Table 8 shows the configuration used in the fine-tuning process.</p>
      </sec>
      <sec id="sec-2-9">
        <title>4.4. Task Adaptive Fine-tuning</title>
        <p>The second objective of this study was to examine the impact of task-specific training, particularly in
the context of early detection. To this end, we implemented a data augmentation strategy designed to
adapt the original dataset to the requirements of an early detection setting. This strategy followed the
methodology employed in previous editions of the task [12].</p>
        <p>The augmentation process aimed to generate additional training instances for each user labeled as
positive. Specifically, we sought to identify the message at which the user first exhibited signs of a
mental health disorder. Once this critical message was determined, all possible concatenations of the
user’s message history were generated and labeled according to the following rule: message sequences
occurring before the critical message were labeled as None, while those including or following it were
labeled as depression.</p>
        <p>To identify the critical message, we conducted a series of experiments in which various models
were trained and evaluated to detect the onset point of disorder-indicative language within each user’s
message sequence.</p>
        <sec id="sec-2-9-1">
          <title>4.4.1. Best Model for Data Augmentation</title>
          <p>The goal of this experimental phase was to identify the model that most accurately detected the critical
message—that is, the point at which a user begins to exhibit signs of a mental health disorder. To
ensure consistency with the shared task guidelines, we replicated the inference procedure defined by
the competition. Specifically, each model processed input batches consisting of a single message per
user and was required to make predictions according to this sequential setup as early as possible.</p>
          <p>For these experiments, the whole dataset was utilized to determine the critical message for each user
labeled as positive. The model that achieved the highest performance in this task was then selected to
generate the augmented data used in the early detection training phase.</p>
          <p>The models selected for this analysis were Longformer-t2, RoBERTa-t2, and SVM-t2-2, as they
demonstrated the highest performance during the training phase. Table 10 presents the experimentation
results. Among the evaluated models, SVM-t2-2 achieved the best performance in classifying users
under an early detection framework. Consequently, this model was selected for the data augmentation
process.</p>
          <p>This technique resulted in a new dataset with increased training samples. Table 11 presents the two
datasets generated through the data augmentation process described above, each corresponding to a
diferent maximum input length: 512 tokens for RoBERTa and 4096 tokens for Longformer. Dataset
2 refers to the version constrained to a maximum sequence length of 512 tokens, while Dataset 3
corresponds to the version that allows sequences of up to 4096 tokens.</p>
          <p>Table 11 presents the relationship between the classes. As can be observed, the application of data
augmentation has further exacerbated the class imbalance. To address this issue, we conducted a
preliminary analysis to discard information of limited utility for training. This analysis aimed to identify
the users who contributed the most informative instances for each class, while disregarding those with
minimal contribution. Additionally, we were particularly interested in retaining those users whose
data lay near the decision boundaries, as these examples are crucial for efectively delineating class
separations.</p>
          <p>To identify the users who provide the most informative examples for each class and those located
near the decision boundary, we relied on the decisions made by the SVM-t2-2 model and the scores
assigned to each user. For the None class, users with highly negative scores were selected, whereas
for the Depression class, those with highly positive scores were chosen. To capture users near the
decision boundary, we selected those misclassified, False Positives (FP), i.e., users incorrectly classified
as positive, and conversely, False Negatives (FN). We also included users whose scores were close to
zero, as they reflect uncertainty in classification and are likely to lie near the decision threshold.</p>
          <p>With the sample filtering, we obtained a new dataset with the following class distribution: 1,039
negative users (None) and 312 positive users (Depression). Table 12 presents the new class distributions
after applying data augmentation to the refined dataset. As shown, the proportion between positive
and negative samples is preserved.</p>
          <p>Table 13 reports the performance of the two models, RoBERTa-t2-t and Longformer-t2-t, which were
trained on datasets obtained via the refinement procedure followed by data augmentation.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Runs</title>
      <p>5.1. Task 1
Table 14 shows the system configuration used for each submitted run and the results obtained on
the evaluation partition. We observe that most of the systems achieve similar performance in AP,
Recall-Precision (R-PREC), and Normalized Discounted Cumulative Gain (NDCG). However, the Run5
with all the information (input sentence embedding, cluster centroids, and BDI embeddings) achieved
better in Precision at 10 (P@10) than the rest of the systems.
The rationale behind selecting these models is to validate the hypotheses of our research:
1. To assess the impact of context on prediction performance and to evaluate how diferent models
handle long-context inputs, selecting two types of models: SVMs and Longformers.
2. To explore the efect of task-specific fine-tuning for model adaptation, not only about the models’
ability to classify samples accurately, but also considering the eficiency of their inference time.
To this end, we have selected two types of LongFormer models (task-specific and straightforward)
and two types of initial context constraints (0 and 100) to examine how task-specific training
influences the model’s decision on when to make a prediction.</p>
      <sec id="sec-3-1">
        <title>5.2.1. Runs Configurations</title>
        <p>In addition to selecting the model for each run, the classification systems required configuring additional
parameters.</p>
        <p>• For each round of the competition, we used a new sample created as the input to the classifier
by combining the user’s latest message with the preceding ones. We also included all replies to
other users within the corresponding post.
• Each system has an initial context constraint, so we configured our system to wait until the initial
context was suficiently large. This context varied between systems and can be seen in Table 15
• The LongFormer system has a limit on tokens. When the system was full, we just returned the
last prediction made.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Results</title>
      <p>6.1. Task 1
Table 16 details our systems’ performance in the competition. The results show that the system, which
only includes the embeddings of the BDI guide, is the one that performs better, difering from the
results in the evaluation partition. We observed significant diferences between these results and those
from evaluation (Table 14), particularly concerning overall generalization ability compared to their
expected level based on internal consistency metrics like high P@10 (typically exceeding 0.9). Notably,
considering that organizers manually verified our top-50 samples provided in Table 16, achieving only a
P@10 score of 0.1 suggests the ranking capabilities did not transfer from evaluation to test as efectively
anticipated based on internal metrics. We cannot yet identify specific reasons for this discrepancy
without further analysis and research with the gold labels of the test set used to evaluate the systems.
Table 17 presents the results achieved by our teams in Task 2. The structure of Table 17 is as follows:
each row corresponds to an individual run, with a special row highlighting the highest values obtained
in the competition. The systems were ranked based on their Macro-F1 score (last column). A total of 50
diferent systems (runs) participated in this task.</p>
      <p>Table 17 presents the results, highlighting Run 0, corresponding to SVM-t2-2, a Linear SVM model, as
the best-performing system. This run achieved sixth place in the competition, making us the second-best
team overall. Based on these results, the following conclusions have been drawn:</p>
      <p>Although our initial assumption was that the Longformer would outperform due to its computational
power and capacity to handle large texts, the SVM ultimately achieved superior results, largely thanks
to its ability to efectively process long-context inputs. Beyond this, it is important to highlight that
the SVM relies on features extracted through TF-IDF, a technique inherently focused on capturing
vocabulary frequency and distribution. This focus on specific words or n-grams may play a decisive
role in detecting signs of depression, as certain linguistic patterns can serve as strong indicators. The
combination of these two factors likely plays a fundamental role in enhancing detection performance.
These findings suggest that classical approaches, such as SVMs, continue to hold significant value for
mental health detection tasks, particularly due to their robustness in managing extensive contexts.
Moreover, their eficiency and lower computational demands make them especially well-suited for
deployment in resource-constrained environments.</p>
      <p>Secondly, concerning the performance gap between the evaluation and test phases, led our team
to hypothesize that the reason lies in the refinement of the corpus provided by the competition.
Specifically, by removing users based on the criteria defined by the SVM-t2-2 model, we may have
inadvertently eliminated critical information necessary for training the models. As a result, while the
models performed competitively during evaluation, the lack of this information during training likely
caused a drop in performance when moving to test-time inference. Conversely, the users selected
for training may have also introduced noise or unnecessary information into the training process.
Therefore, one of the underlying causes is not only the removal of important information but the very
act of removing data itself. In this particular case, it is possible that the more data the model was
exposed to during training, the better it was able to generalize and adjust during the inference phase.</p>
      <p>When examining the results obtained by the Longformer models, it becomes clear that the use of
taskspecific training has led to superior performance. This improvement can be attributed to the adaptation
of the training process through the newly generated dataset, created using data augmentation techniques.
Such training has efectively aligned the models with the specific demands of the task, allowing them
to generalize more efectively during the test phase. Moreover, this approach has enhanced the overall
stability of the models, enabling them to achieve a better balance between precision and recall, a critical
factor for reliable performance in practical applications.</p>
      <p>By comparing Run-1 and Run-3, which both correspond to Longformer models without initial context
constraints, we observe that the model trained with task-specific adaptation not only achieved superior
results, but also, when examining the LatencyTP, demonstrates a clear tendency to wait for more
contextual information before making predictions. This finding suggests that task-specific training
endows the model with a learned capacity to delay decisions until suficient evidence is gathered, rather
than rushing to early conclusions. Such behavior reinforces the broader principle that, in complex
prediction tasks, acting rapidly is not always advantageous; instead, making informed decisions based
on richer context can lead to more accurate and reliable outcomes.</p>
    </sec>
    <sec id="sec-5">
      <title>7. Conclusion</title>
      <p>In this paper, we have presented the participation of the ELiRF-UPV team in the shared tasks of eRisk
at CLEF 2025. Beyond evaluating both traditional classification models and cutting-edge Transformer
architectures, our team’s most innovative contribution was the application of LongFormer models
to expand the context available for decision-making. We leveraged pre-trained models specifically
tailored to the mental health domain and introduced a new data augmentation and refined technique
that customizes model training to the specific task at hand.</p>
      <p>In task 1, our evaluation phase showed promising results for the designed systems. However, the
ranking performance observed during evaluation did not translate consistently to the test phase. We
recognize that without access to the test data, we were limited in providing deeper insights into
this performance discrepancy or analyzing its specific causes on real-world tasks. Further research
and analysis are necessary to pinpoint critical areas where refinement could enhance generalization
capabilities.</p>
      <p>In task 2, while the results were promising for some systems, they ultimately did not fully demonstrate
the potential performance of LongFormer models for this type of task. However, the experiments did
show that classical classifiers remain highly competitive, thanks to their lack of input length restrictions,
faster processing, and low computational resource requirements.</p>
      <p>For future work, three main lines of improvement are identified. On the one hand, eforts will focus
on enhancing early detection so that the system requires less context to make accurate decisions; on
the other hand, the application of Explainable Artificial Intelligence (XAI) techniques will be explored
to gain deeper insights into the system’s behavior. Finally, explore alternative data augmentation
techniques to improve upon the current approach, aiming to fully leverage all the information provided
by the competition, or to enhance dataset refinement in order to avoid the loss of important information.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work is partially supported by MCIN/AEI/10.13039/501100011033, by the "European Union" and
“NextGenerationEU/MRR”, and by “ERDF A way of making Europe” under grants
PDC2021-120846C44 and PID2021-126061OB-C41. Partially supported by the Vicerrectorado de Investigación de la
Universitat Politècnica de València PAID-01-23. It is also partially funded by the Spanish Ministerio de
Universidades under the grant FPU21/05288 for university teacher training.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT and Grammarly in order to: Grammar
and spelling check, Paraphrase, translate, and reword. After using this tool/service, the authors reviewed
and edited the content as needed and take full responsibility for the publication’s content.
[5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,
Attention is All You Need, Advances in Neural Information Processing Systems 30 (2017). URL:
https://arxiv.org/abs/1706.03762, accessed: 2024-05-15.
[6] I. Beltagy, M. E. Peters, A. Cohan, Longformer: The Long-Document Transformer, arXiv preprint
arXiv:2004.05150 (2020). URL: https://arxiv.org/abs/2004.05150.
[7] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Overview of eRisk 2024: Early Risk Prediction
on the Internet, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction. 15th
International Conference of the CLEF Association (CLEF 2024), volume 14959 of Lecture Notes in
Computer Science, Springer, 2024, pp. 73–92. URL: https://doi.org/10.1007/978-3-031-71908-0_4.
doi:10.1007/978-3-031-71908-0_4.
[8] D. E. Losada, F. Crestani, A Test Collection for Research on Depression and Language Use,
in: Experimental IR Meets Multilinguality, Multimodality, and Interaction. 7th International
Conference of the CLEF Association (CLEF 2016), volume 9822 of Lecture Notes in Computer
Science, Springer, 2016, pp. 28–39. URL: https://doi.org/10.1007/978-3-319-44564-9_3. doi:10.1007/
978-3-319-44564-9_3.
[9] N. Reimers, I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,
in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing,
Association for Computational Linguistics, 2019. URL: https://arxiv.org/abs/1908.10084.
[10] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, É. Duchesnay,
Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12 (2011) 2825–
2830. URL: https://jmlr.org/papers/v12/pedregosa11a.html.
[11] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
RoBERTa: A Robustly Optimized BERT Pretraining Approach, arXiv preprint arXiv:1907.11692
(2019). URL: https://arxiv.org/abs/1907.11692.
[12] A. Casamayor, V. Ahuir, A. Molina, L.-F. Hurtado, ELiRF-VRAIN at eRisk 2024: Using LongFormers
for Early Detection of Signs of Anorexia, in: L. Cappellato, N. Ferro, D. E. Losada, H. Müller
(Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, volume
3740 of CEUR Workshop Proceedings, CEUR-WS.org, Grenoble, France, 2024, pp. 803–812. URL:
https://ceur-ws.org/Vol-3740/paper-65.pdf.
[13] L. X. Yuan, distilbert-base-multilingual-cased-sentiments-student (revision 2e33845), 2023.</p>
      <p>URL: https://huggingface.co/lxyuan/distilbert-base-multilingual-cased-sentiments-student.
doi:10.57967/hf/1422.
[14] A. Pourkeyvan, R. Safa, A. Sorourkhah, Harnessing the Power of Hugging Face Transformers for
Predicting Mental Health Disorders in Social Networks, IEEE Access 12 (2024) 28025–28035. URL:
http://dx.doi.org/10.1109/ACCESS.2024.3366653. doi:10.1109/access.2024.3366653.
[15] S. Ji, T. Zhang, L. Ansari, J. Fu, P. Tiwari, E. Cambria, MentalBERT: Publicly Available Pretrained
Language Models for Mental Healthcare, in: N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri,
T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, S. Piperidis (Eds.),
Proceedings of the Thirteenth Language Resources and Evaluation Conference, European Language
Resources Association, Marseille, France, 2022, pp. 7184–7190. URL: https://aclanthology.org/2022.
lrec-1.778.
[16] S. Ji, X. Li, Z. Huang, E. Cambria, Suicidal ideation and mental disorder detection with attentive
relation networks, Neural Computing and Applications (2021).
[17] AIMH, MentalRoBERTa: A Robustly Optimized BERT Pretraining Approach for Mental Health,
2024. URL: https://huggingface.co/AIMH/mental-roberta-large, accessed: 2024-05-15.
[18] AIMH, MentalLongformer: A Long-Document Transformer Model for Mental Health, 2024. URL:
https://huggingface.co/AIMH/mental-longformer-base-4096, accessed: 2024-05-15.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Kessler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Berglund</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Demler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Koretz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Merikangas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Rush</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. E.</given-names>
            <surname>Walters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>The epidemiology of major depressive disorder: results from the National Comorbidity Survey Replication (NCS-R)</article-title>
          ,
          <source>JAMA</source>
          <volume>289</volume>
          (
          <year>2003</year>
          )
          <fpage>3095</fpage>
          -
          <lpage>3105</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>World</given-names>
            <surname>Health</surname>
          </string-name>
          <string-name>
            <surname>Organization</surname>
          </string-name>
          , Depression,
          <year>2021</year>
          . Retrieved from https://www.who.int/newsroom/fact-sheets/detail/depression.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of eRisk 2025:
          <article-title>Early Risk Prediction on the Internet (Extended Overview)</article-title>
          ,
          <source>in: Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2025</year>
          ), Madrid, Spain,
          <fpage>9</fpage>
          -
          <issue>12</issue>
          <year>September</year>
          ,
          <year>2025</year>
          , volume To be published of CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of eRisk 2025:
          <article-title>Early Risk Prediction on the Internet, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction - 16th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2025</year>
          , Madrid, Spain, September 9-
          <issue>12</issue>
          ,
          <year>2025</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , volume To be
          <source>published of Lecture Notes in Computer Science</source>
          , Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>