<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Conference and Labs of the Evaluation Forum, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Overview of eRisk at CLEF 2024: Early Risk Prediction on the Internet (Extended Overview)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Javier Parapar</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patricia Martín-Rodilla</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David E. Losada</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Crestani</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS),Universidade de Santiago de Compostela. Rúa de Jenaro de la Fuente Domínguez</institution>
          ,
          <addr-line>C.P 15782, Santiago de Compostela</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Informatics, Universitá della Svizzera italiana (USI). Campus EST</institution>
          ,
          <addr-line>Via alla Santa 1, 6900 Viganello</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Information Retrieval Lab, Centro de Investigación en Tecnoloxías da Información e as Comunicacións (CITIC), Universidade da Coruña.</institution>
          <addr-line>Campus de Elviña s/n C.P 15071 A Coruña</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>0</volume>
      <fpage>9</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>This paper presents eRisk 2024, the eighth edition of the CLEF conference's lab dedicated to early risk detection. Since its inception, the lab has been at the forefront of developing and refining evaluation methodologies, efectiveness metrics, and processes for early risk detection across various domains. These early alerting models hold significant value, particularly in sectors focused on health and safety, where timely intervention can be crucial. eRisk 2024 featured three main tasks designed to push the boundaries of early risk detection techniques. The first task challenged participants to rank sentences based on their relevance to standardized depression symptoms, a crucial step in identifying early signs of depression from textual data. The second task focused on the early detection of anorexia indicators, aiming to develop models that can recognize the subtle cues of this eating disorder before it becomes critical. The third task was centered around estimating responses to an eating disorders questionnaire by analyzing users' social media posts. Participants had to leverage the rich, real-world textual data available on social media to gauge potential mental health risks. Through these tasks, eRisk 2024 continues to advance the field of early risk detection, fostering innovations that could lead to significant improvements in public health interventions.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Early risk</kwd>
        <kwd>Depression</kwd>
        <kwd>Anorexia</kwd>
        <kwd>Eating disorders</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The primary goal of eRisk is to explore evaluation methodologies, metrics, and other factors essential
for developing research collections and identifying early risk signs. Early detection technologies
are increasingly important in safety and health fields. These technologies are particularly useful for
detecting mental illness symptoms, identifying interactions between infants and sexual abusers, or
spotting antisocial threats online, where they can provide early warnings and potentially prevent
harmful outcomes.</p>
      <p>Our lab focuses on a range of psychological issues, including depression, self-harm, pathological
gambling, and eating disorders. We have found that the relationship between psychological conditions
and language use is complex, highlighting the need for more efective automatic language-based
screening models. This complexity arises from the subtle and varied ways in which psychological
distress can manifest in language, necessitating sophisticated analytical techniques.</p>
      <p>In 2017, we initiated our eforts with a task aimed at detecting early signs of depression. This task
utilized new evaluation methods and a test dataset described in [1, 2]. The goal was to develop models
capable of identifying depressive symptoms from textual data, which could then be used for early
intervention. In 2018, we expanded our scope to include the early detection of anorexia [3, 4]. This
task required models to identify language patterns indicative of anorexia, providing a tool for early
diagnosis.</p>
      <p>In 2019, we continued our work on anorexia and introduced new challenges. These included detecting
early signs of self-harm and estimating responses to a depression questionnaire based on social media
activity [5, 6, 7]. The self-harm detection task aimed to identify individuals at risk by analyzing their
online posts for signs of self-injurious behavior. The severity estimation task aimed to quantify the
level of depressive symptoms exhibited in social media posts, providing a more nuanced understanding
of an individual’s mental health status. In 2020, our focus included further development of self-harm
detection and a new task on depression severity estimation [8, 9, 10].</p>
      <p>In 2021, we concentrated on early detection tasks for pathological gambling and self-harm, along with a
task for estimating depression severity [11, 12, 13]. The pathological gambling task involved identifying
language patterns associated with gambling addiction, which could be used to flag individuals at risk.
The self-harm and depression severity tasks continued to build on our previous work, refining the
models and evaluation methods.</p>
      <p>The 2022 edition of eRisk introduced tasks for early detection of pathological gambling, depression,
and severity estimation of eating disorders [14, 15, 16]. These tasks aimed to improve the accuracy and
reliability of early detection models, providing valuable tools for mental health professionals.
In 2023, eRisk tasks included ranking sentences by their relevance to depression symptoms, early
detection of gambling signs, and severity estimation of eating disorders [17, 18, 19]. The sentence
ranking task required models to assess the relevance of individual sentences to specific depressive
symptoms, as outlined in the BDI-II questionnaire. This task aimed to enhance the precision of symptom
identification in textual data.</p>
      <p>In 2024, eRisk presented three campaign-style tasks [20, 19]. The first task focused on ranking sentences
related to the 21 symptoms of depression as per the BDI-II questionnaire, using sentences extracted
from social media posts. The second task continued our work on early detection of anorexia, requiring
models to identify language patterns indicative of this eating disorder. The third task revisited the
severity estimation of eating disorders, aiming to quantify the severity of symptoms exhibited in textual
data. Detailed descriptions of these tasks are provided in the subsequent sections of this overview
article.</p>
      <p>In 2024, we had 84 teams registered for the lab. We received results from 17 of them: 29 runs for
Task 1, 44 runs for Task 2, and 14 for Task 3. These results provided valuable insights into the
efectiveness of diferent models and approaches, contributing to the ongoing development of early
detection technologies.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task 1: Search for Symptoms of Depression</title>
      <p>This task builds on eRisk 2023’s Task 1, which focused on ranking sentences from user writings based
on their relevance to specific depression symptoms. Participants had to order sentences according to
their relevance to the 21 standardized symptoms listed in the BDI-II Questionnaire [21]. A sentence
was considered relevant if it reflected the user’s condition related to a symptom, including positive
statements (e.g., “I feel quite happy lately” is relevant for the symptom “Sadness”). This year, the dataset
included the target sentence and the sentences immediately before and after it to provide additional
context.</p>
      <sec id="sec-2-1">
        <title>2.1. Dataset</title>
        <p>The dataset provided was in TREC format, tagged with sentences derived from eRisk’s historical data.
Table 1 presents some statistics of the corpus.
1Q0251001_0_1000110myGroupNameMyMethodName
1Q0251202_5_400029.5myGroupNameMyMethodName
1Q0858202_3_200039myGroupNameMyMethodName
...
21Q0153202_2_209981.25myGroupNameMyMethodName
21Q0331302_1_109991myGroupNameMyMethodName
21Q0223133_9_810000.9myGroupNameMyMethodName</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Assessment Process</title>
        <p>Participants were given the corpus of sentences and the description of the symptoms from the BDI-II
questionnaire. They were free to decide on the best strategy to derive queries for representing the
BDI-II symptoms. Each team could submit up to 5 variants (runs). Each run included 21 TREC-style
formatted rankings of sentences, as shown in Figure 1. For each symptom, participants submitted up to
1000 results sorted by estimated relevance. We received 29 runs from 9 participating teams (see Table 2).
To create the relevance judgments, three assessors annotated a pool of sentences associated with each
symptom. These candidate sentences were obtained by performing top-k pooling from the relevance
rankings submitted by the participants.</p>
        <p>The assessors were given specific instructions (see Figure 2) dto determine the relevance of
candidate sentences. They considered a sentence relevant if it addressed the topic and provided explicit
information about the individual’s state in relation to the symptom. This dual concept of relevance
(on-topic and reflective of the user’s state with respect to the symptom) introduced a higher level
of complexity compared to standard relevance assessments. Consequently, we developed a robust
annotation methodology and formal assessment guidelines to ensure consistency and accuracy. The
main change from eRisk 2023’s assessment process was that the assessors were presented with the
sentence and its context (previous and following sentences, if available).</p>
        <p>To create the pool of sentences for assessment, we implemented top-k pooling with  = 50. The
resulting pool sizes per sentence are reported in Table 3.</p>
        <p>The annotation process involved a team of three assessors with diferent backgrounds and expertise.
Assume you a r e g i v e n a BDI item , e . g . :
1 5 . Loss o f Energy
− I have as much energy as e v e r .
− I have l e s s energy than I used t o have .
− I don ’ t have enough energy t o do v e r y much .
− I don ’ t have enough energy t o do a n y t h i n g .</p>
        <p>One assessor had professional training in psychology, while the other two were computer science
researchers—a postdoctoral fellow and a Ph.D. student—with a specialization in early risk technologies.
To ensure consistency and clarity throughout the process, the lab organizers conducted a preparatory
session with the assessors. During this session, an initial version of the guidelines was discussed, and
any doubts or questions raised by the assessors were addressed. This collaborative efort resulted in the
ifnal version of the guidelines 1.</p>
        <p>According to these guidelines, a sentence is considered relevant only if it provides “some information
about the state of the individual related to the topic of the BDI item”. This criterion serves as the basis
for determining the relevance of sentences during the annotation process.</p>
        <p>The final outcomes of the annotation process are presented in Table 3, where the number of relevant
sentences per BDI item is reported (last two columns). We marked a sentence as relevant following two
aggregation criteria: unanimity and majority.</p>
        <sec id="sec-2-2-1">
          <title>1https://erisk.irlab.org/guidelines_erisk24_task1.html</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Results</title>
        <p>The performance results for the participating systems are shown in Tables 4 (majority-based qrels) and 5
(unanimity-based qrels). The tables report several standard performance metrics, such as Mean Average
Precision (MAP), mean R-Precision, mean Precision at 10, and mean NDCG at 1000. Run Config_5, from
the team NUS-IDS [24], achieved the top-ranking performance for nearly all metrics and relevance
judgment types. It consists in an ensemble model designed for computing semantic similarity with respect to
diferent expanded descriptions of BDI symptoms. This ensemble leverages three pre-trained language
models: all-mpnet-base-v2, all-MiniLM-L12-v2, and all-distilroberta-v1.This approach
is similar to the APB-UC3M [28] team’s proposal, which achieved the best results in terms of P@10 using
majority voting. In contrast, the MeVer-REBECCA [26] team opted for the bge-small-en-v1.54
embedding model, attaining the highest P@10 scores in the unanimity case.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Task 2: Early Detection of Signs of Anorexia</title>
      <p>This task is the third edition of the challenge to develop models for early identification of anorexia
signs. The goal was to process evidence sequentially and detect early indications of anorexia as soon
as possible. Participating systems analyzed user posts on social media in the order they were written.
Successful outcomes from this task could be used for sequential monitoring of user interactions across
various online platforms like blogs, social networks, and other digital media.</p>
      <p>The test collection used for this task followed the format described by Losada and Crestani [29]. It
contains writings, including posts and comments, from a selected group of social media users. Users
are categorized into two groups: anorexia and non-anorexia. For each user, the collection contains
a sequence of writings arranged in chronological order. To facilitate the task and ensure uniform
distribution, we established a server that systematically provided user writings to the participating
teams. Further details about the server’s setup are available on the lab’s oficial website 2.
This was a train-test task. During the training stage, teams had access to the entire history of writings
for training users. We indicated which users had explicitly mentioned being diagnosed with anorexia.
Participants could tune their systems with this training data. In 2024, the training data included users
from previous editions of the anorexia task (2018 and 2019).</p>
      <p>During the test stage, participants connected to our server and engaged in an iterative process of
receiving user writings and sending their responses. At any point within the chronology of user
writings, participants could halt the process and issue an alert. After reading each user writing, teams
had to decide between two options: i) alerting about the user, indicating a predicted sign of anorexia, or
ii) not alerting about the user. Participants made this choice independently for each user in the test
split. Once an alert was issued, it was final, and no further decisions regarding that individual were
considered. Conversely, the absence of alerts was non-final, allowing participants to submit an alert
later if they detected signs of risk.</p>
      <p>We evaluated the systems’ performance using two indicators: the accuracy of the decisions made and
the number of user writings required to reach those decisions. These criteria provide insights into the
efectiveness and eficiency of the systems. To support the test stage, we deployed a REST service. The
server iteratively distributed user writings and waited for responses from participants. New user data
was not provided to a participant until the service received a decision from that team. The submission
period for the task was open from February 5th, 2024, until April 12th, 2024.</p>
      <p>Num. subjects
Num. submissions (posts &amp; comments)
Avg num. of submissions per subject
Avg num. of days from first to last submission
Avg num. words per submission
To construct the ground truth assessments, we adopted established approaches to optimize the use of
assessors’ time, as documented in previous studies [30, 31]. These methods employ simulated pooling
strategies to create efective test collections. The main statistics of the test collection used for T2 are
presented in Table 6.</p>
      <sec id="sec-3-1">
        <title>3.1. Decision-based Evaluation</title>
        <p>This evaluation approach uses the binary decisions made by the participating systems for each user.
In addition to standard classification measures such as Precision, Recall, and F1 score (computed with
respect to the positive class), we also calculate ERDE (Early Risk Detection Error), used in previous
editions of the lab. A detailed description of ERDE was presented by Losada and Crestani in [29]. ERDE
is an error measure that incorporates a penalty for delayed correct alerts (true positives). The penalty
increases with the delay in issuing the alert, measured by the number of user posts processed before
making the alert.</p>
        <p>Since 2019, we complemented the evaluation report with additional decision-based metrics that try to
capture additional aspects of the problem. These metrics try to overcome some limitations of ,
namely:
• the penalty associated to true positives goes quickly to 1. This is due to the functional form of
the cost function (sigmoid).
• a perfect system, which detects the true positive case right after the first round of messages (first
chunk), does not get error equal to 0.
• with a method based on releasing data in a chunk-based way (as it was done in 2017 and 2018)
the contribution of each user to the performance evaluation has a large variance (diferent for
users with few writings per chunk vs users with many writings per chunk).</p>
        <p>•  is not interpretable.</p>
        <p>Some research teams have analysed these issues and proposed alternative ways for evaluation. Trotzek
and colleagues [32] proposed %. This is a variant of ERDE that does not depend on the number
of user writings seen before the alert but, instead, it depends on the percentage of user writings seen
before the alert. In this way, user’s contributions to the evaluation are normalized (currently, all users
weight the same). However, there is an important limitation of %. In real life applications, the
overall number of user writings is not known in advance. Social Media users post contents online and
screening tools have to make predictions with the evidence seen. In practice, you do not know when
(and if) a user’s thread of messages is exhausted. Thus, the performance metric should not depend on
knowledge about the total number of user writings.</p>
        <p>Another proposal of an alternative evaluation metric for early risk prediction was done by Sadeque and
colleagues [33]. They proposed , which fits better with our purposes. This measure is described
next.</p>
        <p>Imagine a user  ∈  and an early risk detection system that iteratively analyzes ’s writings (e.g. in
chronological order, as they appear in Social Media) and, after analyzing  user writings ( ≥ 1),
takes a binary decision  ∈ {0, 1}, which represents the decision of the system about the user being a
risk case. By  ∈ {0, 1}, we refer to the user’s golden truth label. A key component of an early risk
evaluation should be the delay on detecting true positives (we do not want systems to detect these cases
too late). Therefore, a first and intuitive measure of delay can be defined as follows 3:
latency 
=</p>
        <p>median{ :  ∈ ,  =  = 1}
This measure of latency is calculated over the true positives detected by the system and assesses the
system’s delay based on the median number of writings that the system had to process to detect such
positive cases. This measure can be included in the experimental report together with standard measures
such as Precision (P), Recall (R) and the F-measure (F):

=
| ∈  :  =  = 1|</p>
        <p>
          | ∈  :  = 1|
3Observe that Sadeque et al (see [33], pg 497) computed the latency for all users such that  = 1. We argue that latency
should be computed only for the true positives. The false negatives ( = 1,  = 0) are not detected by the system and,
therefore, they would not generate an alert.
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
 =
 =
| ∈  :  =  = 1|
        </p>
        <p>| ∈  :  = 1|
2 ·  · 
 + 
Furthermore, Sadeque et al. proposed a measure, , which combines the efectiveness of the
decision (estimated with the F measure) and the delay4 in the decision. This is calculated by multiplying
F by a penalty factor based on the median delay. More specifically, each individual (true positive)
decision, taken after reading  writings, is assigned the following penalty:
() = − 1 +</p>
        <p>
          2
1 + exp− · (− 1)
where  is a parameter that determines how quickly the penalty should increase. In [33],  was set such
that the penalty equals 0.5 at the median number of posts of a user5. Observe that a decision right after
the first writing has no penalty (i.e. (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) = 0). Figure 3 plots how the latency penalty increases
with the number of observed writings.
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
        </p>
        <sec id="sec-3-1-1">
          <title>The system’s overall speed factor is computed as:</title>
          <p>= (1 −</p>
          <p>median{() :  ∈ ,  =  = 1})
where speed equals 1 for a system whose true positives are detected right at the first writing. A slow
system, which detects true positives after hundreds of writings, will be assigned a speed score near 0.
Finally, the latency-weighted F score is simply:</p>
          <p>=  · 
Since 2019 user’s data were processed by the participants in a post by post basis (i.e. we avoided
a chunk-based release of data). Under these conditions, the evaluation approach has the following
properties:
• smooth grow of penalties;
• a perfect system gets  = 1 ;
• for each user  the system can opt to stop at any point  and, therefore, now we do not have the
efect of an imbalanced importance of users;
•  is more interpretable than .
4Again, we adopt Sadeque et al.’s proposal but we estimate latency only over the true positives.
5In the evaluation we set  to 0.0078, a setting obtained from the eRisk 2017 collection.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Ranking-based Evaluation</title>
        <p>In addition to the evaluation discussed above, we employed an alternative form of evaluation to further
assess the systems. After each data release (new user writing, that is post or comment), participants
were required to provide the following information for each user in the collection:
• A decision for the user (alert or no alert), which was used to calculate the decision-based metrics
discussed previously.</p>
        <p>• A score representing the user’s level of risk, estimated based on the evidence observed thus far.
The scores were used to create a ranking of users in descending order of estimated risk. For each
participating system, a ranking was generated at each data release point, simulating a continuous
re-ranking approach based on the observed evidence. In a real-life scenario, this ranking would be
presented to an expert user who could make decisions based on the rankings (e.g., by inspecting the top
of the rankings).</p>
        <p>Each ranking can be evaluated using standard ranking metrics such as P@10 or NDCG. Therefore,
we report the performance of the systems based on the rankings after observing diferent numbers of
writings.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Results</title>
        <p>Table 7 shows the participating teams, the number of runs submitted, and the approximate lapse of
time from the first response to the last response. This time-lapse indicates the degree of automation of
each team’s algorithms. Many of the submitted runs processed the entire thread of messages (2001), but
a few variants stopped earlier. Five teams processed the thread of messages reasonably fast (less than a
day for processing the entire history of user messages). The rest of the teams took several days to run
the whole process.</p>
        <p>Table 8 reports the decision-based performance achieved by the participating teams. In terms of  1 and
latency-weighted  1, the best performing team was NLP-UNED [39] (run 1), while Riewe-Perla [35]
was the team that submitted the best run (run 0) in terms of the ERDE metrics. The majority of teams
made quick decisions. Overall, these findings indicate that some systems achieved a relatively high
level of efectiveness with only a few user submissions. Social and public health systems may use the
best predictive algorithms to assist expert humans in detecting signs of anorexia as early as possible.
Table 9 presents the ranking-based results. UNSL [36] (run 1) obtained the best overall values after only
one writing, while NLP-UNED [39](run 3) obtained the highest scores after 100 writings. These two
teams also contributed the best performing variants for the 500 and 1000 cutofs.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Task 3: Measuring the Severity of Eating Disorders</title>
      <p>The objective of the task is to estimate the severity of various symptoms related to the diagnosis of
eating disorders. Participants were provided with a thread of user submissions to work with. For each
user, a history of posts and comments from Social Media was given, and participants had to estimate
the user’s responses to a standardized eating disorder questionnaire based on the evidence found in the
history of posts/comments.</p>
      <p>
        The questionnaire used in the task is derived from the Eating Disorder Examination Questionnaire
(EDE-Q)6, which is a self-reported questionnaire consisting of 28 items. It is adapted from the
semistructured interview Eating Disorder Examination (EDE)7[40]. For this task, we focused on questions
1-12 and 19-28 from the EDE-Q. This questionnaire is designed to assess various aspects and severity of
6https://www.corc.uk.net/media/1273/ede-q_quesionnaire.pdf
7https://www.corc.uk.net/media/1951/ede_170d.pdf
features associated with eating disorders. It includes four subscales: Restraint, Eating Concern, Shape
Concern, and Weight Concern, along with a global score. Table 10 shows an excerpt of the EDE-Q.
The following questions are concerned with the past four weeks (28 days) only.
you..
2. Have you gone for long periods of time (8 waking hours or more) without
eating anything at all in order to influence your shape or weight?
0. NO DAYS
1. 1-5 DAYS
2. 6-12 DAYS
3. 13-15 DAYS
4. 16-22 DAYS
5. 23-27 DAYS
6. EVERY DAY
3. Have you tried to exclude from your diet any foods that you like in order
to influence your shape or weight (whether or not you have succeeded)?
0. NO DAYS
1. 1-5 DAYS
2. 6-12 DAYS
3. 13-15 DAYS
4. 16-22 DAYS
5. 23-27 DAYS
6. EVERY DAY
.
.
.
22. Has your weight influenced how you think about (judge) yourself as a
person?
0. NOT AT ALL (0)
1. SLIGHTY (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
2. SLIGHTY (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
3. MODERATELY (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
4. MODERATELY (
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
5. MARKEDLY (
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
6. MARKEDLY (
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
23. Has your shape influenced how you think about (judge) yourself as a
person?
0. NOT AT ALL (0)
1. SLIGHTY (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
2. SLIGHTY (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
3. MODERATELY (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
The primary objective of this task was to explore the possibility of automatically estimating the severity
of multiple symptoms related to eating disorders. The algorithms are required to estimate the user’s
response to each individual question based on their writing history. To evaluate the performance of the
participating systems, we collected questionnaires completed by Social Media users along with their
corresponding writing history. The user-completed questionnaires serve as the ground truth against
which the responses provided by the systems are evaluated.
      </p>
      <p>During the training phase, participants were provided with data from 28 users from the 2022 edition
and 46 users from the 2023 edition. This training data included the writing history of the users as well
as their responses to the EDE-Q questions. In the test phase, there were 18 new users for whom the
participating systems had to generate results. The results were expected to follow the following specific
ifle structure:
username1 answer1 answer2...answer12 answer19...answer28
username2 answer1 answer2...answer12 answer19...answer28
.
.
.</p>
      <p>Each line has the username and 22 values (no answers from 13 to 18). These values correspond with the
responses to the questions above (the possible values are 0,1,2,3,4,5,6).</p>
      <sec id="sec-4-1">
        <title>4.1. Evaluation Metrics</title>
        <p>Evaluation is based on the following efectiveness metrics:
• Mean Zero-One Error ( ) between the questionnaire filled by the real user and the
questionnaire filled by the system (i.e. fraction of incorrect predictions).</p>
        <p>
          (, ) = |{ ∈  : () ̸=  ()}| (
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
||
where  denotes the classification done by an automatic system,  is the set of questions of each
questionnaire,  is the i-th question, () is the real user’s answer for the i-th question and
 () is the predicted answer of the system for the i-th question. Each user produces a single
  score and the reported   is the average over all   values (mean  
over all users).
• Mean Absolute Error ( ) between the questionnaire filled by the real user and the
questionnaire filled by the system (i.e. average deviation of the predicted response from the true
response).
        </p>
        <p>(, ) =
=0
where  represents the set of questions whose true answer is  (note that  goes from 0 to 6
because those are the possible answers to each question). Again, each user produces a single
  score and the reported   is the average over all   values (mean
  over all users).</p>
        <p>The following measures are based on aggregated scores obtained from the questionnaires. Further
details about the EDE-Q instruments can be found elsewhere (e.g. see the scoring section of the
questionnaire).
• Restraint Subscale (RS): Given a questionnaire, its restraint score is obtained as the mean
response to the first five questions. This measure computes the RMSE between the restraint ED
score obtained from the questionnaire filled by the real user and the restraint ED score obtained
from the questionnaire filled by the system.</p>
        <p>
          Each user  is associated with a real subscale ED score (referred to as  ()) and an estimated
subscale ED score (referred to as  ()). This metric computes the RMSE between the real and
an estimated subscale ED scores as follows:
∑︀∈ |() −  ()|
||
Again, each user produces a single   score and the reported   is the average over all
  values (mean   over all users).
• Macroaveraged Mean Absolute Error ( ) between the questionnaire filled by the
real user and the questionnaire filled by the system (see [41]).
• Weight Concern Subscale (WCS): Given a questionnaire, its weight concern score is obtained
as the mean response to the following questions (
          <xref ref-type="bibr" rid="ref12 ref8">22, 24, 8, 25, 12</xref>
          ). This metric computes the
RMSE (equation 14) between the weight concern ED score obtained from the questionnaire filled
by the real user and the weight concern ED score obtained from the questionnaire filled by the
system.
        </p>
        <p>(,  ) =
√︃ ∑︀∈ ( () −  ())2
| |
| |
| |</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref9">9</xref>
          )
(
          <xref ref-type="bibr" rid="ref10">10</xref>
          )
(
          <xref ref-type="bibr" rid="ref11">11</xref>
          )
(
          <xref ref-type="bibr" rid="ref12">12</xref>
          )
(
          <xref ref-type="bibr" rid="ref13">13</xref>
          )
where  is the user set.
• Eating Concern Subscale (ECS): Given a questionnaire, its eating concern score is obtained as
the mean response to the following questions (
          <xref ref-type="bibr" rid="ref7 ref9">7, 9, 19, 21, 20</xref>
          ). This metric computes the RMSE
(equation 12) between the eating concern ED score obtained from the questionnaire filled by the
real user and the eating concern ED score obtained from the questionnaire filled by the system.
        </p>
        <p>
          √︃ ∑︀∈ ( () −  ())2
• Shape Concern Subscale (SCS): Given a questionnaire, its shape concern score is obtained as
the mean response to the following questions (
          <xref ref-type="bibr" rid="ref10 ref11 ref6 ref8">6, 8, 23, 10, 26, 27, 28, 11</xref>
          ). This metric computes
the RMSE (equation 13) between the shape concern ED score obtained from the questionnaire
iflled by the real user and the shape concern ED score obtained from the questionnaire filled by
the system.
        </p>
        <p>√︃ ∑︀∈ (() − ())2</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results</title>
        <p>Table 11 reports the results obtained by the participants in this task. In order to provide some context,
the table includes the performance of three baseline variants in the top block: “all 0s”, “all 6s”, and
“average”. The “all 0s” variant represents a strategy where the same response (0) is submitted for all
questions. Similarly, the “all 6s” variant submits the response 6 for all questions. The “average” variant
calculates the mean of the responses provided by all participants for each question and submits the
response that is closest to this mean value (e.g. if the mean response provided by the participants equals
3.7 then this average approach would submit a 4).</p>
        <p>The results indicate that the top-performing system in terms of Mean Absolute Error (MAE) was run 4
by SCaLAR-NITK [42]. This team also got the best MZOE (run 2), the best   (run 0), the best
GED (run 3), the best RS (run 4), the best ECS (run 1), and the best SCS (run 0). The best WCS, instead,
was achieved by team RELAI [23] (run 0). In some cases the best participating system was not better
that some of the baselines (e.g., lowest MZOE is the “all 6s” baseline).</p>
        <p>∈ (  () −  ())2
• Global ED (GED): To obtain an overall or ‘global’ score, the four subscales scores are summed
and the resulting total divided by the number of subscales (i.e. four) [40]. This metric computes
the RMSE between the real and an estimated global ED scores as follows:</p>
        <p>
          ∈ (() − ())2
√︃ ∑︀
√︃ ∑︀
| |
| |
(
          <xref ref-type="bibr" rid="ref14">14</xref>
          )
(
          <xref ref-type="bibr" rid="ref15">15</xref>
          )
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Participating Teams</title>
      <p>Table 12 reports the participating teams and the runs that they submitted for each eRisk task. The next
paragraphs give a brief summary on the techniques implemented by each of them. Further details are
available at the CLEF 2024 working notes proceedings for the participants.</p>
      <p>team
APB-UC3M [28]. The APB-UC3M team, afiliated with Universidad Carlos III de Madrid (UC3M) in
Spain, participated in the three tasks of the eRisk 2024 challenge. For Task 1, which involved searching
for symptoms of depression, the team employed sentence similarity models to compare BDI items
with paragraphs, in conjunction with a RoBERTa classifier. They also explored ensemble methods
combining these approaches. In Task 2, focused on the early detection of anorexia, the team used an
ensemble model comprising three classification algorithms. They generated embeddings using BART
and Doc2Vec models and utilized these embeddings as input for three traditional classifiers: Support
Vector Machine (SVM), Logistic Regression (LR), and Random Forest (RF). For Task 3, which involved
measuring the severity of signs of eating disorders, the team fine-tuned a neural network model. This
model included an embedding layer, a fully connected layer, and a ReLU activation function, and it was
trained to predict the 22 categories of the Eating Disorder Examination (EDE) interview.
BioNLP-IISERB [34]. The BioNLP-IISERB team, afiliated with the Indian Institute of Science Education
and Research, Bhopal, participated in Task 2 of the eRisk 2024 challenge. The team’s approach involved
a combination of various classification methods and feature engineering techniques to identify signs
of anorexia from the provided texts. They utilized both bag-of-words features and transformer-based
embedding methods. For classification, they employed Random Forest, Adaptive Boosting, Logistic
Regression, Support Vector Machine (SVM), and transformer-based classifiers. Their experimental
analysis revealed that the best performance was achieved using SVMs and an Adaboost classifier,
particularly with TF-IDF and entropy-based weighting strategies. Some experimental runs achieved
F1 scores higher than 0.65, indicating the potential of these frameworks to identify textual patterns
indicative of anorexia. Despite the promising results, the complexity of the task suggests there is room
for future improvements.</p>
      <p>DSGT [27]. The DSGT team, from the Georgia Institute of Technology, participated in Tasks 1 and
3 of the eRisk 2024 challenge. For Task 1, they developed two distinct pipelines to detect signs of
depression. Their approach combined traditional NLP techniques, such as TF-IDF, with vector-based
models. Specifically, they constructed a logistic regression classifier, treating the 21 symptoms as targets
for a multiclass classification problem. The results on the hidden test set demonstrated that vector
models and transformer-based models could achieve notable performance on information retrieval
metrics, even without advanced sentence filtering and fine-tuning. For Task 3, the team employed
simpler models, including XGBoost and Random Forests, which showed better performance on smaller
datasets.</p>
      <p>ELiRF-UPV [38]. The ELiRF-VRAIN team, afiliated with the Valencian Research Institute for Artificial
Intelligence (VRAIN) at Universitat Politècnica de València, participated in Task 2. Their work involved
three distinct approaches: a Support Vector Machine (SVM) and two pre-trained Transformer models.
Among the Transformer models, one approach utilized BERT-like models, while the other employed
LongFormer models to expand the context when making decisions. To balance the training example
set, the authors proposed a data augmentation method, which yielded positive results in augmenting
examples during the training process.</p>
      <p>MeVer-REBECCA [26]. The REBECCA team, afiliated with the Information Technologies Institute at
the Centre for Research and Technology Hellas (CERTH) in Thessaloniki, Greece, participated in Task 1
of the eRisk 2024 challenge, which focused on searching for symptoms of depression. Their approach
involved a combination of ranking sentences using cosine similarity and Transformer embeddings,
with refinement through a Large Language Model (LLM), specifically ChatGPT-4. The process began
with text pre-processing and dataset cleaning, discarding sentences not related to the authors and
considering relevant sentences only if they reflected the author’s state surrounding a symptom. They
conducted keyword matching with sentences indicating self-reference. Following this, the team used
sentence ranking with BGE-M3 (Multi-Linguality, Multi-Functionality, and Multi-Granularity) and the
questionnaire answers.</p>
      <p>MindwaveML [25]. The MindwaveML team, from the University of Bucharest, participated in Task
1. The team leveraged a paraphrasing model to match sentences with BDI texts. Specifically, they
encoded the four alternative responses to each of the 21 BDI symptoms into dense embeddings using the
paraphrase-MiniLM-L12-v2 model. These embeddings captured the semantic information contained in
each response. The sentences from the Reddit corpus were also encoded with
paraphrase-MiniLM-L12v2. The cosine similarity between each sentence and the BDI responses was then computed. Additionally,
the team incorporated features to ensure that the sentences contained first-person expressions, ensuring
relevance to the individual’s state. The resulting set of features was fed to standard learning algorithms.
NLP-UNED [39]. The NLP-UNED team, from UNED in Madrid, participated in Task 2. Their system
comprised several steps, starting with an initial embedding representation using sentence encoders.
This was followed by a relabelling process based on Approximate Nearest Neighbors (ANN) techniques
to generate a training dataset annotated at the message level instead of the user level. The encoding
process was further refined with fine-tuning based on contrastive learning, aiming to maximize the
distance between embeddings belonging to diferent classes. For classification, the team also employed
ANN techniques, combined with rules and heuristics to expand the number of messages considered
from each user when making the final decision. Their system achieved the best results in both the
decision-based evaluation and in the ranking-based evaluation.</p>
      <p>NUS-IDS [24]. The NUS-IDS team, afiliated with the National University of Singapore’s Integrative
Sciences and Engineering Programme and the Institute of Data Science, participated in Task 1 of
the eRisk 2024 challenge. The team’s approach involved ranking candidate sentences for depression
symptoms by their average similarity to a predefined set of training sentences. Utilizing methods for
computing dense representations of sentences, the team calculated the score of a test sentence as the
average cosine similarity between the test sentence and each sentence in a set of training sentences
associated with a specific symptom. The authors experimented with diferent configurations of this
algorithm, employing various models for dense representation computation and diferent sets of training
sentences. This approach allowed the NUS-IDS team to efectively rank sentences by their relevance
to depression symptoms, leveraging both similarity metrics and the robustness of multiple model
configurations.</p>
      <p>RELAI [23]. The RELAI team, from Université du Québec à Montréal, Canada, and McMaster
University, Canada, participated in Tasks 1 and 3. For Task 1, which involved searching for symptoms of
depression, the team approached it as a multilabel classification task. They utilized feed-forward neural
networks with contextual embeddings to mine sentences relevant to each item in a standard depression
questionnaire from a large set of social media sentences. Their methods aimed to be lightweight,
minimizing computational costs and infrastructure needs. In Task 3, the team used BERTopic to extract
the 16 most correlated topics with signs of eating disorders as features for prediction. They employed
feed-forward neural networks with topic probabilities as inputs to automatically fill out a standard eating
disorder questionnaire based on social media writing histories. The authors noted significant room
for improvement, particularly in exploring diferent representations of writing history and improving
model calibration for classification transformations.</p>
      <p>Riewe-Perla (MHRec) [35]. The Poznan University of Economics and Business team participated
in Task 2. Their approach involved merging language models with recommender systems to analyze
and predict if recommended content originated from individuals with mental health conditions. The
team’s model was built on document embeddings, user embeddings, and a recommendation engine
using the sentence transformer architecture (SBERT). They employed a hybrid recommendation method
(LightFM) that leveraged both document and user embeddings to flag publications indicating mental
health challenges. The system aimed to facilitate fast classification of new messages, determining as
early as possible whether an individual was sufering from anorexia.</p>
      <p>SCaLAR-NITK [42]. Team SCaLAR-NITK, from the National Institute of Technology Karnataka,
Surathkal, participated in Task 3. The team employed a range of standard techniques across 21 diferent
models—one for each symptom question. Their first approach utilized Support Vector Machine (SVM)
classifiers with input word embeddings constructed using the traditional TF-IDF method. In the second
approach, they again used SVMs but leveraged pre-trained Word2Vec embeddings to model both users
and questions, aggregating the question embeddings with each user publication. To address response
imbalance, they employed back-translation. Their final method followed the second approach but
incorporated Principal Component Analysis (PCA) for dimensionality reduction of embeddings. Their
methods performed well, achieving the best results in 7 out of the 8 evaluated metrics.
SINAI [22]. The SINAI team, a collaborative efort between the Computer Science Department of
Universidad of Jaén (Spain) and Instituto Nacional de Astrofísica, Óptica y Electrónica (Mexico), participated
in Tasks 1 and 2. For Task 1, one of SINAI’s approaches involved training a DistilRoBERTa base model on
labeled sentences, with additional data augmentation using the BDI-Sen dataset. Another approach for
Task 1 utilized GPT-3 prompts to infer connections between PHQ-8 symptoms and BDI symptoms. For
Task 2, the team implemented two transformer-based models trained with causal language modeling,
one trained on positive user data and the other on negative user data. This dual-model solution was
used to produce perplexity estimates.</p>
      <p>UMU Team [37]. The UMU Team, from the University of Murcia (Spain), participated in Tasks 2
and 3. For Task 2, the team proposed a method that classifies user posts by combining the last-layer
hidden representation of a BERT-based model with sentiment features extracted from the text. They
utilized BERT and RoBERTa models for text representation, along with the Cardif NLP TweetEval
model for sentiment analysis. This approach aimed to capture both the semantic and emotional aspects
of the users’ posts to detect signs of anorexia. For Task 3, they adopted a fine-tuning approach using a
sentence transformer model to compute the similarity between the text of the user and the responses of
the EDE-Q questionnaire. This method involved measuring the textual closeness between user posts
and the EDE-Q questions to assess the severity of eating disorder symptoms.</p>
      <p>UNSL [36]. The UNSL team, from Universidad Nacional de San Luis (Argentina), participated in Task 2
with a solution named CPI-DMC, focusing on precision and speed independently, as well as a time-aware
approach where both objectives are tackled together. The first approach aimed to balance identifying
positive users and minimizing the decision-making time, consisting of two separate components: a
Classifier with Partial Information (CPI) and another for Deciding the Moment of the Classification
(DMC). The second approach aimed to optimize both objectives simultaneously by incorporating time
into the learning process and using ERDE as the training objective. To implement this, they included a
[TIME] token in the representations, integrating temporal metrics to validate and select the optimal
models. Their methods achieved good results for the ERDE50 metric and ranking-based metrics, and
demonstrating consistency in solving early risk detection problems.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>This paper provided an extended overview of eRisk 2024, the eighth edition of the lab, which focused
on three types of tasks: symptoms search (Task 1 on depression), early detection (Task 2 on anorexia),
and severity estimations (Task 3 on eating disorders). Participants in Task 1 were given a collection
of sentences and had to rank them according to their relevance to each of the BDI-II depression
symptoms. Participants in Task 2 had sequential access to social media posts and had to send alerts
about individuals showing risks of anorexia. In Task 3, participants were given the full user history and
had to automatically estimate the user’s responses to a standard depression questionnaire.
A total of 87 runs were submitted by 17 teams for the proposed tasks. The experimental results
demonstrate the value of extracting evidence from social media, indicating that automatic or
semiautomatic screening tools to detect at-risk individuals could be promising. These findings highlight the
need for the development of benchmarks for text-based risk indicator screening.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was supported by project PLEC2021-007662 (MCIN/AEI/10.13039/ 501100011033, Ministerio
de Ciencia e Innovación, Agencia Estatal de Investigación, Plan de Recuperación, Transformación y
Resiliencia, Unión Europea-Next Generation EU). The first and second authors thank the financial
support supplied by the Xunta de Galicia-Consellería de Cultura, Educación, Formación Profesional e
Universidade (GPC ED431B 2022/33) and the European Regional Development Fund and project
PID2022137061OB-C21 (MCIN/AEI/ 10.13039/501100011033, Ministerio de Ciencia e Innovación, Agencia Estatal
de Investigación, Proyectos de Generación de Conocimiento; supported by “ERDF A way of making
Europe”, by the “European Union”). The CITIC, as a center accredited for excellence within the Galician
University System and a member of the CIGUS Network, receives subsidies from the Department of
Education, Science, Universities, and Vocational Training of the Xunta de Galicia. Additionally, CITIC is
co-financed by the EU through the FEDER Galicia 2021-27 operational program (Ref. ED431G 2023/01).
The third author thanks the financial support supplied by the Xunta de Galicia-Consellería de Cultura,
Educación, Formación Profesional e Universidade (accreditation 2019-2022 ED431G-2019/04, ED431C
2022/19) and the European Regional Development Fund, which acknowledges the CiTIUS-Research
Center in Intelligent Technologies of the University of Santiago de Compostela as a Research Center
of the Galician University System. David E. Losada also thanks the financial support obtained from
project SUBV23/00002 (Ministerio de Consumo, Subdirección General de Regulación del Juego) and
project PID2022-137061OB-C22 (Ministerio de Ciencia e Innovación, Agencia Estatal de Investigación,
Proyectos de Generación de Conocimiento; supported by the European Regional Development Fund).
prediction on the internet (extended overview), in: Proceedings of the Working Notes of CLEF
2023 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 18–21, 2023,
2023.
[19] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, erisk 2023: Depression, pathological
gambling, and eating disorder challenges, in: Advances in Information Retrieval - 45th European
Conference on IR Research, ECIR 2023, Dublin, Ireland, April 2–6, 2023, Proceedings, Part III, 2023,
p. 585–592.
[20] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Overview of erisk 2024: Early risk prediction
on the internet, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction - 15th
International Conference of the CLEF Association, CLEF 2024, Grenoble, France, September 9–12,
2024, 2024.
[21] A. T. Beck, C. H. Ward, M. Mendelson, J. Mock, J. Erbaugh, An Inventory for Measuring Depression,</p>
      <p>JAMA Psychiatry 4 (1961) 561–571.
[22] A. M. Mármol-Romero, P. A.-O. Adrián Moreno-Muñoz, K. M. Valencia-Segura, E.
MartínezCámara, M. García-Vega, A. Montejo-Ráez, SINAI at eRisk@ CLEF 2024: Approaching the Search
for Symptoms of Depression and Early Detection of Anorexia Signs using Natural Language
Processing, in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum,
Grenoble, France, September 9-12, 2024.
[23] D. Maupomé, Y. Ferstler, S. Mosser, M.-J. Meurs, Automatically finding evidence, predicting
answers in mental health self-report questionnaires , in: Working Notes of CLEF 2024 - Conference
and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024.
[24] B. H. Ang, S. D. Gollapalli, S.-K. Ng, NUS-IDS@eRisk2024: Ranking Sentences for Depression
Symptoms using Early Maladaptive Schemas and Ensembles , in: Working Notes of CLEF 2024
Conference and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024.
[25] R.-M. Hanciu, MindwaveML at eRisk 2024: Identifying Depression Symptoms in Reddit Users, in:
Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, Grenoble, France,
September 9-12, 2024.
[26] A. Barachanou, F. Tsalakanidou, S. Papadopoulos, REBECCA at eRisk 2024: Search for symptoms
of depression using sentence embeddings and prompt-based filtering , in: Working Notes of CLEF
2024 - Conference and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024.
[27] D. Guecha, A. Potdar, A. Miyaguchi, DS@GT eRisk 2024 Working Notes , in: Working Notes of
CLEF 2024 - Conference and Labs of the Evaluation Forum, Grenoble, France, September 9-12,
2024.
[28] A. P. Bascuñana, I. S. Bedmar, APB-UC3M at eRisk 2024: Natural Language Processing and
Deep Learning for the Early Detection of Mental Disorders, in: Working Notes of CLEF 2024
Conference and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024.
[29] D. E. Losada, F. Crestani, A test collection for research on depression and language use, in:</p>
      <p>Proceedings Conference and Labs of the Evaluation Forum CLEF 2016, Evora, Portugal, 2016.
[30] D. Otero, J. Parapar, Á. Barreiro, Beaver: Eficiently building test collections for novel tasks, in:
Proceedings of the First Joint Conference of the Information Retrieval Communities in Europe
(CIRCLE 2020), Samatan, Gers, France, July 6-9, 2020, 2020.
[31] D. Otero, J. Parapar, Á. Barreiro, The wisdom of the rankers: a cost-efective method for building
pooled test collections without participant systems, in: SAC ’21: The 36th ACM/SIGAPP
Symposium on Applied Computing, Virtual Event, Republic of Korea, March 22-26, 2021, 2021, pp.
672–680.
[32] M. Trotzek, S. Koitka, C. Friedrich, Utilizing neural networks and linguistic metadata for early
detection of depression indications in text sequences, IEEE Transactions on Knowledge and Data
Engineering (2018).
[33] F. Sadeque, D. Xu, S. Bethard, Measuring the latency of depression detection in social media, in:</p>
      <p>WSDM, ACM, 2018, pp. 495–503.
[34] P. Sarangi, S. Kumar, S. Agrawal, T. Basu, A natural language processing based framework for
early detection of anorexia via sequential text processing, in: Working Notes of CLEF 2024
Conference and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024.
[35] O. Riewe-Perła, A. Filipowska, Combining Recommender Systems and Language Models in Early
Detection of Signs of Anorexia, in: Working Notes of CLEF 2024 - Conference and Labs of the
Evaluation Forum, Grenoble, France, September 9-12, 2024.
[36] H. Thompson, M. Errecalde, A Time-Aware Approach to Early Detection of Anorexia: UNSL at
eRisk 2024, in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum,
Grenoble, France, September 9-12, 2024.
[37] R. Pan, J. A. G. Díaz, T. B. Beltrán, R. Valencia-Garcia, UMUTeam at eRisk@CLEF 2024: Fine-Tuning
Transformer Models with Sentiment Features for Early Detection and Severity Measurement of
Eating Disorders , in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum,
Grenoble, France, September 9-12, 2024.
[38] A. C. Segarra, V. A. Esteve, A. M. Marco, L.-F. H. Oliver, ELiRF-VRAIN at eRisk 2024: Using
LongFormers for Early Detection of Signs of Anorexia, in: Working Notes of CLEF 2024 - Conference
and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024.
[39] H. Fabregat, D. Deniz, A. Duque, L. Araujo, J. Martinez-Romo, NLP-UNED at eRisk 2024:
Approximate Nearest Neighbors with Encoding Refinement for Early Detecting Signs of Anorexia, in:
Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, Grenoble, France,
September 9-12, 2024.
[40] C. G. Fairburn, Z. Cooper, M. O’Connor, Eating disorder examination Edition 17.0D (April, 2014).
[41] S. Baccianella, A. Esuli, F. Sebastiani, Evaluation measures for ordinal regression, 2009, pp. 283–287.</p>
      <p>doi:10.1109/ISDA.2009.230.
[42] S. Prasanna, A. S. Gulati, S. Karmakar, M. Y. Hiranmayi, A. K. Madasamy, Measuring the severity
of the signs of eating disorders using machine learning techniques, in: Working Notes of CLEF
2024 - Conference and Labs of the Evaluation Forum, Grenoble, France, September 9-12, 2024.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , J. Parapar, eRisk
          <year>2017</year>
          :
          <article-title>CLEF lab on early risk prediction on the internet: Experimental foundations</article-title>
          , in: G. J.
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Lawless</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Mandl</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Cappellato</surname>
          </string-name>
          , N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction</source>
          , Springer International Publishing, Cham,
          <year>2017</year>
          , pp.
          <fpage>346</fpage>
          -
          <lpage>360</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , J. Parapar, eRisk
          <year>2017</year>
          :
          <article-title>CLEF Lab on Early Risk Prediction on the Internet: Experimental foundations</article-title>
          ,
          <source>in: CEUR Proceedings of the Conference and Labs of the Evaluation Forum, CLEF</source>
          <year>2017</year>
          , Dublin, Ireland,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <article-title>Overview of eRisk: Early Risk Prediction on the Internet</article-title>
          , in: P. Bellot,
          <string-name>
            <given-names>C.</given-names>
            <surname>Trabelsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mothe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Murtagh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Soulier</surname>
          </string-name>
          , E. SanJuan, L. Cappellato, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction</source>
          , Springer International Publishing, Cham,
          <year>2018</year>
          , pp.
          <fpage>343</fpage>
          -
          <lpage>361</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          , Overview of eRisk 2018:
          <article-title>Early Risk Prediction on the Internet (extended lab overview)</article-title>
          ,
          <source>in: CEUR Proceedings of the Conference and Labs of the Evaluation Forum, CLEF</source>
          <year>2018</year>
          , Avignon, France,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          , Overview of eRisk 2019:
          <article-title>Early risk prediction on the Internet</article-title>
          , in: F. Crestani,
          <string-name>
            <given-names>M.</given-names>
            <surname>Braschler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Savoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rauber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. Heinatz</given-names>
            <surname>Bürki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cappellato</surname>
          </string-name>
          , N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction</source>
          , Springer International Publishing,
          <year>2019</year>
          , pp.
          <fpage>340</fpage>
          -
          <lpage>357</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          , Overview of eRisk at CLEF 2019:
          <article-title>Early risk prediction on the Internet (extended overview)</article-title>
          ,
          <source>in: CEUR Proceedings of the Conference and Labs of the Evaluation Forum, CLEF</source>
          <year>2019</year>
          , Lugano, Switzerland,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <article-title>Early detection of risks on the internet: An exploratory campaign</article-title>
          ,
          <source>in: Advances in Information Retrieval - 41st European Conference on IR Research</source>
          , ECIR
          <year>2019</year>
          , Cologne, Germany, April 14-
          <issue>18</issue>
          ,
          <year>2019</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          ,
          <year>2019</year>
          , pp.
          <fpage>259</fpage>
          -
          <lpage>266</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          , Overview of erisk 2020:
          <article-title>Early risk prediction on the internet, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction - 11th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2020</year>
          , Thessaloniki, Greece,
          <source>September 22-25</source>
          ,
          <year>2020</year>
          , Proceedings,
          <year>2020</year>
          , pp.
          <fpage>272</fpage>
          -
          <lpage>287</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          , Overview of erisk at CLEF 2020:
          <article-title>Early risk prediction on the internet (extended overview)</article-title>
          ,
          <source>in: Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum</source>
          , Thessaloniki, Greece,
          <source>September 22-25</source>
          ,
          <year>2020</year>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , J. Parapar, erisk
          <year>2020</year>
          :
          <article-title>Self-harm and depression challenges</article-title>
          ,
          <source>in: Advances in Information Retrieval - 42nd European Conference on IR Research</source>
          , ECIR
          <year>2020</year>
          , Lisbon, Portugal,
          <source>April 14-17</source>
          ,
          <year>2020</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          ,
          <year>2020</year>
          , pp.
          <fpage>557</fpage>
          -
          <lpage>563</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk 2021:
          <article-title>Early risk prediction on the internet, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction - 12th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2021</year>
          ,
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Event</surname>
          </string-name>
          ,
          <source>September 21-24</source>
          ,
          <year>2021</year>
          , Proceedings,
          <year>2021</year>
          , pp.
          <fpage>324</fpage>
          -
          <lpage>344</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk at CLEF 2021:
          <article-title>Early risk prediction on the internet (extended overview)</article-title>
          ,
          <source>in: Proceedings of the Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum</source>
          , Bucharest, Romania, September 21st - to - 24th,
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>864</fpage>
          -
          <lpage>887</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , erisk
          <year>2021</year>
          :
          <article-title>Pathological gambling, self-harm and depression challenges</article-title>
          ,
          <source>in: Advances in Information Retrieval - 43rd European Conference on IR Research</source>
          , ECIR
          <year>2021</year>
          ,
          <string-name>
            <surname>Virtual</surname>
            <given-names>Event</given-names>
          </string-name>
          ,
          <year>March</year>
          28 - April 1,
          <year>2021</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>650</fpage>
          -
          <lpage>656</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk 2022:
          <article-title>Early risk prediction on the internet, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction - 13th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2022</year>
          , Bologna, Italy, September 5-
          <issue>8</issue>
          ,
          <year>2022</year>
          ,
          <year>2022</year>
          , p.
          <fpage>233</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk at CLEF 2022:
          <article-title>Early risk prediction on the internet (extended overview)</article-title>
          ,
          <source>in: Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum</source>
          , Bologna, Italy, September 5-
          <issue>8</issue>
          ,
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>821</fpage>
          -
          <lpage>850</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , erisk
          <year>2022</year>
          :
          <article-title>Pathological gambling, depression, and eating disorder challenges</article-title>
          ,
          <source>in: Advances in Information Retrieval - 44th European Conference on IR Research</source>
          , ECIR
          <year>2022</year>
          , Stavanger, Norway,
          <source>April 10-14</source>
          ,
          <year>2022</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          ,
          <year>2022</year>
          , pp.
          <fpage>436</fpage>
          -
          <lpage>442</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk 2023:
          <article-title>Early risk prediction on the internet, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction - 14th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2023</year>
          , Thessaloniki, Greece,
          <source>September 18-21</source>
          ,
          <year>2023</year>
          ,
          <year>2023</year>
          , p.
          <fpage>233</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk at CLEF 2023:
          <article-title>Early risk</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>