=Paper=
{{Paper
|id=Vol-3180/paper-66
|storemode=property
|title=Overview of eRisk 2022: Early Risk Prediction on the Internet (Extended Overview)
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-66.pdf
|volume=Vol-3180
|authors=Javier Parapar,Patricia Martin-Rodilla,David E. Losada,Fabio Crestani
|dblpUrl=https://dblp.org/rec/conf/clef/ParaparMLC22
}}
==Overview of eRisk 2022: Early Risk Prediction on the Internet (Extended Overview)==
<pdf width="1500px">https://ceur-ws.org/Vol-3180/paper-66.pdf</pdf>
<pre>
Overview of eRisk at CLEF 2022: Early Risk Prediction
on the Internet (Extended Overview)
Javier Parapar1 , Patricia Martín-Rodilla1 , David E. Losada2 and Fabio Crestani3
1
  Information Retrieval Lab, Centro de Investigación en Tecnoloxías da Información e as Comunicacións (CITIC),
Universidade da Coruña. Campus de Elviña s/n C.P 15071 A Coruña, Spain
2
  Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS),Universidade de Santiago de Compostela. Rúa de
Jenaro de la Fuente Domínguez, C.P 15782, Santiago de Compostela, Spain
3
  Faculty of Informatics, Universitá della Svizzera italiana (USI). Campus EST, Via alla Santa 1, 6900 Viganello,
Switzerland


                                         Abstract
                                         This paper provides an overview of eRisk 2022, the sixth edition of this lab, at the CLEF conference.
                                         Since its inception, the primary purpose of our lab has been to investigate topics concerning evaluation
                                         techniques, effectiveness metrics, and other processes connected to early risk detection on the internet.
                                         Early warning models can be employed in a range of contexts, including health and safety. This year,
                                         eRisk proposed three tasks. The first one was to discover early indicators of pathological gambling. The
                                         second task was to identify early signs of depression. The third required participant to automatically
                                         complete an eating disorders questionnaire (based on user writings on social media).

                                         Keywords
                                         early risk detection, pathological gambling, early detection of depression, eating disorders


1. Introduction
The major purpose of eRisk is to conduct research on evaluation methodologies, metrics, and
other elements related to building research collections and identifying difficulties for early risk
identification. Early detection technology can be useful in a variety of sectors, notably those
involving safety and health. An automated system may issue early warnings when a person
begins to exhibit indications of a mental illness, a sexual abuser begins engaging with a child,
or a suspected criminal begins making antisocial threats on the Internet.
While our evaluation approaches (new research collections development strategies, creative
evaluation measures, etc.) can be applied across different domains, eRisk has thus far concen-
trated on psychological difficulties (depression, self-harm, pathological gambling, and eating
disorders).
CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ javier.parapar@udc.es (J. Parapar); patricia.martin.rodilla@udc.es (P. Martín-Rodilla); david.losada@usc.es
(D. E. Losada); fabio.crestani@usi.ch (F. Crestani)
 https://www.dc.fi.udc.es/~parapar (J. Parapar); http://www.incipit.csic.es/gl/persoa/patricia-martin-rodilla
(P. Martín-Rodilla); http://tec.citius.usc.es/ir/ (D. E. Losada);
https://search.usi.ch/en/people/4f0dd874bbd63c00938825fae1843200/crestani-fabio (F. Crestani)
 0000-0002-5997-8252 (J. Parapar); 0000-0002-1540-883X2 (P. Martín-Rodilla); 0000-0001-8823-7501 (D. E. Losada);
0000-0001-8672-0700 (F. Crestani)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
In 2017, we conducted an exploratory task on the early detection of depression [1, 2]. The
evaluation methods and test dataset described in [3] were the focus of this pilot task. In 2018,
we continued detecting early signs of depression while also launching a new challenge of
detecting early signs of anorexia [4, 5]. In 2019, we ran the continuation of the challenge on
early identification of symptoms of anorexia, a challenge on early detection of signs of self-harm,
and a third task aimed at estimating a user’s responses to a depression questionnaire focused
on her social media interactions [6, 7, 8]. In 2020, we continued with the early detection of
self-harm, and the task on severity estimation of depression symptoms [9, 10, 11]. Finally, in
the last edition in 2021, we presented two tasks on early detection (pathological gambling and
self-harm) and one on the severity estimation of depression [12, 13, 14].
We’ve had the opportunity to compare a wide selection of solutions that use various tech-
nologies and concepts over the years (e.g. Natural Language Processing, Machine Learning,
or Information Retrieval). We discovered that the link between psychological disorders and
language use is complex, and that some contributing systems are not very effective. For example,
the majority of participants had performance levels (e.g., F1) that were less than 70%. These
figures indicate that further research into early prediction tasks is needed, and the solutions
presented thus far have a lot of space for improvement.
In 2022, the lab had three campaign-style tasks [15]. The first task is the second edition of
the pathological gambling domain. This task follows the same organisation as previous early
detection challenges. The second task is also a continuation of the early detection of the
depression challenge, whose last edition was in 2018. Finally, we provided a new task for the
eating disorder severity estimation. Participants were required to analyse the user’s posts and
then estimate the user’s answers to a standard eating disorder questionnaire. We describe
these tasks in greater detail in the following sections of this overview article. We had 93 teams
registered for the lab. We finally received results from 17 of them: 41 runs for Task 1, 62 runs
for Task 2 and 12 for Task 3.
The lab had three campaign-style projects in 2022 [15].The first task is the second edition of
the early alert problem on the pathological gambling domain. This challenge follows the same
organisation as prior early detection challenges. The second one is likewise a continuation
of the early identification of depression challenge, which had its most recent edition in 2018.
Finally, we present a novel task for the eating disorder severity estimation. Participants have
to analyze the user’s posts and then estimate the user’s responses to a typical eating disorder
questionnaire. These tasks are described in greater detail in the next sections of this overview
article. The lab had 93 teams registered. We eventually got responses from 17 of them: 41 runs
for Task 1, 62 runs for Task 2, and 12 for Task 3.


2. Task 1: Early Detection of Signs of Pathological Gambling
This is a follow-up to Task 1 from 2021. The task was to create new models for detecting patho-
logical gambling risk early on. Pathological gambling is also known as ludomania (ICD-10-CM
code F63.0). It is usually referred as gambling addiction (an urge to gamble independently of
its negative consequences). According to the World Health Organization [16], adult gambling
addiction has prevalence rates ranging from 0.1 percent to 6.0 percent in 2017. As quickly as
Table 1
Task 1 (pathological gambling). Main statistics of the collection
                                                             Train                    Test
                                                      Gamblers Control       Gamblers      Control
    Num. subjects                                       164         2,184       81          1998
    Num. submissions (posts & comments)                54,674     1,073,88    14,627     1,014,122
    Avg num. of submissions per subject                333.37      491.70     180.58       507.56
    Avg num. of days from first to last submission     ≈ 560       ≈ 662      ≈ 489.7      ≈ 664.9
    Avg num. words per submission                      30.64        20.08      30.4         22.2


feasible, the task required progressively digesting evidence and detecting early indicators of
pathological gambling, also known as compulsive gambling or disordered gambling. Participat-
ing systems have to read and analyse Social Media posts in the order that users wrote them. As
a result, systems that perform well in this task may be used to monitor user activities in blogs,
social networks, and other types of online media in a sequential manner.
This task’s test collection followed the same format as the collection specified in [3]. It is a
collection of writings (posts or comments) from a set of Social Media users. The data source
is also the same as in earlier eRisks (Reddit). There are two types of users: pathological
gamblers and non-pathological gamblers, and the collection includes a series of posts for
each (in chronological order). We put up a server that distributed user writings among the
participating teams incrementally. The lab website1 contains more information about the server.
This was a training and test task. The teams got access to training data for the training stage,
where we published the entire history of writings for training users. We identified which users
had specifically said that they are pathological gamblers. As a result, the participants could
tweak their systems using the training data. The training data for Task 1 in 2022 was made up
of all Task 1 users from 2021.
Participants connected to our server and iteratively received user writings and sent responses
during the test stage. At any time in the user chronology, any participant could pause and
deliver an alert. After reading each user post, the teams had to decide whether to alert about
the user (the system predicts the person would develop the risk) or not. Participants had to
make this decision independently for each user in the test split. We regarded alerts as final (i.e.
further decisions about this individual were ignored). In contrast, no alerts were regarded as
provisional (i.e. the participants could later submit an alert about this user if they detected the
appearance of signs of risk). To evaluate the systems, we used the correctness of the decisions
and the number of user writes required to produce the decisions (see below).
We set up a REST service to help with testing. While waiting for responses, the server pro-
gressively disseminated user writings to each participant (no new user data was distributed
to a specific participant until the service received all decisions for users and runs from that
team in previous step). From January 17th, 2022 through April 22nd, 2022, the service was
open for submissions. We used current methodologies that optimise the utilisation of assessors’
time to produce the ground truth assessments [17, 18]. These methods enable the creation of
test collections through the use of simulated pooling algorithms. The key statistics of the test
    1
        https://early.irlab.org/server.html
collection utilised for T1 are reported in table 1 . The following sections go over evaluation
methods.

2.1. Decision-based Evaluation
This form of evaluation revolves around the (binary) decisions taken for each user by the
participating systems. Besides standard classification measures (Precision, Recall and F12 ),
we computed 𝐸𝑅𝐷𝐸, the early risk detection error used in previous editions of the lab. A
full description of 𝐸𝑅𝐷𝐸 can be found in [3]. Essentially, 𝐸𝑅𝐷𝐸 is an error measure that
introduces a penalty for late correct alerts (true positives). The penalty grows with the delay in
emitting the alert, and the delay is measured here as the number of user posts that had to be
processed before making the alert.
Since 2019, we complemented the evaluation report with additional decision-based metrics that
try to capture additional aspects of the problem. These metrics try to overcome some limitations
of 𝐸𝑅𝐷𝐸, namely:

    • the penalty associated to true positives goes quickly to 1. This is due to the functional
      form of the cost function (sigmoid).
    • a perfect system, which detects the true positive case right after the first round of messages
      (first chunk), does not get error equal to 0.
    • with a method based on releasing data in a chunk-based way (as it was done in 2017 and
      2018) the contribution of each user to the performance evaluation has a large variance
      (different for users with few writings per chunk vs users with many writings per chunk).
    • 𝐸𝑅𝐷𝐸 is not interpretable.

Some research teams have analysed these issues and proposed alternative ways for evaluation.
Trotzek and colleagues [19] proposed 𝐸𝑅𝐷𝐸𝑜% . This is a variant of ERDE that does not depend
on the number of user writings seen before the alert but, instead, it depends on the percentage
of user writings seen before the alert. In this way, user’s contributions to the evaluation are
normalized (currently, all users weight the same). However, there is an important limitation of
𝐸𝑅𝐷𝐸𝑜% . In real life applications, the overall number of user writings is not known in advance.
Social Media users post contents online and screening tools have to make predictions with
the evidence seen. In practice, you do not know when (and if) a user’s thread of messages is
exhausted. Thus, the performance metric should not depend on knowledge about the total
number of user writings.
Another proposal of an alternative evaluation metric for early risk prediction was done by
Sadeque and colleagues [20]. They proposed 𝐹𝑙𝑎𝑡𝑒𝑛𝑐𝑦 , which fits better with our purposes. This
measure is described next.
Imagine a user 𝑢 ∈ 𝑈 and an early risk detection system that iteratively analyzes 𝑢’s writings
(e.g. in chronological order, as they appear in Social Media) and, after analyzing 𝑘𝑢 user writings
(𝑘𝑢 ≥ 1), takes a binary decision 𝑑𝑢 ∈ {0, 1}, which represents the decision of the system
about the user being a risk case. By 𝑔𝑢 ∈ {0, 1}, we refer to the user’s golden truth label. A
key component of an early risk evaluation should be the delay on detecting true positives (we
   2
       computed with respect to the positive class.
do not want systems to detect these cases too late). Therefore, a first and intuitive measure of
delay can be defined as follows3 :


                          latency𝑇 𝑃     = median{𝑘𝑢 : 𝑢 ∈ 𝑈, 𝑑𝑢 = 𝑔𝑢 = 1}                                     (1)

This measure of latency is calculated over the true positives detected by the system and assesses
the system’s delay based on the median number of writings that the system had to process to
detect such positive cases. This measure can be included in the experimental report together
with standard measures such as Precision (P), Recall (R) and the F-measure (F):


                                                |𝑢 ∈ 𝑈 : 𝑑𝑢 = 𝑔𝑢 = 1|
                                     𝑃     =                                                                   (2)
                                                   |𝑢 ∈ 𝑈 : 𝑑𝑢 = 1|
                                                |𝑢 ∈ 𝑈 : 𝑑𝑢 = 𝑔𝑢 = 1|
                                      𝑅 =                                                                      (3)
                                                   |𝑢 ∈ 𝑈 : 𝑔𝑢 = 1|
                                                2·𝑃 ·𝑅
                                     𝐹     =                                                                   (4)
                                                 𝑃 +𝑅
Furthermore, Sadeque et al. proposed a measure, 𝐹𝑙𝑎𝑡𝑒𝑛𝑐𝑦 , which combines the effectiveness of
the decision (estimated with the F measure) and the delay4 in the decision. This is calculated by
multiplying F by a penalty factor based on the median delay. More specifically, each individual
(true positive) decision, taken after reading 𝑘𝑢 writings, is assigned the following penalty:

                                                                    2
                               𝑝𝑒𝑛𝑎𝑙𝑡𝑦(𝑘𝑢 ) = −1 +                                                             (5)
                                                             1 + exp−𝑝·(𝑘𝑢 −1)
where 𝑝 is a parameter that determines how quickly the penalty should increase. In [20], 𝑝 was
set such that the penalty equals 0.5 at the median number of posts of a user5 . Observe that a
decision right after the first writing has no penalty (i.e. 𝑝𝑒𝑛𝑎𝑙𝑡𝑦(1) = 0). Figure 1 plots how
the latency penalty increases with the number of observed writings.
The system’s overall speed factor is computed as:


                  𝑠𝑝𝑒𝑒𝑑 = (1 − median{𝑝𝑒𝑛𝑎𝑙𝑡𝑦(𝑘𝑢 ) : 𝑢 ∈ 𝑈, 𝑑𝑢 = 𝑔𝑢 = 1})                                      (6)

where speed equals 1 for a system whose true positives are detected right at the first writing. A
slow system, which detects true positives after hundreds of writings, will be assigned a speed
score near 0.
Finally, the latency-weighted F score is simply:

    3
       Observe that Sadeque et al (see [20], pg 497) computed the latency for all users such that 𝑔𝑢 = 1. We argue
that latency should be computed only for the true positives. The false negatives (𝑔𝑢 = 1, 𝑑𝑢 = 0) are not detected
by the system and, therefore, they would not generate an alert.
     4
       Again, we adopt Sadeque et al.’s proposal but we estimate latency only over the true positives.
     5
       In the evaluation we set 𝑝 to 0.0078, a setting obtained from the eRisk 2017 collection.
                                0.8
                      penalty
                                0.4
                                0.0
                                      0    100       200       300      400
                                          number of observed writings


Figure 1: Latency penalty increases with the number of observed writings (𝑘𝑢 )


                                          𝐹𝑙𝑎𝑡𝑒𝑛𝑐𝑦 = 𝐹 · 𝑠𝑝𝑒𝑒𝑑                                   (7)

Since 2019 user’s data were processed by the participants in a post by post basis (i.e. we avoided a
chunk-based release of data). Under these conditions, the evaluation approach has the following
properties:

    • smooth grow of penalties;
    • a perfect system gets 𝐹𝑙𝑎𝑡𝑒𝑛𝑐𝑦 = 1 ;
    • for each user 𝑢 the system can opt to stop at any point 𝑘𝑢 and, therefore, now we do not
      have the effect of an imbalanced importance of users;
    • 𝐹𝑙𝑎𝑡𝑒𝑛𝑐𝑦 is more interpretable than 𝐸𝑅𝐷𝐸.

2.2. Ranking-based Evaluation
This section explains a different type of evaluation that was employed in addition to the
examination stated above. Following each data release (new user writing), participants were
required to provide back the following information (for each user in the collection): i) a user
decision (alert/no alert), which was utilised to compute the decision-based metrics outlined
above, and ii) a score representing the user’s level of risk (estimated from the evidence seen so
far). We used these results to create a user ranking based on decreasing assessed risk. We have
one ranking at each point for each participating system (i.e., ranking after one writing, ranking
after two writings, etc.). This replicates a constant re-ranking strategy based on previous
evidence. In practise, this ranking would be offered to an expert user who would make decisions
(e.g. by inspecting the rankings).
Each ranking can be evaluated with standard IR metrics, such as P@10 or NDCG. We, therefore,
report the ranking-based performance of the systems after seeing 𝑘 writings (with varying 𝑘).

2.3. Task 1: Results
Table 2 shows the participating teams, the number of runs submitted and the approximate lapse
of time from the first response to the last response. This time-lapse is indicative of the degree of
Table 2
Task 1 (pathological gambling): participating teams, number of runs, number of user writings processed
by the team, and lapse of time taken for the entire process.
         team                 #runs    #user writings processed                 lapse of time
                                                                  (from 1st to last response)
         UNED-NLP               5                          2001                       17:58:48
         SINAI                  3                            46               4 days 12:54:03
         BioInfo_UAVR           5                          1002                       22:35:47
         RELAI                  5                           109               7 days 15:27:25
         BLUE                   3                          2001               3 days 13:15:25
         BioNLP-UniBuc          5                             3                       00:37:33
         UNSL                   5                          2001                1 day 21:53:51
         NLPGroup-IISERB        5                          1020             15 days 21:30:48
         stezmo3                5                            30                       12:30:26


automation of each team’s algorithms. A few of the submitted runs processed the entire thread
of messages (2001), but many variants stopped earlier. Some of the teams were still submitting
results at the deadline time. Three teams processed the thread of messages reasonably fast
(around a day for processing the entire history of user messages). The rest of the teams took
several days to run the whole process. Some teams took even more than a week. This extension
suggests that they incorporated some form of offline processing.
Table 3 reports the decision-based performance achieved by the participating teams. In terms of
Precision, the best performing team was the NLPGroup-IISERB (run 4) but at the expense of a
very low recall. In terms of 𝐹 1, 𝐸𝑅𝐷𝐸50 and latency-weighted 𝐹 1, the best performing run
was submitted by the UNED NLD team. Their run (#4) also has a pretty high level of Recall
(.938). Many teams achieved perfect Recall at the expense of very low Precision figures. In
terms of 𝐸𝑅𝐷𝐸5 , the best performing runs are SINAI #0 and #1 and BLUE #0. The majority
of teams made quick decisions. Overall, these findings indicate that some systems achieved a
relatively high level of effectiveness with only a few user submissions. Social and public health
systems may use the best predictive algorithms to assist expert humans in detecting signs of
pathological gambling as early as possible..
Table 4 presents the ranking-based results. Because some teams only processed a few dozens of
user writings, we could only compute their user rankings for the initial number of processsed
writings. For those participants providing ties in the scores for the users, we used the traditional
docid criteria (subject name) for breaking the ties. Some runs (e.g., UNED-NLP #4, BLUE #0 and
#1 and UNSL #0, #1 and #2) have very good levels of ranking-based shallow effectiveness over
multiple points (after one writing, after 100 writings, and so forth). Regarding the 100 cut-off,
the best performing teams after one writing for nDCG are UNED-NLP (#2) and BLUE (#0 and
#1). In the other scenarios, both UNED-NLP and UNSL obtain very good results.


3. Task 2: Early Detection of Depression
This is a continuation of the tasks from 2017 and 2018. This task proposes early risk detection
of depression in the same way as pathological gambling explained in Section 2. This task’s
Table 3
Decision-based evaluation for Task 1


                                                                                               𝑙𝑎𝑡𝑒𝑛𝑐𝑦-𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝐹 1
                                                                          𝑙𝑎𝑡𝑒𝑛𝑐𝑦𝑇 𝑃
                                                                 𝐸𝑅𝐷𝐸50
                                                         𝐸𝑅𝐷𝐸5


                                                                                       𝑠𝑝𝑒𝑒𝑑
                                                   𝐹1
                                             𝑅
                Team            Run


                                       𝑃
                UNED-NLP         0 0.285 0.975 0.441 0.019 0.010 2.0 0.996 0.440
                UNED-NLP         1 0.555 0.938 0.697 0.019 0.009 2.5 0.994 0.693
                UNED-NLP         2 0.296 0.988 0.456 0.019 0.009 2.0 0.996 0.454
                UNED-NLP         3 0.536 0.926 0.679 0.019 0.009 3.0 0.992 0.673
                UNED-NLP         4 0.809 0.938 0.869 0.020 0.008 3.0 0.992 0.862
                SINAI            0 0.425 0.765 0.546 0.015 0.011 1.0 1.000 0.546
                SINAI            1 0.575 0.802 0.670 0.015 0.009 1.0 1.000 0.670
                SINAI            2 0.908 0.728 0.808 0.016 0.011 1.0 1.000 0.808
                BioInfo_UAVR     0 0.093 0.988 0.170 0.040 0.017 5.0 0.984 0.167
                BioInfo_UAVR     1 0.067 1.000 0.126 0.047 0.024 5.0 0.984 0.124
                BioInfo_UAVR     2 0.052 1.000 0.099 0.051 0.029 5.0 0.984 0.097
                BioInfo_UAVR     3 0.050 1.000 0.095 0.052 0.030 5.0 0.984 0.094
                BioInfo_UAVR     4 0.192 0.988 0.321 0.033 0.011 5.0 0.984 0.316
                RELAI            0 0.000 0.000 0.000 0.039 0.039
                RELAI            1 0.000 0.000 0.000 0.039 0.039
                RELAI            2 0.052 0.963 0.099 0.036 0.029 1.0 1.000 0.099
                RELAI            3 0.051 0.963 0.098 0.037 0.030 1.0 1.000 0.098
                RELAI            4 0.000 0.000 0.000 0.039 0.039
                BLUE             0 0.260 0.975 0.410 0.015 0.009 1.0 1.000 0.410
                BLUE             1 0.123 0.988 0.219 0.021 0.015 1.0 1.000 0.219
                BLUE             2 0.052 1.000 0.099 0.037 0.028 1.0 1.000 0.099
                BioNLP-UniBuc    0 0.039 1.000 0.075 0.038 0.037 1.0 1.000 0.075
                BioNLP-UniBuc    1 0.039 1.000 0.076 0.038 0.037 1.0 1.000 0.076
                BioNLP-UniBuc    2 0.040 1.000 0.077 0.037 0.036 1.0 1.000 0.077
                BioNLP-UniBuc    3 0.046 1.000 0.087 0.033 0.032 1.0 1.000 0.087
                BioNLP-UniBuc    4 0.046 1.000 0.089 0.032 0.031 1.0 1.000 0.089
                UNSL             0 0.401 0.951 0.564 0.041 0.008 11.0 0.961 0.542
                UNSL             1 0.461 0.938 0.618 0.041 0.008 11.0 0.961 0.594
                UNSL             2 0.398 0.914 0.554 0.041 0.008 12.0 0.957 0.531
                UNSL             3 0.365 0.864 0.513 0.017 0.009 3.0 0.992 0.509
                UNSL             4 0.052 0.988 0.100 0.051 0.030 5.0 0.984 0.098
                NLPGroup-IISERB 0 0.107 0.642 0.183 0.030 0.025 2.0 0.996 0.182
                NLPGroup-IISERB 1 0.044 1.000 0.084 0.046 0.033 3.0 0.992 0.083
                NLPGroup-IISERB 2 0.043 1.000 0.083 0.041 0.034 1.0 1.000 0.083
                NLPGroup-IISERB 3 0.140 1.000 0.246 0.025 0.014 2.0 0.996 0.245
                NLPGroup-IISERB 4   1.000 0.074 0.138 0.038 0.037 41.5 0.843 0.116
                stezmo3          0 0.116 0.864 0.205 0.034 0.015 5.0 0.984 0.202
                stezmo3          1 0.116 0.864 0.205 0.049 0.015 12.0 0.957 0.196
                stezmo3          2 0.152 0.914 0.261 0.033 0.011 5.0 0.984 0.257
                stezmo3          3 0.139 0.864 0.240 0.047 0.013 12.0 0.957 0.229
                stezmo3          4 0.160 0.901 0.271 0.043 0.011 7.0 0.977 0.265


test collection followed the same format as the collection specified in [3]. The data source is
also the same as in earlier eRisks. There are two types of users: depressed and non-depressed,
and the collection offers a series of posts for each user (in chronological order). In contrast to
earlier versions of the challenge, this is the first edition to use the REST service rather than the
Table 4
Ranking-based evaluation for Task 1
                                                1 writing                   100 writings                   500 writings                 1000 writings


                                                              𝑁 𝐷𝐶𝐺@100


                                                                                             𝑁 𝐷𝐶𝐺@100


                                                                                                                            𝑁 𝐷𝐶𝐺@100


                                                                                                                                                           𝑁 𝐷𝐶𝐺@100
                                                   𝑁 𝐷𝐶𝐺@10


                                                                                  𝑁 𝐷𝐶𝐺@10


                                                                                                                 𝑁 𝐷𝐶𝐺@10


                                                                                                                                                𝑁 𝐷𝐶𝐺@10
                                        𝑃 @10


                                                                          𝑃 @10


                                                                                                         𝑃 @10


                                                                                                                                        𝑃 @10
              Team            Run
              UNED-NLP         0  0.90 0.88 0.75 0.40 0.29 0.70 0.30 0.20 0.56 0.30 0.19 0.48
              UNED-NLP         1  0.90 0.81 0.68 0.80 0.73 0.83 0.50 0.43 0.80 0.50 0.37 0.75
              UNED-NLP         2  0.90 0.88 0.76 0.60 0.58 0.79 0.40 0.33 0.55 0.30 0.24 0.46
              UNED-NLP         3  0.90 0.81 0.71 0.70 0.66 0.84 0.40 0.35 0.78 0.50 0.42 0.73
              UNED-NLP         4 1.00 1.00 0.56 1.00 1.00 0.88 1.00 1.00 0.95 1.00 1.00 0.95
              SINAI            0  0.10 0.19 0.56
              SINAI            1  0.70 0.65 0.62
              SINAI            2 1.00 1.00 0.70
              BioInfo_UAVR     0  0.00 0.00 0.03 0.80 0.87 0.33 0.00 0.00 0.00 0.10 0.10 0.03
              BioInfo_UAVR     1  0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
              BioInfo_UAVR     2  0.00 0.00 0.03 0.40 0.30 0.29 0.00 0.00 0.02 0.10 0.19 0.05
              BioInfo_UAVR     3  0.00 0.00 0.03 0.00 0.00 0.10 0.00 0.00 0.00 0.10 0.07 0.02
              BioInfo_UAVR     4  0.00 0.00 0.03 0.00 0.00 0.03 0.00 0.00 0.03 0.00 0.00 0.03
              RELAI            0  0.30 0.19 0.31 0.20 0.18 0.21
              RELAI            1  0.30 0.19 0.31 0.20 0.13 0.27
              RELAI            2  0.40 0.34 0.41 0.10 0.12 0.36
              RELAI            3  0.40 0.34 0.41 0.50 0.47 0.37
              RELAI            4  0.00 0.00 0.01 0.00 0.00 0.00
              BLUE             0 1.00 1.00 0.76 1.00 1.00 0.81 1.00 1.00 0.89 1.00 1.00 0.89
              BLUE             1 1.00 1.00 0.76 1.00 1.00 0.89 1.00 1.00 0.91 1.00 1.00 0.91
              BLUE             2 1.00 1.00 0.69 1.00 1.00 0.40 0.00 0.00 0.02 0.00 0.00 0.01
              BioNLP-UniBuc    0  0.00 0.00 0.06
              BioNLP-UniBuc    1  0.00 0.00 0.02
              BioNLP-UniBuc    2  0.00 0.00 0.04
              BioNLP-UniBuc    3  0.10 0.19 0.07
              BioNLP-UniBuc    4  0.00 0.00 0.02
              UNSL             0 1.00 1.00 0.68 1.00 1.00 0.90 1.00 1.00 0.93 1.00 1.00 0.95
              UNSL             1 1.00 1.00 0.70 1.00 1.00 0.90 1.00 1.00 0.92 1.00 1.00 0.93
              UNSL             2  0.90 0.90 0.66 1.00 1.00 0.77 0.90 0.92 0.78 0.90 0.90 0.77
              UNSL             3 1.00 1.00 0.69 0.60 0.58 0.72 0.80 0.81 0.77 0.80 0.81 0.78
              UNSL             4  0.10 0.07 0.32 0.10 0.07 0.32 0.20 0.13 0.33 0.30 0.22 0.37
              NLPGroup-IISERB 0   0.00 0.00 0.02 0.00 0.00 0.03 0.00 0.00 0.03 0.00 0.00 0.03
              NLPGroup-IISERB 1   0.00 0.00 0.03 0.00 0.00 0.03 0.00 0.00 0.05 0.00 0.00 0.03
              NLPGroup-IISERB 2   0.00 0.00 0.15 0.00 0.00 0.11 0.20 0.13 0.12 0.00 0.00 0.08
              NLPGroup-IISERB 3   0.00 0.00 0.01 0.10 0.06 0.10 0.10 0.07 0.12 0.10 0.07 0.12
              NLPGroup-IISERB 4   0.20 0.38 0.15 0.00 0.00 0.06 0.00 0.00 0.07 0.00 0.00 0.07
              stezmo3          0  0.10 0.06 0.26
              stezmo3          1  0.10 0.06 0.26
              stezmo3          2  0.50 0.58 0.61
              stezmo3          3  0.50 0.58 0.61
              stezmo3          4  0.50 0.58 0.61


chuck-based release. The lab website6 contains more information about the server.
This was a training and test task. The test phase was conducted in the same manner as Task 1
(see Section 2). The teams got access to training data for the training stage, where we published
the entire history of writings for training users. We highlighted those who had expressly stated
that they suffer from depression. As a result, the participants could tweak their systems using
   6
       https://early.irlab.org/server.html
Table 5
Task 2 (Depression). Main statistics of test collection
                                                                           Test
                                                                   Depressed Control
                 Num. subjects                                        98         1,302
                 Num. submissions (posts & comments)                35,332      687,228
                 Avg num. of submissions per subject                360.53      527,82
                 Avg num. of days from first to last submission     ≈ 628.2     ≈ 661.7
                 Avg num. words per submission                       27.4        23.5

Table 6
Task 2 (depression): participating teams, number of runs, number of user writings processed by the
team, and lapse of time taken for the whole process.
         team                  #runs    #user writings processed                  lapse of time
                                                                    (from 1st to last response)
         CYUT                     5                        2000                 7 days 12:02:44
         LauSAn                   5                        2000                 2 days 06:44:17
         BLUE                     3                        2000                 2 days 17:16:05
         BioInfo_UAVR             5                         503                         09:38:26
         TUA1                     5                        2000                         16:28:49
         NLPGroup-IISERB          5                         632               11 days 20:35:11
         RELAI                    5                         169                 7 days 02:27:10
         UNED-MED                 5                        1318                 5 days 13:18:24
         Sunday-Rocker2           5                         682                 4 days 03:54:25
         SCIR2                    5                        2000                  1 day 04:52:02
         UNSL                     5                        2000                  1 day 09:35:12
         E8-IJS                   5                        2000                 3 days 02:36:32
         NITK-NLP2                4                           6                         01:52:57


the training data. The training data for Task 2 in 2022 was made up of users from the 2017 and
2018 editions.
Again, we followed existing methods to build the assessments using simulated pooling strategies,
which optimize the use of assessors time [17, 18]. Table 5 reports the main statistics of the test
collections used for T2. The same decision and ranking based measures as discussed in sections
2.1 and 2.2 were used for this task.

3.1. Task 2: Results
Table 6 shows the participating teams, the number of runs submitted and the approximate lapse
of time from the first response to the last response. Most of the submitted runs processed the
entire thread of messages (about 2000), but few stopped earlier or were not able to process the
users’ history in time. Only one team was able to process the entire set of writings in less than
a day.
          Table 7: Decision-based evaluation for Task 2


                                                                             𝑙𝑎𝑡𝑒𝑛𝑐𝑦-𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝐹 1
                                                        𝑙𝑎𝑡𝑒𝑛𝑐𝑦𝑇 𝑃
                                               𝐸𝑅𝐷𝐸50
                                       𝐸𝑅𝐷𝐸5


                                                                     𝑠𝑝𝑒𝑒𝑑
                                 𝐹1
Team            Run


                           𝑅
                     𝑃
CYUT             0 0.165 0.918 0.280 0.053 0.032   3.0 0.992 0.277
CYUT             1 0.162 0.898 0.274 0.053 0.032   3.0 0.992 0.272
CYUT             2 0.106 0.867 0.189 0.056 0.047   1.0 1.000 0.189
CYUT             3 0.149 0.878 0.255 0.075 0.040   7.0 0.977 0.249
CYUT             4 0.142 0.918 0.245 0.082 0.041   8.0 0.973 0.239
LauSAn           0 0.137 0.827 0.235 0.041 0.038   1.0 1.000 0.235
LauSAn           1 0.165 0.888 0.279 0.053 0.040   2.0 0.996 0.278
LauSAn           2 0.174 0.867 0.290 0.056 0.031   4.0 0.988 0.287
LauSAn           3 0.420 0.643 0.508 0.059 0.041   6.0 0.981 0.498
LauSAn           4 0.201 0.724 0.315 0.039 0.033   1.0 1.000 0.315
BLUE             0 0.395 0.898 0.548 0.047 0.027   5.0 0.984 0.540
BLUE             1 0.213 0.939 0.347 0.054 0.033   4.5 0.986 0.342
BLUE             2 0.106 1.000 0.192 0.074 0.048   4.0 0.988 0.190
BioInfo_UAVR     0 0.222 0.949 0.360 0.071 0.031   9.0 0.969 0.349
BioInfo_UAVR     1 0.091 0.969 0.166 0.101 0.054   8.0 0.973 0.162
BioInfo_UAVR     2 0.171 0.969 0.291 0.083 0.035 11.0 0.961 0.279
BioInfo_UAVR     3 0.090 0.990 0.166 0.101 0.052   6.0 0.981 0.162
BioInfo_UAVR     4 0.378 0.857 0.525 0.069 0.031 16.0 0.942 0.494
TUA1             0 0.155 0.806 0.260 0.055 0.037   3.0 0.992 0.258
TUA1             1 0.129 0.816 0.223 0.053 0.041   3.0 0.992 0.221
TUA1             2 0.155 0.806 0.260 0.055 0.037   3.0 0.992 0.258
TUA1             3 0.129 0.816 0.223 0.053 0.041   3.0 0.992 0.221
TUA1             4 0.159 0.959 0.272 0.052 0.036   3.0 0.992 0.270
NLPGroup-IISERB 0 0.682 0.745 0.712 0.055 0.032    9.0 0.969 0.690
NLPGroup-IISERB 1 0.385 0.857 0.532 0.062 0.032 18.0 0.934 0.496
NLPGroup-IISERB 2 0.662 0.459 0.542 0.069 0.058 62.0 0.766 0.416
NLPGroup-IISERB 3 0.653 0.500 0.566 0.067 0.046 26.0 0.903 0.511
NLPGroup-IISERB 4 0.000 0.000 0.000 0.070 0.070
RELAI            0 0.085 0.847 0.155 0.114 0.092 51.0 0.807 0.125
RELAI            1 0.085 0.847 0.155 0.114 0.091 51.0 0.807 0.125
RELAI            2 0.000 0.000 0.000 0.070 0.070
RELAI            3 0.000 0.000 0.000 0.070 0.070
RELAI            4 0.000 0.000 0.000 0.070 0.070
UNED-MED         0 0.119 0.969 0.212 0.091 0.056 18.0 0.934 0.198
UNED-MED         1 0.139 0.980 0.244 0.079 0.046 13.0 0.953 0.233
UNED-MED         2 0.122 0.939 0.215 0.086 0.057 15.0 0.945 0.204
UNED-MED         3 0.131 0.949 0.231 0.084 0.051 15.0 0.945 0.218
UNED-MED         4 0.084 0.163 0.111 0.079 0.078 251.0 0.252 0.028
Sunday-Rocker2   0 0.091 1.000 0.167 0.080 0.053   4.0 0.988 0.165
Sunday-Rocker2   1 0.355 0.786 0.489 0.068 0.041 27.0 0.899 0.439
Sunday-Rocker2   2 0.092 0.388 0.149 0.088 0.083 117.5 0.575 0.085
Sunday-Rocker2   3 0.283 0.816 0.420 0.071 0.045 37.5 0.859 0.361
Sunday-Rocker2   4 0.108 1.000 0.195 0.082 0.047   6.0 0.981 0.191
SCIR2            0 0.396 0.837 0.538 0.076 0.076 150.0 0.477 0.256
SCIR2            1 0.336 0.878 0.486 0.078 0.078 150.0 0.477 0.232
SCIR2            2 0.235 0.908 0.373 0.051 0.046   3.0 0.992 0.370
SCIR2            3 0.316 0.847 0.460 0.079 0.026 44.0 0.834 0.383
SCIR2            4 0.274 0.847 0.414 0.045 0.031   3.0 0.992 0.411
UNSL             0 0.161 0.918 0.274 0.079 0.042 14.5 0.947 0.260
UNSL             1 0.310 0.786 0.445 0.078 0.037 12.0 0.957 0.426
UNSL             2 0.400 0.755 0.523 0.045 0.026   3.0 0.992 0.519
UNSL             3 0.144 0.929 0.249 0.055 0.035   3.0 0.992 0.247
                   Table 7: Decision-based evaluation for Task 2 (Continuation)


                                                                                                                                                                                  𝑙𝑎𝑡𝑒𝑛𝑐𝑦-𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝐹 1
                                                                                                                                        𝑙𝑎𝑡𝑒𝑛𝑐𝑦𝑇 𝑃
                                                                                                                   𝐸𝑅𝐷𝐸50
                                                                                                     𝐸𝑅𝐷𝐸5


                                                                                                                                                                 𝑠𝑝𝑒𝑒𝑑
                                                                                       𝐹1
                Team                 Run


                                                                          𝑅
                                                         𝑃
                UNSL                  4        0.080                0.918             0.146         0.099        0.074            5.0 0.984 0.144
                E8-IJS                0        0.684                0.133             0.222         0.061        0.061            1.0 1.000 0.222
                E8-IJS                1        0.242                0.959             0.387         0.068        0.036           20.5 0.924 0.357
                E8-IJS                2        0.000                0.000             0.000         0.070        0.070
                E8-IJS                3        0.000                0.000             0.000         0.070        0.070
                E8-IJS                4        0.000                0.000             0.000         0.070        0.070
                NITK-NLP2             0        0.138                0.796             0.235         0.047        0.039                  2.0            0.996                  0.234
                NITK-NLP2             1        0.135                0.806             0.231         0.047        0.039                  2.0            0.996                  0.230
                NITK-NLP2             2        0.132                0.786             0.225         0.050        0.040                  2.0            0.996                  0.225
                NITK-NLP2             3        0.149                0.724             0.248         0.049        0.039                  2.0            0.996                  0.247


Table 7 reports the decision-based performance achieved by the participating teams. In terms of
Precision, E8-IJS run #0 obtains the highest values but at the expenses of low Recall. Similarly,
Sunday-Rocker systems #0 and #4 obtain and BLUE #2 perfect Recall but with low Precision
values. When considering the Precision-Recall trade-off, NLPGroup-IISERB #0 is the best
performance being the only run over 0.7 (highest 𝐹 1). Regarding latency-penalized metrics,
UNSL #2 and SCIR2 #3 obtained the best 𝐸𝑅𝐷𝐸50 and LauSAn #4 the best 𝐸𝑅𝐷𝐸5 error value.
It is again NLPGroup-IISERB #04, the one achieving the best latency-weighted 𝐹 1. This run
seems to be quite balanced overall.


                             Table 8: Ranking-based evaluation for Task 2
                                           1 writing                            100 writings                        500 writings                                          1000 writings
                                                              𝑁 𝐷𝐶𝐺@100


                                                                                                     𝑁 𝐷𝐶𝐺@100


                                                                                                                                                     𝑁 𝐷𝐶𝐺@100


                                                                                                                                                                                                                     𝑁 𝐷𝐶𝐺@100
                                              𝑁 𝐷𝐶𝐺@10


                                                                                         𝑁 𝐷𝐶𝐺@10


                                                                                                                             𝑁 𝐷𝐶𝐺@10


                                                                                                                                                                                                         𝑁 𝐷𝐶𝐺@10
                                   𝑃 @10


                                                                              𝑃 @10


                                                                                                                  𝑃 @10


                                                                                                                                                                          𝑃 @10


         Team               Run
         CYUT                0    0.50       0.49            0.37         0.50         0.52         0.54         0.60       0.59               0.58                      0.70                 0.72                  0.61
         CYUT                1    0.70       0.77            0.37         0.60         0.72         0.58         0.60       0.72               0.61                      0.70                 0.80                  0.62
         CYUT                2    0.00       0.00            0.16         0.10         0.07         0.25         0.10       0.19               0.31                      0.10                 0.12                  0.29
         CYUT                3    0.10       0.07            0.12         0.70         0.70         0.57         0.70       0.72               0.59                      0.80                 0.74                  0.60
         CYUT                4    0.10       0.06            0.12         0.60         0.68         0.55         0.60       0.69               0.59                      0.80                 0.84                  0.61
         LauSAn              0    0.60       0.72            0.43         0.30         0.41         0.13         0.20       0.31               0.12                      0.10                 0.19                  0.11
         LauSAn              1    0.60       0.66            0.43         0.40         0.33         0.30         0.50       0.50               0.17                      0.20                 0.15                  0.08
         LauSAn              2    0.60       0.66            0.43         0.40         0.33         0.29         0.50       0.50               0.18                      0.20                 0.15                  0.13
         LauSAn              3    0.60       0.66            0.43         0.40         0.33         0.27         0.50       0.50               0.22                      0.20                 0.15                  0.14
         LauSAn              4    0.40       0.38            0.34         0.50         0.49         0.41         0.40       0.27               0.21                      0.20                 0.22                  0.14
         BLUE                0    0.80       0.88            0.54         0.60         0.56         0.59         0.80       0.81               0.66                      0.80                 0.80                  0.68
         BLUE                1    0.80       0.88            0.54         0.70         0.64         0.67         0.80       0.84               0.74                      0.80                 0.86                  0.72
         BLUE                2    0.80       0.75            0.46         0.40         0.40         0.30         0.30       0.35               0.20                      0.30                 0.38                  0.16
         BioInfo_UAVR        0    0.00       0.00            0.04         0.20         0.15         0.15         0.00       0.00               0.09
         BioInfo_UAVR        1    0.00       0.00            0.02         0.20         0.25         0.14         0.20       0.12               0.07
         BioInfo_UAVR        2    0.20       0.13            0.06         0.60         0.60         0.36         0.70       0.78               0.32
         BioInfo_UAVR        3    0.10       0.08            0.08         0.20         0.26         0.14         0.20       0.17               0.08
         BioInfo_UAVR        4    0.10       0.07            0.05         0.00         0.00         0.04         0.00       0.00               0.05
                   Table 8: Ranking-based evaluation for Task 2 (Continuation)
                                      1 writing                   100 writings                   500 writings                 1000 writings


                                                    𝑁 𝐷𝐶𝐺@100


                                                                                   𝑁 𝐷𝐶𝐺@100


                                                                                                                  𝑁 𝐷𝐶𝐺@100


                                                                                                                                                 𝑁 𝐷𝐶𝐺@100
                                         𝑁 𝐷𝐶𝐺@10


                                                                        𝑁 𝐷𝐶𝐺@10


                                                                                                       𝑁 𝐷𝐶𝐺@10


                                                                                                                                      𝑁 𝐷𝐶𝐺@10
                              𝑃 @10


                                                                𝑃 @10


                                                                                               𝑃 @10


                                                                                                                              𝑃 @10
         Team            Run
         TUA1             0  0.80 0.88 0.44 0.60 0.72 0.52 0.60 0.67 0.52 0.70 0.80 0.57
         TUA1             1 0.70 0.77 0.44 0.50 0.54 0.39 0.50 0.56 0.42 0.50 0.65 0.43
         TUA1             2  0.80 0.88 0.44 0.60 0.72 0.52 0.60 0.67 0.52 0.70 0.80 0.57
         TUA1             3 0.60 0.69 0.43 0.50 0.54 0.39 0.50 0.56 0.42 0.50 0.65 0.43
         TUA1             4 0.50 0.37 0.35 0.00 0.00 0.36 0.00 0.00 0.36 0.20 0.12 0.31
         NLPGroup-IISERB 0 0.00 0.00 0.02 0.90 0.92 0.30 0.90 0.92 0.33
         NLPGroup-IISERB 1 0.30 0.32 0.13 0.90 0.81 0.27 0.80 0.84 0.33
         NLPGroup-IISERB 2 0.70 0.79 0.24 0.00 0.00 0.00 0.00 0.00 0.00
         NLPGroup-IISERB 3 0.00 0.00 0.06 0.10 0.19 0.06 0.00 0.00 0.02
         NLPGroup-IISERB 4 0.00 0.00 0.04 0.90 0.93 0.66 0.90 0.92 0.69
         RELAI            0 0.00 0.00 0.07 0.10 0.06 0.20
         RELAI            1 0.00 0.00 0.07 0.20 0.25 0.20
         RELAI            2 0.10 0.12 0.09 0.00 0.00 0.16
         RELAI            3 0.10 0.12 0.09 0.50 0.52 0.31
         RELAI            4 0.10 0.12 0.07 0.00 0.00 0.00
         UNED-MED         0 0.70 0.69 0.27 0.80 0.84 0.63 0.60 0.66 0.60 0.50 0.46 0.56
         UNED-MED         1 0.50 0.44 0.26 0.70 0.76 0.50 0.60 0.64 0.47 0.80 0.74 0.50
         UNED-MED         2 0.70 0.68 0.28 0.50 0.51 0.59 0.80 0.71 0.61 0.50 0.44 0.62
         UNED-MED         3  0.80 0.82 0.29 0.60 0.44 0.31 0.80 0.73 0.36 0.40 0.51 0.30
         UNED-MED         4 0.00 0.00 0.06 0.00 0.00 0.05 0.00 0.00 0.04 0.10 0.19 0.09
         Sunday-Rocker2   0 0.40 0.47 0.39 0.40 0.44 0.29 0.50 0.46 0.24
         Sunday-Rocker2   1 0.70 0.81 0.39 0.90 0.93 0.66 0.90 0.88 0.65
         Sunday-Rocker2   2 0.10 0.07 0.23 0.00 0.00 0.11 0.30 0.31 0.17
         Sunday-Rocker2   3  0.80 0.88 0.41 0.50 0.50 0.23 0.60 0.69 0.34
         Sunday-Rocker2   4 0.30 0.28 0.31 0.30 0.37 0.25 0.40 0.30 0.18
         SCIR2            0 0.10 0.07 0.08 0.00 0.00 0.06 0.00 0.00 0.06 0.10 0.12 0.06
         SCIR2            1 0.00 0.00 0.05 0.10 0.07 0.07 0.00 0.00 0.04 0.00 0.00 0.05
         SCIR2            2 0.00 0.00 0.06 0.00 0.00 0.05 0.20 0.13 0.07 0.00 0.00 0.06
         SCIR2            3 0.10 0.06 0.05 0.00 0.00 0.04 0.00 0.00 0.06 0.00 0.00 0.02
         SCIR2            4 0.10 0.19 0.09 0.10 0.07 0.05 0.10 0.10 0.07 0.10 0.06 0.05
         UNSL             0 0.60 0.40 0.36 0.20 0.13 0.46 0.30 0.28 0.43 0.60 0.72 0.45
         UNSL             1  0.80 0.88 0.46 0.60 0.73 0.64 0.60 0.73 0.66 0.60 0.71 0.66
         UNSL             2 0.70 0.68 0.50 0.50 0.39 0.55 0.70 0.73 0.61 0.70 0.73 0.61
         UNSL             3 0.10 0.06 0.15 0.40 0.27 0.43 0.30 0.21 0.42 0.30 0.21 0.42
         UNSL             4 0.10 0.12 0.05 0.00 0.00 0.03 0.20 0.19 0.07 0.00 0.00 0.04
         E8-IJS           0 0.00 0.00 0.06 0.10 0.07 0.05 0.10 0.12 0.08 0.00 0.00 0.03
         E8-IJS           1 0.40 0.58 0.19 0.40 0.41 0.15 0.20 0.15 0.09 0.30 0.38 0.15
         E8-IJS           2 0.00 0.00 0.05 0.00 0.00 0.07 0.00 0.00 0.05 0.10 0.19 0.07
         E8-IJS           3 0.00 0.00 0.02 0.10 0.10 0.08 0.10 0.06 0.02 0.00 0.00 0.05
         E8-IJS           4 0.00 0.00 0.07 0.10 0.10 0.08 0.20 0.31 0.11 0.10 0.06 0.04
         NITK-NLP2        0 0.40 0.28 0.15
         NITK-NLP2        1 0.00 0.00 0.01
         NITK-NLP2        2 0.00 0.00 0.02
         NITK-NLP2        3 0.00 0.00 0.02


Table 8 presents the ranking-based results. Contrary to task 1, no run obtained perfect figures
for any of the scenarios. This is worth noting, given that for task 2, there are more positive
subjects. Overall, systems #0 and #1 from the BLUE team seem to be the most consistent under
the different number of writings among the best-performing ones. Other systems, such as those
from NLPGroup-IISERB, show an erratic behaviour going so low as Precision 0 when only one
writing was processed but obtaining the best results for the same metrics after 100.


4. Task 3: Measuring the severity of the signs of Eating Disorders
The challenge consists on estimating the severity of various symptoms linked with an eating
disorder diagnosis. In order to accomplish this, the participants worked from a thread of user
posts. Participants were given a whole history of Social Media posts and comments for each
user, and they had to evaluate the individual’s replies to a typical eating disorder questionnaire
(based on the evidence revealed in the history of posts/comments).
The questionnaire is based on the Eating Disorder Examination Questionnaire (EDE-Q)7 , which
is a 28-item self-reported questionnaire derived from the semi-structured interview Eating
Disorder Examination (EDE)8 [21]. We only used questions 1 through 12 and 19 through 28.
This test is intended to measure the extent and severity of multiple eating disorder symptoms.
It incorporates four subscales (Restraint, Eating Concern, Shape Concern, and Weight Concern)
as well as an overall score. Table 9 has a list of questions.


                             Table 9: Eating Disorder Examination Questionarie
 Instructions:

 The following questions are concerned with the past four weeks (28 days) only. Please read each
 question carefully. Please answer all the questions. Thank you..

 1. Have you been deliberately trying to limit the amount of food you eat to influence your shape
 or weight (whether or not you have succeeded) 0. NO DAYS
  1. 1-5 DAYS
  2. 6-12 DAYS
  3. 13-15 DAYS
  4. 16-22 DAYS
  5. 23-27 DAYS
  6. EVERY DAY

 2. Have you gone for long periods of time (8 waking hours or more) without eating anything at all
 in order to influence your shape or weight?
  0. NO DAYS
  1. 1-5 DAYS
  2. 6-12 DAYS
  3. 13-15 DAYS
  4. 16-22 DAYS
  5. 23-27 DAYS
  6. EVERY DAY

 3. Have you tried to exclude from your diet any foods that you like in order to influence your
 shape or weight (whether or not you have succeeded)?
  0. NO DAYS
  1. 1-5 DAYS
  2. 6-12 DAYS
  3. 13-15 DAYS
  4. 16-22 DAYS
  5. 23-27 DAYS
  6. EVERY DAY


   7
       https://www.corc.uk.net/media/1273/ede-q_quesionnaire.pdf
   8
       https://www.corc.uk.net/media/1951/ede_170d.pdf
                  Table 9: Eating Disorder Examination Questionarie (continued)
4. Have you tried to follow definite rules regarding your eating (for example, a calorie limit)
in order to influence your shape or weight (whether or not you have succeeded)?
 0. NO DAYS
 1. 1-5 DAYS
 2. 6-12 DAYS
 3. 13-15 DAYS
 4. 16-22 DAYS
 5. 23-27 DAYS
 6. EVERY DAY

5. Have you had a definite desire to have an empty stomach with the aim of influencing your shape
or weight?
 0. NO DAYS
 1. 1-5 DAYS
 2. 6-12 DAYS
 3. 13-15 DAYS
 4. 16-22 DAYS
 5. 23-27 DAYS
 6. EVERY DAY

6. Have you had a definite desire to have a totally flat stomach?
 0. NO DAYS
 1. 1-5 DAYS
 2. 6-12 DAYS
 3. 13-15 DAYS
 4. 16-22 DAYS
 5. 23-27 DAYS
 6. EVERY DAY

7. Has thinking about food, eating or calories made it very difficult to concentrate on things
you are interested in (for example, working, following a conversation, or reading)?
 0. NO DAYS
 1. 1-5 DAYS
 2. 6-12 DAYS
 3. 13-15 DAYS
 4. 16-22 DAYS
 5. 23-27 DAYS
 6. EVERY DAY

8. Has thinking about shape or weight made it very difficult to concentrate on things you are
interested in (for example, working, following a conversation, or reading)?
 0. NO DAYS
 1. 1-5 DAYS
 2. 6-12 DAYS
 3. 13-15 DAYS
 4. 16-22 DAYS
 5. 23-27 DAYS
 6. EVERY DAY

9. Have you had a definite fear of losing control over eating
 0. NO DAYS
 1. 1-5 DAYS
 2. 6-12 DAYS
 3. 13-15 DAYS
 4. 16-22 DAYS
 5. 23-27 DAYS
 6. EVERY DAY

10. Have you had a definite fear that you might gain weight?
                   Table 9: Eating Disorder Examination Questionarie (continued)
 0.   NO DAYS
 1.   1-5 DAYS
 2.   6-12 DAYS
 3.   13-15 DAYS
 4.   16-22 DAYS
 5.   23-27 DAYS
 6.   EVERY DAY

11. Have you felt fat?
 0. NO DAYS
 1. 1-5 DAYS
 2. 6-12 DAYS
 3. 13-15 DAYS
 4. 16-22 DAYS
 5. 23-27 DAYS
 6. EVERY DAY

12. Have you had a strong desire to lose weight?
 0. NO DAYS
 1. 1-5 DAYS
 2. 6-12 DAYS
 3. 13-15 DAYS
 4. 16-22 DAYS
 5. 23-27 DAYS
 6. EVERY DAY

19. Over the past 28 days, on how many days have you eaten in secret (ie, furtively)? 0... Do
not count episodes of binge eating.
 0. NO DAYS
 1. 1-5 DAYS
 2. 6-12 DAYS
 3. 13-15 DAYS
 4. 16-22 DAYS
 5. 23-27 DAYS
 6. EVERY DAY

20. On what proportion of the times that you have eaten have you felt guilty (felt that you’ve
done wrong) because of its effect on your shape or weight? 0... Do not count episodes of binge
eating.
 0. NO DAYS
 1. 1-5 DAYS
 2. 6-12 DAYS
 3. 13-15 DAYS
 4. 16-22 DAYS
 5. 23-27 DAYS
 6. EVERY DAY

21. Over the past 28 days, how concerned have you been about other people seeing you eat? 0...
Do not count episodes of binge eating
 0. NO DAYS
 1. 1-5 DAYS
 2. 6-12 DAYS
 3. 13-15 DAYS
 4. 16-22 DAYS
 5. 23-27 DAYS
 6. EVERY DAY

22. Has your weight influenced how you think about (judge) yourself as a person?
 0. NOT AT ALL (0)
 1. SLIGHTY (1)
                       Table 9: Eating Disorder Examination Questionarie (continued)
 2.   SLIGHTY (2)
 3.   MODERATELY (3)
 4.   MODERATELY (4)
 5.   MARKEDLY (5)
 6.   MARKEDLY (6)

23. Has your shape influenced how you think about (judge) yourself as a person?
 0. NOT AT ALL (0)
 1. SLIGHTY (1)
 2. SLIGHTY (2)
 3. MODERATELY (3)
 4. MODERATELY (4)
 5. MARKEDLY (5)
 6. MARKEDLY (6)

24. How much would it have upset you if you had been asked to weigh yourself once a week (no more,
or less, often) for the next four weeks?
 0. NOT AT ALL (0)
 1. SLIGHTY (1)
 2. SLIGHTY (2)
 3. MODERATELY (3)
 4. MODERATELY (4)
 5. MARKEDLY (5)
 6. MARKEDLY (6)

25. How dissatisfied have you been with your weight?
 0. NOT AT ALL (0)
 1. SLIGHTY (1)
 2. SLIGHTY (2)
 3. MODERATELY (3)
 4. MODERATELY (4)
 5. MARKEDLY (5)
 6. MARKEDLY (6)

26. How dissatisfied have you been with your shape?
 0. NOT AT ALL (0)
 1. SLIGHTY (1)
 2. SLIGHTY (2)
 3. MODERATELY (3)
 4. MODERATELY (4)
 5. MARKEDLY (5)
 6. MARKEDLY (6)

27. How uncomfortable have you felt seeing your body (for example, seeing your shape in the mirror,
in a shop window reflection, while undressing or taking a bath or shower)?
 0. NOT AT ALL (0)
 1. SLIGHTY (1)
 2. SLIGHTY (2)
 3. MODERATELY (3)
 4. MODERATELY (4)
 5. MARKEDLY (5)
 6. MARKEDLY (6)

28. How uncomfortable have you felt about others seeing your shape or figure (for example, in
communal changing rooms, when swimming, or wearing tight clothes)?
 0. NOT AT ALL (0)
 1. SLIGHTY (1)
 2. SLIGHTY (2)
 3. MODERATELY (3)
 4. MODERATELY (4)
                     (a) Mean.                                   (b) Standard deviation.
Figure 2: User answers mean and standard deviation.


                     Table 9: Eating Disorder Examination Questionarie (continued)
  5. MARKEDLY (5)
  6. MARKEDLY (6)


The goal of this task is to study the feasibility of automatically evaluating the severity of multiple
eating disorder symptoms. Based on the user’s writing history, the algorithms must estimate
the user’s response to each specific question. We collected surveys filled out by Social Media
users as well as their writing history (we extracted each history of writings right after the
user provided us with the filled questionnaire). The quality of the responses produced by the
participating systems was evaluated using user-completed questionnaires (ground truth). This
was a test task only. The participants received no training data.
The participants were given a dataset with 28 individuals (each user’s writing history was
provided) and asked to create a file with the following structure:

username1 answer1 answer2 ... answer22
username2 answer1 answer2 ... answer22

Each line has the username and 22 values. These values correspond with the responses to the
questions above (the possible values are 0,1,2,3,4,5,6). The next figure illustrates the mean and
standard deviation of user answers considering all the questions.

4.1. Task 3: Evaluation Metrics
Evaluation is based on the following effectiveness metrics:

    • Mean Zero-One Error (𝑀 𝑍𝑂𝐸) between the questionnaire filled by the real user and
      the questionnaire filled by the system (i.e. fraction of incorrect predictions).
                                                  |{𝑞𝑖 ∈ 𝑄 : 𝑅(𝑞𝑖 ) ̸= 𝑓 (𝑞𝑖 )}|
                             𝑀 𝑍𝑂𝐸(𝑓, 𝑄) =                                                   (8)
                                                              |𝑄|
  where 𝑓 denotes the classification done by an automatic system, 𝑄 is the set of questions
  of each questionnaire, 𝑞𝑖 is the i-th question, 𝑅(𝑞𝑖 ) is the real user’s answer for the i-th
  question and 𝑓 (𝑞𝑖 ) is the predicted answer of the system for the i-th question. Each user
  produces a single 𝑀 𝑍𝑂𝐸 score and the reported 𝑀 𝑍𝑂𝐸 is the average over all 𝑀 𝑍𝑂𝐸
  values (mean 𝑀 𝑍𝑂𝐸 over all users).
• Mean Absolute Error (𝑀 𝐴𝐸) between the questionnaire filled by the real user and the
  questionnaire filled by the system (i.e. average deviation of the predicted response from
  the true response).
                                             ∑︀
                                               𝑞𝑖 ∈𝑄 |𝑅(𝑞𝑖 ) − 𝑓 (𝑞𝑖 )|
                             𝑀 𝐴𝐸(𝑓, 𝑄) =                                                    (9)
                                                       |𝑄|
  Again, each user produces a single 𝑀 𝐴𝐸 score and the reported 𝑀 𝐴𝐸 is the average
  over all 𝑀 𝐴𝐸 values (mean 𝑀 𝐴𝐸 over all users).
• Macroaveraged Mean Absolute Error (𝑀 𝐴𝐸𝑚𝑎𝑐𝑟𝑜 ) between the questionnaire filled
  by the real user and the questionnaire filled by the system (see [22]).
                                                       6
                                                            ∑︀
                                             1 ∑︁            𝑞𝑖 ∈𝑄𝑗 |𝑅(𝑞𝑖 ) − 𝑓 (𝑞𝑖 )|
                          𝑀 𝐴𝐸𝑚𝑎𝑐𝑟𝑜 (𝑓, 𝑄) =                                               (10)
                                             7                       |𝑄𝑗 |
                                                      𝑗=0

  where 𝑄𝑗 represents the set of questions whose true answer is 𝑗 (note that 𝑗 goes from 0
  to 6 because those are the possible answers to each question). Again, each user produces a
  single 𝑀 𝐴𝐸𝑚𝑎𝑐𝑟𝑜 score and the reported 𝑀 𝐴𝐸𝑚𝑎𝑐𝑟𝑜 is the average over all 𝑀 𝐴𝐸𝑚𝑎𝑐𝑟𝑜
  values (mean 𝑀 𝐴𝐸𝑚𝑎𝑐𝑟𝑜 over all users).
  The following measures are based on aggregated scores obtained from the questionnaires.
  Further details about the EDE-Q instruments can be found elsewhere (e.g. see the scoring
  section of the questionnaire9 ).
• Restraint Subscale (RS): Given a questionnaire, its restraint score is obtained as the
  mean response to the first five questions. This measure computes the RMSE between the
  restraint ED score obtained from the questionnaire filled by the real user and the restraint
  ED score obtained from the questionnaire filled by the system.
  Each user 𝑢𝑖 is associated with a real subscale ED score (referred to as 𝑅𝑅𝑆 (𝑢𝑖 )) and an
  estimated subscale ED score (referred to as 𝑓𝑅𝑆 (𝑢𝑖 )). This metric computes the RMSE
  between the real and an estimated subscale ED scores as follows:
                                        √︃ ∑︀
                                                                          2
                                             𝑢𝑖 ∈𝑈 (𝑅𝑅𝑆 (𝑢𝑖 ) − 𝑓𝑅𝑆 (𝑢𝑖 ))
                     𝑅𝑀 𝑆𝐸(𝑓, 𝑈 ) =                                                       (11)
                                                         |𝑈 |
     where 𝑈 is the user set.


9
    https://www.corc.uk.net/media/1951/ede_170d.pdf
    • Eating Concern Subscale (ECS): Given a questionnaire, its eating concern score is
      obtained as the mean response to the following questions (7, 9, 19, 21, 20). This metric
      computes the RMSE (equation 12) between the eating concern ED score obtained from
      the questionnaire filled by the real user and the eating concern ED score obtained from
      the questionnaire filled by the system.
                                          √︃ ∑︀
                                                                               2
                                                𝑢𝑖 ∈𝑈 (𝑅𝐸𝐶𝑆 (𝑢𝑖 ) − 𝑓𝐸𝐶𝑆 (𝑢𝑖 ))
                      𝑅𝑀 𝑆𝐸(𝑓, 𝑈 ) =                                                       (12)
                                                             |𝑈 |
    • Shape Concern Subscale (SCS): Given a questionnaire, its shape concern score is
      obtained as the mean response to the following questions (6, 8, 23, 10, 26, 27, 28, 11). This
      metric computes the RMSE (equation 13) between the shape concern ED score obtained
      from the questionnaire filled by the real user and the shape concern ED score obtained
      from the questionnaire filled by the system.
                                         √︃ ∑︀
                                                                              2
                                               𝑢𝑖 ∈𝑈 (𝑅𝑆𝐶𝑆 (𝑢𝑖 ) − 𝑓𝑆𝐶𝑆 (𝑢𝑖 ))
                       𝑅𝑀 𝑆𝐸(𝑓, 𝑈 ) =                                                           (13)
                                                            |𝑈 |
    • Weight Concern Subscale (WCS): Given a questionnaire, its weight concern score is
      obtained as the mean response to the following questions (22, 24, 8, 25, 12). This metric
      computes the RMSE (equation 14) between the weight concern ED score obtained from
      the questionnaire filled by the real user and the weight concern ED score obtained from
      the questionnaire filled by the system.
                                         √︃ ∑︀
                                                                                2
                                               𝑢𝑖 ∈𝑈 (𝑅𝑊 𝐶𝑆 (𝑢𝑖 ) − 𝑓𝑊 𝐶𝑆 (𝑢𝑖 ))
                      𝑅𝑀 𝑆𝐸(𝑓, 𝑈 ) =                                                       (14)
                                                             |𝑈 |
    • Global ED (GED): To obtain an overall or ‘global’ score, the four subscales scores are
      summed and the resulting total divided by the number of subscales (i.e. four) [21]. This
      metric computes the RMSE between the real and an estimated global ED scores as follows:
                                       √︃ ∑︀
                                                                           2
                                            𝑢𝑖 ∈𝑈 (𝑅𝐺𝐸𝐷 (𝑢𝑖 ) − 𝑓𝐺𝐸𝐷 (𝑢𝑖 ))
                     𝑅𝑀 𝑆𝐸(𝑓, 𝑈 ) =                                                        (15)
                                                         |𝑈 |

4.2. Task 3: Results
Table 10 presents the results achieved by the participants in this task. To put things in perspective,
the table also reports (lower block) the performance achieved by three baseline variants: all 0s
and all 6s, which consist of sending the same response (0 or 6) for all the questions, and average,
which is the performance achieved by a method that, for each question, sends as a response the
answer that is the closest to the mean of the responses sent by all participants (e.g. if the mean
response provided by the participants equals 3.7 then this average approach would submit a 4).
Table 11 reports the names of the runs, and the team and corresponding run ID.
Table 10
Task 3 Results. Participating teams and runs with corresponding scores for the metrics.


                                                           𝑀 𝐴𝐸𝑚𝑎𝑐𝑟𝑜
                                             MZOE


                                                                                              WCS
                                                     MAE


                                                                       GED


                                                                                  ECS

                                                                                        SCS
                                                                             RS
                     team       run ID
                NLPGroup-IISERB   1    0.92 2.58 2.09 2.04 2.16 1.89 2.74 2.33
                NLPGroup-IISERB   2    0.92 2.18 1.76 1.74 2.00 1.73 2.03 1.92
                NLPGroup-IISERB   3    0.93 2.60 2.10 2.04 2.13 1.90 2.74 2.35
                NLPGroup-IISERB   4    0.81 3.36 2.96 3.68 3.69 3.18 4.28 3.82
                     RELAI        1    0.82 3.31 2.91 3.59 3.65 3.05 4.19 3.74
                     RELAI        2    0.82 3.30 2.89 3.56 3.65 3.03 4.17 3.71
                     RELAI        3    0.83 3.15 2.70 3.26 3.04 2.72 4.04 3.61
                     RELAI        4    0.82 3.32 2.91 3.59 3.66 3.05 4.19 3.74
                     RELAI        5    0.82 3.19 2.74 3.34 3.15 2.80 4.08 3.64
                     SINAI        1    0.85 2.65 2.29 2.63 3.29 2.35 2.98 2.40
                     SINAI        2    0.87 2.60 2.23 2.42 3.01 2.21 2.85 2.31
                     SINAI        3    0.86 2.62 2.22 2.54 3.15 2.32 2.93 2.36
                      all 0            0.81 3.36 2.96 3.68 3.69 3.18 4.28 3.82
                      all 6            0.67 2.64 3.04 3.25 3.52 3.72 2.81 3.28
                    average            0.88 2.72 2.22 2.69 2.76 2.20 3.35 2.85

Table 11
Task 3. Teams, names of runs and run IDs
                       team            run ID       name
                  NLPGroup-IISERB        1          IISERB_bert
                  NLPGroup-IISERB        2          IISERB_jl
                  NLPGroup-IISERB        3          IISERB_pretrained_bert
                  NLPGroup-IISERB        4          IISERB_sp
                      RELAI              1          RELAI_A-W2V-ED
                      RELAI              2          RELAI_B-Glove200-ED
                      RELAI              3          RELAI_B-Glove200-ED-AllFeatures
                      RELAI              4          RELAI_B-W2V-ED
                      RELAI              5          RELAI_B-W2V-ED-AllFeartures
                      SINAI              1          SINAI_0.4
                      SINAI              2          SINAI_0.35
                      SINAI              3          SINAI_0.375


4.2.1. Distribution of MAE and MZOE
Figure 3 plots for each system the distribution of MAE across the set of test users, while Figure
4 summarizes the MZOE values obtained by each system. Each box plot gives an idea about the
distribution of the effectiveness metric over the available users.
Figure 3: MAE evaluation per user.


5. Participating Teams
Table 12 reports the participating teams and the runs that they submitted for each eRisk task.
The next paragraphs give a brief summary on the techniques implemented by each of them.
Further details are available at the CLEF 2021 working notes proceedings.
NLPGroup-IISERB [23].The NLP-IISERB Lab participated in the three tasks proposed as part
of eRisk CLEF this year. Regarding Task 1 and Task 2, the team performed five runs using
different text mining frameworks in which AB, LR, RF and SVM classifiers were tested, as well
as BERT, Bio-BERT, RoBERTa and Longformer models from the HuggingFace library. All of
them with a variety of engineering features and techniques. The results achieved for Task 1
and Task 2 present successful numbers on precision, recall and F1. NLPGroup-IISERB run #4
achieves the best precision score (1.0) among the precision scores of all 41 submissions for
task 1 of the eRisk2022 challenge. The team observed from the empirical analysis that the
classical BOW model performs better than all the deep learning-based models on the given
data except the longformer model. The Longformer model performed as good as the BOW
model for Task1, but we could not explore its performance for task2 owing to time limitations.
Regarding task 3, NLP-IISERB were one of the top results in the competition, presenting four
runs. NLPGroup-IISERB run #2, a combination of cosine similarity and BERT model fine-tuned
Figure 4: MZOE evaluation per user.


on anorexia dataset from eRisk 2018 shared task 2 performed the best among all the other runs
for task 3 in terms of all the evaluation metrics except MZOE metric. The proposed models
performed well in terms of GED score, indicating that they reasonably identify eating disorders
and their side effects.
RELAI [24]. Their working notes present the similarity-based approaches proposed by the
RELAI (Université du Québec and McMaster University) team to Task 3 of eRisk 2022. The
proposed methods rely on feature sets dedicated to each item in the questionnaire. The feature
sets are compared to the written production of users based on pre-trained word vectors. The
developed models try to measure the severity of the signs of ED in an unsupervised manner.
Thus, the philosophy of the approach is based on dedicating sets of characteristics to each
element of the questionnaire. They compare pre-trained vectors with those generated for
each item and try to measure their severity in an unsupervised way. Two similarity-based
models (each with four variations) and 22 feature sets designed using expert knowledge were
developed. Two kinds of pre-trained word vectors were used as word representations. The first
is 300- dimensional word2vec, word vectors trained on publicly available textual content such
as Wikipedia and UMBC WebBase corpus using the Continuous Bag Of Words (CBOW) model.
The second is GloVe word vectors trained on two billion Twitter posts (tweets).
BioInfo_UAVR [25]. The University of Aveiro participated only in tasks 1 and 2 this year.
Their approach was all centred on finding the best feature engineering technique, that is, finding
Table 12
eRisk 2022 participants
                                             Task 1   Task 2   Task 3
                          team               #runs    #runs
                          NLPGroup-IISERB      5        5        4
                          RELAI                5        5        5
                          BioInfo_UAVR         5        5
                          BLUE                 3        3
                          UNSL                 5        5
                          SINAI                3                 3
                          UNED-NLP             5
                          BioNLP-UniBuc        5
                          stezmo3              5
                          CYUT                          5
                          LauSAn                        5
                          TUA1                          5
                          UNED-MED                      5
                          Sunday-Rocker2                5
                          E8-IJS                        5
                          NITK-NLP2                     5
                          SCIR2                         5


the most useful textual features. They tested Bag-of-Words with tf-idf, GoVe word embedding,
and contextualized language model. These were tested alone and with sentiment analysis. The
results show that these techniques achieved very high recall and the expenses of very low
precision. In addition, the use of sentiment analysis did not improve performance.
BLUE [26]. The BLUE team represents a joint collaboration between the University of Bucharest
(Romania) and Universitat Politécnica de Valencia (Spain). This team participated in the two
early detection challenges (T1 and T2) and employed a transformer-based architecture for
user-level classification. More specifically, the interaction between users’ posts was analysed,
and some technological elements were oriented to mitigate noise in the dataset. Within this
process, the system learns to ignore uninformative posts. This team also made important efforts
to facilitate interpretability.
UNSL [27] The UNSL team includes researchers affiliated with the Universidad Nacional de San
Luis (Argentina), the Consejo Nacional de Investigaciones Científicas y Técnica de Argentina,
Instituto de Matemática Aplicada San Luis (Argentina), and the IDIAP Research Institute from
Switzerland. The group participated in tasks 1 and 2 on early risk detection. Their proposal is
built upon their models from previous editions. In particular, they proposed two new policies
for their EarlyModel. The historic stop policy uses a rolling window over the last decision
probabilities of the model. If the probabilities on the window exceed a threshold, then the model
emits an alert. And the "learned decision tree stop policy" that authors only used on task 2 that
learned a decision tree over the depression dataset from 2018.
SINAI [28] The SINAI team is affiliated with the Universidad de Jaen, Spain. They participated
in tasks 1 and 3. For task 1, they devised an approach based on regression over RoBERTa
sentence embeddings using different features: volumetry, lexical diversity, text complexity
and emotions. For task 3, they used the last 28 days of the users’ history and used embedding
similarities between questions from the EDE-Q and user writings. For the day-based questions,
they counted the number of days with similarities higher than a threshold and defined similarity
ranges for the scale-based questions.
UNED-NLP [29] The UNED-NLP is a joint collaboration between Universidad Nacional de
Educación a Distancia (UNED) and the Instituto Mixto de Investigación de la Escuela Nacional
de Sanidad (IMIENS), Spain. The team participated in task 1. The group explored the use of
approximate nearest neighbours (ANN) over dense representations of the users’ posts based
on the Universal Sentence Encoder. Post-level labels were re-computed from the user-level
annotations in the training data.
BioNLP-UniBuc [30] The team from the University of Bucharest participated in Task 1. After
processing the provided XML files, the resulting training dataset had an unbalanced number of
examples for each of the two classes, so they used a stratified 5-fold cross-validation to try to
reduce the class imbalance in the train/validation splits. For feature extraction, the authors used
the Bag-of-Words and tf-idf models with additional properties for extracting relevant features
such as removing the rare words or frequent words, constructing 2-grams and 3-grams. The
team performs three runs combining classification methods with deep learning models. When
using deep learning models, the best outcome was obtained by a model containing a hidden
dimension of 128 of the linear projection, an embedding size of 64 tokens, and four attention
heads.
stezmo3 [31] The ZHAW team is affiliated to Zurich University of Applied Sciences and
their experiments focused on reproducing the solutions proposed by UNSL for eRisk 2021 and,
additionally, incorporating some innovations related to the use of Glove to support feature
extraction. This team participated in eRisk 2022 T1 task, on early identification of pathological
gambling. All variants proposed are based on Glove features, extracted from user posts and the
classification of user posts was done with SVMs or XGBoost.
CYUT [32] THE CYUT team belongs to the Chaoyang University of Technology, Taiwan.
They participated in the early risk detection of depression task (task 2). Their runs exploit
RoBERTa massive pre-trained model to address the problem with out-of-the-box and improved
representations. In their best run (#2), they use the output vector of the last layer hidden for
linear classification as post representation.
LauSAn [33] The University of Zurich took part only in the early detection of depression,
and they focused expressively on the time-sensitivity of the task. Their approach is based on
two simple strategies of optimising standard text classification models to the early detection of
depression: concatenating the posts in different ways during and inference training (to optimise
training), and continuously changing the decision threshold at inference time (to optimise
the results of time-sensitive metrics). An ablation study confirmed that both strategies were
effective. In fact, the team achieved among the best results in all time-sensitive evaluation
metrics.
TUA1 [34] The University of Tokushima participated only in the detection of early risks of
depression. They proposed a novel and very interesting approach called TAM: Time-Aware-
Affective Memories. A TAM network maintains the memory of a user’s "affective state" which
gets updated as a new user’s posting becomes available. All of this is fed to a Transformer
decoder which then predicts the user’s risk of depression. A study of the latency penalty
complements the approach to see how this could be effectively used to reduce the ERDE metric
score. Their results show that the approach is very efficient and particularly good for the very
early-stage detection of depression.
UNED-MED [35] The team from the Spanish National University of Distance Education
(UNED) propose two rather standard approaches. The first is based on the use of feature-driven
classifiers employing features based on textual data, like terms tf-idf, first-person pronoun use,
sentiment analysis and depression terminology. The second is on a Deep Learning classifier with
pre-trained Embeddings. The main innovation is to enlarge the training data (to make it more
balanced) with data extracted from the same source of eRisk data. Yet the results show only
modest performance. An explanation suggested by the teams is that the depression detection
task was more challenging this year. This might be possible, but it is something the organisers
will need to crosscheck across all participants.
Sunday-Rocker2 [36] The team from the University of Bucharest applied a variety of tech-
niques for addressing the T2 challenge, from using tf-idf and linguistic features extracted from
Reddit writings to using the MixUP technique (a new approach for sentence classification to
augment the data through interpolation). The main interest is in the novelty of the approaches
taken and the way that the features were selected, with some strongly linguistic-based features,
such as self-preoccupation (based on the use of the occurrence of first-number pronouns) and
similar features. The best results were obtained with a Voting Classifier with hard voting applied
on textual features and with an SVM used on both textual features and numerical features as
well.
E8-IJS [37] This Slovenian team includes researchers from Jozef Stefan Institute (Ljubljana)
and the Faculty of Computer and Information Sciences. Their participation in eRisk 2022
focused on the task of early identification of depression cases. To that end, these participants
utilized a classical supervised learning approach (Logistic Regression), and the main goal of
their experiments was to compare different input representations to the logistic classifier. This
included experiments with i) tf-idf representations that were reduced to a latent space via Latent
Semantic Analysis and ii) Sentence Bert representations of the input data. Their classification
methods work at the post level.
NITK-NLP2 [38] This team is affiliated to the National Institute of Technology Surathkal at
Karnataka (India). These participants employed BERT-based and DeBERTa-based models to
classify user’s posts. Their solutions included data augmentation methods to deal with imbalance
and the team focused on comparing the relative performance of the two transformer-based
methods.
SCIR2 This team is affiliated to the Harbin Institute of Technology in Heilongjiang, China.
Their experiments focused on the use of RoBERTa models, together with techniques aimed at
reducing the number of user posts that are employed for analysis. This team participated in the
early detection of depression task (T2).


6. Conlusions
The purpose of this paper was to present an overview of eRisk 2022. This lab’s sixth version
focused on two sorts of tasks. On the one hand, two tasks focused on early identification
of pathological gambling and depression (Tasks 1 and 2, respectively), in which participants
were given sequential access to the user’s social media posts and were required to issue alerts
regarding at-risk persons. On the other hand, one task (Task 3) was assigned to measuring the
severity of the indicators of eating disorders, in which participants were provided the whole
user history and their systems were required to automatically predict the user’s replies to a
standard depression questionnaire. The proposed tasks received 115 runs from 17 different
teams. Although the efficacy of the proposed solutions is currently limited, the experimental
results demonstrate that evidence retrieved from social media is valuable, and automatic or semi-
automatic screening methods to discover at-risk persons might be developed. These findings
push us to investigate the establishment of text-based risk indicator screening benchmarks.


Acknowledgements
This work was supported by projects PLEC2021-007662 (MCIN/AEI/10.13039/ 501100011033,
Ministerio de Ciencia e Innovación, Agencia Estatal de Investigación, Plan de Recuperación,
Transformación y Resiliencia, Unión Europea-Next Generation EU) and RTI2018-093336-B-C21,
RTI2018-093336-B-C22 (Ministerio de Ciencia e Innvovación & ERDF). The first and second
authors thank the financial support supplied by the Consellería de Educación, Universidade
e Formación Profesional (accreditation 2019-2022 ED431G/01, GPC ED431B 2022/33) and the
European Regional Development Fund, which acknowledges the CITIC Research Center in ICT
of the University of A Coruña as a Research Center of the Galician University System. The third
author also thanks the financial support supplied by the Consellería de Educación, Universidade
e Formación Profesional (accreditation 2019-2022 ED431G-2019/04, ED431C 2018/29) and the
European Regional Development Fund, which acknowledges the CiTIUS-Research Center in
Intelligent Technologies of the University of Santiago de Compostela as a Research Center of
the Galician University System.


References
 [1] D. E. Losada, F. Crestani, J. Parapar, eRisk 2017: CLEF lab on early risk prediction on
     the internet: Experimental foundations, in: G. J. Jones, S. Lawless, J. Gonzalo, L. Kelly,
     L. Goeuriot, T. Mandl, L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality,
     Multimodality, and Interaction, Springer International Publishing, Cham, 2017, pp. 346–
     360.
 [2] D. E. Losada, F. Crestani, J. Parapar, eRisk 2017: CLEF Lab on Early Risk Prediction on the
     Internet: Experimental foundations, in: CEUR Proceedings of the Conference and Labs of
     the Evaluation Forum, CLEF 2017, Dublin, Ireland, 2017.
 [3] D. E. Losada, F. Crestani, A test collection for research on depression and language use, in:
     Proceedings Conference and Labs of the Evaluation Forum CLEF 2016, Evora, Portugal,
     2016.
 [4] D. E. Losada, F. Crestani, J. Parapar, Overview of eRisk: Early Risk Prediction on the
     Internet, in: P. Bellot, C. Trabelsi, J. Mothe, F. Murtagh, J. Y. Nie, L. Soulier, E. SanJuan,
     L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and
     Interaction, Springer International Publishing, Cham, 2018, pp. 343–361.
 [5] D. E. Losada, F. Crestani, J. Parapar, Overview of eRisk 2018: Early Risk Prediction on the
     Internet (extended lab overview), in: CEUR Proceedings of the Conference and Labs of the
     Evaluation Forum, CLEF 2018, Avignon, France, 2018.
 [6] D. E. Losada, F. Crestani, J. Parapar, Overview of eRisk 2019: Early risk prediction on
     the Internet, in: F. Crestani, M. Braschler, J. Savoy, A. Rauber, H. Müller, D. E. Losada,
     G. Heinatz Bürki, L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality,
     Multimodality, and Interaction, Springer International Publishing, 2019, pp. 340–357.
 [7] D. E. Losada, F. Crestani, J. Parapar, Overview of eRisk at CLEF 2019: Early risk prediction
     on the Internet (extended overview), in: CEUR Proceedings of the Conference and Labs of
     the Evaluation Forum, CLEF 2019, Lugano, Switzerland, 2019.
 [8] D. E. Losada, F. Crestani, J. Parapar, Early detection of risks on the internet: An exploratory
     campaign, in: Advances in Information Retrieval - 41st European Conference on IR
     Research, ECIR 2019, Cologne, Germany, April 14-18, 2019, Proceedings, Part II, 2019, pp.
     259–266.
 [9] D. E. Losada, F. Crestani, J. Parapar, Overview of erisk 2020: Early risk prediction on
     the internet, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction -
     11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece,
     September 22-25, 2020, Proceedings, 2020, pp. 272–287.
[10] D. E. Losada, F. Crestani, J. Parapar, Overview of erisk at CLEF 2020: Early risk prediction
     on the internet (extended overview), in: Working Notes of CLEF 2020 - Conference and
     Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, 2020.
[11] D. E. Losada, F. Crestani, J. Parapar, erisk 2020: Self-harm and depression challenges, in:
     Advances in Information Retrieval - 42nd European Conference on IR Research, ECIR 2020,
     Lisbon, Portugal, April 14-17, 2020, Proceedings, Part II, 2020, pp. 557–563.
[12] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Overview of erisk 2021: Early risk
     prediction on the internet, in: Experimental IR Meets Multilinguality, Multimodality, and
     Interaction - 12th International Conference of the CLEF Association, CLEF 2021, Virtual
     Event, September 21-24, 2021, Proceedings, 2021, pp. 324–344.
[13] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Overview of erisk at CLEF 2021:
     Early risk prediction on the internet (extended overview), in: Proceedings of the Working
     Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, Bucharest, Romania,
     September 21st - to - 24th, 2021, 2021, pp. 864–887.
[14] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, erisk 2021: Pathological gambling,
     self-harm and depression challenges, in: Advances in Information Retrieval - 43rd European
     Conference on IR Research, ECIR 2021, Virtual Event, March 28 - April 1, 2021, Proceedings,
     Part II, 2021, pp. 650–656.
[15] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, erisk 2022: Pathological gambling,
     depression, and eating disorder challenges, in: Advances in Information Retrieval - 44th
     European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10-14, 2022,
     Proceedings, Part II, 2022, pp. 436–442.
[16] M. Abbott, The epidemiology and impact of gambling disorder and other gambling-related
     harm, in: WHO Forum on alcohol, drugs and addictive behaviours, Geneva, Switzerland,
     2017.
[17] D. Otero, J. Parapar, Á. Barreiro, Beaver: Efficiently building test collections for novel tasks,
     in: Proceedings of the First Joint Conference of the Information Retrieval Communities in
     Europe (CIRCLE 2020), Samatan, Gers, France, July 6-9, 2020, 2020.
[18] D. Otero, J. Parapar, Á. Barreiro, The wisdom of the rankers: a cost-effective method
     for building pooled test collections without participant systems, in: SAC ’21: The 36th
     ACM/SIGAPP Symposium on Applied Computing, Virtual Event, Republic of Korea, March
     22-26, 2021, 2021, pp. 672–680.
[19] M. Trotzek, S. Koitka, C. Friedrich, Utilizing neural networks and linguistic metadata
     for early detection of depression indications in text sequences, IEEE Transactions on
     Knowledge and Data Engineering (2018).
[20] F. Sadeque, D. Xu, S. Bethard, Measuring the latency of depression detection in social
     media, in: WSDM, ACM, 2018, pp. 495–503.
[21] C. G. Fairburn, Z. Cooper, M. O’Connor, Eating disorder examination Edition 17.0D (April,
     2014).
[22] S. Baccianella, A. Esuli, F. Sebastiani, Evaluation measures for ordinal regression, 2009, pp.
     283–287. doi:10.1109/ISDA.2009.230.
[23] H. Srivastava, L. Ns, S. S, T. Basu, Nlp-iiserb@erisk2022: Exploring the potential of bag of
     words, document embeddings and transformer based framework for early prediction of
     eating disorder, depression and pathological gambling over social media, in: Working Notes
     of CLEF 2022 - Conference and Labs of the Evaluation Forum, Bologina, Italy, September
     5-8, 2022.
[24] S. H. H. Saravani, D. Maupomé, F. Rancourt, T. Soulas, L. Normand, S. Besharati, S. M.
     Anaelle Normand, M.-J. Meurs, Measuring the severity of the signs of eating disorders
     using similarity-based models, in: Working Notes of CLEF 2022 - Conference and Labs of
     the Evaluation Forum, Bologina, Italy, September 5-8, 2022.
[25] R. Ferreira, A. Trifan, J. L. Oliveira, Early risk detection of mental illnesses using various
     types of textual features, in: Working Notes of CLEF 2022 - Conference and Labs of the
     Evaluation Forum, Bologina, Italy, September 5-8, 2022.
[26] A.-M. Bucur, A. Cosma, L. P. Dinu, P. Rosso, An end-to-end set transformer for user-level
     classification of depression and gambling disorder, in: Working Notes of CLEF 2022 -
     Conference and Labs of the Evaluation Forum, Bologina, Italy, September 5-8, 2022.
[27] J. M. Loyola, H. Thompson, S. Burdisso, M. Errecalde., Decision policies with history
     for early classification, in: Working Notes of CLEF 2022 - Conference and Labs of the
     Evaluation Forum, Bologina, Italy, September 5-8, 2022.
[28] A. M. Mármol-Romero, S. M. Jiménez-Zafra, F. M. P. del Arco, M. D. Molina-González,
     M.-T. Martín-Valdivia, A. Montejo-Ráez., Sinai at erisk@clef 2022, in: Working Notes of
     CLEF 2022 - Conference and Labs of the Evaluation Forum, Bologina, Italy, September 5-8,
     2022.
[29] H. Fabregat, A. Duque, L. Araujo, J. Martinez-Romo, Uned-nlp at erisk 2022: Analyzing
     gambling disorders in social media using approximate nearest neighbors, in: Working
     Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, Bologina, Italy,
     September 5-8, 2022.
[30] T.-A. Dumitrascu, A. M. Enescu, Clef erisk 2022: Detecting early signs of pathological
      gambling using ml and dl models with dataset chunking, in: Working Notes of CLEF 2022
     - Conference and Labs of the Evaluation Forum, Bologina, Italy, September 5-8, 2022.
[31] S. Stalder, E. Zankov, Zhaw at erisk 2022: Predicting signs of pathological gambling - glove
      for snowy days, in: Working Notes of CLEF 2022 - Conference and Labs of the Evaluation
      Forum, Bologina, Italy, September 5-8, 2022.
[32] S.-H. Wu, Z.-J. Qiu, Cyut at erisk 2022: Early detection of depression based-on concatenat-
      ing representation of multiple hidden layers of roberta model, in: Working Notes of CLEF
      2022 - Conference and Labs of the Evaluation Forum, Bologina, Italy, September 5-8, 2022.
[33] A. Säuberli, S. Cho, L. Stahlhut, Lausan at erisk 2022: Simply and effectively optimizing
      text classification for early detection, in: Working Notes of CLEF 2022 - Conference and
      Labs of the Evaluation Forum, Bologina, Italy, September 5-8, 2022.
[34] K. Xin, D. Rongyu, Y. Haitao, Tua1 at erisk 2022: Exploring affective memories for early
      detection of depression, in: Working Notes of CLEF 2022 - Conference and Labs of the
      Evaluation Forum, Bologina, Italy, September 5-8, 2022.
[35] E. Campillo-Ageitos, J. Martinez-Romo, L. Araujo, Uned-med at erisk 2022: depression
      detection with tf-idf, linguistic features and embeddings, in: Working Notes of CLEF 2022
     - Conference and Labs of the Evaluation Forum, Bologina, Italy, September 5-8, 2022.
[36] R.-A. Gînga, A.-A. Manea, B.-M. Dobre, Sunday rockers at erisk 2022: Early detection of
      depression, in: Working Notes of CLEF 2022 - Conference and Labs of the Evaluation
      Forum, Bologina, Italy, September 5-8, 2022.
[37] I. Tavchioski, B. Škrlj, S. Pollak, B. Koloski, Early detection of depression with linear
      models using hand-crafted and contextual features, in: Working Notes of CLEF 2022 -
      Conference and Labs of the Evaluation Forum, Bologina, Italy, September 5-8, 2022.
[38] S. Devaguptam, T. Kogatam, N. Kotian, A. K. M., Early detection of depression using bert
      and deberta, in: Working Notes of CLEF 2022 - Conference and Labs of the Evaluation
      Forum, Bologina, Italy, September 5-8, 2022.

</pre>