=Paper=
{{Paper
|id=Vol-2696/paper_253
|storemode=property
|title=Overview of eRisk at CLEF 2020: Early Risk Prediction on the Internet (Extended Overview)
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_253.pdf
|volume=Vol-2696
|authors=David E. Losada,Fabio Crestani,Javier Parapar
|dblpUrl=https://dblp.org/rec/conf/clef/LosadaCP20a
}}
==Overview of eRisk at CLEF 2020: Early Risk Prediction on the Internet (Extended Overview)==
<pdf width="1500px">https://ceur-ws.org/Vol-2696/paper_253.pdf</pdf>
<pre>
       Overview of eRisk at CLEF 2020:
Early Risk Prediction on the Internet (Extended
                   Overview)

             David E. Losada1 , Fabio Crestani2 , and Javier Parapar3
       1
         Centro Singular de Investigación en Tecnoloxı́as Intelixentes (CiTIUS),
                   Universidade de Santiago de Compostela, Spain
                                  david.losada@usc.es
                                2
                                   Faculty of Informatics,
                 Universitá della Svizzera italiana (USI), Switzerland
                                fabio.crestani@usi.ch
                              3
                                Information Retrieval Lab,
    Centro de Investigación en Tecnologı́as de la Información y las Comunicaciones,
                             Universidade da Coruña, Spain
                                 javierparapar@udc.es


        Abstract. This paper provides an overview of eRisk 2020, the fourth
        edition of this lab under the CLEF conference. The main purpose of
        eRisk is to explore issues of evaluation methodology, effectiveness met-
        rics and other processes related to early risk detection. Early detection
        technologies can be employed in different areas, particularly those related
        to health and safety. This edition of eRisk had two tasks. The first task
        focused on early detecting signs of self-harm. The second task challenged
        the participants to automatically filling a depression questionnaire based
        on user interactions in social media.


1     Introduction

The main purpose of eRisk is to explore issues of evaluation methodologies,
performance metrics and other aspects related to building test collections and
defining challenges for early risk detection. Early detection technologies are po-
tentially useful in different areas, particularly those related to safety and health.
For example, early alerts could be sent when a person starts showing signs of a
mental disorder, when a sexual predator starts interacting with a child, or when
a potential offender starts publishing antisocial threats on the Internet.
    Although the evaluation methodology (strategies to build new test collec-
tions, novel evaluation metrics, etc) can be applied on multiple domains, eRisk
has so far focused on psychological problems (essentially, depression, self-harm

    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
and eating disorders). In 2017 [3, 4], we ran an exploratory task on early de-
tection of depression. This pilot task was based on the evaluation methodology
and test collection presented in [2]. In 2018 [6, 5], we ran a continuation of the
task on early detection of signs of depression together with a new task on early
detection of signs of anorexia. In 2019 [7, 8], we had a a continuation of the task
on early detection of signs of anorexia, a new task on early detection of signs of
self-harm and a third task oriented to estimate a user’s answers to a depression
questionnaire based on his interactions on social media.
    Over these years, we have been able to compare a number of solutions that
employ multiple technologies and models (e.g. Natural Language Processing,
Machine Learning, or Information Retrieval). We learned that the interaction
between psychological problems and language use is challenging and, in general,
the effectiveness of most contributing systems is modest. For example, most chal-
lenges had levels of performance (e.g. in terms of F1) below 70%. This suggests
that this kind of early prediction tasks require further research and the solutions
proposed so far still have much room from improvement.
    In 2020, the lab had two campaign-style tasks. The first task had the same
orientation of previous early detection tasks. It focused on early detection of
signs of self-harm. The second task was a continuation of 2019’s third task. It
was oriented to analyzing a user’s history of posts and extracting useful evidence
for estimating the user’s depression level. More specifically, the participants had
to process the user’s posts and, next, estimate the user’s answers to a standard
depression questionnaire. These tasks are described in the next sections of this
overview paper.


2     Task 1: Early Detection of Signs of Self-Harm
This is the continuation of eRisk 2019’s T2 task. The challenge consists of se-
quentially processing pieces of evidence and detect early traces of self-harm as
soon as possible. The task is mainly concerned about evaluating Text Mining so-
lutions and, thus, it concentrates on texts written in Social Media. Texts had to
be processed in the order they were posted. In this way, systems that effectively
perform this task could be applied to sequentially monitor user interactions in
blogs, social networks, or other types of online media.
    The test collection for this task had the same format as the collection de-
scribed in [2]. The source of data is also the same used for previous eRisks. It
is a collection of writings (posts or comments) from a set of Social Media users.
There are two categories of users, self-harm and non-self-harm, and, for each
user, the collection contains a sequence of writings (in chronological order).
    In 2019, we moved from a chunk-based release of data (used in 2017 and
2018) to a item-by-item release of data. We set up a server that iteratively gave
user writings to the participating teams. In 2020, the same server was used to
provide the users’ writings during the test stage. More information about the
server can be found at the lab website4 .
4
    http://early.irlab.org/server.html
       Table 1. Task1 (self-harm). Main statistics of the train and test collections

                                                     Train              Test
                                              Self-Harm Control Self-Harm Control
Num. subjects                                     41        299     104       319
Num. submissions (posts & comments)             6,927 163,506 11,691 91,136
Avg num. of submissions per subject             169.0      546.8  112.4      285.6
Avg num. of days from first to last submission ≈ 495      ≈ 500   ≈ 270      ≈ 426
Avg num. words per submission                    24.8       18.8   21.4       11.9


      The 2020 task was organized into two different stages:

 – Training stage. Initially, the teams that participated in this task had access
   to a training stage where we released the whole history of writings for a
   set of training users (we provided all writings of all training users), and we
   indicated what users had explicitly mentioned that they have done self-harm.
   The participants could therefore tune their systems with the training data.
   In 2020, the training data for Task 1 was composed of all 2019’s T2 users.
 – Test stage. The test stage consisted of a period of time where the partici-
   pants had to connect to our server and iteratively got user writings and sent
   responses. Each participant had the opportunity to stop and make an alert
   at any point of the user chronology. After reading each user post, the teams
   had to choose between: i) emitting an alert on the user, or ii) making no alert
   on the user. Alerts were considered as final (i.e. further decisions about this
   individual were ignored), while no alerts were considered as non-final (i.e.
   the participants could later submit an alert for this user if they detected the
   appearance of risk signs). This choice had to be made for each user in the
   test split. The systems were evaluated based on the accuracy of the decisions
   and the number of user writings required to take the decisions (see below). A
   REST server was built to support the test stage. The server iteratively gave
   user writings to the participants and waited for their responses (no new user
   data provided until the system said alert/no alert). This server was running
   from March 2nd, 2020 to May 24th, 20205 .

   Table 1 reports the main statistics of the train and test collections used for
T1. Evaluation measures are discussed in the next section.

2.1     Decision-based Evaluation
This form of evaluation revolves around the (binary) decisions taken for each user
by the participating systems. Besides standard classification measures (Precision,
Recall and F16 ), we computed ERDE, the early risk detection error used in the
previous editions of the lab. A full description of ERDE can be found in [2].
5
  In the initial configuration, the test period was shorter but, because of the COVID-19
  situation, we decided to extend the test stage in order to facilitate participation.
6
  computed with respect to the positive class.
Essentially, ERDE is an error measure that introduces a penalty for late correct
alerts (true positives). The penalty grows with the delay in emitting the alert,
and the delay is measured here as the number of user posts that had to be
processed before making the alert.
    Since 2019, we complemented the evaluation report with additional decision-
based metrics that try to capture additional aspects of the problem. These met-
rics try to overcome some limitations of ERDE, namely:

 – the penalty associated to true positives goes quickly to 1. This is due to the
   functional form of the cost function (sigmoid).
 – a perfect system, which detects the true positive case right after the first
   round of messages (first chunk), does not get error equal to 0.
 – with a method based on releasing data in a chunk-based way (as it was done
   in 2017 and 2018) the contribution of each user to the performance evaluation
   has a large variance (different for users with few writings per chunk vs users
   with many writings per chunk).
 – ERDE is not interpretable.

    Some research teams have analysed these issues and proposed alternative
ways for evaluation. Trotzek and colleagues [10] proposed ERDEo% . This is a
variant of ERDE that does not depend on the number of user writings seen
before the alert but, instead, it depends on the percentage of user writings seen
before the alert. In this way, user’s contributions to the evaluation are normalized
(currently, all users weight the same). However, there is an important limitation
of ERDEo% . In real life applications, the overall number of user writings is not
known in advance. Social Media users post contents online and screening tools
have to make predictions with the evidence seen. In practice, you do not know
when (and if) a user’s thread of message is exhausted. Thus, the performance
metric should not depend on such lack of knowledge about the total number of
user writings.
    Another proposal of an alternative evaluation metric for early risk prediction
was done by Sadeque and colleagues [9]. They proposed Flatency , which fits better
with our purposes. This measure is described next.
    Imagine a user u ∈ U and an early risk detection system that iteratively
analyzes u’s writings (e.g. in chronological order, as they appear in Social Media)
and, after analyzing ku user writings (ku ≥ 1), takes a binary decision du ∈
{0, 1}, which represents the decision of the system about the user being a risk
case. By gu ∈ {0, 1}, we refer to the user’s golden truth label. A key component
of an early risk evaluation should be the delay on detecting true positives (we do
not want systems to detect these cases too late). Therefore, a first and intuitive
measure of delay can be defined as follows7 :
7
    Observe that Sadeque et al (see [9], pg 497) computed the latency for all users such
    that gu = 1. We argue that latency should be computed only for the true positives.
    The false negatives (gu = 1, du = 0) are not detected by the system and, therefore,
    they would not generate an alert.
                  latencyT P = median{ku : u ∈ U, du = gu = 1}                        (1)

    This measure of latency goes over the true positives detected by the system
and assesses the system’s delay based on the median number of writings that
the system had to process to detect such positive cases. This measure can be
included in the experimental report together with standard measures such as
Precision (P), Recall (R) and the F-measure (F):

                                 |u ∈ U : du = gu = 1|
                             P =                                                      (2)
                                    |u ∈ U : du = 1|
                                 |u ∈ U : du = gu = 1|
                             R=                                                       (3)
                                    |u ∈ U : gu = 1|
                                 2·P ·R
                             F =                                                      (4)
                                  P +R
   Furthermore, Sadeque et al. proposed a measure, Flatency , which combines
the effectiveness of the decision (estimated with the F measure) and the delay8 .
This is based on multiplying F by a penalty factor based on the median delay.
More specifically, each individual (true positive) decision, taken after reading ku
writings, is assigned the following penalty:

                                                       2
                       penalty(ku ) = −1 +                                            (5)
                                               1 + exp−p·(ku −1)
where p is a parameter that determines how quickly the penalty should increase.
In [9], p was set such that the penalty equals 0.5 at the median number of posts
of a user9 . Observe that a decision right after the first writing has no penalty
(penalty(1) = 0). Figure 1 plots how the latency penalty increases with the
number of observed writings.
    The system’s overall speed factor is computed as:


            speed = (1 − median{penalty(ku ) : u ∈ U, du = gu = 1})                   (6)

speed equals 1 for a system whose true positives are detected right at the first
writing. A slow system, which detects true positives after hundreds of writings,
will be assigned a speed score near 0.
    Finally, the latency-weighted F score is simply:


                                 Flatency = F · speed                                 (7)
8
  Again, we adopt Sadeque et al.’s proposal but we estimate latency only over the true
  positives.
9
  In the evaluation we set p to 0.0078, a setting obtained from the eRisk 2017 collection.
                         0.8
               penalty
                         0.4
                         0.0   0    100       200       300      400
                                   number of observed writings


      Fig. 1. Latency penalty increases with the number of observed writings (ku )


   Since 2019 user’s data was processed by the participants in a post by post
basis (i.e. we avoided a chunk-based release of data). Under these conditions,
the evaluation approach has the following properties:

 – smooth grow of penalties.
 – a perfect system gets Flatency = 1 .
 – for each user u the system can opt to stop at any point ku and, therefore,
   now we do not have the effect of an imbalanced importance of users.
 – Flatency is more interpretable than ERDE.

2.2    Ranking-based Evaluation
This section discusses an alternative form of evaluation, which was used as a
complement of the evaluation described above. After each release of data (new
user writing) the participants had to send back the following information (for
each user in the collection): i) a decision for the user (alert/no alert), which was
used to compute the decision-based metrics discussed above, and ii) a score that
represents the user’s level of risk (estimated from the evidence seen so far). We
used these scores to build a ranking of users in decreasing estimation of risk.
For each participating system, we have one ranking at each point (i.e., ranking
after 1 writing, ranking after 2 writings, etc.). This simulates a continuous re-
ranking approach based on the evidence seen so far. In a real life application,
this ranking would be presented to an expert user who could take decisions (e.g.
by inspecting the rankings).
    Each ranking can be scored with standard IR metrics, such as P@10 or
NDCG. We therefore report the ranking-based performance of the systems after
seeing k writings (with varying k).

2.3    Task 1: Results
Table 6 shows the participating teams, the number of runs submitted and the
approximate lapse of time from the first response to the last response. This lapse
of time is indicative of the degree of automation of each team’s algorithms. A
few of the submitted runs processed the entire thread of messages (nearly 2000),
but many variants opted for stopping earlier. Six teams processed the thread
of messages in a reasonably fast way (less than a day or so for processing the
entire history of user messages). The rest of the teams took several days to run
the whole process. Some teams took even more than a week. This suggests that
they incorporated some form of offline processing.
    Table 3 reports the decision-based performance achieved by the participating
teams.
    In terms of Precision, F 1, ERDE measures and latency-weighted F 1, the
best performing runs were submitted by the iLab team. The first two iLab runs
had extremely high precision (.833 and .913, respectively) and the first one (run
#0) had the highest latency-weighted F1 (.658). These runs had low levels of
recall (.577 and .404) and they only analyzed a median of 10 user writings. This
suggests that you can get to a reasonably high level of precision based on a few
user writings. The main limitation of these best performing runs is the low levels
of recall achieved. In terms of ERDE, the best performing runs show low levels
of error (.134 and .071). ERDE measures set a strong penalty on late decisions
and the two best runs show a good balance between the accuracy of the decisions
and the delays (latency of the true positives was 2 and 45, respectively, for the
two runs that achieved the lowest ERDE5 and ERDE50 ).
    Other teams submitted high recall runs but their precision was very low and,
thus, these automatic methods are hardly usable to filter out non-risk cases.
    Most teams submitted quick decisions. Only iLab and prhlt-upv have some
runs that analysed more than a hundred submissions before emitting the alerts
(mean latencies higher than 100).
    Overall, these results suggest that with a few dozen user writings some sys-
tems led to reasonably high effectiveness. The best predictive algorithms could
be used to support expert humans in early detecting signs of self-harm.
    Table 4 reports the ranking-based performance achieved by the participating
teams. Some teams only processed a few dozens of user writings and, thus, we
could only compute their rankings of users for the initial points.
    Some teams (e.g., INAOE-CIMAT or BioInfo@UAVR) have the same levels
of ranking-based effectiveness over multiple points (after 1 writing, after 100
writings, and so forth). This suggests that these teams did not change the risk
scores estimated from the initial stages (or their algorithms were not able to
enhance their estimations as more evidence was seen).
    Other participants (e.g., EFE, iLab or hildesheim) behave as expected: the
rankings of estimated risk get better as they are built from more user evidence.
Notably, some iLab variants variants led to almost perfect P @10 and N DCG@10
performance after analyzing more than 100 writings. The N DCG@100 scores
achieved by this team after 100 or 500 writings were also quite effective (above
.81 for all variants). This suggests that, with enough pieces of evidence, the
methods implemented by this team are highly effective at prioritizing at-risk
users.
Table 2. Task 1. Participating teams: number of runs, number of user writings pro-
cessed by the team, and lapse of time taken for the whole process.

         team         #runs #user writings    lapse of time
                              processed (from 1st to last response)
         UNSL           5       1990               10 hs
         INAOE-CIMAT 5          1989          7 days + 7 hs
         BiTeM          5         1               1 min
         EFE            3       1991               12 hs
         NLP-UNED       5        554               1 day
         BioInfo@UAVR   3        565         2 days + 21 hs
         SSN NLP        5        222                3 hs
         Anji           5       1990          1 day + 3 hs
         Hildesheim     5        522         72 days + 20 hs
         RELAI          5       1990          2 days + 8 hs
         Prhlt-upv      5        627          1 day + 8 hs
         iLab           5        954               20 hs


3   Task 2: Measuring the Severity of the Signs of
    Depression

This task is a continuation of 2019’s T3 task. The task consists of estimating
the level of depression from a thread of user submissions. For each user, the
participants were given the user’s full history of postings (in a single release of
data) and the participants had to fill a standard depression questionnaire based
on the evidence found in the history of postings. In 2020, the participants had
the opportunity to use 2019’s data as training data (filled questionnaires and
SM submissions from the 2019 users, i.e. a training set composed of 20 users).
    The questionnaires are derived from the Beck’s Depression Inventory (BDI)[1],
which assesses the presence of feelings like sadness, pessimism, loss of energy, etc,
for the detection of depression. The questionnaire contains 21 questions (see figs
2, 3).
    The task aims at exploring the viability of automatically estimating the sever-
ity of the multiple symptoms associated with depression. Given the user’s history
of writings, the algorithms had to estimate the user’s response to each individual
question. We collected questionnaires filled by Social Media users together with
their history of writings (we extracted each history of writings right after the
user provided us with the filled questionnaire). The questionnaires filled by the
users (ground truth) were used to assess the quality of the responses provided
by the participating systems.
    The participants were given a dataset with 70 users and they were asked to
produce a file with the following structure:

username1 answer1 answer2 .... answer21
username2 ....
....
                Table 3. Task 1. Decision-based evaluation
team name   run P     R F 1 ERDE5 ERDE50 latencyT P speed latency-weighted
             id                                                  F1
Hildesheim    0 .248 1 .397    .292 .196       1       1        .397
Hildesheim    1 .246 1 .395    .304 .185       5     .984       .389
Hildesheim    2 .297 .740 .424 .237 .226      1        1        .424
Hildesheim    3 .270 .942 .420 .400 .251    33.5     .874       .367
Hildesheim    4 .256 .990 .406 .409 .210      12     .957       .389
UNSL          0 .657 .423 .515 .191 .155       2     .996       .513
UNSL          1 .618 .606 .612 .172 .124       2     .996       .609
UNSL          2 .606 .548 .576 .267 .142      11     .961       .553
UNSL          3 .598 .529 .561 .267 .149      12     .957       .537
UNSL          4 .545 .519 .532 .271 .151      12     .957       .509
EFE           0 .730 .519 .607 .257 .142     11      .961       .583
EFE           1 .625 .625 .625 .268 .117     11      .961       .601
EFE           2 .496 .615 .549 .283 .140     11      .961       .528
iLab          0 .833 .577 .682 .252 .111      10     .965       .658
iLab          1 .913 .404 .560 .248 .149     10      .965       .540
iLab          2 .544 .654 .594 .134 .118      2      .996       .592
iLab          3 .564 .885 .689 .287 .071      45     .830       .572
iLab          4 .828 .692 .754 .255 .255     100     .632       .476
prhlt-upv     0 .469 .654 .546 .291 .154      41     .845       .462
prhlt-upv     1 .710 .212 .326 .251 .235     133     .526       .172
prhlt-upv     2 .271 .577 .369 .339 .269    51.5     .806       .298
prhlt-upv     3 .846 .212 .338 .248 .232     133     .526       .178
prhlt-upv     4 .765 .375 .503 .253 .194      42     .841       .423
INAOE-CIMAT 0 .488 .567 .524   .203 .145       4     .988       .518
INAOE-CIMAT 1 .500 .548 .523   .193 .144       4     .988       .517
INAOE-CIMAT 2 .848 .375 .520   .207 .160       5     .984       .512
INAOE-CIMAT 3 .525 .702 .601   .174 .119       3     .992       .596
INAOE-CIMAT 4 .788 .394 .526   .198 .160       4     .988       .519
BioInfo@UAVR 0 .609 .375 .464  .260 .178      14     .949       .441
BioInfo@UAVR 1 .591 .654 .621  .273 .120      11     .961       .597
BioInfo@UAVR 2 .629 .375 .470  .259 .177      13     .953       .448
RELAI         0 .341 .865 .489 .188 .136       2     .996       .487
RELAI         1 .350 .885 .501 .190 .130       2     .996       .499
RELAI         2 .438 .740 .550 .245 .132       8     .973       .535
RELAI         3 .291 .894 .439 .306 .168       7     .977       .428
RELAI         4 .381 .846 .525 .260 .141       7     .977       .513
SSN NLP       0 .264 1 .419    .206 .170       1      1.0       .419
SSN NLP       1 .283 1 .442    .205 .158       1      1.0       .442
SSN NLP       2 .287 .990 .445 .228 .159       2     .996       .443
SSN NLP       3 .688 .423 .524 .233 .171    15.5     .944       .494
SSN NLP       4 .287 .952 .441 .263 .214       4     .988       .436
BiTeM         0 .333 .01 .02   .245 .245      1       1.0       .019
BiTeM         1   0    0    0
BiTeM         2   0    0    0
BiTeM         3   0    0    0
BiTeM         4   0    0    0
NLP-UNED      0 .237 .913 .376 .423 .199      11     .961       .362
NLP-UNED      1 .246 1 .395    .210 .185       1      1.0       .395
NLP-UNED      2 .246 1 .395    .210 .185       1      1.0       .395
NLP-UNED      3 .246 1 .395    .210 .185       1      1.0       .395
NLP-UNED      4 .246 1 .395    .210 .185       1      1.0       .395
Anji          0 .266 1 .420    .205 .167       1      1.0       .420
Anji          1 .266 1 .420    .211 .167       1      1.0       .420
Anji          2 .269 1 .424    .213 .164       1      1.0       .424
Anji          3 .333 .038 .069 .248 .243       7     .977       .067
Anji          4 .258 .990 .410 .208 .174       1      1.0       .410
                    Table 4. Task 1. Ranking-based evaluation

                   1 writing         100 writings       500 writings       1000 writings
team        run P N DCG N DCG P N DCG N DCG P N DCG N DCG P N DCG N DCG
               @10 @10      @100 @10 @10       @100 @10 @10       @100 @10 @10       @100
Hildesheim   0  .1  .10      .26  .4    .43     .42  .5    .53     .42
Hildesheim   1  .4  .44      .30  .5    .48     .49  .5    .54     .57
Hildesheim   2  .2  .15      .24  1      1      .69  1      1      .68
Hildesheim   3  .2  .14      .20  .1    .07     .13  .1    .06     .11
Hildesheim   4  .2  .16      .18  1      1      .62  1      1      .69
UNSL         0 .9   .92      .47  1      1      .60  1      1      .60  1      1      .60
UNSL         1  .8  .87      .55  1      1      .76  1      1      .75  1      1      .75
UNSL         2  .7  .80      .42  .8    .84     .70  .8    .87     .74  .9    .94     .73
UNSL         3  .7  .79      .43  .8    .84     .70  .8    .87     .74  .9    .94     .73
UNSL         4  .5  .63      .36  .8    .86     .62  .8    .86     .62  .8    .86     .62
EFE          0  .7  .65      .59  1      1      .78  1      1      .79  1      1      .79
EFE          1  .6  .54      .58  1      1      .78  1      1      .80  1      1      .80
EFE          2  .6  .64      .55  .9    .92     .71  .9    .92     .73  .9    .92     .72
iLab         0  .8  .88      .63  1      1      .82  1      1      .83
iLab         1  .7  .69      .60  1      1      .82  .9    .94     .81
iLab         2  .7  .69      .60  1      1      .82  .9    .94     .81
iLab         3 .9   .94      .66  1      1      .83  1      1      .84
iLab         4  .8  .88      .63  1      1      .82  1      1      .83
prhlt-upv    0  .2  .13      .30  .9    .93     .68  1      1      .68
prhlt-upv    1 .9   .90      .63  .9    .92     .70  .9    .81     .75
prhlt-upv    2  .5  .41      .42  .6    .69     .48  .6    .69     .48
prhlt-upv    3 .9   .90      .63  .9    .92     .70  .9    .81     .75
prhlt-upv    4  .8  .75      .49  1      1      .70  .9    .90     .69
INAOE-CIMAT 0   .3  .25      .30  .3    .26     .24  .3    .26     .24  .3    .26     .24
INAOE-CIMAT 1   .3  .25      .30  .3    .26     .24  .3    .26     .24  .3    .26     .24
INAOE-CIMAT 2   .3  .25      .30  .3    .26     .24  .3    .26     .24  .3    .26     .24
INAOE-CIMAT 3   .3  .25      .30  .3    .26     .24  .3    .26     .24  .3    .26     .24
INAOE-CIMAT 4   .3  .25      .30  .3    .26     .24  .3    .26     .24  .3    .26     .24
BioInfo@UAVR 0  .6  .62      .33  .6    .62     .31  .6    .62     .31
BioInfo@UAVR 1  .6  .62      .33   0     0      .07   0     0      .04
BioInfo@UAVR 2  .6  .62      .33  .6    .62     .31  .6    .62     .31
RELAI        0  .7  .80      .52  .8    .87     .52  .8    .87     .52  .8    .87     .50
RELAI        1  .3  .28      .43  .6    .69     .47  .6    .69     .47  .7    .75     .47
RELAI        2  .2  .20      .27  .7    .81     .63  .8    .87     .70  .8    .87     .72
RELAI        3  .2  .20      .27  .9    .94     .51  1      1      .59  1      1      .60
RELAI        4  .2  .20      .27  .7    .68     .59  1      1      .71  .9    .81     .66
SSN NLP      0  .7  .68      .50  .5    .38     .43
SSN NLP      1  .7  .68      .50  .5    .38     .43
SSN NLP      2  .7  .68      .50  .5    .38     .43
SSN NLP      3  0    0       .22  .1    .12     .16
SSN NLP      4  .7  .68      .50  .5    .38     .43
BiTeM        0
BiTeM        1
BiTeM        2
BiTeM        3
BiTeM        4
NLP-UNED     0  .7  .69      .49  .6    .73     .26  .6    .73     .24
NLP-UNED     1  .6  .62      .27  .2    .27     .18  .2    .27     .16
NLP-UNED     2  .6  .62      .27  .2    .27     .18  .2    .27     .16
NLP-UNED     3  .6  .62      .27  .2    .27     .18  .2    .27     .16
NLP-UNED     4  .6  .62      .27  .2    .27     .18  .2    .27     .16
Anji         0  .7  .73      .58  .6    .57     .46  .4    .32     .36  .4    .32     .36
Anji         1 .9   .81      .54  .8    .62     .69  .8    .62     .70  .8    .62     .69
Anji         2  .8  .88      .51  .7    .76     .58  .5    .34     .47  .6    .48     .50
Anji         3  .3  .25      .31  .3    .28     .27  .3    .26     .27  .3    .26     .27
Anji         4  .3  .22      .25  .6    .44     .59  .6    .44     .61  .6    .44     .60
Instructions:

This questionnaire consists of 21 groups of statements. Please read each group of statements
carefully, and then pick out the one statement in each group that best describes the way you feel.
If several statements in the group seem to apply equally well, choose the highest
number for that group.

1. Sadness
0. I do not feel sad.
1. I feel sad much of the time.
2. I am sad all the time.
3. I am so sad or unhappy that I can’t stand it.

2. Pessimism
0. I am not discouraged about my future.
1. I feel more discouraged about my future than I used to be.
2. I do not expect things to work out for me.
3. I feel my future is hopeless and will only get worse.

3. Past Failure
0. I do not feel like a failure.
1. I have failed more than I should have.
2. As I look back, I see a lot of failures.
3. I feel I am a total failure as a person.

4. Loss of Pleasure
0. I get as much pleasure as I ever did from the things I enjoy.
1. I don’t enjoy things as much as I used to.
2. I get very little pleasure from the things I used to enjoy.
3. I can’t get any pleasure from the things I used to enjoy.

5. Guilty Feelings
0. I don’t feel particularly guilty.
1. I feel guilty over many things I have done or should have done.
2. I feel quite guilty most of the time.
3. I feel guilty all of the time.

6. Punishment Feelings
0. I don’t feel I am being punished.
1. I feel I may be punished.
2. I expect to be punished.
3. I feel I am being punished.

7. Self-Dislike
0. I feel the same about myself as ever.
1. I have lost confidence in myself.
2. I am disappointed in myself.
3. I dislike myself.

8. Self-Criticalness
0. I don’t criticize or blame myself more than usual.
1. I am more critical of myself than I used to be.
2. I criticize myself for all of my faults.
3. I blame myself for everything bad that happens.

9. Suicidal Thoughts or Wishes
0. I don’t have any thoughts of killing myself.
1. I have thoughts of killing myself, but I would not carry them out.
2. I would like to kill myself.
3. I would kill myself if I had the chance.

10. Crying
0. I don’t cry anymore than I used to.
1. I cry more than I used to.
2. I cry over every little thing.
3. I feel like crying, but I can’t.

11. Agitation
0. I am no more restless or wound up than usual.
1. I feel more restless or wound up than usual.
2. I am so restless or agitated that it’s hard to stay still.
3. I am so restless or agitated that I have to keep moving or doing something.

12. Loss of Interest
0. I have not lost interest in other people or activities.
1. I am less interested in other people or things than before.
2. I have lost most of my interest in other people or things.
3. It’s hard to get interested in anything.

13. Indecisiveness
0. I make decisions about as well as ever.
1. I find it more difficult to make decisions than usual.
2. I have much greater difficulty in making decisions than I used to.
3. I have trouble making any decisions.

14. Worthlessness
0. I do not feel I am worthless.
1. I don’t consider myself as worthwhile and useful as I used to.
2. I feel more worthless as compared to other people.
3. I feel utterly worthless.

15. Loss of Energy
0. I have as much energy as ever.
1. I have less energy than I used to have.
2. I don’t have enough energy to do very much.
3. I don’t have enough energy to do anything.


                               Fig. 2. Beck’s Depression Inventory (part 1)
16. Changes in Sleeping Pattern
0. I have not experienced any change in my sleeping pattern.
la. I sleep somewhat more than usual.
lb. I sleep somewhat less than usual.
2a. I sleep a lot more than usual.
2b. I sleep a Iot less than usual.
3a. I sleep most of the day.
3b. I wake up 1-2 hours early and can’t get back to sleep.

17. Irritability
0. I am no more irritable than usual.
1. I am more irritable than usual.
2. I am much more irritable than usual.
3. I am irritable all the time.

18. Changes in Appetite
0. I have not experienced any change in my appetite.
la. My appetite is somewhat less than usual.
lb. My appetite is somewhat greater than usual.
2a. My appetite is much less than before.
2b. My appetite is much greater than usual.
3a. I have no appetite at all.
3b. I crave food all the time.

19. Concentration Difficulty
0. I can concentrate as well as ever.
1. I can’t concentrate as well as usual.
2. It’s hard to keep my mind on anything for very long.
3. I find I can’t concentrate on anything.

20. Tiredness or Fatigue
0. I am no more tired or fatigued than usual.
1. I get more tired or fatigued more easily than usual.
2. I am too tired or fatigued to do a lot of the things I used to do.
3. I am too tired or fatigued to do most of the things I used to do.

21. Loss of Interest in Sex
0. I have not noticed any recent change in my interest in sex.
1. I am less interested in sex than I used to be.
2. I am much less interested in sex now.
3. I have lost interest in sex completely


                               Fig. 3. Beck’s Depression Inventory (part 2)


    Each line has a user identifier and 21 values. These values correspond to the
responses to the questions of the depression questionnaire (the possible values
are 0, 1a, 1b, 2a, 2b, 3a, 3b -for questions 16 and 18- and 0, 1, 2, 3 -for the rest
of the questions-).


3.1      Task 2: Evaluation Metrics

For consistency purposes, we employed the same evaluation metrics utilised in
2019. These metrics assess the quality of a questionnaire filled by a system in
comparison with the real questionnaire filled by the actual Social Media user:

  – Average Hit Rate (AHR): Hit Rate (HR) averaged over all users. HR is
    a stringent measure that computes the ratio of cases where the automatic
    questionnaire has exactly the same answer as the real questionnaire. For
    example, an automatic questionnaire with 5 matches gets HR equal to 5/21
    (because there are 21 questions in the form).
  – Average Closeness Rate (ACR): Closeness Rate (CR) averaged over all
    users. CR takes into account that the answers of the depression questionnaire
    represent an ordinal scale. For example, consider the #17 question:

      17. Irritability
      0. I am no more irritable than usual.
      1. I am more irritable than usual.
      2. I am much more irritable than usual.
      3. I am irritable all the time.

   Imagine that the real user answered ”0”. A system S1 whose answer is ”3”
   should be penalised more than a system S2 whose answer is ”1”.
   For each question, CR computes the absolute difference (ad) between the real
   and the automated answer (e.g. ad=3 and ad=1 for S1 and S2, respectively)
   and, next, this absolute difference is transformed into an effectiveness score
   as follows: CR = (mad − ad)/mad, where mad is the maximum absolute
   difference, which is equal to the number of possible answers minus one.
   NOTE: in the two questions (#16 and #18) that have seven possible answers
   {0, 1a, 1b, 2a, 2b, 3a , 3b} the pairs (1a, 1b), (2a, 2b), (3a, 3b) are considered
   equivalent because they reflect the same depression level. As a consequence,
   the difference between 3b and 0 is equal to 3 (and the difference between 1a
   and 1b is equal to 0).
 – Average DODL (ADODL): Difference between overall depression levels
   (DODL) averaged over all users. The previous measures assess the systems’
   ability to answer each question in the form. DODL, instead, does not look
   at question-level hits or differences but computes the overall depression level
   (sum of all the answers) for the real and automated questionnaire and, next,
   the absolute difference (ad overall) between the real and the automated
   score is computed.
   Depression levels are integers between 0 and 63 and, thus, DODL is nor-
   malised into [0,1] as follows: DODL = (63 − ad overall)/63.
 – Depression Category Hit Rate (DCHR). In the psychological domain,
   it is customary to associate depression levels with the following categories:

      minimal depression (depression levels 0-9)
      mild depression (depression levels 10-18)
      moderate depression (depression levels 19-29)
      severe depression (depression levels 30-63)

      The last effectiveness measure consists of computing the fraction of cases
      where the automated questionnaire led to a depression category that is equiv-
      alent to the depression category obtained from the real questionnaire.

3.2     Task 2: Results
Table 5 presents the results achieved by the participants in this task. To put
things in perspective, the table also reports (lower block) the performance achieved
by three baseline variants: all 0s and all 1s, which consist of sending the same
response (0 or 1) for all the questions, and random, which is the average perfor-
mance (averaged over 1000 repetitions) achieved by an algorithm that randomly
chooses among the possible answers.
   Although the teams could use training data from 2019 (while 2019’s partici-
pants had no training data), the performance scores tend to be lower than 2019’s
                         Table 5. Task 2. Performance Results

        Run                                AHR    ACR ADODL DCHR
        BioInfo@UAVR                      38.30% 69.21% 76.01% 30.00%
        iLab run1                         36.73% 68.68% 81.07% 27.14%
        iLab run2                         37.07% 69.41% 81.70% 27.14%
        iLab run3                         35.99% 69.14% 82.93% 34.29%
        prhlt logreg features             34.01% 67.07% 80.05% 35.71%
        prhlt svm use                     36.94% 69.02% 81.72% 31.43%
        prhlt svm features                34.56% 67.44% 80.63% 35.71%
        svm features                      34.56% 67.44% 80.63% 35.71%
        relai context paral user          36.80% 68.37% 80.84% 22.86%
        relai context sim answer          21.16% 55.40% 73.76% 27.14%
        relai lda answer                  28.50% 60.79% 79.07% 30.00%
        relai lda user                    36.39% 68.32% 83.15% 34.29%
        relai sylo user                   37.28% 68.37% 80.70% 20.00%
        Run1 resultat CNN Methode max     34.97% 67.19% 76.85% 25.71%
        Run2 resultat CNN Methode suite   32.79% 66.08% 76.33% 17.14%
        Run3 resultat BILSTM Methode max 34.01% 67.78% 79.30% 22.86%
        Run4 resultat BILSTM Methode suit 33.54% 67.26% 78.91% 20.00%
        all 0s                            36.26% 64.22% 64.22% 14.29%
        all 1s                            29.18% 73.38% 81.95% 25.71%
        random (avg 1000 repetitions)     23.94% 58.44% 75.22% 26.53%


performance scores (only ADODL had higher performance). This could be due
to various reasons, including the intrinsic difficulty of the task and the lack of
discussion on SM of psychological concerns by 2020 users.
    In terms of AHR, the best performing run (BioInfo@UAVR) only got 38.30%
of the answers right. The scores of the distance-based measure (ACR) are below
70%. Most of the questions have four possible answers and, thus, a random
algorithm would get AHR near 25%10 . This suggests that the analysis of the user
posts was useful at extracting some signals or symptoms related to depression.
However, ADODL and, particularly, DCHR show that the participants, although
effective at answering some depression-related questions, do not fare well at
estimating the overall level of depression of the individuals. For example, the
best performing run gets the depression category right for only 35.71% of the
individuals.
   Overall, these experiments indicate that we are still far from a really effective
depression screening tool. In the near future, it will be interesting to further
analyze the participants’ estimations in order to investigate which particular
BDI questions are easier or harder to automatically answer based on Social
Media activity.


10
     Actually, slightly less than 25% because a couple of questions have more than four
     possible answers.
4   Participating Teams
Table 6 reports the participating teams and the runs that they submitted for
each eRisk task. The next paragraphs give a brief summary on the techniques
implemented by each of them. Further details are available at the CLEF 2020
working notes proceedings.


                        Table 6. eRisk 2020. Participants

                                         T1    T2
                           team         #runs #runs
                           UNSL           5
                           INAOE-CIMAT 5
                           BiTeM          5
                           EFE            3
                           NLP-UNED       5
                           BioInfo@UAVR   3     1
                           SSN NLP        5
                           Anji           5
                           Hildesheim     5
                           RELAI          5     5
                           Prhlt-upv      5     4
                           iLab           5     3
                           USDB                 4


    EFE. This is a team from Dept. of Computer Engineering, Ferdowsi Univer-
sity of Mashhad (Iran). They implemented three variants for Task 1 that repre-
sent the texts using Word2Vec representations and performed experiments us-
ing Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM)
models, and Support Vector Machines (SVMs). The entire system is an ensemble
multi-level method based on SVM, CNN, and LSTM, which are fine-tuned by
attention layers.
    USDB. This is a joint collaboration between the LRDSI Laboratory (BL-
IDA 1 University, Algeria) and the Information Assurance and Security Research
Group (Faculty of Computing, Universiti Teknologi Malaysia, Malaysia). This
team participated in Task 2 and transformed the user’s texts into distributed
representations following the Skip-gram model. Next, sentences are encoded us-
ing a CNN or Bi-LSTM model (or with Recurrent Neural Networks and Long
Bi-LSTM). For each user post, the models generate 21 outputs, which are the
answers to the BDI questions. Finally, the user’s overall questionnaire is obtained
by selecting, for each BDI question, the most frequent answer.
    NLP-UNED. This is a joint effort by the NLP & IR Group, at Universidad
Nacional de Educación a Distancia (UNED), Spain and the Instituto Mixto de
Investigación - Escuela Nacional de Sanidad (IMIENS), Spain. These researchers,
which participated in Task 1, implemented a machine learning approach using
textual features and a SVM classifier. In order to extract relevant features, this
team followed a sliding window approach that handles the last messages pub-
lished by any given user. The features considered a wide range of variables, such
as title length, words in the title, punctuation, emoticons, and other feature sets
obtained from sentiment analysis, first person pronouns, and NSSI words.
    Hildesheim. This team, from the Institute for Information Science and Nat-
ural Language Processing (University of Hildesheim, Germany), implemented
four variants that apply different methods for Task 1 and a fifth ensemble sys-
tem that combines the four variants. The four methods utilize different types
of features, such as time intervals between posts, and the sentiment and seman-
tics of the writings. To this aim, a neural network approach using bag-of-words
vectors and contextualized word embeddings was employed.
    BioInfo@UAVR. This team comes from the Bioinformatics group of the In-
stitute of Electronics and Engineering Informatics (University of Aveiro, Portu-
gal). They participated in both tasks. Their approach built upon the algorithms
proposed by them for eRisk 2019. For Task 1, they considered a bag of words
approach with tf-idf features and employed linear Support Vector Machines with
Stochastic Gradient Descent and Passive Aggressive classifiers. For Task 2, the
method is based on training a machine learning model using an external dataset
and, next, employing the learnt classifier against the eRisk 2020 data. These
authors considered psycholinguistic and behavioural features in their attempt to
associate the BDI responses with the user’s posts.
    INAOE-CIMAT. This is a joint effort by Instituto Nacional de Astrofı́sica,
Óptica y Electrónica (INAOE), Mexico and Centro de Investigación en Matemá-
ticas (CIMAT), Mexico. This team participated in the first task and proposed a
so-called Bag of Sub-Emotions approach that represents the posts of the users
using a set of sub-emotions. This representation, which was subsequently com-
bined with bag of words representations, captures the emotions and topics that
users with signs of self-harm tend to use. At test time, they experimented with
five variants that estimate the temporal stability associated to the users’ posts.
    SSN-NLP. These participants come from the SSN College Of Engineer-
ing, Chennai (India) and participated in Task 1. Given the training data, they
experimented with five alternative classification approaches (using tf/idf repre-
sentations and Bernoulli Naive Bayes, Gradient Boosting, Random Forest, or
Extra Trees; or CNNs together with Word2Vec representations).
    iLab. This is a joint collaboration between Centro Singular de Investigación
en Tecnoloxı́as Intelixentes (CiTIUS), Universidade de Santiago de Compostela,
Spain and the Department of Computer and Information Sciences, University
of Strathclyde, UK. This team participated in both tasks. They used BERT-
based classifiers which were trained specifically for each task. A variety of pre-
trained models were tested against training data (including BERT, DistillBERT,
RoBERTa and XLM-RoBERTa). Rather than using the task’s training data,
these participants created four new training datasets from Reddit. The submitted
runs for Task 1 were based on XLM-RoBERTa. For Task 2, they employed similar
methods as the ones employed for Task 1, but they treated the problem as a
multi-class labelling problem (one problem for each BDI question).
    Prhlt-upv. This team is composed of researchers from Universitat Poli-
técnica de Valencia and from University of Bucharest. They employed multi-
dimensional representations of language and deep learning models (including
hierarchical architectures, pre-trained transformers and language models). This
team participated in both tasks. For Task 1, they utilized content features, style
features, LIWC features, emotions and sentiments. Different strategies were im-
plemented to represent the users’ submissions (e.g., augmenting the data by
sampling from the user’s history or computing a rolling average associated to
the most recent estimations). For Task 2, they employed simpler learning models
(SVMs and logistic regression) and some of the features extracted for Task 1.
The problem was tackled as a multi-label multi-class problem where one model
was trained for each BDI question.
    Relai. This team comes from University of Quebec in Montreal, Canada.
These researchers participated in both tasks, and addressed them using topic
modeling al- gorithms (LDA and Anchor Variant), neural models with three
different architectures (Deep Averaging Networks, Contextualizers, and RNNs),
and an approach based on writing styles. Some of the variants considered sty-
lometry variables, such as Part-of-Speech, frequent n-grams, punctuation, length
of words/sentences and usage of uppercase or hyperlinks.


5   Conclusions

This paper provided an overview of eRisk 2020. This was the fourth edition of
this lab and the lab’s activities concentrated on two different types of tasks: early
detection of signs of self-harm (T1), where the participants had a sequential ac-
cess to the user’s social media posts and they had to send alerts about at-risk
individuals, and measuring the severity of the signs of depression (T2), where
the participants were given the full user history and their systems had to auto-
matically estimate the user’s responses to a standard depression questionnaire.
    Overall, the proposed tasks received 73 variants or runs from 12 teams. Al-
though the effectiveness of the proposed solutions is still modest, the experiments
suggest that evidence extracted from Social Media is valuable and automatic or
semi-automatic screening tools could be designed to detect at-risk individuals.
This promising result encourages us to further explore the creation of bench-
marks for text-based screening of signs of risk.


Acknowledgements

We thank the support obtained from the Swiss National Science Foundation
(SNSF) under the project “Early risk prediction on the Internet: an evaluation
corpus”, 2015.
   This work was funded by FEDER/Ministerio de Ciencia, Innovación y Uni-
versidades – Agencia Estatal de Investigación/ Project (RTI2018-093336-B-C21).
This work has also received financial support from the Consellerı́a de Edu-
cación, Universidade e Formación Profesional (accreditation 2019-2022 ED431G-
2019/04, ED431C 2018/29) and the European Regional Development Fund,
which acknowledges the CiTIUS-Research Center in Intelligent Technologies of
the University of Santiago de Compostela as a Research Center of the Galician
University System.


References
 1. Beck, A.T., Ward, C.H., Mendelson, M., Mock, J., Erbaugh, J.: An Inventory for
    Measuring Depression. JAMA Psychiatry 4(6), 561–571 (06 1961)
 2. Losada, D.E., Crestani, F.: A test collection for research on depression and language
    use. In: Proceedings Conference and Labs of the Evaluation Forum CLEF 2016.
    Evora, Portugal (2016)
 3. Losada, D.E., Crestani, F., Parapar, J.: eRisk 2017: CLEF lab on early risk pre-
    diction on the internet: Experimental foundations. In: Jones, G.J., Lawless, S.,
    Gonzalo, J., Kelly, L., Goeuriot, L., Mandl, T., Cappellato, L., Ferro, N. (eds.) Ex-
    perimental IR Meets Multilinguality, Multimodality, and Interaction. pp. 346–360.
    Springer International Publishing, Cham (2017)
 4. Losada, D.E., Crestani, F., Parapar, J.: eRisk 2017: CLEF Lab on Early Risk
    Prediction on the Internet: Experimental foundations. In: CEUR Proceedings of
    the Conference and Labs of the Evaluation Forum, CLEF 2017. Dublin, Ireland
    (2017)
 5. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk 2018: Early Risk Predic-
    tion on the Internet (extended lab overview). In: CEUR Proceedings of the Con-
    ference and Labs of the Evaluation Forum, CLEF 2018. Avignon, France (2018)
 6. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk: Early Risk Predic-
    tion on the Internet. In: Bellot, P., Trabelsi, C., Mothe, J., Murtagh, F., Nie,
    J.Y., Soulier, L., SanJuan, E., Cappellato, L., Ferro, N. (eds.) Experimental IR
    Meets Multilinguality, Multimodality, and Interaction. pp. 343–361. Springer In-
    ternational Publishing, Cham (2018)
 7. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk 2019: Early risk predic-
    tion on the Internet. In: Crestani, F., Braschler, M., Savoy, J., Rauber, A., Müller,
    H., Losada, D.E., Heinatz Bürki, G., Cappellato, L., Ferro, N. (eds.) Experimental
    IR Meets Multilinguality, Multimodality, and Interaction. pp. 340–357. Springer
    International Publishing (2019)
 8. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk at CLEF 2019: Early
    risk prediction on the Internet (extended overview). In: CEUR Proceedings of the
    Conference and Labs of the Evaluation Forum, CLEF 2019. Lugano, Switzerland
    (2019)
 9. Sadeque, F., Xu, D., Bethard, S.: Measuring the latency of depression detection in
    social media. In: WSDM. pp. 495–503. ACM (2018)
10. Trotzek, M., Koitka, S., Friedrich, C.: Utilizing neural networks and linguistic
    metadata for early detection of depression indications in text sequences. IEEE
    Transactions on Knowledge and Data Engineering (04 2018)

</pre>