<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of eRisk at CLEF 2020: Early Risk Prediction on the Internet (Extended Overview)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David E. Losada</string-name>
          <email>david.losada@usc.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Crestani</string-name>
          <email>fabio.crestani@usi.ch</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Javier Parapar</string-name>
          <email>javierparapar@udc.es</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro Singular de Investigacion en Tecnolox as Intelixentes (CiTIUS)</institution>
          ,
          <addr-line>Universidade de Santiago de Compostela</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Informatics, Universita della Svizzera italiana (USI)</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Information Retrieval Lab, Centro de Investigacion en Tecnolog as de la Informacion y las Comunicaciones, Universidade da Corun~a</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper provides an overview of eRisk 2020, the fourth edition of this lab under the CLEF conference. The main purpose of eRisk is to explore issues of evaluation methodology, e ectiveness metrics and other processes related to early risk detection. Early detection technologies can be employed in di erent areas, particularly those related to health and safety. This edition of eRisk had two tasks. The rst task focused on early detecting signs of self-harm. The second task challenged the participants to automatically lling a depression questionnaire based on user interactions in social media.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The main purpose of eRisk is to explore issues of evaluation methodologies,
performance metrics and other aspects related to building test collections and
de ning challenges for early risk detection. Early detection technologies are
potentially useful in di erent areas, particularly those related to safety and health.
For example, early alerts could be sent when a person starts showing signs of a
mental disorder, when a sexual predator starts interacting with a child, or when
a potential o ender starts publishing antisocial threats on the Internet.</p>
      <p>
        Although the evaluation methodology (strategies to build new test
collections, novel evaluation metrics, etc) can be applied on multiple domains, eRisk
has so far focused on psychological problems (essentially, depression, self-harm
and eating disorders). In 2017 [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ], we ran an exploratory task on early
detection of depression. This pilot task was based on the evaluation methodology
and test collection presented in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In 2018 [
        <xref ref-type="bibr" rid="ref5 ref6">6, 5</xref>
        ], we ran a continuation of the
task on early detection of signs of depression together with a new task on early
detection of signs of anorexia. In 2019 [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ], we had a a continuation of the task
on early detection of signs of anorexia, a new task on early detection of signs of
self-harm and a third task oriented to estimate a user's answers to a depression
questionnaire based on his interactions on social media.
      </p>
      <p>Over these years, we have been able to compare a number of solutions that
employ multiple technologies and models (e.g. Natural Language Processing,
Machine Learning, or Information Retrieval). We learned that the interaction
between psychological problems and language use is challenging and, in general,
the e ectiveness of most contributing systems is modest. For example, most
challenges had levels of performance (e.g. in terms of F1) below 70%. This suggests
that this kind of early prediction tasks require further research and the solutions
proposed so far still have much room from improvement.</p>
      <p>In 2020, the lab had two campaign-style tasks. The rst task had the same
orientation of previous early detection tasks. It focused on early detection of
signs of self-harm. The second task was a continuation of 2019's third task. It
was oriented to analyzing a user's history of posts and extracting useful evidence
for estimating the user's depression level. More speci cally, the participants had
to process the user's posts and, next, estimate the user's answers to a standard
depression questionnaire. These tasks are described in the next sections of this
overview paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Task 1: Early Detection of Signs of Self-Harm</title>
      <p>This is the continuation of eRisk 2019's T2 task. The challenge consists of
sequentially processing pieces of evidence and detect early traces of self-harm as
soon as possible. The task is mainly concerned about evaluating Text Mining
solutions and, thus, it concentrates on texts written in Social Media. Texts had to
be processed in the order they were posted. In this way, systems that e ectively
perform this task could be applied to sequentially monitor user interactions in
blogs, social networks, or other types of online media.</p>
      <p>
        The test collection for this task had the same format as the collection
described in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The source of data is also the same used for previous eRisks. It
is a collection of writings (posts or comments) from a set of Social Media users.
There are two categories of users, self-harm and non-self-harm, and, for each
user, the collection contains a sequence of writings (in chronological order).
      </p>
      <p>In 2019, we moved from a chunk-based release of data (used in 2017 and
2018) to a item-by-item release of data. We set up a server that iteratively gave
user writings to the participating teams. In 2020, the same server was used to
provide the users' writings during the test stage. More information about the
server can be found at the lab website4.
4 http://early.irlab.org/server.html</p>
      <p>The 2020 task was organized into two di erent stages:
{ Training stage. Initially, the teams that participated in this task had access
to a training stage where we released the whole history of writings for a
set of training users (we provided all writings of all training users), and we
indicated what users had explicitly mentioned that they have done self-harm.
The participants could therefore tune their systems with the training data.</p>
      <p>In 2020, the training data for Task 1 was composed of all 2019's T2 users.
{ Test stage. The test stage consisted of a period of time where the
participants had to connect to our server and iteratively got user writings and sent
responses. Each participant had the opportunity to stop and make an alert
at any point of the user chronology. After reading each user post, the teams
had to choose between: i) emitting an alert on the user, or ii) making no alert
on the user. Alerts were considered as nal (i.e. further decisions about this
individual were ignored), while no alerts were considered as non- nal (i.e.
the participants could later submit an alert for this user if they detected the
appearance of risk signs). This choice had to be made for each user in the
test split. The systems were evaluated based on the accuracy of the decisions
and the number of user writings required to take the decisions (see below). A
REST server was built to support the test stage. The server iteratively gave
user writings to the participants and waited for their responses (no new user
data provided until the system said alert/no alert). This server was running
from March 2nd, 2020 to May 24th, 20205.</p>
      <p>Table 1 reports the main statistics of the train and test collections used for
T1. Evaluation measures are discussed in the next section.
2.1</p>
      <sec id="sec-2-1">
        <title>Decision-based Evaluation</title>
        <p>
          This form of evaluation revolves around the (binary) decisions taken for each user
by the participating systems. Besides standard classi cation measures (Precision,
Recall and F16), we computed ERDE, the early risk detection error used in the
previous editions of the lab. A full description of ERDE can be found in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
5 In the initial con guration, the test period was shorter but, because of the COVID-19
situation, we decided to extend the test stage in order to facilitate participation.
6 computed with respect to the positive class.
        </p>
        <p>Essentially, ERDE is an error measure that introduces a penalty for late correct
alerts (true positives). The penalty grows with the delay in emitting the alert,
and the delay is measured here as the number of user posts that had to be
processed before making the alert.</p>
        <p>Since 2019, we complemented the evaluation report with additional
decisionbased metrics that try to capture additional aspects of the problem. These
metrics try to overcome some limitations of ERDE, namely:
{ the penalty associated to true positives goes quickly to 1. This is due to the
functional form of the cost function (sigmoid).
{ a perfect system, which detects the true positive case right after the rst
round of messages ( rst chunk), does not get error equal to 0.
{ with a method based on releasing data in a chunk-based way (as it was done
in 2017 and 2018) the contribution of each user to the performance evaluation
has a large variance (di erent for users with few writings per chunk vs users
with many writings per chunk).
{ ERDE is not interpretable.</p>
        <p>
          Some research teams have analysed these issues and proposed alternative
ways for evaluation. Trotzek and colleagues [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] proposed ERDEo%. This is a
variant of ERDE that does not depend on the number of user writings seen
before the alert but, instead, it depends on the percentage of user writings seen
before the alert. In this way, user's contributions to the evaluation are normalized
(currently, all users weight the same). However, there is an important limitation
of ERDEo%. In real life applications, the overall number of user writings is not
known in advance. Social Media users post contents online and screening tools
have to make predictions with the evidence seen. In practice, you do not know
when (and if) a user's thread of message is exhausted. Thus, the performance
metric should not depend on such lack of knowledge about the total number of
user writings.
        </p>
        <p>
          Another proposal of an alternative evaluation metric for early risk prediction
was done by Sadeque and colleagues [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. They proposed Flatency, which ts better
with our purposes. This measure is described next.
        </p>
        <p>
          Imagine a user u 2 U and an early risk detection system that iteratively
analyzes u's writings (e.g. in chronological order, as they appear in Social Media)
and, after analyzing ku user writings (ku 1), takes a binary decision du 2
f0; 1g, which represents the decision of the system about the user being a risk
case. By gu 2 f0; 1g, we refer to the user's golden truth label. A key component
of an early risk evaluation should be the delay on detecting true positives (we do
not want systems to detect these cases too late). Therefore, a rst and intuitive
measure of delay can be de ned as follows7:
7 Observe that Sadeque et al (see [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], pg 497) computed the latency for all users such
that gu = 1. We argue that latency should be computed only for the true positives.
The false negatives (gu = 1, du = 0) are not detected by the system and, therefore,
they would not generate an alert.
        </p>
        <p>latencyT P = medianfku : u 2 U; du = gu = 1g</p>
        <p>This measure of latency goes over the true positives detected by the system
and assesses the system's delay based on the median number of writings that
the system had to process to detect such positive cases. This measure can be
included in the experimental report together with standard measures such as
Precision (P), Recall (R) and the F-measure (F):</p>
        <p>P = ju 2 U : du = gu = 1j</p>
        <p>ju 2 U : du = 1j
R = ju 2 U : du = gu = 1j</p>
        <p>ju 2 U : gu = 1j
2 P R
F =</p>
        <p>P + R
(1)
(2)
(3)
(4)
(5)
Flatency = F speed
(7)
8 Again, we adopt Sadeque et al.'s proposal but we estimate latency only over the true
positives.
9 In the evaluation we set p to 0:0078, a setting obtained from the eRisk 2017 collection.</p>
        <p>Furthermore, Sadeque et al. proposed a measure, Flatency, which combines
the e ectiveness of the decision (estimated with the F measure) and the delay8.
This is based on multiplying F by a penalty factor based on the median delay.
More speci cally, each individual (true positive) decision, taken after reading ku
writings, is assigned the following penalty:
penalty(ku) =
1 +</p>
        <p>
          2
1 + exp p (ku 1)
where p is a parameter that determines how quickly the penalty should increase.
In [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], p was set such that the penalty equals 0:5 at the median number of posts
of a user9. Observe that a decision right after the rst writing has no penalty
(penalty(1) = 0). Figure 1 plots how the latency penalty increases with the
number of observed writings.
        </p>
        <p>The system's overall speed factor is computed as:
speed = (1
medianfpenalty(ku) : u 2 U; du = gu = 1g)
(6)
speed equals 1 for a system whose true positives are detected right at the rst
writing. A slow system, which detects true positives after hundreds of writings,
will be assigned a speed score near 0.</p>
        <p>Finally, the latency-weighted F score is simply:
Since 2019 user's data was processed by the participants in a post by post
basis (i.e. we avoided a chunk-based release of data). Under these conditions,
the evaluation approach has the following properties:
{ smooth grow of penalties.
{ a perfect system gets Flatency = 1 .
{ for each user u the system can opt to stop at any point ku and, therefore,
now we do not have the e ect of an imbalanced importance of users.
{ Flatency is more interpretable than ERDE.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Ranking-based Evaluation</title>
        <p>This section discusses an alternative form of evaluation, which was used as a
complement of the evaluation described above. After each release of data (new
user writing) the participants had to send back the following information (for
each user in the collection): i) a decision for the user (alert/no alert), which was
used to compute the decision-based metrics discussed above, and ii) a score that
represents the user's level of risk (estimated from the evidence seen so far). We
used these scores to build a ranking of users in decreasing estimation of risk.
For each participating system, we have one ranking at each point (i.e., ranking
after 1 writing, ranking after 2 writings, etc.). This simulates a continuous
reranking approach based on the evidence seen so far. In a real life application,
this ranking would be presented to an expert user who could take decisions (e.g.
by inspecting the rankings).</p>
        <p>Each ranking can be scored with standard IR metrics, such as P@10 or
NDCG. We therefore report the ranking-based performance of the systems after
seeing k writings (with varying k).
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Task 1: Results</title>
        <p>Table 6 shows the participating teams, the number of runs submitted and the
approximate lapse of time from the rst response to the last response. This lapse
of time is indicative of the degree of automation of each team's algorithms. A
few of the submitted runs processed the entire thread of messages (nearly 2000),
but many variants opted for stopping earlier. Six teams processed the thread
of messages in a reasonably fast way (less than a day or so for processing the
entire history of user messages). The rest of the teams took several days to run
the whole process. Some teams took even more than a week. This suggests that
they incorporated some form of o ine processing.</p>
        <p>Table 3 reports the decision-based performance achieved by the participating
teams.</p>
        <p>In terms of Precision, F 1, ERDE measures and latency-weighted F 1, the
best performing runs were submitted by the iLab team. The rst two iLab runs
had extremely high precision (.833 and .913, respectively) and the rst one (run
#0) had the highest latency-weighted F1 (.658). These runs had low levels of
recall (.577 and .404) and they only analyzed a median of 10 user writings. This
suggests that you can get to a reasonably high level of precision based on a few
user writings. The main limitation of these best performing runs is the low levels
of recall achieved. In terms of ERDE, the best performing runs show low levels
of error (.134 and .071). ERDE measures set a strong penalty on late decisions
and the two best runs show a good balance between the accuracy of the decisions
and the delays (latency of the true positives was 2 and 45, respectively, for the
two runs that achieved the lowest ERDE5 and ERDE50).</p>
        <p>Other teams submitted high recall runs but their precision was very low and,
thus, these automatic methods are hardly usable to lter out non-risk cases.</p>
        <p>Most teams submitted quick decisions. Only iLab and prhlt-upv have some
runs that analysed more than a hundred submissions before emitting the alerts
(mean latencies higher than 100).</p>
        <p>Overall, these results suggest that with a few dozen user writings some
systems led to reasonably high e ectiveness. The best predictive algorithms could
be used to support expert humans in early detecting signs of self-harm.</p>
        <p>Table 4 reports the ranking-based performance achieved by the participating
teams. Some teams only processed a few dozens of user writings and, thus, we
could only compute their rankings of users for the initial points.</p>
        <p>Some teams (e.g., INAOE-CIMAT or BioInfo@UAVR) have the same levels
of ranking-based e ectiveness over multiple points (after 1 writing, after 100
writings, and so forth). This suggests that these teams did not change the risk
scores estimated from the initial stages (or their algorithms were not able to
enhance their estimations as more evidence was seen).</p>
        <p>Other participants (e.g., EFE, iLab or hildesheim) behave as expected: the
rankings of estimated risk get better as they are built from more user evidence.
Notably, some iLab variants variants led to almost perfect P @10 and N DCG@10
performance after analyzing more than 100 writings. The N DCG@100 scores
achieved by this team after 100 or 500 writings were also quite e ective (above
.81 for all variants). This suggests that, with enough pieces of evidence, the
methods implemented by this team are highly e ective at prioritizing at-risk
users.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Task 2: Measuring the Severity of the Signs of</title>
    </sec>
    <sec id="sec-4">
      <title>Depression</title>
      <p>This task is a continuation of 2019's T3 task. The task consists of estimating
the level of depression from a thread of user submissions. For each user, the
participants were given the user's full history of postings (in a single release of
data) and the participants had to ll a standard depression questionnaire based
on the evidence found in the history of postings. In 2020, the participants had
the opportunity to use 2019's data as training data ( lled questionnaires and
SM submissions from the 2019 users, i.e. a training set composed of 20 users).</p>
      <p>
        The questionnaires are derived from the Beck's Depression Inventory (BDI)[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
which assesses the presence of feelings like sadness, pessimism, loss of energy, etc,
for the detection of depression. The questionnaire contains 21 questions (see gs
2, 3).
      </p>
      <p>The task aims at exploring the viability of automatically estimating the
severity of the multiple symptoms associated with depression. Given the user's history
of writings, the algorithms had to estimate the user's response to each individual
question. We collected questionnaires lled by Social Media users together with
their history of writings (we extracted each history of writings right after the
user provided us with the lled questionnaire). The questionnaires lled by the
users (ground truth) were used to assess the quality of the responses provided
by the participating systems.</p>
      <p>The participants were given a dataset with 70 users and they were asked to
produce a le with the following structure:
username1 answer1 answer2 .... answer21
username2 ....
....
This questionnaire consists of 21 groups of statements. Please read each group of statements
carefully, and then pick out the one statement in each group that best describes the way you feel.</p>
      <p>If several statements in the group seem to apply equally well, choose the highest
number for that group.
1. Sadness
0. I do not feel sad.
1. I feel sad much of the time.
2. I am sad all the time.
3. I am so sad or unhappy that I can't stand it.
2. Pessimism
0. I am not discouraged about my future.
1. I feel more discouraged about my future than I used to be.
2. I do not expect things to work out for me.
3. I feel my future is hopeless and will only get worse.
3. Past Failure
0. I do not feel like a failure.
1. I have failed more than I should have.
2. As I look back, I see a lot of failures.
3. I feel I am a total failure as a person.
4. Loss of Pleasure
0. I get as much pleasure as I ever did from the things I enjoy.
1. I don't enjoy things as much as I used to.
2. I get very little pleasure from the things I used to enjoy.
3. I can't get any pleasure from the things I used to enjoy.
5. Guilty Feelings
0. I don't feel particularly guilty.
1. I feel guilty over many things I have done or should have done.
2. I feel quite guilty most of the time.
3. I feel guilty all of the time.
6. Punishment Feelings
0. I don't feel I am being punished.
1. I feel I may be punished.
2. I expect to be punished.
3. I feel I am being punished.
7. Self-Dislike
0. I feel the same about myself as ever.
1. I have lost confidence in myself.
2. I am disappointed in myself.
3. I dislike myself.
8. Self-Criticalness
0. I don't criticize or blame myself more than usual.
1. I am more critical of myself than I used to be.
2. I criticize myself for all of my faults.
3. I blame myself for everything bad that happens.
9. Suicidal Thoughts or Wishes
0. I don't have any thoughts of killing myself.
1. I have thoughts of killing myself, but I would not carry them out.
2. I would like to kill myself.
3. I would kill myself if I had the chance.
10. Crying
0. I don't cry anymore than I used to.
1. I cry more than I used to.
2. I cry over every little thing.
3. I feel like crying, but I can't.
11. Agitation
0. I am no more restless or wound up than usual.
1. I feel more restless or wound up than usual.
2. I am so restless or agitated that it's hard to stay still.
3. I am so restless or agitated that I have to keep moving or doing something.
12. Loss of Interest
0. I have not lost interest in other people or activities.
1. I am less interested in other people or things than before.
2. I have lost most of my interest in other people or things.
3. It's hard to get interested in anything.
13. Indecisiveness
0. I make decisions about as well as ever.
1. I find it more difficult to make decisions than usual.
2. I have much greater difficulty in making decisions than I used to.
3. I have trouble making any decisions.
14. Worthlessness
0. I do not feel I am worthless.
1. I don't consider myself as worthwhile and useful as I used to.
2. I feel more worthless as compared to other people.
3. I feel utterly worthless.
15. Loss of Energy
0. I have as much energy as ever.
1. I have less energy than I used to have.
2. I don't have enough energy to do very much.
3. I don't have enough energy to do anything.</p>
      <p>Fig. 2. Beck's Depression Inventory (part 1)</p>
      <p>Each line has a user identi er and 21 values. These values correspond to the
responses to the questions of the depression questionnaire (the possible values
are 0, 1a, 1b, 2a, 2b, 3a, 3b -for questions 16 and 18- and 0, 1, 2, 3 -for the rest
of the questions-).</p>
      <sec id="sec-4-1">
        <title>3.1 Task 2: Evaluation Metrics</title>
        <p>For consistency purposes, we employed the same evaluation metrics utilised in
2019. These metrics assess the quality of a questionnaire lled by a system in
comparison with the real questionnaire lled by the actual Social Media user:
{ Average Hit Rate (AHR): Hit Rate (HR) averaged over all users. HR is
a stringent measure that computes the ratio of cases where the automatic
questionnaire has exactly the same answer as the real questionnaire. For
example, an automatic questionnaire with 5 matches gets HR equal to 5/21
(because there are 21 questions in the form).
{ Average Closeness Rate (ACR): Closeness Rate (CR) averaged over all
users. CR takes into account that the answers of the depression questionnaire
represent an ordinal scale. For example, consider the #17 question:
17. Irritability
0. I am no more irritable than usual.
1. I am more irritable than usual.
2. I am much more irritable than usual.
3. I am irritable all the time.</p>
        <p>Imagine that the real user answered "0". A system S1 whose answer is "3"
should be penalised more than a system S2 whose answer is "1".
For each question, CR computes the absolute di erence (ad) between the real
and the automated answer (e.g. ad=3 and ad=1 for S1 and S2, respectively)
and, next, this absolute di erence is transformed into an e ectiveness score
as follows: CR = (mad ad)=mad, where mad is the maximum absolute
di erence, which is equal to the number of possible answers minus one.
NOTE: in the two questions (#16 and #18) that have seven possible answers
f0; 1a; 1b; 2a; 2b; 3a ; 3bg the pairs (1a; 1b), (2a; 2b), (3a; 3b) are considered
equivalent because they re ect the same depression level. As a consequence,
the di erence between 3b and 0 is equal to 3 (and the di erence between 1a
and 1b is equal to 0).
{ Average DODL (ADODL): Di erence between overall depression levels
(DODL) averaged over all users. The previous measures assess the systems'
ability to answer each question in the form. DODL, instead, does not look
at question-level hits or di erences but computes the overall depression level
(sum of all the answers) for the real and automated questionnaire and, next,
the absolute di erence (ad overall) between the real and the automated
score is computed.</p>
        <p>
          Depression levels are integers between 0 and 63 and, thus, DODL is
normalised into [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ] as follows: DODL = (63 ad overall)=63.
{ Depression Category Hit Rate (DCHR). In the psychological domain,
it is customary to associate depression levels with the following categories:
minimal depression (depression levels 0-9)
mild depression (depression levels 10-18)
moderate depression (depression levels 19-29)
severe depression (depression levels 30-63)
The last e ectiveness measure consists of computing the fraction of cases
where the automated questionnaire led to a depression category that is
equivalent to the depression category obtained from the real questionnaire.
3.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Task 2: Results</title>
        <p>Table 5 presents the results achieved by the participants in this task. To put
things in perspective, the table also reports (lower block) the performance achieved
by three baseline variants: all 0s and all 1s, which consist of sending the same
response (0 or 1) for all the questions, and random, which is the average
performance (averaged over 1000 repetitions) achieved by an algorithm that randomly
chooses among the possible answers.</p>
        <p>Although the teams could use training data from 2019 (while 2019's
participants had no training data), the performance scores tend to be lower than 2019's
performance scores (only ADODL had higher performance). This could be due
to various reasons, including the intrinsic di culty of the task and the lack of
discussion on SM of psychological concerns by 2020 users.</p>
        <p>In terms of AHR, the best performing run (BioInfo@UAVR) only got 38.30%
of the answers right. The scores of the distance-based measure (ACR) are below
70%. Most of the questions have four possible answers and, thus, a random
algorithm would get AHR near 25%10. This suggests that the analysis of the user
posts was useful at extracting some signals or symptoms related to depression.
However, ADODL and, particularly, DCHR show that the participants, although
e ective at answering some depression-related questions, do not fare well at
estimating the overall level of depression of the individuals. For example, the
best performing run gets the depression category right for only 35.71% of the
individuals.</p>
        <p>Overall, these experiments indicate that we are still far from a really e ective
depression screening tool. In the near future, it will be interesting to further
analyze the participants' estimations in order to investigate which particular
BDI questions are easier or harder to automatically answer based on Social
Media activity.
10 Actually, slightly less than 25% because a couple of questions have more than four
possible answers.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Participating Teams</title>
      <p>EFE. This is a team from Dept. of Computer Engineering, Ferdowsi
University of Mashhad (Iran). They implemented three variants for Task 1 that
represent the texts using Word2Vec representations and performed experiments
using Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM)
models, and Support Vector Machines (SVMs). The entire system is an ensemble
multi-level method based on SVM, CNN, and LSTM, which are ne-tuned by
attention layers.</p>
      <p>USDB. This is a joint collaboration between the LRDSI Laboratory
(BLIDA 1 University, Algeria) and the Information Assurance and Security Research
Group (Faculty of Computing, Universiti Teknologi Malaysia, Malaysia). This
team participated in Task 2 and transformed the user's texts into distributed
representations following the Skip-gram model. Next, sentences are encoded
using a CNN or Bi-LSTM model (or with Recurrent Neural Networks and Long
Bi-LSTM). For each user post, the models generate 21 outputs, which are the
answers to the BDI questions. Finally, the user's overall questionnaire is obtained
by selecting, for each BDI question, the most frequent answer.</p>
      <p>NLP-UNED. This is a joint e ort by the NLP &amp; IR Group, at Universidad
Nacional de Educacion a Distancia (UNED), Spain and the Instituto Mixto de
Investigacion - Escuela Nacional de Sanidad (IMIENS), Spain. These researchers,
which participated in Task 1, implemented a machine learning approach using
textual features and a SVM classi er. In order to extract relevant features, this
team followed a sliding window approach that handles the last messages
published by any given user. The features considered a wide range of variables, such
as title length, words in the title, punctuation, emoticons, and other feature sets
obtained from sentiment analysis, rst person pronouns, and NSSI words.</p>
      <p>Hildesheim. This team, from the Institute for Information Science and
Natural Language Processing (University of Hildesheim, Germany), implemented
four variants that apply di erent methods for Task 1 and a fth ensemble
system that combines the four variants. The four methods utilize di erent types
of features, such as time intervals between posts, and the sentiment and
semantics of the writings. To this aim, a neural network approach using bag-of-words
vectors and contextualized word embeddings was employed.</p>
      <p>BioInfo@UAVR. This team comes from the Bioinformatics group of the
Institute of Electronics and Engineering Informatics (University of Aveiro,
Portugal). They participated in both tasks. Their approach built upon the algorithms
proposed by them for eRisk 2019. For Task 1, they considered a bag of words
approach with tf-idf features and employed linear Support Vector Machines with
Stochastic Gradient Descent and Passive Aggressive classi ers. For Task 2, the
method is based on training a machine learning model using an external dataset
and, next, employing the learnt classi er against the eRisk 2020 data. These
authors considered psycholinguistic and behavioural features in their attempt to
associate the BDI responses with the user's posts.</p>
      <p>INAOE-CIMAT. This is a joint e ort by Instituto Nacional de Astrof sica,
Optica y Electronica (INAOE), Mexico and Centro de Investigacion en
Matematicas (CIMAT), Mexico. This team participated in the rst task and proposed a
so-called Bag of Sub-Emotions approach that represents the posts of the users
using a set of sub-emotions. This representation, which was subsequently
combined with bag of words representations, captures the emotions and topics that
users with signs of self-harm tend to use. At test time, they experimented with
ve variants that estimate the temporal stability associated to the users' posts.</p>
      <p>SSN-NLP. These participants come from the SSN College Of
Engineering, Chennai (India) and participated in Task 1. Given the training data, they
experimented with ve alternative classi cation approaches (using tf/idf
representations and Bernoulli Naive Bayes, Gradient Boosting, Random Forest, or
Extra Trees; or CNNs together with Word2Vec representations).</p>
      <p>iLab. This is a joint collaboration between Centro Singular de Investigacion
en Tecnolox as Intelixentes (CiTIUS), Universidade de Santiago de Compostela,
Spain and the Department of Computer and Information Sciences, University
of Strathclyde, UK. This team participated in both tasks. They used
BERTbased classi ers which were trained speci cally for each task. A variety of
pretrained models were tested against training data (including BERT, DistillBERT,
RoBERTa and XLM-RoBERTa). Rather than using the task's training data,
these participants created four new training datasets from Reddit. The submitted
runs for Task 1 were based on XLM-RoBERTa. For Task 2, they employed similar
methods as the ones employed for Task 1, but they treated the problem as a
multi-class labelling problem (one problem for each BDI question).</p>
      <p>Prhlt-upv. This team is composed of researchers from Universitat
Politecnica de Valencia and from University of Bucharest. They employed
multidimensional representations of language and deep learning models (including
hierarchical architectures, pre-trained transformers and language models). This
team participated in both tasks. For Task 1, they utilized content features, style
features, LIWC features, emotions and sentiments. Di erent strategies were
implemented to represent the users' submissions (e.g., augmenting the data by
sampling from the user's history or computing a rolling average associated to
the most recent estimations). For Task 2, they employed simpler learning models
(SVMs and logistic regression) and some of the features extracted for Task 1.
The problem was tackled as a multi-label multi-class problem where one model
was trained for each BDI question.</p>
      <p>Relai. This team comes from University of Quebec in Montreal, Canada.
These researchers participated in both tasks, and addressed them using topic
modeling al- gorithms (LDA and Anchor Variant), neural models with three
di erent architectures (Deep Averaging Networks, Contextualizers, and RNNs),
and an approach based on writing styles. Some of the variants considered
stylometry variables, such as Part-of-Speech, frequent n-grams, punctuation, length
of words/sentences and usage of uppercase or hyperlinks.
5</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>This paper provided an overview of eRisk 2020. This was the fourth edition of
this lab and the lab's activities concentrated on two di erent types of tasks: early
detection of signs of self-harm (T1), where the participants had a sequential
access to the user's social media posts and they had to send alerts about at-risk
individuals, and measuring the severity of the signs of depression (T2), where
the participants were given the full user history and their systems had to
automatically estimate the user's responses to a standard depression questionnaire.</p>
      <p>Overall, the proposed tasks received 73 variants or runs from 12 teams.
Although the e ectiveness of the proposed solutions is still modest, the experiments
suggest that evidence extracted from Social Media is valuable and automatic or
semi-automatic screening tools could be designed to detect at-risk individuals.
This promising result encourages us to further explore the creation of
benchmarks for text-based screening of signs of risk.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>We thank the support obtained from the Swiss National Science Foundation
(SNSF) under the project \Early risk prediction on the Internet: an evaluation
corpus", 2015.</p>
      <p>This work was funded by FEDER/Ministerio de Ciencia, Innovacion y
Universidades { Agencia Estatal de Investigacion/ Project (RTI2018-093336-B-C21).
This work has also received nancial support from the Conseller a de
Educacion, Universidade e Formacion Profesional (accreditation 2019-2022
ED431G2019/04, ED431C 2018/29) and the European Regional Development Fund,
which acknowledges the CiTIUS-Research Center in Intelligent Technologies of
the University of Santiago de Compostela as a Research Center of the Galician
University System.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Beck</surname>
            ,
            <given-names>A.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ward</surname>
            ,
            <given-names>C.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mendelson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mock</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erbaugh</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>An Inventory for Measuring Depression</article-title>
          .
          <source>JAMA Psychiatry</source>
          <volume>4</volume>
          (
          <issue>6</issue>
          ),
          <volume>561</volume>
          {
          <volume>571</volume>
          (06
          <year>1961</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Losada</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A test collection for research on depression and language use</article-title>
          .
          <source>In: Proceedings Conference and Labs of the Evaluation Forum CLEF</source>
          <year>2016</year>
          . Evora,
          <string-name>
            <surname>Portugal</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Losada</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parapar</surname>
            ,
            <given-names>J.: eRisk</given-names>
          </string-name>
          <year>2017</year>
          :
          <article-title>CLEF lab on early risk prediction on the internet: Experimental foundations</article-title>
          . In: Jones,
          <string-name>
            <given-names>G.J.</given-names>
            ,
            <surname>Lawless</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Kelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.)
          <string-name>
            <surname>Experimental IR Meets Multilinguality</surname>
          </string-name>
          , Multimodality, and Interaction. pp.
          <volume>346</volume>
          {
          <fpage>360</fpage>
          . Springer International Publishing,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Losada</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parapar</surname>
            ,
            <given-names>J.: eRisk</given-names>
          </string-name>
          <year>2017</year>
          :
          <article-title>CLEF Lab on Early Risk Prediction on the Internet: Experimental foundations</article-title>
          .
          <source>In: CEUR Proceedings of the Conference and Labs of the Evaluation Forum</source>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2017</year>
          . Dublin, Ireland (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Losada</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parapar</surname>
          </string-name>
          , J.:
          <source>Overview of eRisk</source>
          <year>2018</year>
          :
          <article-title>Early Risk Prediction on the Internet (extended lab overview)</article-title>
          .
          <source>In: CEUR Proceedings of the Conference and Labs of the Evaluation Forum</source>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2018</year>
          . Avignon, France (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Losada</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parapar</surname>
          </string-name>
          , J.:
          <article-title>Overview of eRisk: Early Risk Prediction on the Internet</article-title>
          . In: Bellot,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Trabelsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Mothe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Murtagh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.Y.</given-names>
            ,
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>SanJuan</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.)
          <string-name>
            <surname>Experimental IR Meets Multilinguality</surname>
          </string-name>
          , Multimodality, and Interaction. pp.
          <volume>343</volume>
          {
          <fpage>361</fpage>
          . Springer International Publishing,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Losada</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parapar</surname>
          </string-name>
          , J.:
          <source>Overview of eRisk</source>
          <year>2019</year>
          :
          <article-title>Early risk prediction on the Internet</article-title>
          . In: Crestani,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Braschler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Savoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Rauber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            , Muller, H.,
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.E.</given-names>
            , Heinatz Burki, G.,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.)
          <string-name>
            <surname>Experimental IR Meets Multilinguality</surname>
          </string-name>
          , Multimodality, and Interaction. pp.
          <volume>340</volume>
          {
          <fpage>357</fpage>
          . Springer International Publishing (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Losada</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parapar</surname>
          </string-name>
          , J.:
          <source>Overview of eRisk at CLEF</source>
          <year>2019</year>
          :
          <article-title>Early risk prediction on the Internet (extended overview)</article-title>
          .
          <source>In: CEUR Proceedings of the Conference and Labs of the Evaluation Forum</source>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2019</year>
          . Lugano,
          <string-name>
            <surname>Switzerland</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Sadeque</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Measuring the latency of depression detection in social media</article-title>
          .
          <source>In: WSDM</source>
          . pp.
          <volume>495</volume>
          {
          <fpage>503</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Trotzek</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koitka</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedrich</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Utilizing neural networks and linguistic metadata for early detection of depression indications in text sequences</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering (04</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>