-

Overview of eRisk at CLEF 2020: Early Risk Prediction on the Internet (Extended Overview)

David E. Losada

david.losada@usc.es 0

Fabio Crestani

fabio.crestani@usi.ch 1

Javier Parapar

javierparapar@udc.es 2 0 Centro Singular de Investigacion en Tecnolox as Intelixentes (CiTIUS) , Universidade de Santiago de Compostela , Spain 1 Faculty of Informatics, Universita della Svizzera italiana (USI) , Switzerland 2 Information Retrieval Lab, Centro de Investigacion en Tecnolog as de la Informacion y las Comunicaciones, Universidade da Corun~a , Spain

This paper provides an overview of eRisk 2020, the fourth edition of this lab under the CLEF conference. The main purpose of eRisk is to explore issues of evaluation methodology, e ectiveness metrics and other processes related to early risk detection. Early detection technologies can be employed in di erent areas, particularly those related to health and safety. This edition of eRisk had two tasks. The rst task focused on early detecting signs of self-harm. The second task challenged the participants to automatically lling a depression questionnaire based on user interactions in social media.

The main purpose of eRisk is to explore issues of evaluation methodologies, performance metrics and other aspects related to building test collections and de ning challenges for early risk detection. Early detection technologies are potentially useful in di erent areas, particularly those related to safety and health. For example, early alerts could be sent when a person starts showing signs of a mental disorder, when a sexual predator starts interacting with a child, or when a potential o ender starts publishing antisocial threats on the Internet.

Although the evaluation methodology (strategies to build new test collections, novel evaluation metrics, etc) can be applied on multiple domains, eRisk has so far focused on psychological problems (essentially, depression, self-harm and eating disorders). In 2017 [ 3, 4 ], we ran an exploratory task on early detection of depression. This pilot task was based on the evaluation methodology and test collection presented in [ 2 ]. In 2018 [ 6, 5 ], we ran a continuation of the task on early detection of signs of depression together with a new task on early detection of signs of anorexia. In 2019 [ 7, 8 ], we had a a continuation of the task on early detection of signs of anorexia, a new task on early detection of signs of self-harm and a third task oriented to estimate a user's answers to a depression questionnaire based on his interactions on social media.

Over these years, we have been able to compare a number of solutions that employ multiple technologies and models (e.g. Natural Language Processing, Machine Learning, or Information Retrieval). We learned that the interaction between psychological problems and language use is challenging and, in general, the e ectiveness of most contributing systems is modest. For example, most challenges had levels of performance (e.g. in terms of F1) below 70%. This suggests that this kind of early prediction tasks require further research and the solutions proposed so far still have much room from improvement.

In 2020, the lab had two campaign-style tasks. The rst task had the same orientation of previous early detection tasks. It focused on early detection of signs of self-harm. The second task was a continuation of 2019's third task. It was oriented to analyzing a user's history of posts and extracting useful evidence for estimating the user's depression level. More speci cally, the participants had to process the user's posts and, next, estimate the user's answers to a standard depression questionnaire. These tasks are described in the next sections of this overview paper. 2

Task 1: Early Detection of Signs of Self-Harm

This is the continuation of eRisk 2019's T2 task. The challenge consists of sequentially processing pieces of evidence and detect early traces of self-harm as soon as possible. The task is mainly concerned about evaluating Text Mining solutions and, thus, it concentrates on texts written in Social Media. Texts had to be processed in the order they were posted. In this way, systems that e ectively perform this task could be applied to sequentially monitor user interactions in blogs, social networks, or other types of online media.

The test collection for this task had the same format as the collection described in [ 2 ]. The source of data is also the same used for previous eRisks. It is a collection of writings (posts or comments) from a set of Social Media users. There are two categories of users, self-harm and non-self-harm, and, for each user, the collection contains a sequence of writings (in chronological order).

In 2019, we moved from a chunk-based release of data (used in 2017 and 2018) to a item-by-item release of data. We set up a server that iteratively gave user writings to the participating teams. In 2020, the same server was used to provide the users' writings during the test stage. More information about the server can be found at the lab website4. 4 http://early.irlab.org/server.html

The 2020 task was organized into two di erent stages: { Training stage. Initially, the teams that participated in this task had access to a training stage where we released the whole history of writings for a set of training users (we provided all writings of all training users), and we indicated what users had explicitly mentioned that they have done self-harm. The participants could therefore tune their systems with the training data.

In 2020, the training data for Task 1 was composed of all 2019's T2 users. { Test stage. The test stage consisted of a period of time where the participants had to connect to our server and iteratively got user writings and sent responses. Each participant had the opportunity to stop and make an alert at any point of the user chronology. After reading each user post, the teams had to choose between: i) emitting an alert on the user, or ii) making no alert on the user. Alerts were considered as nal (i.e. further decisions about this individual were ignored), while no alerts were considered as non- nal (i.e. the participants could later submit an alert for this user if they detected the appearance of risk signs). This choice had to be made for each user in the test split. The systems were evaluated based on the accuracy of the decisions and the number of user writings required to take the decisions (see below). A REST server was built to support the test stage. The server iteratively gave user writings to the participants and waited for their responses (no new user data provided until the system said alert/no alert). This server was running from March 2nd, 2020 to May 24th, 20205.

Table 1 reports the main statistics of the train and test collections used for T1. Evaluation measures are discussed in the next section. 2.1

Decision-based Evaluation

This form of evaluation revolves around the (binary) decisions taken for each user by the participating systems. Besides standard classi cation measures (Precision, Recall and F16), we computed ERDE, the early risk detection error used in the previous editions of the lab. A full description of ERDE can be found in [ 2 ]. 5 In the initial con guration, the test period was shorter but, because of the COVID-19 situation, we decided to extend the test stage in order to facilitate participation. 6 computed with respect to the positive class.

Essentially, ERDE is an error measure that introduces a penalty for late correct alerts (true positives). The penalty grows with the delay in emitting the alert, and the delay is measured here as the number of user posts that had to be processed before making the alert.

Since 2019, we complemented the evaluation report with additional decisionbased metrics that try to capture additional aspects of the problem. These metrics try to overcome some limitations of ERDE, namely: { the penalty associated to true positives goes quickly to 1. This is due to the functional form of the cost function (sigmoid). { a perfect system, which detects the true positive case right after the rst round of messages ( rst chunk), does not get error equal to 0. { with a method based on releasing data in a chunk-based way (as it was done in 2017 and 2018) the contribution of each user to the performance evaluation has a large variance (di erent for users with few writings per chunk vs users with many writings per chunk). { ERDE is not interpretable.

Some research teams have analysed these issues and proposed alternative ways for evaluation. Trotzek and colleagues [ 10 ] proposed ERDEo%. This is a variant of ERDE that does not depend on the number of user writings seen before the alert but, instead, it depends on the percentage of user writings seen before the alert. In this way, user's contributions to the evaluation are normalized (currently, all users weight the same). However, there is an important limitation of ERDEo%. In real life applications, the overall number of user writings is not known in advance. Social Media users post contents online and screening tools have to make predictions with the evidence seen. In practice, you do not know when (and if) a user's thread of message is exhausted. Thus, the performance metric should not depend on such lack of knowledge about the total number of user writings.

Another proposal of an alternative evaluation metric for early risk prediction was done by Sadeque and colleagues [ 9 ]. They proposed Flatency, which ts better with our purposes. This measure is described next.

Imagine a user u 2 U and an early risk detection system that iteratively analyzes u's writings (e.g. in chronological order, as they appear in Social Media) and, after analyzing ku user writings (ku 1), takes a binary decision du 2 f0; 1g, which represents the decision of the system about the user being a risk case. By gu 2 f0; 1g, we refer to the user's golden truth label. A key component of an early risk evaluation should be the delay on detecting true positives (we do not want systems to detect these cases too late). Therefore, a rst and intuitive measure of delay can be de ned as follows7: 7 Observe that Sadeque et al (see [ 9 ], pg 497) computed the latency for all users such that gu = 1. We argue that latency should be computed only for the true positives. The false negatives (gu = 1, du = 0) are not detected by the system and, therefore, they would not generate an alert.

latencyT P = medianfku : u 2 U; du = gu = 1g

This measure of latency goes over the true positives detected by the system and assesses the system's delay based on the median number of writings that the system had to process to detect such positive cases. This measure can be included in the experimental report together with standard measures such as Precision (P), Recall (R) and the F-measure (F):

P = ju 2 U : du = gu = 1j

ju 2 U : du = 1j R = ju 2 U : du = gu = 1j

ju 2 U : gu = 1j 2 P R F =

P + R (1) (2) (3) (4) (5) Flatency = F speed (7) 8 Again, we adopt Sadeque et al.'s proposal but we estimate latency only over the true positives. 9 In the evaluation we set p to 0:0078, a setting obtained from the eRisk 2017 collection.

Furthermore, Sadeque et al. proposed a measure, Flatency, which combines the e ectiveness of the decision (estimated with the F measure) and the delay8. This is based on multiplying F by a penalty factor based on the median delay. More speci cally, each individual (true positive) decision, taken after reading ku writings, is assigned the following penalty: penalty(ku) = 1 +

2 1 + exp p (ku 1) where p is a parameter that determines how quickly the penalty should increase. In [ 9 ], p was set such that the penalty equals 0:5 at the median number of posts of a user9. Observe that a decision right after the rst writing has no penalty (penalty(1) = 0). Figure 1 plots how the latency penalty increases with the number of observed writings.

The system's overall speed factor is computed as: speed = (1 medianfpenalty(ku) : u 2 U; du = gu = 1g) (6) speed equals 1 for a system whose true positives are detected right at the rst writing. A slow system, which detects true positives after hundreds of writings, will be assigned a speed score near 0.

Finally, the latency-weighted F score is simply: Since 2019 user's data was processed by the participants in a post by post basis (i.e. we avoided a chunk-based release of data). Under these conditions, the evaluation approach has the following properties: { smooth grow of penalties. { a perfect system gets Flatency = 1 . { for each user u the system can opt to stop at any point ku and, therefore, now we do not have the e ect of an imbalanced importance of users. { Flatency is more interpretable than ERDE. 2.2

Ranking-based Evaluation

This section discusses an alternative form of evaluation, which was used as a complement of the evaluation described above. After each release of data (new user writing) the participants had to send back the following information (for each user in the collection): i) a decision for the user (alert/no alert), which was used to compute the decision-based metrics discussed above, and ii) a score that represents the user's level of risk (estimated from the evidence seen so far). We used these scores to build a ranking of users in decreasing estimation of risk. For each participating system, we have one ranking at each point (i.e., ranking after 1 writing, ranking after 2 writings, etc.). This simulates a continuous reranking approach based on the evidence seen so far. In a real life application, this ranking would be presented to an expert user who could take decisions (e.g. by inspecting the rankings).

Each ranking can be scored with standard IR metrics, such as P@10 or NDCG. We therefore report the ranking-based performance of the systems after seeing k writings (with varying k). 2.3

Task 1: Results

Table 6 shows the participating teams, the number of runs submitted and the approximate lapse of time from the rst response to the last response. This lapse of time is indicative of the degree of automation of each team's algorithms. A few of the submitted runs processed the entire thread of messages (nearly 2000), but many variants opted for stopping earlier. Six teams processed the thread of messages in a reasonably fast way (less than a day or so for processing the entire history of user messages). The rest of the teams took several days to run the whole process. Some teams took even more than a week. This suggests that they incorporated some form of o ine processing.

Table 3 reports the decision-based performance achieved by the participating teams.

In terms of Precision, F 1, ERDE measures and latency-weighted F 1, the best performing runs were submitted by the iLab team. The rst two iLab runs had extremely high precision (.833 and .913, respectively) and the rst one (run #0) had the highest latency-weighted F1 (.658). These runs had low levels of recall (.577 and .404) and they only analyzed a median of 10 user writings. This suggests that you can get to a reasonably high level of precision based on a few user writings. The main limitation of these best performing runs is the low levels of recall achieved. In terms of ERDE, the best performing runs show low levels of error (.134 and .071). ERDE measures set a strong penalty on late decisions and the two best runs show a good balance between the accuracy of the decisions and the delays (latency of the true positives was 2 and 45, respectively, for the two runs that achieved the lowest ERDE5 and ERDE50).

Other teams submitted high recall runs but their precision was very low and, thus, these automatic methods are hardly usable to lter out non-risk cases.

Most teams submitted quick decisions. Only iLab and prhlt-upv have some runs that analysed more than a hundred submissions before emitting the alerts (mean latencies higher than 100).

Overall, these results suggest that with a few dozen user writings some systems led to reasonably high e ectiveness. The best predictive algorithms could be used to support expert humans in early detecting signs of self-harm.

Table 4 reports the ranking-based performance achieved by the participating teams. Some teams only processed a few dozens of user writings and, thus, we could only compute their rankings of users for the initial points.

Some teams (e.g., INAOE-CIMAT or BioInfo@UAVR) have the same levels of ranking-based e ectiveness over multiple points (after 1 writing, after 100 writings, and so forth). This suggests that these teams did not change the risk scores estimated from the initial stages (or their algorithms were not able to enhance their estimations as more evidence was seen).

Other participants (e.g., EFE, iLab or hildesheim) behave as expected: the rankings of estimated risk get better as they are built from more user evidence. Notably, some iLab variants variants led to almost perfect P @10 and N DCG@10 performance after analyzing more than 100 writings. The N DCG@100 scores achieved by this team after 100 or 500 writings were also quite e ective (above .81 for all variants). This suggests that, with enough pieces of evidence, the methods implemented by this team are highly e ective at prioritizing at-risk users.

Task 2: Measuring the Severity of the Signs of Depression

This task is a continuation of 2019's T3 task. The task consists of estimating the level of depression from a thread of user submissions. For each user, the participants were given the user's full history of postings (in a single release of data) and the participants had to ll a standard depression questionnaire based on the evidence found in the history of postings. In 2020, the participants had the opportunity to use 2019's data as training data ( lled questionnaires and SM submissions from the 2019 users, i.e. a training set composed of 20 users).

The questionnaires are derived from the Beck's Depression Inventory (BDI)[ 1 ], which assesses the presence of feelings like sadness, pessimism, loss of energy, etc, for the detection of depression. The questionnaire contains 21 questions (see gs 2, 3).

The task aims at exploring the viability of automatically estimating the severity of the multiple symptoms associated with depression. Given the user's history of writings, the algorithms had to estimate the user's response to each individual question. We collected questionnaires lled by Social Media users together with their history of writings (we extracted each history of writings right after the user provided us with the lled questionnaire). The questionnaires lled by the users (ground truth) were used to assess the quality of the responses provided by the participating systems.

The participants were given a dataset with 70 users and they were asked to produce a le with the following structure: username1 answer1 answer2 .... answer21 username2 .... .... This questionnaire consists of 21 groups of statements. Please read each group of statements carefully, and then pick out the one statement in each group that best describes the way you feel.

If several statements in the group seem to apply equally well, choose the highest number for that group. 1. Sadness 0. I do not feel sad. 1. I feel sad much of the time. 2. I am sad all the time. 3. I am so sad or unhappy that I can't stand it. 2. Pessimism 0. I am not discouraged about my future. 1. I feel more discouraged about my future than I used to be. 2. I do not expect things to work out for me. 3. I feel my future is hopeless and will only get worse. 3. Past Failure 0. I do not feel like a failure. 1. I have failed more than I should have. 2. As I look back, I see a lot of failures. 3. I feel I am a total failure as a person. 4. Loss of Pleasure 0. I get as much pleasure as I ever did from the things I enjoy. 1. I don't enjoy things as much as I used to. 2. I get very little pleasure from the things I used to enjoy. 3. I can't get any pleasure from the things I used to enjoy. 5. Guilty Feelings 0. I don't feel particularly guilty. 1. I feel guilty over many things I have done or should have done. 2. I feel quite guilty most of the time. 3. I feel guilty all of the time. 6. Punishment Feelings 0. I don't feel I am being punished. 1. I feel I may be punished. 2. I expect to be punished. 3. I feel I am being punished. 7. Self-Dislike 0. I feel the same about myself as ever. 1. I have lost confidence in myself. 2. I am disappointed in myself. 3. I dislike myself. 8. Self-Criticalness 0. I don't criticize or blame myself more than usual. 1. I am more critical of myself than I used to be. 2. I criticize myself for all of my faults. 3. I blame myself for everything bad that happens. 9. Suicidal Thoughts or Wishes 0. I don't have any thoughts of killing myself. 1. I have thoughts of killing myself, but I would not carry them out. 2. I would like to kill myself. 3. I would kill myself if I had the chance. 10. Crying 0. I don't cry anymore than I used to. 1. I cry more than I used to. 2. I cry over every little thing. 3. I feel like crying, but I can't. 11. Agitation 0. I am no more restless or wound up than usual. 1. I feel more restless or wound up than usual. 2. I am so restless or agitated that it's hard to stay still. 3. I am so restless or agitated that I have to keep moving or doing something. 12. Loss of Interest 0. I have not lost interest in other people or activities. 1. I am less interested in other people or things than before. 2. I have lost most of my interest in other people or things. 3. It's hard to get interested in anything. 13. Indecisiveness 0. I make decisions about as well as ever. 1. I find it more difficult to make decisions than usual. 2. I have much greater difficulty in making decisions than I used to. 3. I have trouble making any decisions. 14. Worthlessness 0. I do not feel I am worthless. 1. I don't consider myself as worthwhile and useful as I used to. 2. I feel more worthless as compared to other people. 3. I feel utterly worthless. 15. Loss of Energy 0. I have as much energy as ever. 1. I have less energy than I used to have. 2. I don't have enough energy to do very much. 3. I don't have enough energy to do anything.

Fig. 2. Beck's Depression Inventory (part 1)

Each line has a user identi er and 21 values. These values correspond to the responses to the questions of the depression questionnaire (the possible values are 0, 1a, 1b, 2a, 2b, 3a, 3b -for questions 16 and 18- and 0, 1, 2, 3 -for the rest of the questions-).

3.1 Task 2: Evaluation Metrics

For consistency purposes, we employed the same evaluation metrics utilised in 2019. These metrics assess the quality of a questionnaire lled by a system in comparison with the real questionnaire lled by the actual Social Media user: { Average Hit Rate (AHR): Hit Rate (HR) averaged over all users. HR is a stringent measure that computes the ratio of cases where the automatic questionnaire has exactly the same answer as the real questionnaire. For example, an automatic questionnaire with 5 matches gets HR equal to 5/21 (because there are 21 questions in the form). { Average Closeness Rate (ACR): Closeness Rate (CR) averaged over all users. CR takes into account that the answers of the depression questionnaire represent an ordinal scale. For example, consider the #17 question: 17. Irritability 0. I am no more irritable than usual. 1. I am more irritable than usual. 2. I am much more irritable than usual. 3. I am irritable all the time.

Imagine that the real user answered "0". A system S1 whose answer is "3" should be penalised more than a system S2 whose answer is "1". For each question, CR computes the absolute di erence (ad) between the real and the automated answer (e.g. ad=3 and ad=1 for S1 and S2, respectively) and, next, this absolute di erence is transformed into an e ectiveness score as follows: CR = (mad ad)=mad, where mad is the maximum absolute di erence, which is equal to the number of possible answers minus one. NOTE: in the two questions (#16 and #18) that have seven possible answers f0; 1a; 1b; 2a; 2b; 3a ; 3bg the pairs (1a; 1b), (2a; 2b), (3a; 3b) are considered equivalent because they re ect the same depression level. As a consequence, the di erence between 3b and 0 is equal to 3 (and the di erence between 1a and 1b is equal to 0). { Average DODL (ADODL): Di erence between overall depression levels (DODL) averaged over all users. The previous measures assess the systems' ability to answer each question in the form. DODL, instead, does not look at question-level hits or di erences but computes the overall depression level (sum of all the answers) for the real and automated questionnaire and, next, the absolute di erence (ad overall) between the real and the automated score is computed.

Depression levels are integers between 0 and 63 and, thus, DODL is normalised into [ 0,1 ] as follows: DODL = (63 ad overall)=63. { Depression Category Hit Rate (DCHR). In the psychological domain, it is customary to associate depression levels with the following categories: minimal depression (depression levels 0-9) mild depression (depression levels 10-18) moderate depression (depression levels 19-29) severe depression (depression levels 30-63) The last e ectiveness measure consists of computing the fraction of cases where the automated questionnaire led to a depression category that is equivalent to the depression category obtained from the real questionnaire. 3.2

Task 2: Results

Table 5 presents the results achieved by the participants in this task. To put things in perspective, the table also reports (lower block) the performance achieved by three baseline variants: all 0s and all 1s, which consist of sending the same response (0 or 1) for all the questions, and random, which is the average performance (averaged over 1000 repetitions) achieved by an algorithm that randomly chooses among the possible answers.

Although the teams could use training data from 2019 (while 2019's participants had no training data), the performance scores tend to be lower than 2019's performance scores (only ADODL had higher performance). This could be due to various reasons, including the intrinsic di culty of the task and the lack of discussion on SM of psychological concerns by 2020 users.

In terms of AHR, the best performing run (BioInfo@UAVR) only got 38.30% of the answers right. The scores of the distance-based measure (ACR) are below 70%. Most of the questions have four possible answers and, thus, a random algorithm would get AHR near 25%10. This suggests that the analysis of the user posts was useful at extracting some signals or symptoms related to depression. However, ADODL and, particularly, DCHR show that the participants, although e ective at answering some depression-related questions, do not fare well at estimating the overall level of depression of the individuals. For example, the best performing run gets the depression category right for only 35.71% of the individuals.

Overall, these experiments indicate that we are still far from a really e ective depression screening tool. In the near future, it will be interesting to further analyze the participants' estimations in order to investigate which particular BDI questions are easier or harder to automatically answer based on Social Media activity. 10 Actually, slightly less than 25% because a couple of questions have more than four possible answers.

Participating Teams

EFE. This is a team from Dept. of Computer Engineering, Ferdowsi University of Mashhad (Iran). They implemented three variants for Task 1 that represent the texts using Word2Vec representations and performed experiments using Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM) models, and Support Vector Machines (SVMs). The entire system is an ensemble multi-level method based on SVM, CNN, and LSTM, which are ne-tuned by attention layers.

USDB. This is a joint collaboration between the LRDSI Laboratory (BLIDA 1 University, Algeria) and the Information Assurance and Security Research Group (Faculty of Computing, Universiti Teknologi Malaysia, Malaysia). This team participated in Task 2 and transformed the user's texts into distributed representations following the Skip-gram model. Next, sentences are encoded using a CNN or Bi-LSTM model (or with Recurrent Neural Networks and Long Bi-LSTM). For each user post, the models generate 21 outputs, which are the answers to the BDI questions. Finally, the user's overall questionnaire is obtained by selecting, for each BDI question, the most frequent answer.

NLP-UNED. This is a joint e ort by the NLP & IR Group, at Universidad Nacional de Educacion a Distancia (UNED), Spain and the Instituto Mixto de Investigacion - Escuela Nacional de Sanidad (IMIENS), Spain. These researchers, which participated in Task 1, implemented a machine learning approach using textual features and a SVM classi er. In order to extract relevant features, this team followed a sliding window approach that handles the last messages published by any given user. The features considered a wide range of variables, such as title length, words in the title, punctuation, emoticons, and other feature sets obtained from sentiment analysis, rst person pronouns, and NSSI words.

Hildesheim. This team, from the Institute for Information Science and Natural Language Processing (University of Hildesheim, Germany), implemented four variants that apply di erent methods for Task 1 and a fth ensemble system that combines the four variants. The four methods utilize di erent types of features, such as time intervals between posts, and the sentiment and semantics of the writings. To this aim, a neural network approach using bag-of-words vectors and contextualized word embeddings was employed.

BioInfo@UAVR. This team comes from the Bioinformatics group of the Institute of Electronics and Engineering Informatics (University of Aveiro, Portugal). They participated in both tasks. Their approach built upon the algorithms proposed by them for eRisk 2019. For Task 1, they considered a bag of words approach with tf-idf features and employed linear Support Vector Machines with Stochastic Gradient Descent and Passive Aggressive classi ers. For Task 2, the method is based on training a machine learning model using an external dataset and, next, employing the learnt classi er against the eRisk 2020 data. These authors considered psycholinguistic and behavioural features in their attempt to associate the BDI responses with the user's posts.

INAOE-CIMAT. This is a joint e ort by Instituto Nacional de Astrof sica, Optica y Electronica (INAOE), Mexico and Centro de Investigacion en Matematicas (CIMAT), Mexico. This team participated in the rst task and proposed a so-called Bag of Sub-Emotions approach that represents the posts of the users using a set of sub-emotions. This representation, which was subsequently combined with bag of words representations, captures the emotions and topics that users with signs of self-harm tend to use. At test time, they experimented with ve variants that estimate the temporal stability associated to the users' posts.

SSN-NLP. These participants come from the SSN College Of Engineering, Chennai (India) and participated in Task 1. Given the training data, they experimented with ve alternative classi cation approaches (using tf/idf representations and Bernoulli Naive Bayes, Gradient Boosting, Random Forest, or Extra Trees; or CNNs together with Word2Vec representations).

iLab. This is a joint collaboration between Centro Singular de Investigacion en Tecnolox as Intelixentes (CiTIUS), Universidade de Santiago de Compostela, Spain and the Department of Computer and Information Sciences, University of Strathclyde, UK. This team participated in both tasks. They used BERTbased classi ers which were trained speci cally for each task. A variety of pretrained models were tested against training data (including BERT, DistillBERT, RoBERTa and XLM-RoBERTa). Rather than using the task's training data, these participants created four new training datasets from Reddit. The submitted runs for Task 1 were based on XLM-RoBERTa. For Task 2, they employed similar methods as the ones employed for Task 1, but they treated the problem as a multi-class labelling problem (one problem for each BDI question).

Prhlt-upv. This team is composed of researchers from Universitat Politecnica de Valencia and from University of Bucharest. They employed multidimensional representations of language and deep learning models (including hierarchical architectures, pre-trained transformers and language models). This team participated in both tasks. For Task 1, they utilized content features, style features, LIWC features, emotions and sentiments. Di erent strategies were implemented to represent the users' submissions (e.g., augmenting the data by sampling from the user's history or computing a rolling average associated to the most recent estimations). For Task 2, they employed simpler learning models (SVMs and logistic regression) and some of the features extracted for Task 1. The problem was tackled as a multi-label multi-class problem where one model was trained for each BDI question.

Relai. This team comes from University of Quebec in Montreal, Canada. These researchers participated in both tasks, and addressed them using topic modeling al- gorithms (LDA and Anchor Variant), neural models with three di erent architectures (Deep Averaging Networks, Contextualizers, and RNNs), and an approach based on writing styles. Some of the variants considered stylometry variables, such as Part-of-Speech, frequent n-grams, punctuation, length of words/sentences and usage of uppercase or hyperlinks. 5

Conclusions

This paper provided an overview of eRisk 2020. This was the fourth edition of this lab and the lab's activities concentrated on two di erent types of tasks: early detection of signs of self-harm (T1), where the participants had a sequential access to the user's social media posts and they had to send alerts about at-risk individuals, and measuring the severity of the signs of depression (T2), where the participants were given the full user history and their systems had to automatically estimate the user's responses to a standard depression questionnaire.

Overall, the proposed tasks received 73 variants or runs from 12 teams. Although the e ectiveness of the proposed solutions is still modest, the experiments suggest that evidence extracted from Social Media is valuable and automatic or semi-automatic screening tools could be designed to detect at-risk individuals. This promising result encourages us to further explore the creation of benchmarks for text-based screening of signs of risk.

Acknowledgements

We thank the support obtained from the Swiss National Science Foundation (SNSF) under the project \Early risk prediction on the Internet: an evaluation corpus", 2015.

This work was funded by FEDER/Ministerio de Ciencia, Innovacion y Universidades { Agencia Estatal de Investigacion/ Project (RTI2018-093336-B-C21). This work has also received nancial support from the Conseller a de Educacion, Universidade e Formacion Profesional (accreditation 2019-2022 ED431G2019/04, ED431C 2018/29) and the European Regional Development Fund, which acknowledges the CiTIUS-Research Center in Intelligent Technologies of the University of Santiago de Compostela as a Research Center of the Galician University System.

1. Beck , A.T. , Ward , C.H. , Mendelson , M. , Mock , J. , Erbaugh , J.: An Inventory for Measuring Depression . JAMA Psychiatry 4 ( 6 ), 561 { 571 (06 1961 )

2. Losada , D.E. , Crestani , F. : A test collection for research on depression and language use . In: Proceedings Conference and Labs of the Evaluation Forum CLEF 2016 . Evora, Portugal ( 2016 )

3. Losada , D.E. , Crestani , F. , Parapar , J.: eRisk 2017 : CLEF lab on early risk prediction on the internet: Experimental foundations . In: Jones, G.J. , Lawless , S. , Gonzalo , J. , Kelly , L. , Goeuriot , L. , Mandl , T. , Cappellato , L. , Ferro , N. (eds.) Experimental IR Meets Multilinguality , Multimodality, and Interaction. pp. 346 { 360 . Springer International Publishing, Cham ( 2017 )

4. Losada , D.E. , Crestani , F. , Parapar , J.: eRisk 2017 : CLEF Lab on Early Risk Prediction on the Internet: Experimental foundations . In: CEUR Proceedings of the Conference and Labs of the Evaluation Forum , CLEF 2017 . Dublin, Ireland ( 2017 )

5. Losada , D.E. , Crestani , F. , Parapar , J.: Overview of eRisk 2018 : Early Risk Prediction on the Internet (extended lab overview) . In: CEUR Proceedings of the Conference and Labs of the Evaluation Forum , CLEF 2018 . Avignon, France ( 2018 )

6. Losada , D.E. , Crestani , F. , Parapar , J.: Overview of eRisk: Early Risk Prediction on the Internet . In: Bellot, P. , Trabelsi , C. , Mothe , J. , Murtagh , F. , Nie , J.Y. , Soulier , L. , SanJuan , E., Cappellato , L. , Ferro , N. (eds.) Experimental IR Meets Multilinguality , Multimodality, and Interaction. pp. 343 { 361 . Springer International Publishing, Cham ( 2018 )

7. Losada , D.E. , Crestani , F. , Parapar , J.: Overview of eRisk 2019 : Early risk prediction on the Internet . In: Crestani, F. , Braschler , M. , Savoy , J. , Rauber , A. , Muller, H., Losada , D.E. , Heinatz Burki, G., Cappellato , L. , Ferro , N. (eds.) Experimental IR Meets Multilinguality , Multimodality, and Interaction. pp. 340 { 357 . Springer International Publishing ( 2019 )

8. Losada , D.E. , Crestani , F. , Parapar , J.: Overview of eRisk at CLEF 2019 : Early risk prediction on the Internet (extended overview) . In: CEUR Proceedings of the Conference and Labs of the Evaluation Forum , CLEF 2019 . Lugano, Switzerland ( 2019 )

9. Sadeque , F. , Xu , D. , Bethard , S. : Measuring the latency of depression detection in social media . In: WSDM . pp. 495 { 503 . ACM ( 2018 )

10. Trotzek , M. , Koitka , S. , Friedrich , C. : Utilizing neural networks and linguistic metadata for early detection of depression indications in text sequences . IEEE Transactions on Knowledge and Data Engineering (04 2018 )