Early Risk Detection of Self-Harm and
           Depression Severity using BERT-based
                       Transformers
                          iLab at CLEF eRisk 2020

                     Rodrigo Martı́nez-Castaño1,2 , Amal Htait2 ,
                      Leif Azzopardi2 , and Yashar Moshfeghi2
       1
       Centro Singular de Investigación en Tecnoloxı́as Intelixentes (CiTIUS),
                 Universidade de Santiago de Compostela, Spain
                           rodrigo.martinez@usc.es
2
  Department of Computer and Information Sciences, University of Strathclyde, UK
        {amal.htait,leif.azzopardi,yashar.moshfeghi}@strath.ac.uk


        Abstract. This paper briefly describes our research groups’ efforts in
        tackling Task 1 (Early Detection of Signs of Self-Harm), and Task 2
        (Measuring the Severity of the Signs of Depression) from the CLEF
        eRisk Track. Core to how we approached these problems was the use
        of BERT-based classifiers which were trained specifically for each task.
        Our results on both tasks indicate that this approach delivers high per-
        formance across a series of measures, particularly for Task 1, where our
        submissions obtained the best performance for precision, F1, latency-
        weighted F1 and ERDE at 5 and 50. This work suggests that BERT-
        based classifiers, when trained appropriately, can accurately infer which
        social media users are at risk of self-harming, with precision up to 91.3%
        for Task 1. Given these promising results, it will be interesting to further
        refine the training regime, classifier and early detection scoring mecha-
        nism, as well as apply the same approach to other related tasks (e.g.,
        anorexia, depression, suicide).

        Keywords: Self-Harm · Depression · Classification · Social Media ·
        Early Detection · BERT · XLM-RoBERTa


1     Introduction
The eRisk CLEF track aims to explore the development of methods for early risk
detection on the Internet, their evaluation, and the application of such methods
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
     Complementary content: https://github.com/brunneis/ilab-erisk-2020
for improving the health and well being of individuals [8–11]. Early detection
technologies can be employed in different areas, particularly those related to
health and safety. For instance, in [9] they examined whether it was possible to
identify grooming activities of paedophiles given posts to online forums. While
in [10, 11], they explored whether it was possible to detect users that were de-
pressed or anorexic from their posts, and crucially how quickly this could be
detected. This year the focus is on detecting the early signs of self-harm from
people’s posts to social media (Task 1), and whether it is possible to infer how
depressed people are given such posts (Task 2) [12]. Below is an elaborated
description of each task.
    Task 1: Early Detection of Signs of Self-Harm. This first task consists
of triggering alerts for users that present early signs of committing self-harm. A
tagged set of users and their posts to Reddit3 groups was provided for training
purposes. The different methods were benchmarked using a system that simu-
lates a real-time scenario introduced in [11]. The posts from the users of the test
dataset are served in rounds, one post at a time (simulating their live posting to
the Reddit groups). The task then is to provide a decision about each user given
their posts, and to do so as early as possible (i.e., with the fewest posts). For
the evaluation, the correctness of the prediction (i.e., whether the user will cause
self-harm or not) is not the only factor taken into account, but also the delay
taken to emit the alerts. Clearly, the sooner a person who is likely to self-harm
is identified, the sooner the intervention can be provided.
    Task 2: Measuring the Severity of the Signs of Depression. This
task consists of automatically estimating the level of several symptoms asso-
ciated with depression. For that, a questionnaire with 21 questions related to
different feelings and well-being (e.g., sadness, pessimism, fatigue) is provided.
Each question has between four and seven possible answers which are related to
different levels of severity (or relevance) of the symptom or behaviour. A sample
of users with their answers to the questionnaire and their writings at Reddit
was given. To benchmark the different approaches, a new set of users and their
writings is provided, for which every team has to predict their answers.

    Thus, the goal of this paper is to explore the potential of a BERT-based
classifier coupled with a novel scoring mechanism for the early detection of self-
harm and depression. This paper is structured as follows. In Section 2 we describe
our general approach for both tasks by using BERT-based models for sentence
classification. In Section 3 and Section 4 we explain how the classifiers were
trained and applied for Task 1 and Task 2 respectively. Section 5 covers the
analysis of our results, where our approach performs the best across a number
of metrics for both tasks. Finally, in Section 6 we summarise the contributions
of these working notes.


3
    https://reddit.com/
2   Approach

A breakthrough in the use of machine learning for Natural Language Processing
(NLP) appeared with the generative pre-training of language models on a diverse
corpus of unlabelled text, such as ELMo [15], BERT [4], OpenAI GPT [16],
XLM [6], and RoBERTa [7]. Such a technique demonstrated large gains on a
variety of NLP tasks (e.g., sequence or token classification, question answering,
semantic similarity assessment, document classification). In particular, BERT
(Bidirectional Encoder Representations from Transformers) [4, 3], the model by
Google AI, proved to be one of the most powerful tools for text classification [13,
14, 5]. BERT is based on the Transformer architecture [18] and it was trained for
both masked word prediction and next sentence prediction at the same time. As
input, BERT takes two concatenated segments of text which are delimited with
special tokens and whose length respects a defined maximum. The model was
pre-trained on a huge dataset of unlabelled text. It is typically used within a text
classifier for sentence tokenisation and text representation. A standard BERT
classifier is presented in Figure 1 where a sentence is tokenised, represented in
embeddings and then classified. The results are normalised between 0 and 1
using the softmax function, representing the probability of the input sentence to
belong to a certain class (e.g., the probability of the sentence to be written by a
self-harmer).


                              Output : 80% Positive (self-harmer)


                                             Softmax


                                         Classification


                                       BERT Embeddings


                [CLS]   the    power    to   regenerate after hurting myself   [SEP]


                                        BERT Tokeniser


                    Input : "The power to regenerate after hurting myself"


                  Fig. 1. BERT-based Classification Architecture.
    As for RoBERTa [7] (a replication study of BERT pre-training by Facebook
AI), it shares a similar architecture with BERT but with a different pre-training
approach. RoBERTa was trained over ten times more data, the next sentence
prediction objective was removed, and the masked word prediction task was
improved with the introduction of a dynamic masking pattern applied to the
training data.
    In another attempt to improve the language model, Facebook AI presented
XLM-RoBERTa [2] with the pre-training of multilingual language models. This
new improvement led to significant performance gains in text classification. For
our participation at the eRisk challenges of 2020, variety of pre-training lan-
guage models were tested: BERT, DistillBERT, RoBERTa, and XLM-RoBERTa,
among others. However, the best performance was achieved when using XLM-
RoBERTa on our training data. In our work, we used Ernie 4 , a Python library
for sentence classification built on top of Hugging Face Transformers 5 , the main
library that implements state-of-the-art general-purpose Transformer-based ar-
chitectures.
    Most of the pre-training language models, including XLM-RoBERTa, have a
maximum input length of 512 tokens. In our work, we experimented with input
sentences of sizes between 32 and 128 tokens due to GPU memory restrictions.
The best results were achieved with an input size of 128 tokens. Note that Reddit
posts are usually shorter than 128 tokens. Therefore, using an input size larger
than 128 would not substantially increase performance, but it would significantly
increase the required computational resources. In the few cases where the Reddit
posts were longer, we split them based on punctuation marks in an attempt
to respect the context of the writings posted by the users. When training the
classifiers, the weights of the pre-trained base models (e.g., XLM-RoBERTa) are
updated, in addition to the classification head.
    For our participation at the eRisk challenges of 2020, both Task 1 and Task 2,
we used the previously explained approach for sentence classification. However,
in each task, the employed training schedule and training data were varied and
tailored to fit the task scenarios, as explained in the following sections.


3     Task 1 - Early Risk Detection of Self-Harm

We trained a number of different language models based on the original BERT
architecture with a classification head to predict whether a sentence was written
by a subject that self-harms or not. Those models are the base to predict if a user
is likely to self-harm and thus, triggering an alert, given a stream of texts. All
of our final models were based on XLM-RoBERTa, which demonstrated better
performance for this task.

4
     https://github.com/labteral/ernie/
5
     https://github.com/huggingface/transformers/
3.1    Data

To train our models, we avoided using the training dataset provided by the eRisk
organisers for two reasons. First, during the beginning of our experimentation,
we found that the results obtained with our BERT-based approach were not
promising enough to beat the existing approaches used in 2019. Second, the
training dataset matches the test data of the eRisk 2019’s task. Taking it out
from the training stage led us to be able to compare our results with the obtained
by the last year’s participants in our search for models with greater performance.
    The data collected and used for training our models were obtained from the
Pushshift Reddit Dataset [1] through its public API6 , which exposes a repository
with constantly updated and almost complete dataset of all the public Reddit
data. We downloaded all the available submissions and comments written to
the most popular subreddit about self-harm (r/selfharm). From those posts, we
extracted 42, 839 authors. In addition, we collected all of the posts in any other
subreddit for those authors (selfharm-users-texts dataset). Then, we ob-
tained an equivalent amount of random users from which we also extracted all
their posts (random-users-texts dataset). We filtered the obtained datasets in
several ways. First, we checked that there were not any user collision between the
two collections. After identifying some of the main self-harm related subreddits
(r/selfharm, r/Cutters, r/MadeOfStyrofoam, r/SelfHarmScars, r/StopSelfHarm,
r/CPTSD and r/SuicideWatch), we removed the users from random-users-
texts having at least one post in any of them. All the users with more than
5, 000 submissions were removed since those with an extremely high number of
posts seem more likely to be bots. Besides, the vast majority of the users had
posted fewer times so we presumed to have more chances to profile the average
user below that threshold. We also pruned the less active users under 50 sub-
missions. The number of sentences was expanded by splitting the users’ texts
that were too long for the parameters we utilised in our models. Otherwise,
the sentences would be truncated during training, potentially losing valuable
information. We split the large posts into groups of contiguous sentences of ap-
proximately the maximum length in tokens utilised in our models and following
the punctuation marks hierarchy (e.g., prioritising the splits on full stops over
commas). As commented before, a maximum length of 128 tokens was set so the
models could be fine-tuned in commercial GPUs.

   We created several datasets mainly derived from selfharm-users-texts
and random-users-texts for training our model candidates. These datasets
are presented in Table 1, and explained below:

 – A manually created dataset:
    • real-selfharmers-texts: This dataset was created with the aim of
      obtaining a bigger but similar dataset to the one provided by the eRisk
      organisers. We manually tagged 354 users as real self-harmers from the
      users of the selfharm-users-texts dataset. Then, we filtered the last
6
    https://pushshift.io/api-parameters/
      1, 000 submissions and comments for every user. We also pruned the
      writing sequences just before their first writing at r/selfharm. After that,
      we filtered the users with at least 10 writings remaining, ending up with
      a total of 120 real self-harmers. For the negative class, we took a sample
      of random users from the dataset random-users-texts in the same
      proportion as in the provided training data: ∼ 7.3 random users per
      self-harmer.

– Datasets automatically generated from selfharm-users-texts and
  random-users-texts after removing the users from real-selfharmers-
  texts. In Figure 2, we show the distribution of posts per user for the original
  datasets (selfharm-users-texts and random-users-texts) and the de-
  rived ones utilised to train the final classifiers:
    • users-texts-200k: This dataset was generated by random sampling
      200K writings from both selfharm-users-texts (as self-harmers) and
      random-users-texts (as non self-harmers), with 100K from each data-
      set. Note that we experimented by replicating last years’ task with dif-
      ferent sizes of sampling such as 2K, 20K, 100K, 300K, 400K and 500K
      writings, but the best results were achieved with a sampling size of 200K
      writings.
    • users-texts-2m: This dataset is a variant of users-texts-200k; a
      balanced dataset with ten times more sentences, totalling 2M writings.
      Note that, during our experimentation replicating last years’ task, using
      a training set larger than 200K did not improve the results except for
      the ERDE5 metric with the 2M writings.
    • users-submissions-200k: This dataset was generated in a similar pro-
      cedure as users-texts-200k, with 200K random sampled writings, but
      with the difference of avoiding comments. Therefore, sampling users’ sub-
      missions exclusively.


         Dataset             Class     Users Subreddits Sentences          Years
                            selfharm    120      1, 346      8, 943    2013 - 2020
real-selfharmers-texts
                            random      875      5, 585      87, 260   2009 - 2020
                            selfharm 9, 487      9, 797     107, 277   2006 - 2020
   users-texts-200k
                            random 14, 280       9, 793     107, 152   2006 - 2020
                            selfharm 10, 454     26, 931   1, 075, 476 2006 - 2020
    users-texts-2m
                            random 17, 548       26, 409   1, 076, 707 2005 - 2020
                            selfharm 10, 319     13, 681    131, 233   2006 - 2020
users-submissions-200k
                            random 15, 937       14, 913    128, 064   2005 - 2020

      Table 1. Some statistics of the datasets used to train the classifiers.
Fig. 2. Distribution of the number of posts per user in the datasets selfharm-users-
texts, random-users-texts and the derived datasets from them.


3.2   Method
For our participation in Task 1 of eRisk we trained three models for binary
sentence classification, all of them based on the XLM-RoBERTa-base language
model (since it behaved better than other variants we tried such as BERT,
DistillBERT, XLNet, etc.):
 – xlmrb-selfharm-200k trained with the dataset users-texts-200k.
 – xlmrb-selfharm-2m trained with the dataset users-texts-2m.
 – xlmrb-selfharm-sub-200k trained with the dataset users-submissions-
   200k.
   We established for those models a maximum length of tokens as 128 per sen-
tence, a training rate of 2e − 5 and a validation size of the 20%.

    In order to predict if a user has or has not risk of self-harm, we averaged
the predicted probability of the known writings for every user. We omitted the
prediction of sentences with less than 10 tokens as we concluded that the perfor-
mance on smaller sentences is poor. Since the provided training set was the test
set of the last year’s task, we used it to compare the performance of our models
with the participants of the previous year. We defined several parameters to de-
termine if the system should trigger an alert given a list of known user’s texts:
the minimum average probability threshold (θ), the minimum number of texts
necessary to trigger an alert, and the maximum number of texts that the system
will take into account to make its decisions on the subjects. Given a growing list
of texts from a user, the system will trigger an alert if the average probability of
the known texts for that user is greater or equal than θ, the number of known
texts is greater or equal to the minimum, and lower or equal to the maximum.
    The parameters were adjusted in five variants by finding their optimal values
for F1 and the eRisk related metrics: latency-weighted F1, ERDE5 and ERDE50
with the real-selfharmers-texts dataset. For example, in Figure 3 it can
be observed that the best value for latency-weighted F1 with any θ is obtained
when waiting for at least 10-12 texts for xlmrb-selfharm-200k. We chose the
model with the best performance for each target metric. The selected parameters
for each variant can be observed in Table 2 and the results obtained with the
real-selfharmers-texts dataset are shown in Table 3.
    After choosing the parameters with the real-selfharmers-texts dataset,
we tested the classifiers with the last year’s test data for the same task as showed
in Table 4, where we compare the obtained results with the best performer of
2019 for that task: UNSL. That team obtained the best results for precision, F1,
ERDE5 , ERDE50 and latency-weighted F1. With the classifiers that we used in
our submission, we improved their results for F1, ERDE5 , ERDE50 and latency-
weighted F1.


    Run      Model                          Target Metric       θ     Min.    Max.
                                                                      posts   posts
       0     xlmrb-selfharm-200k         latency-weighted F1   0.75    10      50
       1     xlmrb-selfharm-2m           latency-weighted F1   0.76    10      50
       2     xlmrb-selfharm-2m                 ERDE 5          0.69    2       5
       3     xlmrb-selfharm-sub-200k          ERDE 50          0.64    45      45
       4     xlmrb-selfharm-200k                  F1           0.68   100     100

      Table 2. Combinations of models and parameters for the five submitted runs.


4      Task 2

4.1        Data

For our participation in Task 2 of eRisk, we used the training dataset provided
by the task’s organisers. Both training and test datasets consist of Reddit posts
written by users who have answered the questionnaire. The training dataset
includes a total of 10, 941 posts by 20 users, and the test dataset includes 35, 562
posts by 70 users.
                                         0.65


               latency-weighted F1
                                          0.6
                                                                                          0.605

                                         0.55

                                          0.5
                                                    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
                                                            Minimum number of posts

Fig. 3. Latency-weighted F1 when varying the minimum number of texts to trigger an
alert. Model xlmrb-selfharm-200k with the real-selfharmers-texts dataset. A
maximum of 50 posts are taken into account.


  Run      P                              R          F1     ERDE          ERDE         Latency     Speed     Latency-
                                                                 5           50          TP                 weighted F1
   0     0.646                           0.608      0.627    0.125        0.052          10        0.965      0.605
   1    0.736                            0.533      0.618    0.123        0.059          10        0.965       0.597
   3     0.401                           0.708      0.517   0.057         0.050          2         0.050       0.515
   4     0.350                           0.825      0.491    0.143        0.044          45        0.830       0.408
   5     0.720                           0.600      0.655    0.124        0.124         100        0.632      0.414

Table 3. Results obtained by our five final variants with the real-selfharmers-
texts dataset when using the optimal parameters.


    Team                       Run              P     R     F1       ERDE ERDE Latency Speed                  Latency-
                                                                      5           50         TP             weighted F1
 UNSL 2019                           0     0.71 0.41 0.52            0.090    0.073           2       1         0.52
 UNSL 2019                           4     0.31 0.88 0.46            0.082    0.049           3      .99        0.45
    iLab                             0     0.68 0.66 0.67 0.125               0.046           10     0.97      0.64
    iLab                             1     0.69 0.59 0.63            0.124    0.054           10     0.97       0.61
    iLab                             2     0.33 0.71 0.45 0.062               0.057           2       1         0.44
    iLab                             3     0.34 0.83 0.48            0.144    0.045           45     0.83       0.40
    iLab                             4     0,68 0.66 0.67 0.125               0.125          100     0.63      0.42

Table 4. Results obtained by our five final variants with the 2019 dataset compared
to the results obtained by UNSL.
   An analogous approach as the one employed for Task 1, with random posts
from users connected solely by a common subreddit, was not possible this time.
Therefore, and due to the small dataset for training (only 20 different users),
we used the full provided training dataset in order to train the classifiers. For
each question of the questionnaire, we modified the training dataset by assigning
the same class to all the texts posted by a given user (i.e., each class matches
one of the available answers). Thus, we obtained a different training set for each
question of the questionnaire, and, therefore, one different multi-class classifier.


4.2   Method

For this task, we applied a similar method as the one employed in Task 1, but
we treated the problem as a multi-class labelling problem. We created three
variants, only differing in the base language model and the pre-processing of the
training data, as it can be observed in Table 5. For the runs 1 and 2, we expanded
the training by splitting texts larger than 128 tokens in the same way as in Task
1. However, for Run 3, sentences larger than 128 tokens were truncated during
the training phase.


                      Run          Base LM           Strategy
                        1    XLM-RoBERTa-base          split
                        2       RoBERTa-base           split
                        3       RoBERTa-base        truncate

      Table 5. Base language models and training set variants used for Task 2.


    For each variant, we fine-tuned the base language model with a head for
multi-class classification for every question. As shown in Table 6, we balanced
the class weights of every question model for all the variants. The RoBERTa-
based classifiers were trained for 4 epochs, whereas we executed 5 epochs for the
XLM-RoBERTa-based ones. Those numbers of epochs were found to be optimal
in all the models we created during our experimentation for Task 1. We estab-
lished the maximum sentence length to 128 tokens and the learning rate to 2e−5
to train all the models. We assigned a 20% of the training data for validation.

    For a given user and variant, we predict the questionnaire answer in the fol-
lowing way: given a question and the associated classifier, we obtain the softmax
prediction vector for every text written by that user and we sum them. The
class with the highest accumulated value is the answer to the questionnaire we
predict. As in Task 1, during prediction, if the input texts are larger than 128
tokens, we split them and average the predictions of the chunks.
      Question                                 Answers
                   0       1 (1a)    2 (1b)     3 (2a)      2b        3a        3b
          1        1.000    1.079    14.399     0.000        -         -         -
          2        1.000    1.291     9.003     1.935        -         -         -
          3        1.000    2.151     1.956     4.001        -         -         -
          4        2.660    1.000     4.523    247.375       -         -         -
          5        1.000    2.332    220.406    2.751        -         -         -
          6        1.000   93.442     9.658     62.294       -         -         -
          7        1.000    2.630     2.820     2.707        -         -         -
          8        1.000    1.084     6.410     5.449        -         -         -
          9        1.000    1.223     5.250     0.000        -         -         -
          10       1.000   10.119     3.634     30.020       -         -         -
          11       1.000    1.981     1.548     3.308        -         -         -
          12       1.000    1.332    41.777     2.451        -         -         -
          13       1.000   10.514    10.041     5.239        -         -         -
          14       1.000    3.386     3.300     6.204        -         -         -
          15       1.622    3.072     1.000     5.106        -         -         -
          16       1.000    9.361     7.254     2.280      1.439     0.000    1.261
          17       1.076    1.000     2.042     0.000        -         -         -
          18       1.000    3.427     1.584     18.572    189.781   20.243    18.021
          19       1.000    2.531     1.102     20.576       -         -         -
          20       1.594    1.000     6.826     4.764        -         -         -
          21       1.170    1.000     1.790     3.227        -         -        -

Table 6. Class weights for each question used to train the classifiers in all the variants.
5    Results

Table 7 shows the performance of our runs for Task 1, while Table 8 shows the
performance of our runs for Task 2. In each table, the best scores among all the
participants are highlighted in bold. Other runs from other teams have also been
included to show the best performing runs for each task on each metric.
    For task 1, the evaluation metrics used were [11]:

 – The standard classification measures precision (P), recall (R) and F1,
   are computed with respect to the positive class, since they are the only cases
   that trigger alerts.
 – ERDE (Early Risk Detection Error) [8], is an error measure that introduces
   a penalty for late correct alerts (true positives) and depends on the number
   of user writings seen before the alert. Two sets of user writing numbers are
   taken into consideration in this challenge: 5 and 50. Contrary to the other
   metrics, the lower the value of ERDE, the better the performance of the
   system.
 – LatencyT P measures the delay in detecting true positives, defined as the
   median number of writings used to detect positive cases.
 – Speed is the system’s overall speed factor, where it will be equal to 1 for
   a system whose true positives are detected right at the first writing, and
   almost 0 for a slow system, which detects true positives after hundreds of
   writings.
 – Latency-weighted F1 [17] score is equal to F 1·speed, and a perfect system
   gets latency-weighted F1 equals to 1.

    For Task 2, the following metrics were used [11]:

 – AHR (Average Hit Rate) is the average of Hit Rate (HR) across all users,
   and HR is the ratio of cases where the automatic questionnaire has exactly
   the same answer as the actual questionnaire.
 – ACR (Average Closeness Rate) is the average of Closeness Rate (CR) across
   all users, and CR is equal to (mad - ad)/mad, where mad is the maximum
   absolute difference, which is equal to the number of possible answers minus
   one, and ad is the absolute difference between the real and the automated
   answer.
 – ADODL (Average DODL) is the averaged of Difference between Overall
   Depression Levels (DODL) across all users. DODL computes the overall
   depression level (sum of all the answers) for the real and automated ques-
   tionnaire and, next, the absolute difference (ad overall) between the real and
   the automated score is computed. DODL is normalised into [0,1] as follows:
   DODL = (63 - ad overall)/63.
 – DCHR (Depression Category Hit Rate) computes the fraction of cases
   where the automated questionnaire led to a depression category (out of 4
   categories: nonexistence, mild, moderate and severe) that is equivalent to
   the depression category obtained from the real questionnaire.
  Run      P          R    F1       ERDE      ERDE    Latency    Speed    Latency-
                                      5        50        TP              Weighted F1
   0     0.833    0.577   0.682     0.252     0.111      10      0.965     0.658
   1     0.913    0.404   0.560     0.248     0.149      10      0.965         0.540
   2     0.544    0.654   0.594     0.134     0.118      2       0.996         0.592
   3     0.564    0.885   0.689     0.287     0.071      45      0.830         0.572
   4     0.828    0.692   0.754     0.255     0.255      100     0.632     0.476

Table 7. The performance for each run we submitted on Task 1: Early Detection
of Signs of Self-Harm. Note that for each bolded metric our run gave the highest
performance.


          Team              Run      AHR         ACR      ADODL       DCHR
          BioInfo@UAVR          0   38.30%      69.21%       76.01%   30.00%
          prhlt-upv             0    34.01%     67.07%       80.05%   35.71%
          prhlt-upv             1    34.56%     67.44%       80.63%   35.71%
          RELAI                 0    36.39%     68.32%    83.15%      34.29%
          iLab                  0    36.73%     68.68%       81.07%   27.14%
          iLab                  1    37.07%     69.41%       81.70%   27.14%
          iLab                  2    35.99%     69.14%       82.93%   34.29%

Table 8. The performance for each run we submitted on Task 2: Measuring the severity
of the signs of depression, along with the runs from other teams that scored higher.
    For Task 1, our team’s performance for each of the key metrics was the best
compared to the other teams this year. Given our training schedule which tried
to maximise the performance for each metric per run, we can see that no specific
run was the best across all the metrics, but rather there is a trade-off between
metrics. For example, Run 1 obtains a precision score of 0.913, but has the lowest
recall, while Run 4 obtains the highest F1, but not the best precision or recall.
Of most interest is the performance on the eRisk-specific metrics, where our runs
obtained notably the best results. With Run 0 we obtained a latency-weighted
F1 of 0.66, where the second-best result was obtained by the team UNSL with
their run 1 at 0.61. For ERDE5 , Run 2 scored 0.134, whereas the second-best
team was again UNSL with their run 1 at 0.172 (where lower is better). For
ERDE50 , our Run 3 obtained a score of 0.071, whereas all the other runs ranged
between 0.11 to 0.25.
    For Task 2, our team’s performance was the best for ACR, and competitive
for the other metrics. For AHR, ADODL and DCHR our performances were
within 1-2% of the best performances submitted. Interestingly, while the ADODL
scores were around 81-83%, this did not translate into a better classification
of depression category as surmised by DCHR, which was 34% at best. This
disparity may be due to how we employed the BERT based classifier (i.e., we
made separate models to predict the results of each question). However, it may
be more appropriate to jointly predict the results of all questions and the final
depression category. This is because the questions will have a high correlation
between answers, and information for inferring the answer for one question, may
be useful in inferring others when taken together.


6   Summary
In this paper we have described how we employed a BERT-based classifier for
the tasks of the CLEF eRisk Track: Task 1, early risk detection of self-harm;
and Task 2, inferring answers to a depression survey. Our results on both tasks
indicated that this approach works very well and obtains very good performance
(the best on Task 1 and very competitive performance on Task 2). These results
are perhaps not too surprising, given the impact that BERT-based models have
been making in improving many other tasks. However, a key difference in this
work is how we trained the model. In future work, we will explore and compare
different training schedules and classifiers extensions for these tasks, but also
for other related tasks (e.g., classifying whether someone is like to suffer from
anorexia, depression).


Acknowledgements
The first author would like to thank the following funding bodies for their sup-
port: FEDER / Ministerio de Ciencia, Innovación y Universidades, Agencia Es-
tatal de Investigación / Project (RTI2018-093336-B-C21), Consellerı́a de Edu-
cación, Universidade e Formación Profesional and the European Regional De-
velopment Fund (ERDF) (accreditation 2019-2022 ED431G-2019/04, ED431C
2018/29, ED431C 2018/19).
    The second and third authors would like to thank the UKRI’s EPSRC Project
Cumulative Revelations in Personal Data (Grant Number: EP/R033897/1) for
their support. We would also like to thank David Losada for arranging this
collaboration.

References
 1. Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., Blackburn, J.: The
    pushshift reddit dataset. In: Proceedings of the International AAAI Conference
    on Web and Social Media. vol. 14, pp. 830–839 (2020)
 2. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán,
    F., Grave, E., Ott, M., Zettlemoyer, L., Stoyanov, V.: Unsupervised cross-lingual
    representation learning at scale. arXiv preprint arXiv:1911.02116 (2019)
 3. Devlin, J., Chang, M.W.: Open sourcing bert: State-of-the-art pre-training for
    natural language processing. Google AI Blog, November 2 (2018)
 4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
    tional transformers for language understanding. arXiv preprint arXiv:1810.04805
    (2018)
 5. Gao, Z., Feng, A., Song, X., Wu, X.: Target-dependent sentiment classification
    with bert. IEEE Access 7, 154290–154299 (2019)
 6. Lample, G., Conneau, A.: Cross-lingual language model pretraining. arXiv preprint
    arXiv:1901.07291 (2019)
 7. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
    Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining
    approach. arXiv preprint arXiv:1907.11692 (2019)
 8. Losada, D.E., Crestani, F.: A test collection for research on depression and language
    use. In: International Conference of the Cross-Language Evaluation Forum for
    European Languages. pp. 28–39. Springer (2016)
 9. Losada, D.E., Crestani, F., Parapar, J.: CLEF 2017 eRisk overview: Early Risk pre-
    diction on the internet: Experimental foundations. CEUR Workshop Proceedings
    1866 (2017)
10. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk 2018: Early Risk Predic-
    tion on the Internet (extended lab overview). CEUR Workshop Proceedings 2125
    (2018)
11. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk 2019 Early Risk Pre-
    diction on the Internet. Lecture Notes in Computer Science (including subseries
    Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 11696
    LNCS(September), 340–357 (2019)
12. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk 2020: Early Risk Pre-
    diction on the Internet. In: Experimental IR Meets Multilinguality, Multimodality,
    and Interaction Proceedings of the Eleventh International Conference of the CLEF
    Association (CLEF 2020) (2020)
13. Nikolov, A., Radivchev, V.: Nikolov-radivchev at semeval-2019 task 6: Offensive
    tweet classification with bert and ensembles. In: Proceedings of the 13th Interna-
    tional Workshop on Semantic Evaluation. pp. 691–695 (2019)
14. Parikh, P., Abburi, H., Badjatiya, P., Krishnan, R., Chhaya, N., Gupta, M., Varma,
    V.: Multi-label categorization of accounts of sexism using a neural framework.
    arXiv preprint arXiv:1910.04602 (2019)
15. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K.,
    Zettlemoyer, L.: Deep contextualized word representations. arXiv preprint
    arXiv:1802.05365 (2018)
16. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language
    understanding by generative pre-training. URL https://s3-us-west-2. amazonaws.
    com/openai-assets/researchcovers/languageunsupervised/language understanding
    paper. pdf (2018)
17. Sadeque, F., Xu, D., Bethard, S.: Measuring the latency of depression detection
    in social media. In: Proceedings of the Eleventh ACM International Conference on
    Web Search and Data Mining. pp. 495–503 (2018)
18. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
    L., Polosukhin, I.: Attention is all you need. In: Advances in neural information
    processing systems. pp. 5998–6008 (2017)