<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Early Risk Detection of Self-Harm and Depression Severity using BERT-based Transformers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rodrigo Mart nez-Castan~o</string-name>
          <email>rodrigo.martinez@usc.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amal Htait</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leif Azzopardi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yashar Moshfeghi</string-name>
          <email>yashar.moshfeghig@strath.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro Singular de Investigacion en Tecnolox as Intelixentes (CiTIUS)</institution>
          ,
          <addr-line>Universidade de Santiago de Compostela</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer and Information Sciences, University of Strathclyde</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper brie y describes our research groups' e orts in tackling Task 1 (Early Detection of Signs of Self-Harm), and Task 2 (Measuring the Severity of the Signs of Depression) from the CLEF eRisk Track. Core to how we approached these problems was the use of BERT-based classi ers which were trained speci cally for each task. Our results on both tasks indicate that this approach delivers high performance across a series of measures, particularly for Task 1, where our submissions obtained the best performance for precision, F1, latencyweighted F1 and ERDE at 5 and 50. This work suggests that BERTbased classi ers, when trained appropriately, can accurately infer which social media users are at risk of self-harming, with precision up to 91.3% for Task 1. Given these promising results, it will be interesting to further re ne the training regime, classi er and early detection scoring mechanism, as well as apply the same approach to other related tasks (e.g., anorexia, depression, suicide).</p>
      </abstract>
      <kwd-group>
        <kwd>Self-Harm</kwd>
        <kwd>Early Detection</kwd>
        <kwd>BERT</kwd>
        <kwd>Depression</kwd>
        <kwd>Classi cation</kwd>
        <kwd>XLM-RoBERTa</kwd>
        <kwd>Social Media</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The eRisk CLEF track aims to explore the development of methods for early risk
detection on the Internet, their evaluation, and the application of such methods
for improving the health and well being of individuals [8{11]. Early detection
technologies can be employed in di erent areas, particularly those related to
health and safety. For instance, in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] they examined whether it was possible to
identify grooming activities of paedophiles given posts to online forums. While
in [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ], they explored whether it was possible to detect users that were
depressed or anorexic from their posts, and crucially how quickly this could be
detected. This year the focus is on detecting the early signs of self-harm from
people's posts to social media (Task 1), and whether it is possible to infer how
depressed people are given such posts (Task 2) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Below is an elaborated
description of each task.
      </p>
      <p>
        Task 1: Early Detection of Signs of Self-Harm. This rst task consists
of triggering alerts for users that present early signs of committing self-harm. A
tagged set of users and their posts to Reddit3 groups was provided for training
purposes. The di erent methods were benchmarked using a system that
simulates a real-time scenario introduced in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The posts from the users of the test
dataset are served in rounds, one post at a time (simulating their live posting to
the Reddit groups). The task then is to provide a decision about each user given
their posts, and to do so as early as possible (i.e., with the fewest posts). For
the evaluation, the correctness of the prediction (i.e., whether the user will cause
self-harm or not) is not the only factor taken into account, but also the delay
taken to emit the alerts. Clearly, the sooner a person who is likely to self-harm
is identi ed, the sooner the intervention can be provided.
      </p>
      <p>Task 2: Measuring the Severity of the Signs of Depression. This
task consists of automatically estimating the level of several symptoms
associated with depression. For that, a questionnaire with 21 questions related to
di erent feelings and well-being (e.g., sadness, pessimism, fatigue) is provided.
Each question has between four and seven possible answers which are related to
di erent levels of severity (or relevance) of the symptom or behaviour. A sample
of users with their answers to the questionnaire and their writings at Reddit
was given. To benchmark the di erent approaches, a new set of users and their
writings is provided, for which every team has to predict their answers.</p>
      <p>Thus, the goal of this paper is to explore the potential of a BERT-based
classi er coupled with a novel scoring mechanism for the early detection of
selfharm and depression. This paper is structured as follows. In Section 2 we describe
our general approach for both tasks by using BERT-based models for sentence
classi cation. In Section 3 and Section 4 we explain how the classi ers were
trained and applied for Task 1 and Task 2 respectively. Section 5 covers the
analysis of our results, where our approach performs the best across a number
of metrics for both tasks. Finally, in Section 6 we summarise the contributions
of these working notes.</p>
    </sec>
    <sec id="sec-2">
      <title>3 https://reddit.com/</title>
      <sec id="sec-2-1">
        <title>Approach</title>
        <p>
          A breakthrough in the use of machine learning for Natural Language Processing
(NLP) appeared with the generative pre-training of language models on a diverse
corpus of unlabelled text, such as ELMo [15], BERT [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], OpenAI GPT [16],
XLM [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], and RoBERTa [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Such a technique demonstrated large gains on a
variety of NLP tasks (e.g., sequence or token classi cation, question answering,
semantic similarity assessment, document classi cation). In particular, BERT
(Bidirectional Encoder Representations from Transformers) [
          <xref ref-type="bibr" rid="ref3 ref4">4, 3</xref>
          ], the model by
Google AI, proved to be one of the most powerful tools for text classi cation [
          <xref ref-type="bibr" rid="ref13 ref14 ref5">13,
14, 5</xref>
          ]. BERT is based on the Transformer architecture [18] and it was trained for
both masked word prediction and next sentence prediction at the same time. As
input, BERT takes two concatenated segments of text which are delimited with
special tokens and whose length respects a de ned maximum. The model was
pre-trained on a huge dataset of unlabelled text. It is typically used within a text
classi er for sentence tokenisation and text representation. A standard BERT
classi er is presented in Figure 1 where a sentence is tokenised, represented in
embeddings and then classi ed. The results are normalised between 0 and 1
using the softmax function, representing the probability of the input sentence to
belong to a certain class (e.g., the probability of the sentence to be written by a
self-harmer).
        </p>
        <p>Output : 80% Positive (self-harmer)</p>
        <p>Softmax</p>
        <p>Classification</p>
        <p>BERT Embeddings
[CLS] the
power
to regenerate after hurting myself [SEP]</p>
        <p>BERT Tokeniser</p>
        <p>Input : "The power to regenerate after hurting myself"</p>
        <p>Fig. 1. BERT-based Classi cation Architecture.</p>
        <p>
          As for RoBERTa [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] (a replication study of BERT pre-training by Facebook
AI), it shares a similar architecture with BERT but with a di erent pre-training
approach. RoBERTa was trained over ten times more data, the next sentence
prediction objective was removed, and the masked word prediction task was
improved with the introduction of a dynamic masking pattern applied to the
training data.
        </p>
        <p>
          In another attempt to improve the language model, Facebook AI presented
XLM-RoBERTa [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] with the pre-training of multilingual language models. This
new improvement led to signi cant performance gains in text classi cation. For
our participation at the eRisk challenges of 2020, variety of pre-training
language models were tested: BERT, DistillBERT, RoBERTa, and XLM-RoBERTa,
among others. However, the best performance was achieved when using
XLMRoBERTa on our training data. In our work, we used Ernie 4, a Python library
for sentence classi cation built on top of Hugging Face Transformers 5, the main
library that implements state-of-the-art general-purpose Transformer-based
architectures.
        </p>
        <p>Most of the pre-training language models, including XLM-RoBERTa, have a
maximum input length of 512 tokens. In our work, we experimented with input
sentences of sizes between 32 and 128 tokens due to GPU memory restrictions.
The best results were achieved with an input size of 128 tokens. Note that Reddit
posts are usually shorter than 128 tokens. Therefore, using an input size larger
than 128 would not substantially increase performance, but it would signi cantly
increase the required computational resources. In the few cases where the Reddit
posts were longer, we split them based on punctuation marks in an attempt
to respect the context of the writings posted by the users. When training the
classi ers, the weights of the pre-trained base models (e.g., XLM-RoBERTa) are
updated, in addition to the classi cation head.</p>
        <p>For our participation at the eRisk challenges of 2020, both Task 1 and Task 2,
we used the previously explained approach for sentence classi cation. However,
in each task, the employed training schedule and training data were varied and
tailored to t the task scenarios, as explained in the following sections.
3</p>
      </sec>
      <sec id="sec-2-2">
        <title>Task 1 - Early Risk Detection of Self-Harm</title>
        <p>We trained a number of di erent language models based on the original BERT
architecture with a classi cation head to predict whether a sentence was written
by a subject that self-harms or not. Those models are the base to predict if a user
is likely to self-harm and thus, triggering an alert, given a stream of texts. All
of our nal models were based on XLM-RoBERTa, which demonstrated better
performance for this task.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4  https://github.com/labteral/ernie/</title>
    </sec>
    <sec id="sec-4">
      <title>5  https://github.com/huggingface/transformers/</title>
      <p>3.1
To train our models, we avoided using the training dataset provided by the eRisk
organisers for two reasons. First, during the beginning of our experimentation,
we found that the results obtained with our BERT-based approach were not
promising enough to beat the existing approaches used in 2019. Second, the
training dataset matches the test data of the eRisk 2019's task. Taking it out
from the training stage led us to be able to compare our results with the obtained
by the last year's participants in our search for models with greater performance.</p>
      <p>
        The data collected and used for training our models were obtained from the
Pushshift Reddit Dataset [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] through its public API6, which exposes a repository
with constantly updated and almost complete dataset of all the public Reddit
data. We downloaded all the available submissions and comments written to
the most popular subreddit about self-harm (r/selfharm). From those posts, we
extracted 42; 839 authors. In addition, we collected all of the posts in any other
subreddit for those authors (selfharm-users-texts dataset). Then, we
obtained an equivalent amount of random users from which we also extracted all
their posts (random-users-texts dataset). We ltered the obtained datasets in
several ways. First, we checked that there were not any user collision between the
two collections. After identifying some of the main self-harm related subreddits
(r/selfharm, r/Cutters, r/MadeOfStyrofoam, r/SelfHarmScars, r/StopSelfHarm,
r/CPTSD and r/SuicideWatch), we removed the users from
random-userstexts having at least one post in any of them. All the users with more than
5; 000 submissions were removed since those with an extremely high number of
posts seem more likely to be bots. Besides, the vast majority of the users had
posted fewer times so we presumed to have more chances to pro le the average
user below that threshold. We also pruned the less active users under 50
submissions. The number of sentences was expanded by splitting the users' texts
that were too long for the parameters we utilised in our models. Otherwise,
the sentences would be truncated during training, potentially losing valuable
information. We split the large posts into groups of contiguous sentences of
approximately the maximum length in tokens utilised in our models and following
the punctuation marks hierarchy (e.g., prioritising the splits on full stops over
commas). As commented before, a maximum length of 128 tokens was set so the
models could be ne-tuned in commercial GPUs.
      </p>
      <p>We created several datasets mainly derived from selfharm-users-texts
and random-users-texts for training our model candidates. These datasets
are presented in Table 1, and explained below:
{ A manually created dataset:
real-selfharmers-texts: This dataset was created with the aim of
obtaining a bigger but similar dataset to the one provided by the eRisk
organisers. We manually tagged 354 users as real self-harmers from the
users of the selfharm-users-texts dataset. Then, we ltered the last
6 https://pushshift.io/api-parameters/
1; 000 submissions and comments for every user. We also pruned the
writing sequences just before their rst writing at r/selfharm. After that,
we ltered the users with at least 10 writings remaining, ending up with
a total of 120 real self-harmers. For the negative class, we took a sample
of random users from the dataset random-users-texts in the same
proportion as in the provided training data: 7:3 random users per
self-harmer.
{ Datasets automatically generated from selfharm-users-texts and
random-users-texts after removing the users from
real-selfharmerstexts. In Figure 2, we show the distribution of posts per user for the original
datasets (selfharm-users-texts and random-users-texts) and the
derived ones utilised to train the nal classi ers:
users-texts-200k: This dataset was generated by random sampling
200K writings from both selfharm-users-texts (as self-harmers) and
random-users-texts (as non self-harmers), with 100K from each
dataset. Note that we experimented by replicating last years' task with
different sizes of sampling such as 2K, 20K, 100K, 300K, 400K and 500K
writings, but the best results were achieved with a sampling size of 200K
writings.
users-texts-2m: This dataset is a variant of users-texts-200k; a
balanced dataset with ten times more sentences, totalling 2M writings.
Note that, during our experimentation replicating last years' task, using
a training set larger than 200K did not improve the results except for
the ERDE5 metric with the 2M writings.
users-submissions-200k: This dataset was generated in a similar
procedure as users-texts-200k, with 200K random sampled writings, but
with the di erence of avoiding comments. Therefore, sampling users'
submissions exclusively.</p>
      <sec id="sec-4-1">
        <title>Users Subreddits Sentences Dataset</title>
        <p>real-selfharmers-texts
users-texts-200k
users-texts-2m
users-submissions-200k
120
875</p>
        <p>Class
selfharm
random
selfharm 9; 487
random 14; 280
selfharm 10; 454
random 17; 548
selfharm 10; 319
random 15; 937
For our participation in Task 1 of eRisk we trained three models for binary
sentence classi cation, all of them based on the XLM-RoBERTa-base language
model (since it behaved better than other variants we tried such as BERT,
DistillBERT, XLNet, etc.):
{ xlmrb-selfharm-200k trained with the dataset users-texts-200k.
{ xlmrb-selfharm-2m trained with the dataset users-texts-2m.
{ xlmrb-selfharm-sub-200k trained with the dataset
users-submissions200k.</p>
        <p>We established for those models a maximum length of tokens as 128 per
sentence, a training rate of 2e 5 and a validation size of the 20%.</p>
        <p>In order to predict if a user has or has not risk of self-harm, we averaged
the predicted probability of the known writings for every user. We omitted the
prediction of sentences with less than 10 tokens as we concluded that the
performance on smaller sentences is poor. Since the provided training set was the test
set of the last year's task, we used it to compare the performance of our models
with the participants of the previous year. We de ned several parameters to
determine if the system should trigger an alert given a list of known user's texts:
the minimum average probability threshold ( ), the minimum number of texts
necessary to trigger an alert, and the maximum number of texts that the system
will take into account to make its decisions on the subjects. Given a growing list
of texts from a user, the system will trigger an alert if the average probability of
the known texts for that user is greater or equal than , the number of known
texts is greater or equal to the minimum, and lower or equal to the maximum.</p>
        <p>The parameters were adjusted in ve variants by nding their optimal values
for F1 and the eRisk related metrics: latency-weighted F1, ERDE5 and ERDE50
with the real-selfharmers-texts dataset. For example, in Figure 3 it can
be observed that the best value for latency-weighted F1 with any is obtained
when waiting for at least 10-12 texts for xlmrb-selfharm-200k. We chose the
model with the best performance for each target metric. The selected parameters
for each variant can be observed in Table 2 and the results obtained with the
real-selfharmers-texts dataset are shown in Table 3.</p>
        <p>After choosing the parameters with the real-selfharmers-texts dataset,
we tested the classi ers with the last year's test data for the same task as showed
in Table 4, where we compare the obtained results with the best performer of
2019 for that task: UNSL. That team obtained the best results for precision, F1,
ERDE5, ERDE50 and latency-weighted F1. With the classi ers that we used in
our submission, we improved their results for F1, ERDE5, ERDE50 and
latencyweighted F1.</p>
        <p>Run</p>
      </sec>
      <sec id="sec-4-2">
        <title>Model Target Metric</title>
        <p>0
1
2
3
4
xlmrb-selfharm-200k
xlmrb-selfharm-2m
xlmrb-selfharm-2m
xlmrb-selfharm-sub-200k
xlmrb-selfharm-200k
latency-weighted F1 0.75
latency-weighted F1 0.76
ERDE 5
ERDE 50</p>
        <p>F1
0.69
0.64
0.68</p>
        <p>Min.</p>
        <p>Max.
posts posts
10
10
2
45
100
50
50
5
45
100</p>
        <sec id="sec-4-2-1">
          <title>Task 2</title>
          <p>Data
For our participation in Task 2 of eRisk, we used the training dataset provided
by the task's organisers. Both training and test datasets consist of Reddit posts
written by users who have answered the questionnaire. The training dataset
includes a total of 10; 941 posts by 20 users, and the test dataset includes 35; 562
posts by 70 users.</p>
          <p>1 0:65
F
d
teh 0:6
g
i
e
-yw 0:55
c
n
e
t
a
l
0:5</p>
          <p>An analogous approach as the one employed for Task 1, with random posts
from users connected solely by a common subreddit, was not possible this time.
Therefore, and due to the small dataset for training (only 20 di erent users),
we used the full provided training dataset in order to train the classi ers. For
each question of the questionnaire, we modi ed the training dataset by assigning
the same class to all the texts posted by a given user (i.e., each class matches
one of the available answers). Thus, we obtained a di erent training set for each
question of the questionnaire, and, therefore, one di erent multi-class classi er.
For this task, we applied a similar method as the one employed in Task 1, but
we treated the problem as a multi-class labelling problem. We created three
variants, only di ering in the base language model and the pre-processing of the
training data, as it can be observed in Table 5. For the runs 1 and 2, we expanded
the training by splitting texts larger than 128 tokens in the same way as in Task
1. However, for Run 3, sentences larger than 128 tokens were truncated during
the training phase.</p>
          <p>For each variant, we ne-tuned the base language model with a head for
multi-class classi cation for every question. As shown in Table 6, we balanced
the class weights of every question model for all the variants. The
RoBERTabased classi ers were trained for 4 epochs, whereas we executed 5 epochs for the
XLM-RoBERTa-based ones. Those numbers of epochs were found to be optimal
in all the models we created during our experimentation for Task 1. We
established the maximum sentence length to 128 tokens and the learning rate to 2e 5
to train all the models. We assigned a 20% of the training data for validation.</p>
          <p>For a given user and variant, we predict the questionnaire answer in the
following way: given a question and the associated classi er, we obtain the softmax
prediction vector for every text written by that user and we sum them. The
class with the highest accumulated value is the answer to the questionnaire we
predict. As in Task 1, during prediction, if the input texts are larger than 128
tokens, we split them and average the predictions of the chunks.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Results</title>
          <p>
            { The standard classi cation measures precision (P), recall (R) and F1,
are computed with respect to the positive class, since they are the only cases
that trigger alerts.
{ ERDE (Early Risk Detection Error) [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ], is an error measure that introduces
a penalty for late correct alerts (true positives) and depends on the number
of user writings seen before the alert. Two sets of user writing numbers are
taken into consideration in this challenge: 5 and 50. Contrary to the other
metrics, the lower the value of ERDE, the better the performance of the
system.
{ LatencyT P measures the delay in detecting true positives, de ned as the
median number of writings used to detect positive cases.
{ Speed is the system's overall speed factor, where it will be equal to 1 for
a system whose true positives are detected right at the rst writing, and
almost 0 for a slow system, which detects true positives after hundreds of
writings.
{ Latency-weighted F1 [17] score is equal to F 1 speed, and a perfect system
gets latency-weighted F1 equals to 1.
          </p>
          <p>
            For Task 2, the following metrics were used [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]:
{ AHR (Average Hit Rate) is the average of Hit Rate (HR) across all users,
and HR is the ratio of cases where the automatic questionnaire has exactly
the same answer as the actual questionnaire.
{ ACR (Average Closeness Rate) is the average of Closeness Rate (CR) across
all users, and CR is equal to (mad - ad)/mad, where mad is the maximum
absolute di erence, which is equal to the number of possible answers minus
one, and ad is the absolute di erence between the real and the automated
answer.
{ ADODL (Average DODL) is the averaged of Di erence between Overall
Depression Levels (DODL) across all users. DODL computes the overall
depression level (sum of all the answers) for the real and automated
questionnaire and, next, the absolute di erence (ad overall) between the real and
the automated score is computed. DODL is normalised into [
            <xref ref-type="bibr" rid="ref1">0,1</xref>
            ] as follows:
DODL = (63 - ad overall)/63.
{ DCHR (Depression Category Hit Rate) computes the fraction of cases
where the automated questionnaire led to a depression category (out of 4
categories: nonexistence, mild, moderate and severe) that is equivalent to
the depression category obtained from the real questionnaire.
0
1
2
3
4
          </p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>Speed</title>
        <p>0.965
0.965
0.996
0.830
0.632</p>
        <p>For Task 1, our team's performance for each of the key metrics was the best
compared to the other teams this year. Given our training schedule which tried
to maximise the performance for each metric per run, we can see that no speci c
run was the best across all the metrics, but rather there is a trade-o between
metrics. For example, Run 1 obtains a precision score of 0.913, but has the lowest
recall, while Run 4 obtains the highest F1, but not the best precision or recall.
Of most interest is the performance on the eRisk-speci c metrics, where our runs
obtained notably the best results. With Run 0 we obtained a latency-weighted
F1 of 0.66, where the second-best result was obtained by the team UNSL with
their run 1 at 0.61. For ERDE5, Run 2 scored 0.134, whereas the second-best
team was again UNSL with their run 1 at 0.172 (where lower is better). For
ERDE50, our Run 3 obtained a score of 0.071, whereas all the other runs ranged
between 0.11 to 0.25.</p>
        <p>For Task 2, our team's performance was the best for ACR, and competitive
for the other metrics. For AHR, ADODL and DCHR our performances were
within 1-2% of the best performances submitted. Interestingly, while the ADODL
scores were around 81-83%, this did not translate into a better classi cation
of depression category as surmised by DCHR, which was 34% at best. This
disparity may be due to how we employed the BERT based classi er (i.e., we
made separate models to predict the results of each question). However, it may
be more appropriate to jointly predict the results of all questions and the nal
depression category. This is because the questions will have a high correlation
between answers, and information for inferring the answer for one question, may
be useful in inferring others when taken together.
6</p>
        <sec id="sec-4-3-1">
          <title>Summary</title>
          <p>In this paper we have described how we employed a BERT-based classi er for
the tasks of the CLEF eRisk Track: Task 1, early risk detection of self-harm;
and Task 2, inferring answers to a depression survey. Our results on both tasks
indicated that this approach works very well and obtains very good performance
(the best on Task 1 and very competitive performance on Task 2). These results
are perhaps not too surprising, given the impact that BERT-based models have
been making in improving many other tasks. However, a key di erence in this
work is how we trained the model. In future work, we will explore and compare
di erent training schedules and classi ers extensions for these tasks, but also
for other related tasks (e.g., classifying whether someone is like to su er from
anorexia, depression).</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>Acknowledgements</title>
          <p>The rst author would like to thank the following funding bodies for their
support: FEDER / Ministerio de Ciencia, Innovacion y Universidades, Agencia
Estatal de Investigacion / Project (RTI2018-093336-B-C21), Conseller a de
Educacion, Universidade e Formacion Profesional and the European Regional
Development Fund (ERDF) (accreditation 2019-2022 ED431G-2019/04, ED431C
2018/29, ED431C 2018/19).</p>
          <p>The second and third authors would like to thank the UKRI's EPSRC Project
Cumulative Revelations in Personal Data (Grant Number: EP/R033897/1) for
their support. We would also like to thank David Losada for arranging this
collaboration.
15. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K.,
Zettlemoyer, L.: Deep contextualized word representations. arXiv preprint
arXiv:1802.05365 (2018)
16. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language
understanding by generative pre-training. URL https://s3-us-west-2. amazonaws.
com/openai-assets/researchcovers/languageunsupervised/language understanding
paper. pdf (2018)
17. Sadeque, F., Xu, D., Bethard, S.: Measuring the latency of depression detection
in social media. In: Proceedings of the Eleventh ACM International Conference on</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>Web Search and Data Mining. pp. 495{503 (2018)</title>
        <p>18. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,</p>
      </sec>
      <sec id="sec-4-5">
        <title>L., Polosukhin, I.: Attention is all you need. In: Advances in neural information</title>
        <p>processing systems. pp. 5998{6008 (2017)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Baumgartner</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zannettou</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keegan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Squire</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blackburn</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The pushshift reddit dataset</article-title>
          .
          <source>In: Proceedings of the International AAAI Conference on Web and Social Media</source>
          . vol.
          <volume>14</volume>
          , pp.
          <volume>830</volume>
          {
          <issue>839</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Conneau</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khandelwal</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaudhary</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wenzek</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guzman</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ott</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoyanov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          . arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>02116</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          :
          <article-title>Open sourcing bert: State-of-the-art pre-training for natural language processing</article-title>
          .
          <source>Google AI Blog, November</source>
          <volume>2</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Target-dependent sentiment classi cation with bert</article-title>
          .
          <source>IEEE Access 7</source>
          ,
          <issue>154290</issue>
          {
          <fpage>154299</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Lample</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conneau</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Cross-lingual language model pretraining</article-title>
          . arXiv preprint arXiv:
          <year>1901</year>
          .
          <volume>07291</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ott</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoyanov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          . arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Losada</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A test collection for research on depression and language use</article-title>
          .
          <source>In: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          . pp.
          <volume>28</volume>
          {
          <fpage>39</fpage>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Losada</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parapar</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>CLEF 2017 eRisk overview: Early Risk prediction on the internet: Experimental foundations</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          <year>1866</year>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Losada</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parapar</surname>
          </string-name>
          , J.:
          <source>Overview of eRisk</source>
          <year>2018</year>
          :
          <article-title>Early Risk Prediction on the Internet (extended lab overview)</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          <volume>2125</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Losada</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parapar</surname>
          </string-name>
          , J.:
          <source>Overview of eRisk 2019 Early Risk Prediction on the Internet. Lecture Notes in Computer Science (including subseries Lecture Notes in Arti cial Intelligence and Lecture Notes in Bioinformatics) 11696 LNCS(September)</source>
          ,
          <volume>340</volume>
          {
          <fpage>357</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Losada</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parapar</surname>
          </string-name>
          , J.:
          <source>Overview of eRisk</source>
          <year>2020</year>
          :
          <article-title>Early Risk Prediction on the Internet</article-title>
          .
          <source>In: Experimental IR Meets Multilinguality, Multimodality, and Interaction Proceedings of the Eleventh International Conference of the CLEF Association (CLEF</source>
          <year>2020</year>
          )
          <article-title>(</article-title>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Nikolov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radivchev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Nikolov-radivchev at semeval-2019 task 6: O ensive tweet classi cation with bert and ensembles</article-title>
          .
          <source>In: Proceedings of the 13th International Workshop on Semantic Evaluation</source>
          . pp.
          <volume>691</volume>
          {
          <issue>695</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abburi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Badjatiya</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krishnan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chhaya</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varma</surname>
          </string-name>
          , V.
          <article-title>: Multi-label categorization of accounts of sexism using a neural framework</article-title>
          . arXiv preprint arXiv:
          <year>1910</year>
          .
          <volume>04602</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>