=Paper=
{{Paper
|id=Vol-2936/paper-75
|storemode=property
|title=UPV-Symanto at eRisk 2021: Mental Health Author Profiling for Early Risk Prediction
                        on the Internet
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-75.pdf
|volume=Vol-2936
|authors=Angelo Basile,Mara Chinea-Rios,Ana-Sabina Uban,Thomas Müller,Luise Rössler,Seren Yenikent,Berta Chulví,Paolo Rosso,Marc Franco-Salvador
|dblpUrl=https://dblp.org/rec/conf/clef/BasileCUMRYCRF21
}}
==UPV-Symanto at eRisk 2021: Mental Health Author Profiling for Early Risk Prediction
                        on the Internet==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-75.pdf</pdf>
<pre>
UPV-Symanto at eRisk 2021: Mental Health Author
Profiling for Early Risk Prediction on the Internet
Angelo Basile1 , Mara Chinea-Rios2 , Ana-Sabina Uban3,4 , Thomas Müller2 ,
Luise Rössler2 , Seren Yenikent1 , Berta Chulví3 , Paolo Rosso3 and
Marc Franco-Salvador2
1
  Symanto Research, Nuremberg, Germany
2
  Symanto Research, Valencia, Spain
3
  PRHLT Research Center, Universitat Politècnica de València
4
  Human Language Technologies Research Center, University of Bucharest


                                         Abstract
                                         This paper presents the contributions of the UPV-Symanto team, a collaboration between Symanto
                                         Research and the PRHLT Center, in the eRisk 2021 shared tasks on gambling addiction, self-harm de-
                                         tection and prediction of depression levels. We have used a variety of models and techniques, including
                                         Transformers, hierarchical attention networks with multiple linguistic features, a dedicated early alert
                                         decision mechanism, and temporal modelling of emotions. We trained the models using additional train-
                                         ing data that we collected and annotated thanks to expert psychologists. Our emotions-over-time model
                                         obtained the best results for the depression severity task in terms of ACR (and second best according
                                         to ADODL). For the self-harm detection task, our Transformer-based model obtained the best absolute
                                         result in terms of ERDE5 and we ranked equal first in terms of speed and latency.

                                         Keywords
                                         risk detection, depression, self-harm, pathological gambling, social media, hierarchical networks, trans-
                                         former


1. Introduction
The availability of user-generated texts on social media such as Reddit and Twitter, makes it
possible to organize an early reaction to risks and threats as these are mentioned in conversations
between users. It has been shown that social media language data can be used for detecting
natural risks such as floods and earthquakes [1], predicting public health issues such as influenza
[2], and analyzing riots and protest events [3]. In this work, we focus on predicting individual
risk of mental disorder, within the contest of our participation to the eRisk 2021 CLEF evaluation
campaign [4]. We participate in all the three shared tasks proposed by the organizers: Early
Detection of Signs of Pathological Gambling (Task 1), Early Detection of Signs of Self-Harm

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" angelo.basile@symanto.com (A. Basile); ana.uban+acad@gmail.com (A. Uban); luise.roessler@symanto.com
(L. Rössler); seren.yenikent@symanto.com (S. Yenikent); prosso@dsic.upv.es (P. Rosso);
marc.franco@symanto.com (M. Franco-Salvador)
 0000-0002-3312-9359 (A. Basile); 0000-0002-2313-9633 (M. Chinea-Rios); 0000-0003-2197-3947 (A. Uban);
0000-0002-8360-4189 (T. Müller); 0000-0003-4834-5326 (S. Yenikent); 0000-0001-7946-6601 (M. Franco-Salvador)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
(Task 2) and Measuring the Severity of Signs of Depression (Task 3). All three tasks are framed
as author profiling tasks, i.e., some personal characteristics of an author have to be inferred
from their writings.
   Task 1 For Task 1, we built a system for classifying Reddit users as potential pathological
gamblers based on their writings. This task was organized as an "only-test" task, with no
training data released by the organizers. The test data is collected from Reddit following the
procedure described in [5]. All the texts contained in the dataset are in English. Before the
system submission deadline, no other information about the test data is known. Since the task
focuses on early detection of signs of pathological gambling, the evaluation metrics take into
account the number of posts processed before providing a positive prediction for each user.
   Task 2 For this task, participants were asked to develop a system for predicting early signs
of self-harm. The task was framed as a binary classification task (self-harm, no self-harm). As
for Task 1, the data consists of Reddit comments in English, collected following [5]. For this
task, the organizers provided a training dataset with posts from 763 labelled users, of which 145
belonging to the positive (i.e., self-harm) class. This task is evaluated in the same way as Task
1.
   Task 3 In contrast to both Task 1 and Task 2, Task 3 is not focused on early prediction,
but on estimating the severity of the users’ depression. As for Task 1 and Task 2, the data
source is Reddit and all the texts are in English; for each Reddit user in the dataset (90 in total),
the organizers collected their Reddit post history; furthermore, the organizers provided the
answers to a depression questionnaire as filled by each user included in the dataset. The goal of
Task 3 consists in estimating users’ response to the questionnaire given their history of Reddit
comments.
   We approached the three tasks using a combination of neural models and manually engineered
features, developed by domain experts. We collected additional data from Reddit and hired
expert psychologists to annotate it. We obtained the best results in Task 2 and Task 3 according
to several key metrics.


2. Data
We train our models in a supervised fashion using all the data released by the organizers. In
addition to that, we augment the released training splits by collecting additional data from
Reddit.1 For Task 1, we build a training and development set from scratch, since no data was
released by the organizers for this task. We follow the strategy described in [5] and run a series
of queries looking for occurrences of the following strings in all of Reddit:

    • I was diagnosed with depression
    • [I am]|[I’m] a problem gambler
    • [I am]|[I’m] addicted to gambling

   We then collect all the comments and submissions from all the Reddit users who posted a
text matching the queries. Furthermore, we collect all the comments and submissions posted to
    1
        We use the PushShift API [6].
a manually compiled list of subreddits.2 We collected in total approximately 16 million texts.
   Data Annotation Since we ran our initial queries against all of Reddit and since we collected
indiscriminately all the posts and comments posted to specific subreddits, we expected the data
to be noisy and to contain a lot of false positives.3 Considering the large size of the collected
dataset, sampling a random portion of the data for annotation would have probably lead to
a highly imbalanced label distribution. For these reasons, we adopted a "search as labelling"
approach: all the data was dumped into a PostgreSQL database [7] and indexed using trigrams,
allowing annotators to use their expertise for finding instances of the positive and negative class
using free text queries. Two psychologists were hired to annotate the collected data. Annotators
used the criteria provided in the Diagnostic and Statistical Manual of Mental Disorders (DSM-5)
for each task assigned for each disorder; gambling disorder, non-suicidal self-injury, and major
depressive disorder [8]. To annotate instances for the control groups, we used two different sets
of labels. We used one (definitely-no-gambler and definitely-no-self-harm) for
instances whose authors can safely be classified as a part of the control group, e.g., requests for
help for someone other than the post author or news articles using a vocabulary that partially
overlaps with the one used in the target group (e.g., articles on financial investment risk): when
such an instance is found, we automatically labeled all the text from its author with the same
label. We believe these instances to be the most challenging for the models. Another set of labels
(maybe-no-gambler and maybe-no-self-harm) is used to annotate those instances that do
not belong to the positive class, but nothing can be safely inferred about the author: in this
case, we don’t label all the texts from the same author as belonging to the negative class. Table
1 shows the label distribution in the annotated dataset. The most common feedback from the
annotators involved the comorbidity issue. Accordingly, in all three tasks, respective disorders
were observed to be co-occurring with signals from other types of disorders. For instance, in
the depression task, it was observed that signals for anxiety disorders, post-traumatic stress
disorder, and self-harm would commonly co-occur. While alcohol and drug addiction were the
major comorbid conditions in the gambling task, self-harm was accompanied by depressive and
anxiety signals. In some cases, the signals from the co-occurring conditions might have been
stronger in text and led to misclassifications. Although this may be considered as a hindering
effect for the training of the models, it in fact showcases the real-time conditions. Comorbidity
of disorders is a common issue in mental health conditions [9]. Thus, this observation in our
dataset is quite expected, and suggests that the models developed for such tasks should reflect
this scenario.


3. Task 1: Early Detection of Signs of Pathological Gambling
We approach Task 1 in two steps. First, we train a text classifier on each text of each user inde-
pendently, propagating the users’ labels (e.g., pathological gambler or not pathological gambler)
to their writings, assuming that the information contained in a single post can potentially be

    2
      Each subreddit is a forum dedicated to a particular topic. The complete list of crawled subreddits can be found
in Appendix A.
    3
      For example, we noticed that many submissions on depression-related subreddits are from people asking for
help for their loved ones instead of being ill themselves.
Table 1
Overview of the label distribution in the in-house annotated dataset. For task 2, these data was merged
with the official dataset released by the organizers.
                                   Task         Label           # posts    # users
                                                gambler             3143      722
                                   Task 1       no gambler          1655      178
                                                self-harm           104           28
                                   Task 2       no self-harm        209           18


enough for classifying its author. Second, we build an alert-emitting system which computes
the probability of a user being ill based on the averaged probabilities assigned to each processed
post. To develop the models we use our manually annotated corpus.

3.1. Models
3.1.1. Transformer Model with Alert-Emitting System
For modelling task 1, we train a Transformer-based text classifier using a pretrained, small
English uncased Bert model [10, 11].4 We report the hyper-parameter settings in Appendix B.1.
   To build the alert-emitting system, we followed the work of the participants that obtained
the best results in the 2020 edition of this shared task [12]. The system emits a risk alert if the
average probability of the positive class is higher than a certain value 𝜃, having considered a
number of user posts between a minimum and maximum 𝜓 and 𝛿, respectively. We find the
best values for 𝜃, 𝜓, and 𝛿 according to 5 key metrics using a black-box optimization approach
based on a Gaussian process.5 We tuned an early alert decision maker for each of these metrics:
F1-score, latency-weighted F1-score, ERDE5 , ERDE50 , and an equally-weighted combination
of all of them. Table 2 highlights the results. The models differ only with respect to 𝜃, 𝜓, and 𝛿.

Table 2
Development results for Task 1, with the automatically optimized hyper-parameters for the alert-
emitting system: minimum number of posts (𝜓), maximum number of posts (𝛿), minimum threshold
for emitting a prediction (𝜃). The bold values in each column denote the optimized metric.
 model             𝜓    𝛿      𝜃            P           R      F1     latencyF1    ERDE5         ERDE50    norm avg
 UPV-Symanto 0     1    5     0.20        88.32%    96.56% 92.25%       92.61%           3.86%     3.51%    94.38%
 UPV-Symanto 1     1   50     0.21        88.49%    96.31% 92.24%      92.59%            3.96%     3.61%    94.32%
 UPV-Symanto 2     1   89     0.38        43.78%    79.28% 56.41%       56.63%         13.59%     11.03%    72.10%
 UPV-Symanto 3     1    5     0.25        37.26%    88.29% 52.41%       52.61%          11.72%   11.47%     70.46%
 UPV-Symanto 4     1   50     0.35        42.01%    82.88% 55.76%       55.98%          13.61%    10.76%   71.84%


    4
      Specifically, we use the bert_en_uncased_L-2_H-128_A-2 model available from the https://tfhub.dev model
repository.
    5
      We use the implementation available in scikit-optimize.
3.2. Evaluation and Results
The official evaluation setup of Task 1 is composed of two set of metrics: one for a decision-based
evaluation and one for a ranking-based evaluation. The decision-based evaluation provides an
estimation of models’ performance at classifying at-risk users, while ranking-based evaluation
is used to asses the goodness of a model at sorting users by their level of risk. For the decision-
based evaluation, the standard classification metrics are used, i.e., precision (P), recall (R) and
f-measure (F1), together with a set of metrics which take into account the time required to
emit an alert. We measure classification performance considering the number of posts required
to emit a correct prediction for each user (using ERDE5 and ERDE50 , requiring 5 and 50
posts respectively) and considering the number of writings required by a system for finding
true positive instances (using latencyTP and latency-weighted F1). Table 3 shows the official
results as computed by the organizers on the test set. A detailed description of the metrics
can be found in [5] and [13]. The ranking-based evaluation is conducted using the 𝑃 @𝑁 and
𝑁 𝐷𝐶𝐺@𝑁 metrics.6

Table 3
Decision-based evaluation for Task 1. For comparison, we include the runs that obtained the best results
in each metric, as reported in [4].
                     run     P      R     F1     ERDE5 ERDE50 latencyTP speed latencyF1
        UPV-Symanto 0       .422   .415   .077     .088        .087          1           1        .077
        UPV-Symanto 1       .420   .457   .074     .097        .091          1           1        .074
        UPV-Symanto 2       .030   .238   .053     .093        .091          1           1        .053
        UPV-Symanto 3       .035   .409   .064     .098        .097          1           1        .064
        UPV-Symanto 4       .028   .256   .051     .098        .095          1           1        .051
         UNSL 2 (Best) .586 .939 .721              .073       .020           11        .961       .693
         RELAI 0 (Best) . 138 .988 .243            .048       .036            1          1        .243


4. Task 2: Early Detection of Signs of Self-Harm
We experimented with two types of models for approaching Task 2. The first model mirrors
our work for Task 1, using a Transformer model to classify each post individually and then
predicting a label for a user based on the probabilities assigned to their writings. A second
model consists of a hierarchical LSTM-based architecture with attention (HAN) using a set of
hand-crafted features. For both types of models, at inference time we use the alert-emitting
system described in Section 3.1.1. Runs UPV-Symanto 0, 2 and 3 are based on HAN, while runs
UPV-Symanto 1 and 4 are based on a Transformer.


    6
     We don’t report here on the official ranking-based evaluation for Task 1, since due to a bug our model always
predicts the negative class and thus all the metrics are equal to 0.
4.1. Models
4.1.1. Transformer Model
The Transformer-based architectures that we use for modelling Task 2 are the same that we
used for Task 1, with the same hyper-parameters described in Appendix B.1.

4.1.2. Hierarchical Attention Network with Composite Features
In Task 2, we used a Hierarchical Attention Network (HAN)[14] with multiple linguistic features.
Here we describe the features used, the experimental setup and the network architecture.
     Content features. We include a general representation of text content by transforming
each text into word sequences. Preprocessing of texts includes lowercasing and tokenizing,
removing punctuation and numbers; function words are not excluded. Most frequent 20,000
words were selected to form the vocabulary, and words not in the vocabulary were represented
as a special "unknown" token. When passed as input to the neural networks, words within a
sequence were encoded as embeddings of dimension 300. In order to initialize the weights of
the embedding layers, we started from pre-trained GloVe embeddings [15]7 .
   Style features. We aim at representing the stylistic level of texts through including function
word and pronoun features. Function words have traditionally been used as stylistic markers,
whereas increased use of pronouns, especially first person pronouns, has been shown to correlate
with mental disorder risk [16]. We include two separate stylistic features: firstly, we extract
from each text a numerical vector representing function words frequencies as bag-of-words.
We complement these with features extracted from the LIWC lexicon [17], including pronoun
usage and other syntactical features, as described below.
   LIWC features. The LIWC [17]8 is a lexicon mapping words in the English vocabulary to
lexico-syntactic features of different kinds. It has been widely used in computational studies for
analysing how suffering from mental disorders manifests in authors’ writings. LIWC categories
have the capacity to capture different levels of language: including style (through syntactic
categories), emotions (through affect categories) and topics (through content-oriented categories
such as words referring to cognitive or analytical processes, or words referring to topics such
as money, health or religion). We use LIWC 2015 and include in our analysis all 64 categories in
the lexicon, and represent them as numerical vectors by computing for each category the ratio
of words in a text that are related to the category, according to the lexicon.
   Emotions and sentiment features. We dedicate a few features to represent emotional
content in our texts, since the emotional state of a user is known to be highly correlated with their
mental health. Several of the LIWC categories aim to capture sentiment polarity and emotion
content (negative emotion, positive emotion, affect, sadness, anxiety). We additionally include
a second lexicon, the NRC emotion lexicon [18], which is dedicated exclusively to emotion
representation, containing 8 different emotion categories, as well as the 2 sentiment categories:
anger, anticipation, disgust, fear, joy, negative,positive, sadness, surprise, trust.9 We represent

    7
     http://nlp.stanford.edu/data/glove.840B.300d.zip
    8
     http://www.liwc.net/
   9
     Time limitations at the inference stage prevented us from using the more sophisticated sentiment and emotion
model available through Symanto’s Text Analysis API (https://www.symanto.com/api/), which we intend to explore
Figure 1: Hierarchical attention network architecture.


NRC features similarly to LIWC features, by computing for each category the proportion of
words in the text which are associated with that category.
   Experimental setup. We trained our models on the eRisk 2020 [19] training data for
the self-harm task. For one of the models (UPV-Symanto 3), we used transfer learning by
first pretraining all the model’s parameters on eRisk data annotated for anorexia [20], and
subsequently training them on the self-harm data.
   During the training phase, we do not consider social media posts individually as datapoints,
since they are too short to be sufficiently predictive. Instead, we generate our datapoints by
grouping sequences of 𝑐 chronologically consecutive posts into larger chunks, to obtain more
consistent samples of text as our datapoints. Features are computed at chunk-level. We use
different values for 𝑐: UPV-Symanto 0 uses chunks of 80 posts, while UPV-Symanto 2 uses 10
posts per chunk, and UPV-Symanto 3 uses 50 posts per chunk.
   Hierarchical Attention Network. Hierarchical attention networks were introduced in [14]
where they were used for review classification, by representing a text as a hierarchical structure
where a document is comprised of sentences and a sentence is comprised of words. We propose
that social media data in our setup is well suited to such a hierarchical representation; in our
case the hierarchy consists of user post histories, which are composed of social media posts,
which are in turn composed of word sequences. Especially since the evolution of the mental
state of a user is in itself a relevant indicator for the development of a disorder, as shown in [21],
user-level representations are expected to be natural and useful for modelling this problem.
   In the hierarchical setup, posts within a chunk (datapoint) are stacked to form a hierarchical
structure: word sequences (truncated at 256 words), as well as the rest of vectorial numerical and
bag-of-words features, are stacked to form bi-dimensional vectors. Bag-of-words and numerical
features also follow a hierarchical structure, with a set of features extracted for each post in the
group, and stacked together into bi-dimensional vectors. The hierarchical network is composed

in future work.
of two components: a post-level encoder, which produces a representation of a post, and a
user-level encoder, which generates a representation of a user’s post history. Each of the posts
in the input datapoint is encoded with the post-level encoder, and then they are stacked to form
a bi-dimensional representation, which is then concatenated with the other features, and passed
to the user-level encoder. We choose to model the user-level encoder as an LSTM layer with
attention, with 32 units. The output of the user encoder is connected to the output layer which
generates the final prediction. A depiction of the hierarchical architecture is shown in Figure 1.
   We use batch normalization and L2 regularization. Binary cross-entropy is used as a loss
function. More details on the network’s configuration are found in the Appendix.

Table 4
Development results for Task 2, with the automatically optimized hyper-parameters for the alert-
emitting system: minimum number of posts (𝜓), maximum number of posts (𝛿), minimum threshold for
emitting a prediction (𝜃). The bold values in each column denote the optimized metric. H: Hierarchical
Attention Network; T: Transformer model.
      model            𝜓     𝛿          𝜃           P           R          F1     latencyF1   ERDE5    ERDE50     norm avg
 H    UPV-Symanto 0    5    46        0.40         66.30%      60.60% 63.3%        62.60%     17.90%    11.60%      74.10%
 T    UPV-Symanto 1    1     5        0.05         53.70%      62.50% 57.80%       58.0%      12.80%    12.50%      72.63%
 H    UPV-Symanto 2    1    50        0.05         61.70%      63.50% 62.60%       62.60%     14.0%     11.40%      74.93%
 H    UPV-Symanto 3    4   100        3            53.90%      73.10% 62.00%       61.60%     17.70%    10.5%       73.86%
 T    UPV-Symanto 4    3   100        0.10         44.60%      75.00% 55.90%       55.70%     16.00%    12.30%     70.848%


4.2. Evaluation and Results
The evaluation of task 2 mirrors exactly the setup of task 1. Table 4 reports the development
results and Table 5 highlights the official results on the test set as released by the organizers.

Table 5
Official test results for the decision-based evaluation for Task 2.
       run                        P          R          F1      ERDE5           ERDE50    latencyTP    speed     latencyF1
  H    UPV-Symanto 0             .307       .678        .422        .097         .051         5         1.0        .416
  T    UPV-Symanto 1             .276       .638        .385        .059         .056         1        .996        .385
  H    UPV-Symanto 2             .313       .645        .422        .072         .053         2        .984        .420
  H    UPV-Symanto 3             .301       .770        .433        .089         .044         5        .992        .426
  T    UPV-Symanto 4             .198       .711        .310        .082         .063         3        .961        .307
       UNSL 4 (Best)             .532       .763     .627           .064         .038         3        .992        .622
       Birmingham 2 (Best)       .757       .349     .477           .085          .07         4        .988        .472
       CeDRI 2 (Best)            .105        1.0      .19           .096         .094         1         1.0         .19


5. Task 3: Measuring the Severity of Signs of Depression
Task 3 consists of filling a questionnaire with 21 questions related to the user’s mental state
from the user’s Reddit post history.
Table 6
Ranking-based results for Task 2 computed using 1 and 10 posts per users.
              # writings             run          P@10    NDCG@10       NDCG@100
                           H   UPV-Symanto 0       0.8        0.83          0.53
                           T   UPV-Symanto 1       0.8        0.88           0.5
                  1        H   UPV-Symanto 2       0.8        0.82          0.55
                           H   UPV-Symanto 3       0.6         0.7          0.51
                           T   UPV-Symanto 4       0.9        0.93          0.53
                           H   UPV-Symanto 0       0.9        0.94          0.67
                           T   UPV-Symanto 1       0.8        0.69          0.64
                 100       H   UPV-Symanto 2       0.8        0.83          0.59
                           H   UPV-Symanto 3       0.9        0.94          0.69
                           T   UPV-Symanto 4       0.9        0.81          0.65


5.1. Models
5.1.1. Emotions over Time Model
As one of our models, we chose an approach based on the evolution of emotions and certain
psycho-linguistic features over time. Unlike other models used in this task, this approach models
users not by extracting static features from their writings, but instead as time series describing
their communication style related to emotions and self-expression over time.
   Features. We use, to this effect, some of the features introduced in Section 4.1, namely: the
10 emotion categories in the NRC lexicon, and in addition 3 categories of the LIWC lexicon
related to self-reference: I (usage of first person singular pronoun), we (usage of first person
plural pronoun) and ppron (overall usage of personal pronouns). We compute scores for each of
these categories for each post in the dataset, in a similar way to the feature extraction step for
Task 2: for a given text and feature (lexicon category), we compute the number of words in the
text corresponding to that category, normalized by the text length.
   In order to obtain time series for each of the considered features, we compute the scores
for a given user aggregated at the day level (computed over all texts posted in one day by a
given user). In this way, we allow a fair comparison between users who have different habits in
terms of frequency of posting, but who might nevertheless exhibit similar patterns in terms
of emotion evolution over time. We also apply a rolling average of 100 days over the obtained
scores, so as to reduce noise.
   User Similarity over Time. In order to obtain predictions for a given user, we use the
computed time series to define a similarity metric between users, and then predict answers to
the questionnaire by imitating the answers of similar users in the training set. The similarity
metric between users includes two factors: the static scores for the extracted features for the
two users, as well as the correlations over time for the two users.
   1. Static scores. We compute the average score for each of the considered features for a
      given user across all their writings (as a score between 0-1). The distance between two
      users will be computed as the arithmetic difference between the scores for two users (If 𝑑
       is the distance between two users, the similarity between two users is then 1 − 𝑑).
    2. Correlations over time. These are computed between the time series of feature scores
       corresponding to two given users. Since the time series for any two users are not guaran-
       teed to have the same length, we attempt to "align" the two time series by finding the
       maximum correlation between them. We use a sliding window of the length of the smaller
       of the time series, and compute correlation scores for all possible alignments between the
       two time series, then take the maximum correlation score as the similarity between the
       two time series.
The final similarity score between two users is computed as the sum between the static and
temporal component, both factors contributing with equal weight.
   Predictions for a given user are computed separately for each question in the questionnaire,
as a weighted mean to the answers of the most similar 15 users in the training data, weighing
the answers (as integers) with the corresponding similarity scores, and rounding the result
to the nearest integer (in order to obtain a valid answer to the question10 ). In this way, we
approach the prediction of answers as a regression problem, by considering a continuous
range of possible answers, and are able to obtain good approximations for the overall level
of depression (obtaining high ADODL and ACR scores), even when the exact answer is not
correctly predicted.

5.1.2. Classification with Reddit BERT model
This model is trained in a two step approach. We first crawl posts from subreddits related to
mental health issues such as depression, self harm and anxiety. We group the data into 13
categories related to mental health and an additional category consisting of random posts. We
then train a balanced classifiers to discriminate between these 14 classes. We train a classifier
based on distilroberta-base [22] and another one based on roberta-base [23] we call these models
subreddit14 and subreddit14-roberta-base, respectively.
   In a second step we extract the [CLS] embedding of the pre-trained model for every post in
the training and test datasets as well as the probability of the depression class. While the first is
used as the main representation for classification the second one gives us a notion of relevance.
   For every user in the training and tests sets we then average the embeddings of their posts to
obtain the final user representation. Given this representation we train a classifier for each of
the 21 questions. Since the dataset contains a small number of users we find it helpful to create
multiple examples per users by sampling 80% of the user’s posts. This can be understood as a
form of random dropout. Additionally we find it beneficial to restrict to the posts where the
pre-trained model predicts a probability of > 0.0711 for the depression class. The motivation
is that many posts are unrelated to the user’s mental state and that this filtering removes the
noise introduced by these irrelevant posts.


    10
       For questions where one answer had two variations (𝑎 and 𝑏), we ignored the variation and only considered
the integer value.
    11
       Recall that we trained a balanced classifier on 14 classes so that the average probability assigned to a class is
≈ 0.07
Table 7
Development results for Task 3 (trained on the 2019 data and evaluated on the 2020 data).
  model                                                         AHR      ACR      ADODL      DCHR       MEAN
                12
  RANDOM                                                        28.81    63.38       80.15     27.29     49.90
  Emotion over time model                                      27.14    74.35       83.19     33.53      54.55
  SVM (subreddit14)                                            38.23    69.30       81.18     24.29      53.25
  SVM (subreddit14-roberta-base)                               39.25    70.25       83.04     35.71      57.06
  SVM (subreddit14, most recent 30 posts)                      35.78    67.96       82.43     35.71      55.47
  Random-Forest (subreddit14, most recent 30 posts)            35.99    68.78       83.52     35.71      56.00
  UPV 2020 System 1 [24]                                        34.56    67.44       80.63    35.71      54.59
  UPV 2020 System 2 (Best) [24]                                 36.94    69.02       81.72    31.53      54.80
  BioInfo@UAVR (Best) [25]                                      38.30    69.21       76.01     30.00     53.38
  iLab run2 (Best) [12]                                         37.07    69.41       81.70     27.14     53.83
  relai_lda_user (Best) [26]                                    36.39    68.32       83.15     34.29     55.54


Table 8
Official Evaluation results for Task 3
  model                                                         AHR      ACR      ADODL      DCHR       MEAN
  Emotion over time model                                       34.17   73.17        82.42     32.50     55.57
  SVM (subreddit14-roberta-base)                                32.20   66.05        77.28     26.25     50.45
  SVM (subreddit14)                                             34.58   67.32        75.62     26.25     50.94
  SVM (subreddit14, most recent 30 posts)                       33.15   66.05        75.42     23.75     49.59
  Random-Forest (subreddit14, most recent 30 posts)             33.09   66.39        76.87     23.75     50.03
  RELAI etm (BEST)                                             38.78     72.56      80.27      35.71     56.83
  CYUT run2 (BEST)                                             32.62     69.46      83.59      41.25     56.73


5.2. Evaluation and Results
Following the setup used in the shared task we use the following metrics: Average Hit Rate
(AHR), Average Closeness Rate (ACR), Average Difference between Overall Depression Levels
(ADODL), Depression Category Hit Rate (DCHR) and the average of the former four metrics
(MEAN).


6. Discussion
In Task 3, we have considered that among the features used for mental disorder detection in
previous literature, LIWC categories, emotions and personal pronouns usage have consistently
been shown to be relevant for this task [27, 28, 29, 30, 31, 32, 33]. In previous work [34] we
have noticed that the expression of emotions in relation with the use of personal pronouns
reveals a specific pattern in users diagnosed with a mental disorders. For example, the use of
“I” and personal pronouns in general present differences in the correlation with all positive
   12
        Random answer drawn from the train distribution of each questions. Metrics are averaged over 10 runs.
emotions between depressed and not depressed people: the more depressed people use “I” and
personal pronouns, the more they express positive emotions like joy,anticipation and trust,
and the opposite happens with not depressed people. On this basis, in task 3 we considered
that similarity in the expression of emotions and in the use of personal pronouns could be a
predictor of responses in the Beck Depression Inventory. The good results obtained on the ACR
and ADODL reinforce the idea that the expression of emotions, both in a static way and in its
evolution over time, are a strong sign of development of depression.

6.1. Negative Results
We experimented with modelling all the three tasks simultaneously with a single Multi-Task
Learning (MTL) model [35]. We aggregated all the datasets and trained a roberta-base [23]
model, using a masked version of the cross-entropy loss, which does not penalize the model
when training on instances with missing labels. These experiments did not provide good results.
   For Task 1, the promising results obtained during development degraded terribly on the
official test set due to a bug introduced right before the submission.
   The choice of post chunking for training our hierarchical attention network for self-harm
detection (Task 2) was motivated by preliminary experiments showing that classifying individual
posts (using a network with a comparable architecture) does not achieve reasonable performance
on the development data. We also experimented with different number of posts per chunk
(including training and prediction), ranging between 10 and 90 posts per chunk and have
seen that, overall, prediction performance (in terms of F1-score) increases proportionally with
the size of chunks, while the models using larger chunks lose some performance in terms
of latency-based metrics (latency-weighted F1 and ERDE scores). We tuned the early-alert
decision mechanism applied to the trained models’ predictions (in terms of the different metrics
evaluated on the development data) for choosing the configurations of our official runs.
   In Task 3, an interesting question is why we obtain good scores in ACR and ADODL (that
informs about scores predictions item by item and about overall depression level), but we fail
to predict the depression category. What is observed is that we lose a lot of precision when
we move from an ordinal scale, such as ACR and ADODL, to a categorical scale such as DCHR.
We consider for future work the possibility to examine what patterns in ACR and ADODL are
present in users who have been well classified and what patterns are present in those who
have been misclassified in terms of categories for depression. It could be possible to apply a
correction factor to certain extreme scores on certain items if we observe that these scores and
these items play a major role in the errors of user’s classification to a certain level or category.
Another interesting observation is that the classification-based models perform well on the
development setup but not on test. A possible explanation is that the there is a discrepancy
of users with minimal and moderate depression: 24% on test, 47% on dev and 40% on train
(Appendix C). However, it is unclear why this affects the classification-based models more than
other approaches.
6.2. Motivation and Intended Usage
Newly emerging AI-supported services are in a promising position in the use of early detection
and prevention of mental health conditions. The necessity to implement such services into
everyday life is becoming more relevant, especially considering recent incidents like the CoViD-
19 pandemic where many people suffered from psychological problems [36]. Digital mental
health solutions (e.g. therapy programs, chat bots, smart device applications) is one of the
largest use case areas in this sense. Early detection is a vital aspect of therapeutic interventions
which makes the process more effective and prevents an aggravation of symptoms [37]. AI-
supported systems could be used as a preliminary analysis tool in clinical settings to enable an
early and preventive way of determining the type of the condition, severity of symptoms, and
recommendations for a successful therapy concept [38]. Furthermore, digital solutions could
solve the issue of accessibility, and stigmatization while providing individuals a healthy and
unbiased way of self-help [39]. Besides the application in a clinical environment, AI-supported
tools would be of use for general well-being practices. A holistic understanding of mental
health requires not only the detection of problems but also the positive build-up of human
psychology. Linguistic social media data is able to reflect different components of well-being,
thus provide important insights into the everyday representations of mental health topics [40].
Early detection models developed by considering such insights could help to raise awareness
on self-reflection and foster preventive lifestyle interventions. Academic and organizational
institutions, which possess large application and impact areas on individuals, could use such
tools to predict well-being of students and employees, and spot and support at-risk individuals
[41, 42].


7. Related Work
Apart from the overviews of the previous editions of the eRisk shared task [20, 19], the closest
literature to our work can probably be found in the review of CLPsych 2015 shared task on
predicting depression and PTSD from Twitter data [43].
   There are many studies in both computational linguistics and psychology which approach
the problem of analyzing the language of people suffering from a mental disorder, especially
depression. Many of these studies perform simple quantitative analyses or use traditional
machine learning models (such as logistic regression). Recently, more studies have started
employing deep learning for mental disorder detection, generally using word sequences as
features [44, 45, 46]. A few recent studies also use pre-trained transformers for detection of
mental health disorders [47, 48].
   Hierarchical attention networks have successfully been used for mental health disorder
detection in the past, including previous editions of the eRisk shared task: Mohammadi et al.
[49] use HANs for anorexia detection (obtaining best results at the eRisk 2019 shared task [20]).
Recently, Rao et al. [50] use hierarchical networks for depression detection, and Amini and
Kosseim [51] use them for anorexia detection. All previously mentioned studies use HANs with
standard word embedding features. Hierarchical attention networks using multiple linguistic
features have previously been used for self-harm detection in eRisk 2020 [24], as well as for
studying the detection of other mental health disorders as well as the model’s explainability,
including for depression, anorexia and post-traumatic stress disorders [52].
   Most computational studies model mental disorder symptoms as static phenomena, whereas
the evolution of mental disorder markers, as well as their prevalence in texts posted by a user,
is an important indicator of mental disorder risk. We mention one previous study [21] in which
the authors attempt to classify time series representing the mood of social media users in order
to predict occurrence of anorexia, with promising results. One recent study attempts a more
in-depth analysis of emotions and other psycho-linguistic features over time [34].
   Emotions have been previously shown to be relevant for modelling mental disorders, but
not many go beyond simple quantitative analyses. We mention an approach focused on a
fine-grained analysis of emotions, published three studies on depression, anorexia and self-harm
detection [53, 54, 55]. Starting from Plutchik’s eight basic emotions [56], the authors use word
embedding spaces to automatically identify sub-emotions, which they use as features for their
classifiers, trained to detect depression [53], anorexia [54] and self-harm [55] respectively.
   The "search as labelling" approach that we used to annotate our internal Reddit corpus is
described in [57].


8. Conclusion
In this paper we presented the contributions of the UPV-Symanto team in the eRisk 2021 shared
tasks: gambling addiction and self-harm detection and the prediction of depression levels,
based on social media text data. We have used a variety of models and techniques, including
Transformers, hierarchical attention networks with multiple linguistic features, a dedicated
early alert decision mechanism, and temporal modelling of emotions. We ranked first in terms
of ACR and second in terms of ADODL for Task 3, exceeding the previous state-of-the-art
for this eRisk shared task [19, 4], as well as best results for Task 2 in terms of 𝐸𝑅𝐷𝐸5 score.
We conclude that our methods are promising, encouraging the use of emotion and linguistic
features, temporal modelling, and dedicated early detection mechanisms.


Acknowledgements
The authors from Universitat Politècnica de València thank the EU-FEDER Comunitat Valenciana
2014-2020 grant IDIFEDER/2018/025. The work of Paolo Rosso was in the framework of the
research project PROMETEO/2019/121 (DeepPattern) by the Generalitat Valenciana. We would
like to thank the two anonymous reviewers who helped us improve this paper.
Appendix

A. List of crawled subreddits
For augmenting the training data, we collected all the posts and comments from the following
list of subreddits:

   • r/ADHD/
   • r/Anxiety/
   • r/aspergers/
   • r/bipolar/
   • r/BipolarReddit/
   • r/BPD/
   • r/CPTSD/
   • r/depression/
   • r/GamblingAddiction
   • r/mentalhealth/
   • r/OCD/
   • r/problemgambling/
   • r/schizophrenia/
   • r/selfharm/
   • r/SuicideWatch/


B. Hyper-Parameters
B.1. Transformer-based Model
   • batch size = 8
   • optimizer = Adam
   • dropout = 0.1
   • learning rate = 5e-5
   • early stopping patience = 7
   • epochs = 15
   • maximum sequence length = 512

B.2. Hierarchical Attention Network
   • LSTM units (post encoder) = 128
   • dense BoW units = 20
   • dense lexicon units = 20
   • LSTM units (user encoder) = 32
   • dropout = 0.3
    • 𝑙2 = 0.00001
    • optimizer = Adam
    • learning rate = 1e-4
    • early stopping patience = 5
    • epochs = 25
    • maximum sequence length = 256
    • posts per chunk = 80


C. Task 3 - Risk Category Distribution

Table 9
Risk category distribution for Task 3
                         name           minimal   mild   moderate   severe
                         train (2019)      0.20   0.20       0.20     0.40
                         dev (2020)        0.14   0.33       0.26     0.27
                         test (2021)       0.08   0.16       0.34     0.43


References
 [1] K. Kireyev, L. Palen, K. Anderson, Applications of topics models to analysis of disaster-
     related twitter data, in: NIPS workshop on applications for topic models: text and beyond,
     volume 1, Canada: Whistler, 2009.
 [2] E. Aramaki, S. Maskawa, M. Morita, Twitter catches the flu: Detecting influenza epidemics
     using Twitter, in: Proceedings of the 2011 Conference on Empirical Methods in Natural
     Language Processing, Association for Computational Linguistics, Edinburgh, Scotland,
     UK., 2011, pp. 1568–1576. URL: https://www.aclweb.org/anthology/D11-1145.
 [3] J. Sech, A. DeLucia, A. L. Buczak, M. Dredze, Civil unrest on Twitter (CUT): A dataset
     of tweets to support research on civil unrest, in: Proceedings of the Sixth Workshop on
     Noisy User-generated Text (W-NUT 2020), Association for Computational Linguistics,
     Online, 2020, pp. 215–221. URL: https://www.aclweb.org/anthology/2020.wnut-1.28. doi:10.
     18653/v1/2020.wnut-1.28.
 [4] J. Parapar, M.-R. Patricia, D. E. Losada, F. Crestani, Overview of erisk 2021: Early risk
     prediction on the internet, in: Proceedings of the Twelfth International Conference of the
     Cross-Language Evaluation Forum for European Languages, Springer, 2021.
 [5] D. Losada, F. Crestani, A test collection for research on depression and language use,
     in: Proc. of Experimental IR Meets Multilinguality, Multimodality, and Interaction, 7th
     International Conference of the CLEF Association, CLEF 2016, Evora, Portugal, 2016, pp.
     28–39. URL: https://citius.usc.es/sites/default/files/publicacions_postprints/clef.pdf.
 [6] J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, J. Blackburn, The pushshift reddit
     dataset, in: Proceedings of the International AAAI Conference on Web and Social Media,
     volume 14, 2020, pp. 830–839.
 [7] M. Stonebraker, L. A. Rowe, The design of postgres, ACM Sigmod Record 15 (1986)
     340–355.
 [8] A. P. Association, et al., Diagnostic and statistical manual of mental disorders (DSM-5®),
     American Psychiatric Pub, 2013.
 [9] O. Plana-Ripoll, C. B. Pedersen, Y. Holtz, M. E. Benros, S. Dalsgaard, P. De Jonge, C. C.
     Fan, L. Degenhardt, A. Ganna, A. N. Greve, et al., Exploring comorbidity within mental
     disorders among a danish national population, JAMA psychiatry 76 (2019) 259–270.
[10] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
     transformers for language understanding, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://www.aclweb.org/
     anthology/N19-1423. doi:10.18653/v1/N19-1423.
[11] I. Turc, M.-W. Chang, K. Lee, K. Toutanova, Well-read students learn better: On the
     importance of pre-training compact models, arXiv preprint arXiv:1908.08962 (2019).
[12] R. Martínez-Castaño, A. Htait, L. Azzopardi, Y. Moshfeghi, Early risk detection of self-harm
     and depression severity using bert-based transformers: ilab at clef erisk 2020, Early Risk
     Prediction on the Internet (2020). URL: http://ceur-ws.org/Vol-2696/paper_50.pdf.
[13] D. E. Losada, F. Crestani, J. Parapar, Overview of erisk: early risk prediction on the
     internet, in: International Conference of the Cross-Language Evaluation Forum for
     European Languages, Springer, 2018, pp. 343–361.
[14] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy, Hierarchical attention networks for
     document classification, in: Proceedings of the 2016 conference of the North American
     chapter of the association for computational linguistics: human language technologies,
     2016, pp. 1480–1489.
[15] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in:
     Proceedings of the 2014 conference on empirical methods in natural language processing
     (EMNLP), 2014, pp. 1532–1543.
[16] M. Trotzek, S. Koitka, C. M. Friedrich, Linguistic metadata augmented classifiers at the
     clef 2017 task for early detection of depression., in: CLEF (Working Notes), 2017.
[17] J. W. Pennebaker, M. E. Francis, R. J. Booth, Linguistic inquiry and word count: Liwc 2001,
     Mahway: Lawrence Erlbaum Associates 71 (2001) 2001.
[18] S. M. Mohammad, P. D. Turney, Nrc emotion lexicon, National Research Council, Canada
     2 (2013).
[19] D. E. Losada, F. Crestani, J. Parapar, Overview of erisk at clef 2020: Early risk prediction on
     the internet (extended overview) (2020). URL: http://ceur-ws.org/Vol-2696/paper_253.pdf.
[20] D. E. Losada, F. Crestani, J. Parapar, Overview of erisk 2019 early risk prediction on
     the internet, in: International Conference of the Cross-Language Evaluation Forum for
     European Languages, Springer, 2019, pp. 340–357. URL: http://www.dei.unipd.it/~ferro/
     CLEF-WN-Drafts/CLEF2019/paper_248.pdf.
[21] W. Ragheb, J. Azé, S. Bringay, M. Servajean, Attentive multi-stage learning for early risk
     detection of signs of anorexia and self-harm on social media., in: CLEF (Working Notes),
     2019.
[22] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller,
     faster, cheaper and lighter, ArXiv abs/1910.01108 (2019).
[23] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy-
     anov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692
     (2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692.
[24] A.-S. Uban, P. Rosso, Deep learning architectures and strategies for early detection of
     self-harm and depression level prediction, in: CEUR Workshop Proceedings, volume 2696,
     Sun SITE Central Europe, 2020, pp. 1–12.
[25] L. Oliveira, Bioinfo@ uavr at erisk 2020: on the use of psycholinguistics features and
     machine learning for the classification and quantification of mental diseases (2020).
[26] D. Maupomé, M. D. Armstrong, R. Belbahar, J. Alezot, R. Balassiano, M. Queudot, S. Mosser,
     M.-J. Meurs, Early mental health risk assessment through writing styles, topics and neural
     models (2020).
[27] M. De Choudhury, S. Counts, E. J. Horvitz, A. Hoff, Characterizing and predicting postpar-
     tum depression from shared facebook data, in: Proceedings of the 17th ACM conference
     on Computer supported cooperative work & social computing, 2014, pp. 626–638.
[28] M. De Choudhury, M. Gamon, S. Counts, E. Horvitz, Predicting depression via social
     media, in: Seventh international AAAI conference on weblogs and social media, 2013.
[29] S. C. Guntuku, D. B. Yaden, M. L. Kern, L. H. Ungar, J. C. Eichstaedt, Detecting depression
     and mental illness on social media: an integrative review, Current Opinion in Behavioral
     Sciences 18 (2017) 43–49.
[30] M. Trotzek, S. Koitka, C. M. Friedrich, Word embeddings and linguistic metadata at the clef
     2018 tasks for early detection of depression and anorexia., in: L. Cappellato, N. Ferro, J. Nie
     and L. Soulier (eds.) CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop
     Proceedings.CEUR-WS.org, volume 2125, 2018.
[31] M. Conway, D. O’Connor, Social media, big data, and mental health: current advances and
     ethical implications, Current opinion in psychology 9 (2016) 77–82.
[32] P. Resnik, A. Garron, R. Resnik, Using topic modeling to improve prediction of neuroticism
     and depression in college students, in: Proceedings of the 2013 conference on empirical
     methods in natural language processing, 2013, pp. 1348–1353.
[33] J. C. Eichstaedt, R. J. Smith, R. M. Merchant, L. H. Ungar, P. Crutchley, D. Preoţiuc-Pietro,
     D. A. Asch, H. A. Schwartz, Facebook language predicts depression in medical records,
     Proceedings of the National Academy of Sciences 115 (2018) 11203–11208.
[34] A. S. Uban, B. Chulvi, P. Rosso, An emotion and cognitive based analysis of mental health
     disorders from social media data, Future Generation Computer Systems (In press) (2021).
[35] R. Caruana, Multitask learning, Machine learning 28 (1997) 41–75.
[36] B. Pfefferbaum, C. S. North, Mental health and the covid-19 pandemic, New England
     Journal of Medicine 383 (2020) 510–512.
[37] S. Graham, C. Depp, E. E. Lee, C. Nebeker, X. Tu, H.-C. Kim, D. V. Jeste, Artificial intelligence
     for mental health and mental illnesses: an overview, Current psychiatry reports 21 (2019)
     1–18.
[38] M. Ewbank, R. Cummins, V. Tablan, A. Catarino, S. Buchholz, A. Blackwell, Understanding
     the relationship between patient language and outcomes in internet-enabled cognitive
     behavioural therapy: A deep learning approach to automatic coding of session transcripts,
     Psychotherapy Research (2020) 1–13.
[39] C. A. Lovejoy, Technology and mental health: the role of artificial intelligence, European
     Psychiatry 55 (2019) 1–3.
[40] H. A. Schwartz, M. Sap, M. L. Kern, J. C. Eichstaedt, A. Kapelner, M. Agrawal, E. Blanco,
     L. Dziurzynski, G. Park, D. Stillwell, et al., Predicting individual well-being through the
     language of social media, in: Biocomputing 2016: Proceedings of the Pacific Symposium,
     World Scientific, 2016, pp. 516–527.
[41] E. Pogrebtsova, G. F. Tondello, H. Premsukh, L. E. Nacke, Using technology to boost
     employee wellbeing? how gamification can help or hinder results., in: PGW@ CHI PLAY,
     2017.
[42] S. Volkova, K. Han, C. Corley, Using social media to measure student wellbeing: a large-
     scale study of emotional response in academic discourse, in: International Conference on
     Social Informatics, Springer, 2016, pp. 510–526.
[43] G. Coppersmith, M. Dredze, C. Harman, K. Hollingshead, M. Mitchell, CLPsych 2015
     shared task: Depression and PTSD on Twitter, in: Proceedings of the 2nd Workshop on
     Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical
     Reality, Association for Computational Linguistics, Denver, Colorado, 2015, pp. 31–39.
     URL: https://www.aclweb.org/anthology/W15-1204. doi:10.3115/v1/W15-1204.
[44] F. Sadeque, D. Xu, S. Bethard, Uarizona at the clef erisk 2017 pilot task: linear and recurrent
     models for early depression detection, in: CEUR workshop proceedings, volume 1866,
     NIH Public Access, 2017.
[45] G. Shen, J. Jia, L. Nie, F. Feng, C. Zhang, T. Hu, T.-S. Chua, W. Zhu, Depression detection
     via harvesting social media: A multimodal dictionary learning solution., in: IJCAI, 2017,
     pp. 3838–3844.
[46] Y.-T. Wang, H.-H. Huang, H.-H. Chen, A neural network approach to early risk detection
     of depression and anorexia on social media text., in: L. Cappellato, N. Ferro, J. Nie and
     L. Soulier (eds.) CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop
     Proceedings.CEUR-WS.org, volume 2125, 2018.
[47] M. Matero, A. Idnani, Y. Son, S. Giorgi, H. Vu, M. Zamani, P. Limbachiya, S. C. Guntuku,
     H. A. Schwartz, Suicide risk assessment with multi-level dual-context language and bert, in:
     Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology,
     2019, pp. 39–44.
[48] A. Zirikly, P. Resnik, O. Uzuner, K. Hollingshead, Clpsych 2019 shared task: Predicting
     the degree of suicide risk in reddit posts, in: Proceedings of the sixth workshop on
     computational linguistics and clinical psychology, 2019, pp. 24–33.
[49] E. Mohammadi, H. Amini, L. Kosseim, Quick and (maybe not so) easy detection of anorexia
     in social media posts., in: CLEF (Working Notes), 2019. URL: http://ceur-ws.org/Vol-2380/
     paper_74.pdf.
[50] G. Rao, Y. Zhang, L. Zhang, Q. Cong, Z. Feng, Mgl-cnn: A hierarchical posts representations
     model for identifying depressed individuals in online forums, IEEE Access 8 (2020) 32395–
     32403.
[51] H. Amini, L. Kosseim, Towards explainability in using deep learning for the detection
     of anorexia in social media, in: International Conference on Applications of Natural
     Language to Information Systems, Springer, 2020, pp. 225–235.
[52] A. S. Uban, B. Chulvi, P. Rosso, On the explainability of automatic predictions of mental
     disorders from social media data, in: International Conference on Applications of Natural
     Language to Information Systems (In press), Springer, 2021.
[53] M. E. Aragón, A. P. López-Monroy, L. C. González-Gurrola, M. Montes, Detecting depres-
     sion in social media using fine-grained emotions, in: Proceedings of the 2019 Conference
     of the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 1481–1486.
[54] M. E. Aragón, A. P. López-Monroy, M. Montes-y Gómez, Inaoe-cimat at erisk 2019:
     Detecting signs of anorexia using fine-grained emotions., in: L. Cappellato, N. Ferro, D.
     Losada and H. Müller (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR
     Workshop Proceedings.CEUR-WS.org, volume 2380, 2019.
[55] M. E. Aragón, A. P. López-Monroy, M. Montes-y Gómez, Inaoe-cimat at erisk 2020:
     Detecting signs of self-harm using sub-emotions and words 2696 (2020).
[56] R. Plutchik, The emotions, University Press of America, 1991.
[57] J. Attenberg, F. Provost, Why label when you can search? alternatives to active learning for
     applying human resources to build classification models under extreme class imbalance, in:
     Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery
     and Data Mining, KDD ’10, Association for Computing Machinery, New York, NY, USA,
     2010, p. 423–432. URL: https://doi.org/10.1145/1835804.1835859. doi:10.1145/1835804.
     1835859.

</pre>