=Paper= {{Paper |id=Vol-2380/paper_126 |storemode=property |title=Attentive Multi-stage Learning for Early Risk Detection of Signs of Anorexia and Self-harm on Social Media |pdfUrl=https://ceur-ws.org/Vol-2380/paper_126.pdf |volume=Vol-2380 |authors=Waleed Ragheb,Jérôme Azé,Sandra Bringay,Maximilien Servajean |dblpUrl=https://dblp.org/rec/conf/clef/RaghebABS19 }} ==Attentive Multi-stage Learning for Early Risk Detection of Signs of Anorexia and Self-harm on Social Media== https://ceur-ws.org/Vol-2380/paper_126.pdf
 Attentive Multi-stage Learning for Early Risk
Detection of Signs of Anorexia and Self-harm on
                  Social Media

      Waleed Ragheb1,2 , Jérôme Azé1,2 , Sandra Bringay1,3 , and Maximilien
                                   Servajean1,3
     1
         LIRMM UMR 5506, CNRS, University of Montpellier, Montpellier, France
             2
               IUT de Béziers, University of Montpellier, Béziers, France
          3
            AMIS, Paul Valéry University - Montpellier 3 , Montpellier, France
                               {first.last}@lirmm.fr



         Abstract. Three tasks are proposed at CLEF eRisk-2019 for predict-
         ing mental disorder using users posts on Reddit. Two tasks (T1 and T2)
         focus on early risk detection of signs of anorexia and self-harm respec-
         tively. The other one (T3) focus on estimation of the severity level of
         depression from a thread of user submissions. In this paper, we present
         the participation of LIRMM (Laboratoire d’Informatique, de Robotique
         et de Microélectronique de Montpellier) in both tasks on early detection
         (T1 and T2). The proposed model addresses this problem by modeling
         the temporal mood variation detected from user posts through multi-
         stage learning phases. The proposed architectures use only textual in-
         formation without any hand-crafted features or dictionaries. The basic
         architecture uses two learning phases through exploration of state-of-the-
         art deep language models. The proposed models perform comparably to
         other contributions.

         Keywords: Classification, LSTM, Attention, Temporal Variation, Bayesian
         Variational Inference, Anorexia, Self-harm


1     Introduction

Anorexia is consider one of the most common eating disorder. It is characterized
by low weight, worry of gaining weight, and a powerful need to be skinny, leading
to food restriction. Many who suffer from eating disorder see themselves as
overweight although they could be thin [8]. Individuals with eating disorders
have also been shown to have lower employment rates, in addition to an overall
loss of earnings. Eating disorder sufferers who are experiencing an overall loss in
earnings associated with their illness are also magnified by the excess of health-
care costs. According to the National Eating Disorder Association (NEDA), up
    Copyright © 2019 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12
    September 2019, Lugano, Switzerland.
to 70 million people worldwide suffer from eating disorders [1]. Eating disorder
symptoms are beginning earlier in both males and females. As estimated, 1.1 to
4.2 percent of women suffer from anorexia at some point in their lifetime [6].
Young people between the ages of 15 and 24 with anorexia have 10 times the
risk of dying compared to their same-aged peers.
     Self-harm is a very common problem, and many people are struggling to deal
with it [9]. Several illnesses are associated with self-harm, including borderline
personality disorder, depression, eating disorders, anxiety or emotional distress
[3]. Self-harm occurs most often during the teenage and young adult begin around
age 14 and carry on into their 20s, though it can also happen later in life [9].
There is also an increased risk of suicide in individuals who self-harm and it is
found in 40% to 60% of suicides [5].
     Social media is becoming increasingly used not only by adults but also at
different age stages. Mental disordered patients also turn to online social media
and web forums for information on specific conditions and emotional support.
Even though social media can be used as a very helpful tool in changing a
person’s life, it may cause such conflicts that can have a negative impact. This
puts responsibilities for content and community management for monitoring
and moderation. With the increasing number of users and their contents, these
operations turn out to be extremely difficult. Many social media try to deal with
this problem by reactive moderation. In reactive moderation, users report any
inappropriate, negative or risky user generated contents. However it may reduce
the workload or the cost of moderating, it is not enough especially for handling
mental disordered user’s threads or posts.
     Previous researches on social media have established the relationship between
an individual’s psychological state and his\her linguistic and conversational pat-
terns [19, 18]. This motivate the task organizers to initiate the pilot task for
detecting depression from user posts on Reddit1 in eRisk-2017 [11]. In eRisk-
2018 the extension of the study was planned to include detection of anorexia. In
eRisk-2019, a continuation of anorexia tasks in addition to two other tasks are
proposed. One task is for early detection of signs of self-harm (T2). In this task
no training dataset is provided. Also, another new task for detection of severity
level of depression (T3) is presented. Tasks organizers proposed new evaluation
measures than what were used before.
     In this paper, we present the participation of LIRMM (Laboratoire
d’Informatique, de Robotique et de Microélectronique de Montpellier) in both
tasks for early detection of anorexia and self-harm in eRisk-2019. The originality
of our approach is to perform the detection through two main learning phases. In
the first learning phase. we proposed Deep Mood Evaluation Module (DMEM)
that uses attention based deep learning models to construct a time series rep-
resenting temporal mood variation through users posts or writings. The second
phase is either to use machine learning or Bayesian inference model to obtain
1
    Reddit is an open-source platform where community members (red-ditors) can sub-
    mit content (posts, comments, or direct links), vote submissions, and the content
    entries are organized by areas of interests (subreddits).
the proper decision. The main idea is to give a decision once the models detect
clear signs of mental disorder from current and previous mood extracted from
the content.
    The rest of the paper is organized as follows. In Section 2, the related work
is introduced. Then in Section 3, a brief tasks (T1 and T2) description of early
risk detection and used datasets are presented. Section 4 presents the proposed
models. The experimental setup and all model variants used are introduced in
Section 5. In Section 6, the evaluation results and discussions are presented. We
conclude the study and experiments in Section 7.


2   Related Work

Recent psychological studies showed the correlation between person’s mental
status and mood variation over time [11]. It is also evident that some mental
disordered may have chronic week-to-week mood instability. It is a common
presenting symptom for people with a wide variety of mental disorders, with
as many as 8 of 10 patients reporting some degree of mood instability during
assessment. These studies suggest that clinicians should screen for temporal
mood variation across most common mental health disorders.

    Concerning text representation, traditional Natural Language Processing
(NLP) modules start with feature extraction from text such as the count or
frequency of specific words, predefined patterns, Part-of-Speech tagging, etc.
These hand-crafted features should be selected carefully and sometimes with an
expert view. However these features are interesting [22], sometimes they loose the
sense of generalization. Another recent trend is the use of word and documents
vectorization methods. These strategies that convert either words, sentences or
even overall documents into vectors take into account all the text not just parts
of it. There are many ways to transform a text to high-dimensional space such
as term frequency and inverse document frequency (TF-IDF), Latent Semantic
Analysis (LSA), Latent Dirichlet Allocation (LDA), etc [14]. This direction was
revolutionized by Mikolov et al. [16, 17] who proposed the Continuous Bag Of
Words (CBOW) and skip-gram models known as Word2vec. It is a probabilistic
based model that makes use of a two layered neural network architecture to
compute the conditional probability of a word given its context. Based on this
work Le et al. [10] propose Paragraph Vector model. The algorithm which is
also known as Doc2vec learns fixed-length feature representations from variable-
length pieces of texts, such as sentences, paragraphs, and documents. Both word
vectors and documents vectors are trained using stochastic gradient descent and
back-propagation shallow neural network language models. The development
of Universal Language Model Fine Tuning (ULMFiT) is considered like mov-
ing from shallow to deep contextual pre-training word representation [7]. This
idea has been proved to achieve Computer Vision (CV)-like transfer learning
for many NLP task. ULMFiT make use of the state-of-the art language model
AWD-LSTM (Average stochastic gradient descent - Weighted Dropout LSTM)
proposed by Merity et al. in 2017 [15]. The same 3-layer LSTM recurrent ar-
chitecture with the same hyperparameters and no additions other than tuned
dropout hyperparameters are used. The classifier layers above the base LM en-
coder is simply a pooling layer (maximum and average pool) followed by three
fully-connected linear layers. The overall models signicantly outperforms the
state-of-the-art on six text classication tasks including three tasks for sentiment
analysis. In this paper, we will use these techniques for text representations.
    Attention mechanism is considered as one of the recent trends in NLP models
[2]. It can be described as mapping a query and a set of key-value pairs to an
output, where the query, keys, values and output are all vectors. The output is
computed as a weighted sum of the values, where the weight assigned to each
value is computed by a compatibility function of the query with the correspond-
ing key. This can be seen as take a collection of vectors, whether it could be
a sequence of vectors representing a sequence of words, or an unordered collec-
tions of vectors representing a collection of attributes and summarize them into
a single vector. This summarization is done by scoring each input sequence with
a probability-like scores obtained from the attention. This helps the model to
pay close attention to the sequence items with higher attention scores. In this
paper, we will evaluate the effect of attention mechanisms on the model.
    In this paper, we will use deep attention based modification of ULMFiT
classifier to construct a time series representing temporal mood variation. We the
used classical machine learning and statistical models to get the final decisions.



3   Tasks Description

In CLEF eRisk 2019, three tasks are presented [13]. The first task (T1) is for
early detection of signs of anorexia. It is a continuation of the same task in eRisk-
2018. The second one (T2) is a new task in 2019 for early detection of signs of
self-harm. No training data is provided for this task. Another task was proposed
(T3) for measuring the severity of the signs of depression. In this section we will
describe the first two tasks (T1 and T2) that we have participated on.
    Both tasks are considered as a binary classification problem. The datasets
are a dated textual data of user posts and comments -posts without titles- on
Reddit. The training and testing datasets are provided in stream of user writings
(posts and comments). The stream is ordered chronologically. A brief statistics
and summary for these datasets are provided in Table 1. Task organizers set
up a server that iteratively gives user writings to the participating teams. The
goal is not only to perform classification but also to do it as early as possible
using minimum amount of writings for each user. A decision must be sent after
processing each user writing to continue receiving more. This decision could be
positive risk case or postponed for future writings. A detailed description of
the tasks and used evaluation metrics can be found in the corresponding task
description paper [13].
       Table 1: Summary of early risk detection tasks (T1 and T2) Datasets

                                                    T1                       T2
                                       Training            Testing        Testing
No. of Users (At-risk/Controlled )   472 (61/411)        815 (73/742)   340(41/299)
No. of writings                        253,341             570,510        170,698
Avg. No. of writings/User               536.74              700.01         502.05
Avg. writings Size (words)               35.38              34.83          33.15
Vocabulary Size                        117,090             210,763        105,448


4     Proposed Models
The temporal aspects of the eRisk tasks inspired us to model the temporal
mood variation through user’s text content. The average number of days ranging
from the first submission to the last submission is approximately 600 days. So,
determining the way in which user’s posts and comments vary from positive
to negative and vice versa through time is worth inspecting. In the proposed
models, the main idea is to process user writings for each user and determine
the probability of how positive or negative it is. A detailed description of our
model can be found in the working notes paper of eRisk 2018 [20]. The proposed
architecture of our models comes in three main steps.
    Step 1 - Text Vectorization Module: It is considered as language mod-
eling step. The input of this step is the textual training datasets and the output
is text vectorization model.
    Step 2 - Mood Evaluation Module: This step is considered as the first
supervised learning phase. Assign to each writing a probability like score repre-
senting how positive (risky) the submission is. The output of this step is a time
series representing the mood variability over time. These time series will be the
training set of the second learning phase.
    Step 3 - Temporal Modeling Module: Another learning phase is to build
machine learning models to learn some patterns from these time series to come
up with the final classification model.
    We tried to encapsulate text vectorization and mood evaluation modules
and proposed Deep Mood Evaluation Module (DMEM). This module is based on
ULMFiT architecture [7] and the idea of transfer learning for language modeling
in addition to using attention layers for classifications. In addition, we tried
Bayesian Variational Inference (BVI) [21] for the second learning phase.

4.1   Deep Mood Evaluation Module (DMEM)
We propose a modification of the basic architecture of the ULMFiT by adding
attention to the model. The proposed architecture will help the model to focus
on the important parts of the text that influence the network decision. Figure
1 shows the proposed model and the separation between encoder layers (text
vectorization module) and classifier layers (mood evaluation module).
    The input sequence is passed to the embedding layer then the three Bi-LSTM
layers to form the output of the encoder. The encoder output has the form of
Xi = {xi1 , xi2 , xi3 , . . . , xiN } where N is the sequence length. The attention layer
takes the encoded input sequence and computes the attention scores S i . The
attention layer can be viewed as a linear layer without bias.


                                αi = {W i .X i }
                                             exp(αi )                                   (1)
                                S i = log[ PN            ]
                                                      i
                                            j=1 exp(αj )

    Where W i is the weight of the attention layer of the ith sequence. The atten-
tion scores S i is used to compute the scored sequence Oi = {oi1 , oi2 , oi3 , . . . , oiN }
which has the same length as the input sequence.

                                      Oi = S i     Xi                                   (2)

    Since the input sequence to the attention layer (encoder output) resulted
                                                               i
from Bi-LSTM layers, the last element in the scored output SN    can be used
for representing the whole sequence. The whole sequence is represented by the
weighted sum of all output sequences Ōi .
                                    X
                              Ōi =       Si X i                          (3)
                                         




                    Fig. 1: Deep Mood Evaluation Module (DMEM)


   For classification layers, a simple concatenation between the maximum and
average pooling in addition to the scored output is inputted to a group of two
different sizes fully connected linear layers. The output of the last linear layer is
passed to the Softmax to form the network decision.
    Training the over whole models comes into three main steps proposed in [7].

 1. The LM is initialized by training the encoder on a general-domain corpus
    (Wikitext-103 dataset [23]). This helps to capture general features of the
    language. Preserve low-level representations and adapt high-level ones
 2. The pre-trained LM is fine-tuned using the training datasets for both tasks.
 3. The classifier and the encoder is fine-tuned on the target task using different
    strategies for each layer group.

The training of the architecture is done using slanted triangular learning rates
(STLR), discriminative fine-tuning (Discr) and layers gradual unfreezing pro-
posed for ULMFiT with the same hyperparameter settings [7]. We train the
model on the forward language models for both the general-domain and task
specific datasets. Training the attention layer uses the same learning rates and
cycles used in the classification layers group.


4.2   Bayesian Variational Inference (BVI)

We can represent the problem of classifying users from the already classified
(observed) writings as a variant of independent Bayesian classifier combination
[21]. Figure 2 shows the graphical model for the proposed BVI where the observed
random variable Wik represents if the ith writing for the k th user if it is classified
as positive or negative such that:




    Fig. 2: Graphical Model for BVI: The shaded node represents observed values,
circular nodes are variables with a distribution and rectangular nodes are instantiated
                                        variables
                                   Wik ∼ Bernoulli(πuk )
                                                                                 (4)
                                     πi ∼ Beta(λ, γ)

   The hidden variable uk represents if the user will be classified as at-risk
(anorexia, self-harm) or not. So we can say:


                                     uk ∼ Bernoulli(κ)
                                                                                 (5)
                                      κ ∼ Beta(α, β)

The variables λ, γ, α and β are the hyper-parameters reflecting our a priori
belief about the proportion of positive and negative users.
    We are interested in the posterior distribution of the random variable Uk ,
that defines if the user is positive or negative, which is unfortunately intractable.
We use a variational inference approach to compute an approximation such as
in [21]. The approximation is obtained by solving the following equation for all
variables Zi conditionned on the observed data X:

                     log qi (Zi |X) = Ej6=i [log p(Z, X)] + const.               (6)

   So, we start from a number of positive and negative user writings (N d ) where
d ∈ {+, −} for positives and negatives respectively. More specifically:
                           X                             X
                 N+ =            1[Wik = 1],    N− =           1[Wik = 0]        (7)
                           k,i                           k,i

   Then, the expected number of positive and negative writings for positive
users can be represented by N1+ and N1− respectively. The same for negative
users is N0+ and N0− . These values are computed as:
                   X
           Nrd =                          k
                         E[1[uk = d]].[1[wi = r]],     d ∈ {+, −}, r ∈ {0, 1}    (8)
                   k,i


   We can estimate the expectation of the log of the probability to observe
positive writings independently of the user category as E[ln(κ)] and for negative
writings as E[ln(1 − κ)] such that:

                                     +               +   −
                   E[ln(κ)] = ψ(α + N ) − ψ(α + β + N + N )
                                     −               +   −
                                                                                 (9)
               E[1 − ln(κ)] = ψ(β + N ) − ψ(α + β + N + N )

    Where ψ is the digamma function defined as the logarithmic derivative of the
gamma function. In addition, we can estimate the expectation of the log prob-
ability for positive users to write positive writings as E[ln(π1 )] and for negative
users as E[ln(π0 )] where:
                                         +                +    −
                     E[ln(πi )] = ψ(λ + Ni ) − ψ(λ + γ + Ni + Ni )
                                        −                +    −
                                                                                 (10)
                E[1 − ln(πi )] = ψ(γ + Ni ) − ψ(λ + γ + Ni + Ni )

      So, the expectation of a user to be positive or negative can be obtained as:

                           Mk
                           X
              ln(ρkj ) =        Wik E[ln(πj )] + (1 − Wik ) E[ln(1 − πj )]
                           i
                                   + (α − 1) E[ln(κ)] + (β − 1) E[ln(1 − κ)]     (11)
                              ρki
              E[1[Uk = j]] = P k
                              j ρi

Where E[1[Uk = j]] is a normalized value for the two types of users (at-risk or
controlled). We can evaluate an optimal value for it iteratively by first initializing
all factors, then updating each in turn using the expectations with respect to
the current values of the other factors [21].


5      Experimental Setup

For each task, each team could participate with different five runs. We create
different variants of our proposed architecture. In this section, we will present
all these variants, training procedures and model hyperparameters.


5.1     Proposed Model Variants

All the proposed model variants for both tasks are based on two supervised
learning phases (step 2 and step 3 in temporal mood variation model). For self-
harm detection task (T2), as there is no training data, we train our models
on the depression and anorexia datasets of eRisk-2018 [12]. We assumed that
if a person with a clear signs of depression and/or anorexia could think about
harm himself. We used the DMEM module as the first learning phase an all
the variants and tried different machine learning and statistical methods as the
second learning phase. Table 2 shows the used model for the second learning
phase in all the runs for both tasks. MLP stands for Multi-Layer Perceptrons
and RF is for Random Forest. All models that do not employ another learning
phase are marked by dashes. In these runs, we used simple counting thresholds
for successive positive classified writings.


5.2     Model Training and hyperparameters

We processed the training and testing streams of user writings by moving
window concatenation of size (N ). In other words, to give a decision about the
                 Table 2: Summary of the proposed model variants

                                             2nd Learning Phase
                Run ID     Model Name         T1            T2
                  0         LIRMMA           MLP           —–
                  1         LIRMMB            RF           —–
                  2         LIRMMC            —–           MLP
                  3         LIRMMD            —–           RF
                  4         LIRMME           BVI           BVI




current writing at time (t), we process all user writing starting from (t − N + 1).
This gives more information about the context of a writing and reduce the effect
of noisy and irrelevant ones. Experiments show that (N = 5) to be a reasonable
choice for the window size.


    For DMEM, we use the same set of hyperparameter of AWD-LSTM proposed
by [15] replacing the LSTM with Bi-LSTM and keep the same embedding size of
400 and 1150 hidden activations. We used weighted dropout of 0.2 and 0.25 as
the input embedding dropout and the learning rate is 0.004. We fine-tuned the
LM by eithr anorexia or depression training datasets provided. We train the LM
for 14 epochs using batch size of 128 and limit the number of vocabulary to all
token that appear more than twice. For classifier, we used masked self-attention
layers and concatenation of maximum and average pooling. For the linear
block, we used hidden linear layer of size 100 and apply dropout of 0.4. We
used Adam optimizer [4] with β1 = 0.8 and β2 = 0.99. The base learning rate
is 0.01. We used the same batch size used in training LMs. For training the
classifier, we create each batch using weight random sampling to handle the
problem of imbalance in the datasets. We train the classifier on training set for
30 epochs and select the best model on validation set to get the final model.
For T2 training, we combine the training datasets for depression and anorexia
of eRisk-2018.


    In the second learning phase, the used architecture of the MLP had two
hidden layers with ten neurons each. Concerning the RF classifier, ten estimators
were used. These models are used to classify time series of (N ) points. For
MLP, RF and BVI models in T1, positive users were reported for those with
classification probability higher than 0.8. This value increases to 0.9 in T2. We
set both thresholds to 0.6 in the last rounds. For some model variants (LIRMMC
and LIRMMD in T1 and LIRMMA and LIRMMB in T2), we apply counting
of successive positive writings and give a decision after either 5 or 10 following
writings respectively.
6   Results & Discussions
In eRisk-2019 two different types are used for model evaluation. The first one is
decision-based evaluations; where the classical classification measures - precision
(P), Recall (R) and (F1) - are computed for positive (at-risk) user. In addition
to these and due to the drawbacks of ERDE measure, a new latency weighted
F1 measure is introduced [13]. The other complementary evaluation is ranking-
based evaluation. Beside the fired decision, scores are computed and used to
build a ranking of users in decreasing estimation of risk. We participated only
for decision-based evaluation. Tables 3 and 4 show the evaluation results of
all our proposed variants for both tasks. It is clear that using MLP for the
second learning phase is the best choice for both tasks. However, the usage of
high threshold in T2 make the models predict most of the positive user in late
writings. Also, applying BVI gets more comparable results than the runs with
simple counting of positive writings. But it needs more precise choice of threshold
for early detection in both tasks.


              Table 3: Results of proposed runs for anorexia task (T1)

                       P          R          F1      latency-weighted F1
         LIRMMA       0.74       0.63       0.68             0.63
         LIRMMB       0.77       0.60       0.68             0.62
         LIRMMC       0.66       0.70       0.68             0.60
         LIRMMD       0.74       0.42       0.54             0.48
         LIRMME       0.57       0.75       0.65              —




             Table 4: Results of proposed runs for self-harm task (T2)

                       P          R          F1      latency-weighted F1
         LIRMMA       0.57       0.29       0.39             0.35
         LIRMMB       0.53       0.22       0.31             0.29
         LIRMMC       0.48       0.49       0.48              —
         LIRMMD       0.47       0.44       0.46              —
         LIRMME       0.52       0.41       0.46              —


    Tables 5 and 6 show some statistics of other participants runs compared to
our proposed models. The ranks of the best run for each evaluation metric are
also included. The statistics of the anorexia task are for 54 runs of 13 teams.
The self-harm task statistics on results are for 33 runs of 8 teams. However the
proposed architecture does not include any hand-crafted features, it seems to be
comparable with other contributions for both tasks. Also, combining anorexia
and past eRisk depression training datasets for detecting signs of self-harm is
very competitive.
      Table 5: Statistics on 54 participating runs results and our ranks for T1

                            P           R          F1      latency-weighted F1
    Max                    0.77        0.99       0.71             0.69
    Min                    0.11        0.15       0.20             0.19
    Average                0.45        0.63       0.48             0.46
    Standard Deviation     0.17        0.24       0.17             0.15
    Rank                    1           14          5               5


      Table 6: Statistics on 33 participating runs results and our ranks for T2

                            P           R          F1      latency-weighted F1
    Max                    0.71        1.00       0.52             0.52
    Min                    0.12        0.22       0.22             0.17
    Average                0.29        0.73       0.32             0.29
    Standard Deviation     0.18        0.29       0.11             0.10
    Rank                    3           17          3               4



7   Conclusions

In this paper we present the participation of LIRMM in the CLEF eRisk-2019
T1 and T2 tasks. Both tasks are for early detection of signs of anorexia and
self-harm from users posts on Reddit respectively. We proposed five runs for
each task and the results are interesting and comparable to other contributions.
The proposed framework architecture used the text without any handcrafted
features. It performs the classification through two phases of supervised learning
using state-of-the-art deep language modeling neural network. The first learning
phase builds a time series representing the mood variation using attention-based
modification of the ULMFiT model. The second learning phase is another clas-
sification model that learns patterns from these time series to detect early signs
of such mental disorders. In this phase, We tried set of machine learning (MLP
and RF) and statistical (BVI) models.
     Combining anorexia and previous eRisk depression datasets to detect early
signs of self-harm (T2) is interesting and shows the correlation of such mental
disorders. However, the proposed models need tuning of second learning phase
classification thresholds for earlier risk detection.



Acknowledgments

We would like to acknowledge La Région Occitanie and l'Agglomération Béziers
Méditerranée which finance the thesis of Waleed Ragheb as well as INSERM
and CNRS for their financial support of CONTROV project.
References
 1. The national eating disorders association (NEDA).: Envisioning a world without
    eating disorders. In: The newsletter of the National Eating Disorders Association.
    Issue 22 (2009)
 2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning
    to align and translate. In: International Conference on Learning Representations
    (ICLR). vol. abs/1409.0473 (Sep 2014)
 3. Doyle, L., Treacy, M.M.P., Sheridan, A.J.: Self-harm in young people: Prevalence,
    associated factors, and help-seeking in school-going adolescents. International jour-
    nal of mental health nursing 24 6, 485–94 (2015)
 4. Dozat, T., Manning, C.D.: Deep biaffine attention for neural dependency parsing.
    vol. abs/1611.01734 (2017)
 5. Hawton, K., Zahl, D., Weatherall, R.: Suicide following deliberate self-harm: long-
    term follow-up of patients who presented to a general hospital. British Journal of
    Psychiatry 182(6), 537542 (2003)
 6. Hoek, H.: Review of the worldwide epidemiology of eating disorders. In: Current
    Opinion in Psychiatry. vol. 29 (2016)
 7. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification.
    In: Proceedings of the 56th Annual Meeting of the Association for Computational
    Linguistics (Volume 1: Long Papers). pp. 328–339 (2018)
 8. Joyce, D., L. Sulkowski, M.: The diagnostic and statistical manual of mental disor-
    ders: Fifth edition (dsm-5) model of impairment. In: Assessing Impairment: From
    Theory to Practice. pp. 167–189 (2016)
 9. Klonsky, E.D.: The functions of deliberate self-injury: A review
    of the evidence. Clinical psychology review 27, 226–39 (04 2007).
    https://doi.org/10.1016/j.cpr.2006.08.002
10. Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents.
    In: ICML. JMLR Workshop and Conference Proceedings, vol. 32, pp. 1188–1196.
    JMLR.org (2014)
11. Losada, D.E., Crestani, F., Parapar, J.: erisk 2017: Clef lab on early risk prediction
    on the internet: Experimental foundations. In: 8th International Conference of the
    CLEF Association. pp. 346–360. Springer Verlag (2017)
12. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk – Early Risk Predic-
    tion on the Internet. In: Experimental IR Meets Multilinguality, Multimodality,
    and Interaction. Proceedings of the Ninth International Conference of the CLEF
    Association (CLEF 2018). Avignon, France (2018)
13. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk 2019: Early Risk Pre-
    diction on the Internet. In: Experimental IR Meets Multilinguality, Multimodality,
    and Interaction. 10th International Conference of the CLEF Association, CLEF
    2019. Springer International Publishing, Lugano, Switzerland (2019)
14. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning
    word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting
    of the Association for Computational Linguistics: Human Language Technologies
    - Volume 1. pp. 142–150. HLT ’11, Association for Computational Linguistics,
    Stroudsburg, PA, USA (2011)
15. Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing LSTM language
    models. In: International Conference on Learning Representations (2018)
16. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
    sentations of words and phrases and their compositionality. In: Advances in Neural
    Information Processing Systems 26. pp. 3111–3119. Curran Associates, Inc. (2013)
17. Mikolov, T., Yih, S.W.t., Zweig, G.: Linguistic regularities in continuous space word
    representations. In: Proceedings of the 2013 Conference of the North American
    Chapter of the Association for Computational Linguistics: Human Language Tech-
    nologies (NAACL-HLT-2013). Association for Computational Linguistics (2013)
18. Moulahi, B., Azé, J., Bringay, S.: Dare to care: A context-aware framework to track
    suicidal ideation on social media. In: Bouguettaya A. et al. (eds) Web Information
    Systems Engineering - WISE 2017.,Lecture Notes in Computer Science,. Springer,
    Cham. vol. 10570 (2017)
19. Paul, M.J., Dredze, M.: You are what you tweet: Analyzing twitter for public
    health. In: Adamic, L.A., Baeza-Yates, R.A., Counts, S. (eds.) ICWSM. The AAAI
    Press (2011)
20. Ragheb, W., Moulahi, B., Azé, J., Bringay, S., Servajean, M.: Temporal mood
    variation: at the CLEF erisk-2018 tasks for early risk detection on the internet.
    In: Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum,
    Avignon, France, September 10-14, 2018. (2018)
21. Simpson, E., Roberts, S., Psorakis, I., Smith, A.: Dynamic bayesian combination
    of multiple imperfect classifiers. In: Decision Making and Imperfection. pp. 1–35.
    Springer Berlin Heidelberg, Berlin, Heidelberg (2013)
22. Trotzek, M., Koitka, S., Friedrich, C.: Linguistic metadata augmented classifiers
    at the clef 2017 task for early detection of depression. In: Working Notes of CLEF
    2017 - Conference and Labs of the Evaluation Forum. vol. CEUR-WS 1866 (2017)
23. Wang, H., Keskar, N.S., Xiong, C., Socher, R.: Identifying generalization proper-
    ties in neural networks. In: International Conference on Learning Representations
    (2019), https://openreview.net/forum?id=BJxOHs0cKm