Deep learning architectures and strategies for
early detection of self-harm and depression level
                    prediction

                        Ana-Sabina Uban1,2 and Paolo Rosso1
             1
             PRHLT Research Center, Universitat Politècnica de València
     2
         Human Language Technologies Research Center, University of Bucharest
                 ana.uban+prof@gmail.com, prosso@dsic.upv.es


         Abstract. This paper summarizes the contributions of the PRHLT-
         UPV team as a participant in the eRisk 2020 tasks on self-harm detec-
         tion and prediction of depression levels from social media. Computational
         methods based on machine learning and natural language processing have
         a great potential to assist with early detection of mental disorders of so-
         cial media users, based on their online activity. We use multi-dimensional
         representations of language, and compare various deep learning models’
         performance, exploring rarely approached avenues in previous research,
         including hierarchical deep learning architectures and pre-trained trans-
         formers and language models.

         Keywords: deep learning · mental disorders · BERT · hierarchical at-
         tention network · self-harm · depression.


1   Introduction
Mental health disorders affect hundreds of millions of people worldwide; [17] de-
pression alone is a major factor for suicide, and is usually underdiagnosed and
undertreated. People affected by mental disorders often turn to social media to
talk about their problems. There is an important opportunity for automatic pro-
cessing of social media data in order to identify changes in mental health status
that may otherwise go undetected before they develop more serious health con-
sequences. Identifying people who start to develop signs of a mental illness early
on is very important to managing its evolution, and in certain cases it can be
life-saving. Recently, the recent COVID-19 pandemic is expected to exacerbate
this problem, affecting mental health as well as physical health [9].
     The CLEF eRisk Lab 3 , organized every year since 2017, is dedicated specif-
ically to identifying early signs of mental disorders from a user’s social media
posts, before the user was diagnosed with the disorder, for disorders including
  Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
  mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
  ber 2020, Thessaloniki, Greece.
3
  https://erisk.irlab.org/
depression, anorexia and thoughts of self-harm [10–12]. Each year a new task
is organized around predicting a specific disorder: in 2017 and 2018 the shared
tasks focused on depression detection, in 2019 a new task for anorexia prediction
was organized, as well as a second task around predicting self-harm tendencies
without any training data; in 2020 self-harm detection was again the topic, this
time in a supervised setting. Datasets are collected from Reddit posts and com-
ments selected from specific relevant sub-reddits, annotated by automatically
detecting self-stated diagnoses of users. Healthy users are selected from partic-
ipants in the same sub-reddits, thus making sure the gap between healthy and
diagnosed users is not trivially detectable. For the self-harm task, the dataset
includes only posts published before any involvement in the self-harm related
communities, which conditions any model trained on this data to be capable of
very early prediction, and at the same time adds difficulty to the task.


    The language used by a speaker has been shown to contain strong indicators
of an altered mental state. These can manifest both explicitly, at the level of
the topics approached, or implicitly, at the level of the emotional charge of the
text (greater negative emotion [5]), or even more subtle stylistic indicators (such
as the increased use of first-person pronouns [25]). Textual data from social
media, as a very rich and relatively easy to obtain type of data, as well as
continuously growing source of real-time information, can thus be leveraged to
gain many valuable insights into an individual’s behavior and mental state and
its evolution.


    Most previous research related to automatic mental disorder detection from
social media data have focused the study of depression [6, 8, 1], but other men-
tal illnesses have also been studied, including generalized anxiety disorder [23],
schizophrenia [13], post-traumatic stress disorder [3, 4], risks of suicide [16],
anorexia [11] and self-harm [11]. The majority of studies on mental disorder
detection use simple machine learning models (such as support vector machines
(SVMs) and logistic regression) [6, 5]. Few studies have used more complex deep
learning methods [21, 25, 26, 22]. At the level of features, most previous works
have used traditional bag of words n-grams [3], as well as hand-crafted lexi-
cons [24], LIWC features [5], or Latent Semantic Analysis [20, 24]. There are few
studies which jointly consider several aspects of the language [22, 23].


   This study summarizes our contributions as participants to the eRisk shared
tasks on self-harm detection and assessment of depression levels [12]. We explore
the use of deep learning for detecting mental disorders from text data, and com-
pare various architectures, including hierarchical attention networks and trans-
formers. We model our text data using a multi-aspect representation, through
using features that reflect various complementary levels of the language, includ-
ing content, style and emotion. For predicting the level of depression, we use
traditional machine learning models including SVMs and Logistic Regression.
2     Task 1. Self-harm detection

The first task in eRisk 2020 consists of detecting whether a user is at risk of de-
veloping self-harm tendencies. Training data collected from Reddit was available,
consisting of 340 users (of which 41 were labelled as positive) and their Reddit
post history. Test data was provided as a stream of user posts, and candidate
systems were asked to provide a decision (a binary number: a user is at risk or
not), as well as a risk score (a real number), at each time step in the stream.
    We participated in the task with five different models. We implement several
neural network architectures, as well as experiment with pre-trained models and
strategies for sampling training data in order to improve results. Details of the
architectures used and the experimental setup are described below.


2.1    Features

    Content features. We include a general representation of text content by
transforming each text into word sequences. Preprocessing of texts includes low-
ercasing and tokenizing, removing punctuation and numbers; function words are
not excluded. Most frequent 20,000 words were selected to form the vocabulary,
and words not in the vocabulary were represented as a special ”unknown” to-
ken. When passed as input to the neural networks, words within a sequence were
encoded as embeddings of dimension 100. In order to initialize the weights of
the embedding layers, we started from GloVe embeddings pre-trained on Twit-
ter data. The choice of pre-trained embeddings was justified by their dimension,
which is smaller than for other GloVe embeddings pre-trained on large corpora,
leading to fewer mode parameters overall (in view of avoiding overfitting prob-
lems). Nevertheless, even though the data for this task is also sampled from
social media, the two platforms (Reddit and Twitter) have significant differ-
ences as well; exploring the use of other embedding initializations (especially
using embeddings pre-trained on longer texts) would be interesting in future
experiments.
    Style features. We aim at representing the stylistic level of texts through
including function word and pronoun features. Function words have traditionally
been used as stylistic markers, whereas increased use of pronouns, especially
first person pronouns, has been shown to correlate with mental disorder risk
[24]. We include two separate stylistic features: firstly, we extract from each
text a numerical vector representing function words frequencies as bag-of-words.
Separately, we include a simple scalar feature meant to capture the first person
personal pronoun usage, by measuring the proportion of first person pronouns
relative to the total number of words used in a text. We complement these with
features extracted from the LIWC lexicon, as described below.
    LIWC features. The LIWC4 [18] is a lexicon mapping words in the En-
glish vocabulary to lexico-syntactic features of different kinds. It has been widely
used in computational studies for analysing how suffering from mental disorders
4
    http://www.liwc.net/
manifests in an author’s writings. LIWC categories have the capacity to capture
different levels of language: including style (through syntactic categories), emo-
tions (through affect categories) and topics (through content-oriented categories
such as words referring to cognitive or analytical processes, or words referring
to topics such as money, health or religion). We include in our analysis all 64
categories in the lexicon, and represent them as numerical vectors by computing
for each category the ratio of words in a text that are related to the category,
according to the lexicon.
    Emotions and sentiment. We dedicate a few features to represent emo-
tional content in our texts, since the emotional state of a user is known to be
highly correlated with his/her mental health. Several of the LIWC categories
aim to capture sentiment polarity and emotion content (negative emotion, posi-
tive emotion, affect, sadness, anxiety). We additionally include a second lexicon:
the NRC emotion lexicon [14], which is dedicated exclusively to emotion repre-
sentation, containing 9 different emotion categories: anger, anticipation, disgust,
fear, joy, negative,positive, sadness, surprise, trust. We represent NRC features
similarly to LIWC features, by computing for each category the proportion of
words in the text which are associated with that category.


2.2   Experimental setup

During the training phase as well as for testing, we do not consider social me-
dia posts individually as datapoints, since they are too short to be sufficiently
predictive. Instead, we generate our datapoints by grouping sequences of 50
chronologically consecutive posts into larger chunks, to obtain more consistent
samples of text as our datapoints. Features are computed at chunk-level.
    As a consequence, prediction is always done on chunks of 50 posts. When
analyzing the input stream of test data, we form new chunks of the last 50 posts
received periodically (after every 20 new posts), and feed them to the networks
to generate predictions.
    As we will describe in the following section, we use two types of architectures
for modelling the input: sequential and hierarchical. We adopt a special strategy
for predictions on the first 50 posts in the stream: we pad the input data up to
the size used during training (512 words in the case of the sequential setup and
50 posts of length 256 in the case of the hierarchical setup), but only submit the
output score provided by the network, and as decisions (user is at risk or not) we
submit zeros regardless of the output score, so as not to send premature alerts
(since once a user is declared at risk, the decision can not be reverted).
    Sequence sampling. For one of our runs, we employ a special strategy
during the training phase. We attempt to augment the training data through
generating ”artificial” chunks of user posts, aside from the ones formed naturally
through chunking the user’s post history in chronological order. We do this by
sampling from the post history randomly, following an exponential distribution
so as to sample with higher probability from recent posts (which are more likely
to contain signs of the disorder). The chronological order of posts is maintained.
    Rolling average of predictions. As previously mentioned, for most runs
predictions are generated using the last 50 posts seen in the test data stream.
For one of our runs, we use a different strategy, by computing a rolling average
of the most recent 3 network outputs: in this way, we hope to obtain more robust
results that are not dependent only on the last batch of 50 user posts, but take
into account a larger window of context.

2.3   Architectures
    BiLSTM with attention. The first model we consider is a bidirectional
LSTM network, with attention. Input word sequences are truncated at maxi-
mum 512 words, with words encoded as embeddings, and passed as input to the
BiLSTM layer with 256 units, which is then fed to an attention layer. The bag-
of-words features representing function word distribution are passed through a
dense layer of 20 units; and the remaining extracted features (including pro-
noun, emotion and LIWC category usage) are concatenated into one vector.
The output of the BiLSTM is concatenated along with the other features and
the final representation passed through an output layer that generates the final
prediction.


                Fig. 1. Hierarchical attention network architecture.


    Hierarchical Attention Network. Hierarchical attention networks (HAN)
were introduced in [27] where they were used for review classification, by rep-
resenting a text as a hierarchical structure where a document is comprised of
sentences and a sentence is comprised of words. We propose that social media
data in our setup is very well suited to such a hierarchical representation; in our
case the hierarchy consists of user post histories, which are composed of social
media posts, which are in turn composed of word sequences. Especially since
the evolution of the mental state of a user is in itself a relevant indicator for
the development of a disorder, as shown in [19], user-level representations are
expected to be natural and useful for modelling this problem. One other study
has included post-level and user-level attention on their classifier’s architecture,
obtaining top results in the anorexia detection shared task [15].
    In the hierarchical setup, posts within a chunk (datapoint) are stacked to
form a hierarchical structure: word sequences (truncated at 256 words), as well
as the rest of vectorial numerical and bag-of-words features, are stacked to form
bi-dimensional vectors. Bag-of-words and numerical features also follow a hier-
archical structure, with a set of features extracted for each post in the group,
and stacked together into bi-dimensional vectors. The hierarchical network is
composed of two components: a post-level encoder, which produces a represen-
tation of a post, and a user-level encoder, which generates a representation of
a user’s post history. For encoding the word sequence at post-level, we use a
convolutional layer with 100 filters of length 3. Each of the posts in the input
datapoint is encoded with the post-level encoder, and then they are stacked to
form a bi-dimensional representation, which is then concatenated with the other
features, and passed to the user-level encoder. We choose to model the user-level
encoder as an LSTM layer with attention, with 32 units. The output of the user
encoder is connected to the output layer which generates the final prediction. A
depiction of the hierarchical architecture is shown in Figure 1.
    Transformers. We experiment with state-of-the-art language models based
on transformer architectures, which have been shown to obtain high perfor-
mances on a wide range of NLP tasks, with minimal task-specific training. We
use pre-trained BERT [7] models for English (the ”base” versions of the models)
with one trainable output layer and fine-tune them for our task.
    Ensemble. Finally, we use a simple ensemble model for one of our runs:
predictions are generated through averaging the outputs of several other models
on the received input.

2.4   Models submitted
The models and setups used for each of the five runs submitted by our team are
described below.
    Run 0. BERT + sequence sampling. For our first run we used the pre-
trained and fine-tuned BERT model. During fine-tuning, the sequence sampling
strategy for data augmentation was used.
    Run 1. BiLSTM. Run 1 consists of the BiLSTM model described in the
previous section.
    Run 2. Hierarchical CNN + LSTM. For run 2 we used the hierarchical
attention network with CNN and LSTM layers. Due to memory limitations, we
only generated predictions for the first 50 posts in the stream: all subsequent
predictions (for all datapoints in the stream) were based on these outputs.
    Run 3. Ensemble. For this run, we used an ensemble of the first three
models: BERT, the BiLSTM and the hierarchical attention network. To obtain
prediction scores, we averaged the outputs of the three networks for each input
datapoint. A user is considered at risk if the obtained output exceeds the 0.5
threshold.
    Run 4. Rolling average of BiLSTM. For our last run, we used the rolling
average strategy described in the previous section, to obtain a smoothed version
of the model’s outputs. For each timestep, we averaged the output of the BiLSTM
model for the most recent three inputs (chunks of 50 posts).


2.5   Results

Table 1 show the official results obtained for each of our runs. Evaluation mea-
sures included the traditional precision, recall and F1-scores computed at user-
level, as well as some metrics specifically designed for measuring how early risk
was detected: latency-weighted F1, which is the F1-score weighted by a penaliz-
ing factor for late predictions, and ERDE [12], a measure of error that increases
when predictions are delayed. For comparison, we include the systems that ob-
tained best scores for each metric.


 Run               Precision Recall F1 ERDE5 ERDE50 latencyw-F1
 BERT+seq-sampling   .469     .654 .546 .291   .154      .462
 BiLSTM              .710     .212 .326 .251   .235      .172
 HAN                 .271     .577 .369 .339   .269      .298
 Ensemble            .846     .212 .338 .248   .232      .178
 BiLSTM+rolling      .765     .375 .503 .253   .194      .423
 iLab/run 1          .913     .404 .560 .248   .149      .540
 SSN NLP/run 1       .283       1   .442 .205  .158      .442
 iLab/run 4          .828     .692 .754 .255   .255      .476
 iLab/run 2          .544     .654 .594 .134   .118      .592
 iLab/run 3          .564     .885 .689 .287  .071       .572
 iLab/run 0          .833     .577 .682 .252   .111     .658
                        Table 1. Official results for task 1


            Run              P@10 NDCG@10 NDCG@100
            BERT+seq-sampling 1      1       .68
            BiLSTM            .9    .81      .75
            HAN               .6    .69      .48
            Ensemble          .9    .81      .75
            BiLSTM+rolling    .9    .90      .69
            iLab/run 3        1      1       .84
                       Table 2. Ranking metrics for task 1


    The best F1-scores were obtained with the BERT model using sequence sam-
pling training, showing that pre-trained transformers are powerful for external
tasks including detection of self-harm, and also that the sequence sampling strat-
egy might be an effective method for data augmentation. The second best results
were obtained with the last model - the rolling average of outputs strategy brings
significant improvement to predictions compared to the base model (simple BiL-
STM). We attribute the poorer performance of the HAN and ensemble models
to the small size of test data used for predictions (first 50 posts in the stream).
    A second evaluation approach treats the task as a ranking task, by using the
system’s continuous risk scores and ranking users in order of risk according to
these scores. Metrics specific to ranking tasks are used to measure performance,
including precision @ k (P@10), and Normalized Discounted Cumulative Gain @
k (NDCG@10, NDCG@100). In Table 2 we show the evaluation results for our
systems using the ranking metrics, measured on the first 500 posts in the input
stream. Our models perform well on these metrics, the first system obtaining
perfect scores for both metrics measured @ 10. For comparison, we include the
system that obtained best scores in terms of all ranking metrics @ 500 writings,
submitted by the iLab team.


3     Task 2. Predicting levels of depression

The second task consisted of predicting the level of depression of social media
users, by predicting answers to a 21-question questionnaire for assessing depres-
sion, where each question can have one of four to six answers. Training data
consisting of 20 labelled users was available beforehand. The test data consisted
of 70 users’ social media posts, and the participating systems had to predict
their answers to each of the questions.
    Several evaluation metrics were used, measuring how well the predictions
match the true labels, from more fine-grained to more general levels, including:
average hit rate (AHR), average closeness rate (ACR), average difference between
overall depression levels (ADODL), depression category hit rate (DCHR).
    We participated with three different models in this task. The details of the
models and features used are described below.


3.1    Features

For the first two models, we used a few of the same features described in the pre-
vious sections. We included the lower-dimensionality numerical features: LIWC
and emotion categories, represented as continuous vectors. For obtaining user-
level representations, we averaged the values of these vectors computed for each
of the user’s posts. Since it has been shown that the evolution of certain be-
haviors and linguistic markers is in itself predictive of developing a disorder or
not, we choose to capture the variation of the features extracted, by including
in our feature vectors the standard variations (aside from the averages) seen in
the distribution of each feature across a user’s history of posts.
    For our final model, we tried to leverage pre-trained language models in order
to obtain semantic representations of the user’s social media posts. To this effect,
we extracted sentence representations from Universal Sentence Encoder (USE)
[2] for each of the posts in a user’s history, obtaining a continuous vectorial
representation for each post. A user’s representation was obtained by averaging
the representations of each of his/her posts.


3.2   Models

We chose to use simpler traditional machine learning models for this task with
fewer parameters than the neural networks used in task 1, to suit the small size
of the training data: we experimented with SVM and logistic regression models,
using the features previously described.
    All models were trained on the available training data, and the trained models
were used to make predictions on the new data in the testing phase. For each
of the models, we modelled the task as a multi-label multi-class classification
problem, by training one model for each of the 21 questions, where each question
can be assigned one of 4-6 labels (depending on the question).
    LogReg-features The first model used was a logistic regression model with
the lexicon-based features represented as numerical vectors.
    SVM-features For the second run, we used an SVM with RBF kernel, with
the same features as for the previous run.
    SVM-USE Our last model was an SVM with RBF kernel, and USE features.


3.3   Results


              Run             AHR ACR ADODL DCHR
              LogReg-features 34.01% 67.07% 80.05% 35.71%
              SVM-features    34.56% 67.44% 80.63% 35.71%
              SVM-USE         36.94% 69.02% 81.72% 31.53%
              BioInfo@UAVR 38.30% 69.21% 76.01% 30.00%
              iLab run2       37.07% 69.41% 81.70% 27.14%
              relai lda user  36.39% 68.32% 83.15% 34.29%
                          Table 3. Official results task 2


    Table 3 shows official results results for task 2, for all evaluation metrics.
Our best models in terms of DCHR were the models using lexicon-based features,
which obtained the maximum score of all participating teams on this metric. The
model using USE features has better performance than the other two for the rest
of the metrics. The good scores obtained with simple models and features suggest
the problem may not be well suited to complex representations and architectures,
possibly due to the small size of the training data. For comparison, we include in
the table results of the systems that obtained best scores in terms of the other
metrics (aside from DCHR). Overall, scores for this task were modest for all
participating teams, suggesting predicting the level of depression is a difficult
task.


4   Conclusion

In this paper we presented the contributions of the PRHLT-UPV team in the
eRisk 2020 shared tasks: self-harm detection and the prediction of depression
levels, based on social media text data. We used multi-dimensional features to
represent various levels of the language, including content, style and emotion.
In the first task, where more training data was available, we experimented with
different deep learning architectures, including hierarchical attention networks
and transformers, as well as with different strategies concerning the experimental
setup: such as sequence sampling for data augmentation, and rolling average
for smoothing model outputs. For the second task we used traditional models
such as SVM and logistical regression, with features including style and emotion
features, as well as semantic sentence representations from pre-trained language
models. We obtained best scores in terms of detecting the general depression
category in the second task.


Acknowledgements

The work of Paolo Rosso was in the framework of the research project PROM-
ETEO/2019/121 (DeepPattern) by the Generalitat Valenciana.


References

 1. Abd Yusof, N.F., Lin, C., Guerin, F.: Analysing the causes of depressed mood from
    depression vulnerable individuals. In: Proceedings of the International Workshop
    on Digital Disease Detection using Social Media 2017 (DDDSM-2017). pp. 9–17
    (2017)
 2. Cer, D., Yang, Y., Kong, S.y., Hua, N., Limtiaco, N., John, R.S., Constant, N.,
    Guajardo-Cespedes, M., Yuan, S., Tar, C., et al.: Universal sentence encoder. arXiv
    preprint arXiv:1803.11175 (2018)
 3. Coppersmith, G., Dredze, M., Harman, C.: Quantifying mental health signals in
    twitter. In: Proceedings of the workshop on computational linguistics and clinical
    psychology: From linguistic signal to clinical reality. pp. 51–60 (2014)
 4. Coppersmith, G., Dredze, M., Harman, C., Hollingshead, K., Mitchell, M.: Clpsych
    2015 shared task: Depression and ptsd on twitter. In: Proceedings of the 2nd Work-
    shop on Computational Linguistics and Clinical Psychology: From Linguistic Signal
    to Clinical Reality. pp. 31–39 (2015)
 5. De Choudhury, M., Counts, S., Horvitz, E.J., Hoff, A.: Characterizing and predict-
    ing postpartum depression from shared facebook data. In: Proceedings of the 17th
    ACM conference on Computer supported cooperative work & social computing.
    pp. 626–638 (2014)
 6. De Choudhury, M., Gamon, M., Counts, S., Horvitz, E.: Predicting depression via
    social media. In: Seventh international AAAI conference on weblogs and social
    media (2013)
 7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
    tional transformers for language understanding. arXiv preprint arXiv:1810.04805
    (2018)
 8. Eichstaedt, J.C., Smith, R.J., Merchant, R.M., Ungar, L.H., Crutchley, P.,
    Preoţiuc-Pietro, D., Asch, D.A., Schwartz, H.A.: Facebook language predicts de-
    pression in medical records. Proceedings of the National Academy of Sciences
    115(44), 11203–11208 (2018)
 9. Lee, S.A., Mathis, A.A., Jobe, M.C., Pappalardo, E.A.: Clinically significant fear
    and anxiety of covid-19: A psychometric examination of the coronavirus anxiety
    scale. Psychiatry Research p. 113112 (2020)
10. Losada, D.E., Crestani, F., Parapar, J.: Overview of erisk: early risk prediction on
    the internet. In: International Conference of the Cross-Language Evaluation Forum
    for European Languages. pp. 343–361. Springer (2018)
11. Losada, D.E., Crestani, F., Parapar, J.: Overview of erisk 2019 early risk prediction
    on the internet. In: International Conference of the Cross-Language Evaluation
    Forum for European Languages. pp. 340–357. Springer (2019)
12. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk 2020: Early
    Risk Prediction on the Internet. In: A. Arampatzis, E. Kanoulas,
    T.T.S.V.H.J.C.L.C.E.A.N.L.C.N.F.e. (ed.) Experimental IR Meets Multilin-
    guality, Multimodality, and Interaction Proceedings of the Eleventh International
    Conference of the CLEF Association (CLEF 2020). Springer International
    Publishing (2020)
13. Mitchell, M., Hollingshead, K., Coppersmith, G.: Quantifying the language of
    schizophrenia in social media. In: Proceedings of the 2nd workshop on Computa-
    tional linguistics and clinical psychology: From linguistic signal to clinical reality.
    pp. 11–20 (2015)
14. Mohammad, S.M., Turney, P.D.: Nrc emotion lexicon. National Research Council,
    Canada 2 (2013)
15. Mohammadi, E., Amini, H., Kosseim, L.: Quick and (maybe not so) easy detection
    of anorexia in social media posts. In: CLEF (Working Notes) (2019)
16. O’dea, B., Wan, S., Batterham, P.J., Calear, A.L., Paris, C., Christensen, H.:
    Detecting suicidality on twitter. Internet Interventions 2(2), 183–188 (2015)
17. Organization, W.H.: Depression: A global crisis. world mental health day, october
    10 2012. World Federation for Mental Health, Occoquan, Va, USA (2012)
18. Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic inquiry and word count:
    Liwc 2001. Mahway: Lawrence Erlbaum Associates 71(2001), 2001 (2001)
19. Ragheb, W., Azé, J., Bringay, S., Servajean, M.: Attentive multi-stage learning for
    early risk detection of signs of anorexia and self-harm on social media. In: CLEF
    (Working Notes) (2019)
20. Resnik, P., Garron, A., Resnik, R.: Using topic modeling to improve prediction
    of neuroticism and depression in college students. In: Proceedings of the 2013
    conference on empirical methods in natural language processing. pp. 1348–1353
    (2013)
21. Sadeque, F., Xu, D., Bethard, S.: Uarizona at the clef erisk 2017 pilot task: linear
    and recurrent models for early depression detection. In: CEUR workshop proceed-
    ings. vol. 1866. NIH Public Access (2017)
22. Shen, G., Jia, J., Nie, L., Feng, F., Zhang, C., Hu, T., Chua, T.S., Zhu, W.:
    Depression detection via harvesting social media: A multimodal dictionary learning
    solution. In: IJCAI. pp. 3838–3844 (2017)
23. Shen, J.H., Rudzicz, F.: Detecting anxiety through reddit. In: Proceedings of the
    Fourth Workshop on Computational Linguistics and Clinical Psychology—From
    Linguistic Signal to Clinical Reality. pp. 58–65 (2017)
24. Trotzek, M., Koitka, S., Friedrich, C.M.: Linguistic metadata augmented classifiers
    at the clef 2017 task for early detection of depression. In: CLEF (Working Notes)
    (2017)
25. Trotzek, M., Koitka, S., Friedrich, C.M.: Word embeddings and linguistic metadata
    at the clef 2018 tasks for early detection of depression and anorexia. In: CLEF
    (Working Notes) (2018)
26. Wang, Y.T., Huang, H.H., Chen, H.H.: A neural network approach to early risk
    detection of depression and anorexia on social media text. In: CLEF (Working
    Notes) (2018)
27. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention
    networks for document classification. In: Proceedings of the 2016 conference of the
    North American chapter of the association for computational linguistics: human
    language technologies. pp. 1480–1489 (2016)