UNED-MED at eRisk 2022: depression detection with
TF-IDF, linguistic features and Embeddings
Elena Campillo-Ageitos1 , Juan Martinez-Romo1,2 and Lourdes Araujo1,2
1
  NLP & IR Group, Dpto. Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia (UNED),
Juan del Rosal 16, Madrid 28040, Spain
2
  IMIENS: Instituto Mixto de Investigación, Escuela Nacional de Sanidad, Monforte de Lemos 5, Madrid 28019, Spain


                                         Abstract
                                         Mental health problems, such as depression and anxiety, are conditions that can have very serious
                                         consequences if untreated, and cause the patient a lot of suffering. Research suggests that the way
                                         people write can reflect mental well-being and mental health risks, and social media provides a source of
                                         user-generated text to study. Early detection is crucial for mental health problems, and with this in mind
                                         the shared task eRisk was created. This paper describes the participation of the group UNED-MED on
                                         the 2022 T2 subtask. Participants were asked to create systems to detect early signs of depression on
                                         users from Reddit. Our team proposes two approaches: feature-driven classifiers with features based on
                                         text data, TF-IDF terms, first-person pronoun use, sentiment analysis and depression terminology; and a
                                         Deep Learning classifier with pretrained Embeddings. The official task results show modest results that
                                         show the difficulty of working with depression data.

                                         Keywords
                                         early risk detection, depression detection, natural language processing, data extraction, data relabeling,
                                         CEUR-WS


1. Introduction
Mental health problems such as anxiety and depression are conditions that affect millions of
people every year. People with depression may not seek medical attention in time, causing
them unnecessary suffering. Some patients forego medical attention because they are not aware
that they need it, but some still avoid it because of the stigma associated with it. Whatever the
cause, untreated mental illnesses can worsen with time and lead to serious consequences, such
as substance abuse, or even death.
   Language is a tool we use to communicate with one another. Besides transmitting the intended
message with it, we also transmit information about ourselves: our upbringing, our mood, our
emotional well-being, etc. Several studies have shown a correlation between differences in the
way people talk and write, and having a mental health condition [1, 2]. This use of language can
be studied with Natural Language Processing (NLP) to detect untreated mental health problems.


CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ ecampillo@lsi.uned.es (E. Campillo-Ageitos); juaner@lsi.uned.es (J. Martinez-Romo); lurdes@lsi.uned.es
(L. Araujo)
 0000-0003-0255-0834 (E. Campillo-Ageitos); 0000-0002-6905-7051 (J. Martinez-Romo); 0000-0002-7657-4794
(L. Araujo)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
   Social media sites such as Twitter, Tumblr or Reddit present a natural collection of user-
generated texts. They use these sites to communicate with their friends, follow celebrities, and
to express themselves. There is an insurmountable amount of information that can be applied
to NLP techniques with various purposes. Recent research has been applying these techniques
to automatically detect users who suffer from several mental health issues.
   In the case of mental health, early treatment is especially important, since it increases the
probabilities of a good prognosis. The longer a patient suffers without medical treatment, the
more likely they are to suffer from associated risks. Early detection helps detect these cases
before they become bigger problems. Most work done in the literature focus on detecting people
who already suffer from conditions, but we believe focusing on early detection is important to
allow for a faster diagnostic and faster intervention.
   The eRisk shared task was created with this objective in mind. The shared task focuses
on the early detection of mental health problems in social networks. Previous editions have
focused on anorexia, self-harm, and pathological gambling. The 2022 eRisk shared task proposed
three subtasks, and this paper presents our participation to the subtask T2: early detection of
depression. An overview of the task and the overall results for all participating teams can be
found on [3].
   The following sections are organized as follows: Section 2 describes the data we used to
train our models; Section 3 details the models we proposed for this task; Section 4 explains
the experiment setup; Section 5 summarizes our results after the test stage; finally, Section 6
presents our conclusions and ideas for future work.


2. Dataset Description
Our system for the T2 task was trained and evaluated with two different datasets: 1) The official
eRisk 2022 shared Task 2 dataset, and 2) The UNED-MED depression Reddit dataset. This section
briefly describes each dataset.

2.1. eRisk 2022 shared Task 2 dataset
This is the official eRisk dataset given to the participants. It is a Reddit early risk detection
dataset first presented by Losada et al. at [4]. The dataset is comprised of a collection of
documents. Each document contains the post submission history of a user from Reddit. The
users are labeled as one of two classes: Positive (at risk of depression) and negative (control
group).
   This year, the training data was formed with the training and testing data of 2017 and 2018
eRisk depression tasks. We decided to use the data from 2017 to train our model, and the data
from 2018 for evaluation.
   Each post in a user document is either a text submission or a comment to another users’
submission. The posts are ordered chronologically, from earliest to latest. The number of posts
for an user is indeterminate, as is the length of each post. Additional data comes with each post,
such as the date of submission.
   The classes are deeply unbalanced, as can be seen in table 1 represented by the Original eRisk
column: there are 7 times more negative users than positive users. This can be a challenge
when training classifiers, especially when the most important class is the least represented one.
   Table 2 shows some statistics about the dataset. The average length of messages for positive
users is 206, while for negative users it is 170. The 25%, 50%, and 75% values show that, while
below the median, most posts are short for both groups of users, positive users tend to write
longer posts, even if the longest post was written by a negative user.
   One brief exploration of the data shows some challenges from the textual data. This is no
surprise, since they are not formal texts; they are sentences written by people on the Internet
to communicate between each other. It is widely observed that Internet speech has deep
particularities, such as loose grammar, incorrect spelling (sometimes in purpose, and with
meaning), emoji use, etc. A whole paper could be written on social cues only observed on
Internet speech. We also find metadata, such as hyperlinks, references to other users, etc.

2.2. UNED-MED 2022 depression Reddit dataset
One of the most challenging aspects of the official dataset is the deeply unbalanced aspect of
the classes. Detecting positive users is arguably much more important than detecting negative
users, but they are underrepresented in the training dataset. We have curbed this problem
in previous editions of the shared task by applying data oversampling. In this occasion, we
obtained additional data from Reddit, following the strategies described in [4].
   We used the PRAW Python Reddit API Wrapper to extract new data 1 . We searched Reddit
submissions with the following search queries:

    • diagnosed AND depression
    • I AND am AND diagnosed AND depression

  We searched on r/all, and on the following subreddits related to mental health and depression:

    • r/addiction
    • r/adultdepression
    • r/anxiety
    • r/anxietyhelp
    • r/depression
    • r/depression_help
    • r/mentalhealth
    • r/mentalillness
    • r/sad
    • r/suicidewatch

  Results were manually reviewed to make sure users were talking about themselves, and had
been officially diagnosed with clinical depression. A list was compiled, and we obtained the
post and comment history of each user. Users with less than ten submissions were discarded.
  The resulting dataset is a collection of 235 users. Table 2 shows some statistics for the resulting
dataset. We can see that the statistics resemble those of the positive users of the original eRisk
dataset.
   1
       https://praw.readthedocs.io/en/stable/
Table 1
Breakdown of the Original eRisk 2022 dataset’s number of positive and negative users, plus the additional
extracted data.
                                       Original eRisk New eRisk Combined
                     Positive users          214            235           449
                     Negative users         1493              0          1493
                     All users              1707            235          1942

Table 2
Analysis of the messages text length of the Original eRisk 2022 dataset and the UNED-MED 2022 dataset,
represented as New eRisk
                      Original eRisk   Original positive   Original negative    New eRisk
             count          1076582               90222              986360        106588
             mean             172.91             205.54               169.92        208.87
             std              538.48             398.54               549.41        481.77
             min                   1                   1                   1             1
             25%                  40                  40                  40            33
             50%                  73                  90                  72            79
             75%                 160                213                 156           199
             max               38663              18175               38663         31638


3. Proposed Model
We present three early risk models, depending on the classifier algorithm we use: 1) Random
Forest, 2) XGBoost, and 3) CNN. Models 1 and 2 are based on traditional machine learning
techniques, while model 3 applies Deep Learning. Features for models 1 and 2 are a combination
of TF-IDF and text-based. Features for model 3 were Embeddings.
   Each model is conformed of three stages: 1) Data pre-processing, 2) features, and 3) classifi-
cation. The models take one message by one user and predict whether this user is at risk of
depression (1) or not (0). As is established by the eRisk task, a positive decision is final, but a
negative decision may be rectified later.

3.1. Training data
Based on the data described in section 2, we created three different training sets:

    • Original eRisk: This training set was formed by combining the eRisk 2022 shared Task 2
      train and test datasets.
    • Augmented eRisk: This training set was created by incorporating the UNED-MED 2022
      depression Reddit dataset to the Original eRisk training set.
    • Relabeled eRisk: This training set was created by applying a relabelling strategy based on
      sentiment analysis to the Original eRisk training set.
Table 3
Analysis of the messages text length of the Original eRisk 2022 dataset after applying relabeling.
                                  Relabel eRisk     Relabel positives    Relabel negatives
                        count           531394                 11521               519873
                        mean             160.67               203.26                159.73
                        std              397.83               224.69                400.77
                        min                   1                     1                    1
                        25%                  39                    61                   39
                        50%                  72                   134                   72
                        75%                 159                   262                  157
                        max               38216                 4033                38216


3.1.1. Relabeled eRisk
Labels in the Original eRisk dataset are applied at user level, not at post level. This means that
every post from a positive user is labeled positive, and vice-versa. We propose the hypothesis
that not all posts by users at risk contain relevant information that can be detected by an early
risk system, and that training a system with these posts labeled as positive makes the system
perform worse.
   We could approach this hypothesis in different ways: for example, we could apply unsuper-
vised learning, or treat the problem as a zero-shot classification problem. As a first approach to
the problem, we chose to re-classify only posts labeled as positive by using sentiment analysis.
Posts with a negative sentiment analysis above a certain threshold would keep their classification
as positive, while others would be re-classified as negative.
   This training set was created with this strategy applied to the Original eRisk training set.
Posts from positive users were relabeled based on the sentiment analysis strategy. We applied a
twitter-XLM-roBERTa-base model trained on tweets and finetuned for Sentiment Analysis 2 [5].
   Table 3 shows statistics of the dataset after applying the relabeling.

3.2. Preprocessing
Standard text preprocessing was applied to the text from each post. Posts were cleaned, tok-
enized, and stems were obtained. Stop words were kept as part of the text, since we believe
they are important for this particular task.
  We used the Python library redditcleaner 3 to clean the textual data. We removed Markdown
formatting, separated contractions, removed hyperlinks, HTML tags, numbers and multiple
spaces. Finally, all text was made lowercase.

3.2.1. Windowfying
Some texts are long, while others are exceptionally short. To curve this difference and make sure
a significant length of text is processed in each step without compromising speed, we applied a
sliding window to the posts. After cleaning, we joined the text of a post from its previous 𝑤
    2
        https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment
    3
        https://pypi.org/project/redditcleaner/
messages, where 𝑤 is a configurable parameter. Features, explained next, were calculated on
this message window instead of only on the text of the particular post.

3.3. Features
We used two different strategies for features, depending on whether the classifier algorithm was
a traditional machine learning algorithm (models 1 and 2), or a Deep Learning one (model 3).

3.3.1. Traditional features
The features applied to traditional machine learning were an adapted version of the ones used
for the eRisk 2021 T2 [6].

Text-based features We applied two features in this category: 1) text length and 2) number
of words. We showed in section 2 that positive users are more likely to write longer texts, so
we keep track of this information with these two features.
   Similarly to our previous eRisk participations, we applied a collection features tailored to the
depression dataset. Features were normalized by text length and discretized to a fixed number
of bins.

First-person pronouns First-person pronouns: Several works [7] [8] have established that
people with mental health problems such as depression tend to use more first-person pronouns
when they speak. We create a feature that counts the number of times a first-person singular
pronoun is used in a text.

Depression-related words In previous editions of the shared task, we applied a wordset
of self-harm related terms as a feature. This year, we applied a collection of words related to
clinical depression, and the moods and topics associated with it extracted from [9]. This feature
counts the number of depression-related words that appear in a text.
   We combined these features to TF-IDF-based features using Scipy Hparse.

3.3.2. TF-IDF features
Similarly to previous years, we trained a TF-IDF featurizer on the positive users of the data and
used this featurizer to obtain TF-IDF features for each message window. The featurizer was
trained exclusively on positive data (in the case of the relabeled dataset, it was trained only on
those messages that remained positive) because we want to detect words used frequently by
positive users.

3.3.3. Embeddings
Embeddings were used for the Deep Learning model. We applied Standford’s pre-trained GloVe
[10] Wikipedia 2014 100d word Embeddings. Posts were windowfied and then padded to a
sufficiently long length in order to include the longest messages before applying the Embeddings.
Table 4
UNED-MED eRisk 2022 T2 runs configurations.
       run         dataset              model        training window size   test sliding window size
        0       Original eRisk         XGBoost                 30                       30
        1     Augmented eRisk        Random forest             10                       10
        2      Relabeled eRisk         XGBoost                100                      100
        3      Relabeled eRisk         XGBoost                100                       10
        4       Original eRisk           CNN                   10                       10


3.4. Classifier Algorithms
We worked with traditional machine learning models, and with one Deep Learning model. The
classifiers predict whether a message window belongs to a user at risk of being "positive" or
"negative". Like the task specifies, a positive decision is final, but a negative decision may be
revised later.
   We worked with two traditional machine learning models: Random Forest and XGBoost; and
one Deep Learning model: a Convolutional Neural Network (CNN).

    • Random Forest: We used the scikit-learn4 implementation of the Random Forest Classifier
      [11].
    • XGBoost5 : This model is a type of ensemble model. It learns by optimizing a distributed
      gradient on learning algorithms of the Gradient Boosting framework.
    • CNN: We implemented a Convolutional Neural Network using Keras. The Neural Network
      was formed by a CNN layer of size 64, a GlobalMaxPooling layer, a Dense layer with relu
      activation, and an output Dense layer with sigmoid activation.

3.4.1. Training strategy
We applied descending training weights to positive posts. This was in order to make our system
prioritize earlier messages, and detect positive users as fast as possible.
   Messages created by negative users were all assigned the same training weight (1). Messages
created by positive users were assigned descending weights, from oldest to most recent message,
through a fixed rate (2 to 1). Our working notes from the eRisk 2021 task [6] present a thorough
explanation of the algorithm used to calculate these training weights.


4. Experimental Setup
Table 4 shows the parameter configurations for the five different runs. Each run uses a different
combination of training data, classifier model and training and test window size.
  They all used weighted training.


   4
       https://scikit-learn.org
   5
       https://xgboost.readthedocs.io/en/stable/
Table 5
eRisk 2022 T2 decision-based evaluation. Our teams’ results (UNED-MED) are compared to the best
results for each metric. Our best results for each metric and the overall best results for the rest of the
teams are bolded.
                      run                                                                          𝑙𝑎𝑡𝑒𝑛𝑐𝑦-𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑
team name                     𝑃      𝑅       𝐹 1 𝐸𝑅𝐷𝐸5 𝐸𝑅𝐷𝐸50 𝑙𝑎𝑡𝑒𝑛𝑐𝑦𝑡𝑝 𝑠𝑝𝑒𝑒𝑑
                       id                                                                                   𝐹1
UNED-MED                0   .119 .969 .212           .091        .056           18         .934            .198
UNED-MED                1   .139 .980 .244           .079        .046           13         .953            .233
UNED-MED                2   .122 .939 .215           .086        .057           15         .945            .204
UNED-MED                3   .131 .949 .231           .084        .051           15         .945            .218
UNED-MED                4   .084 .163 .111           .079        .078          251         .252            .028
NLPGroup-IISERB         0   .682 .745 .712           .055        .032            9         .969            .690
BLUE                    0   .395 .898 .548           .047        .027            5         .984            .540
UNSL                    2   .400 .755 .523           .045        .026            3         .992            .519


5. Results and Discussion
In this section we analyze the task results of our participation.
   Table 5 shows our five runs metric results, plus the results for the best groups according to
different metrics. Our results this year were modest, with a best latency-weighted F1 of 0.233
compared to NLPGroup-IISERB’s 0.690.
   When comparing our different runs, we observe that run 1 obtained the best results overall
in all the available metrics, followed by run 3. In the following ranking metrics we will also
observe our best results in these two runs. Run 1 used the Random forest model trained on the
augmented dataset, while run 3 was an XGBoost model trained on the relabeled dataset. Other
differences between these runs are the sizes of the feature windows: 10 for run 1, and 10 for run
3.
   While modest, we believe these differences show that the strategies to improve the training
dataset worked favorably. Our best results were obtained with a model trained on the augmented
dataset, which used an increased amount of positive users during training. It would be interesting
to see how the models would behave if we trained the XGBoost model on the augmented dataset
instead, and the Random Forest model on the relabeled dataset. Unfortunately, due to the
amount of combinations we wanted to test, we could not fit these combinations in the official
task.
   Smaller feature window sizes appear to yield better results, as can be also seen by the ranking
of our results. Run 1 was trained on window sizes of size 10, and evaluated on sliding window
sizes of the same size. Run 3 was trained on window sizes of size 100, but evaluated on sliding
window sizes of size 10. It appears that using smaller sizes on test is better, but it may not be
necessary during training.
   Our worst results were obtained with run 4, the Deep Learning model. We can only wonder
as to why this happened, since usually Deep Learning models perform better than traditional
learning models in similar circumstances. Despite the sliding window size being 10, the same
as runs 1 and 3, the latency value is also very high compared to our other runs (251 compared
to less than 20 for all other runs). This makes us think that maybe something went wrong with
Table 6
Ranking-based evaluation. Our team’s results (UNED-MED) are compared to the best team’s results.
The best results overall are bolded.
                                      1 writing                       100 writings
team                  run P@10 NDCG@10 NDCG@100 P@10 NDCG@10 NDCG@100
UNED-MED                1      .50    .44          .26       .70       .76            .50
UNED-MED                3      .80    .82          .29       .60       .44            .31
BLUE                    0      .80    .88          .54       .60       .56            .59
BLUE                    1      .80    .88          .54       .70       .64            .67
BLUE                    2      .80    .75          .46       .40       .40            .30
TUA1                    0      .80    .88          .44       .60       .72            .52
TUA1                    2      .80    .88          .44       .60       .72            .52
Sunday-Rocker2          3      .80    .88          .41       .50       .50            .23
UNSL                    1      .80    .88          .46       .60       .73            .64
Sunday-Rocker2          1      .70    .81          .39       .90       .93            .66
NLPGroup-IISERB         1      .30    .32          .13       .90       .81            .27
NLPGroup-IISERB         4      .00    .00          .04       .90       .93            .66
NLPGroup-IISERB         0      .00    .00          .02       .90       .92            .30
CYUT                    3      .10    .07          .12       .70       .70            .57
CYUT                    4      .10    .06          .12       .60       .68            .55


the implementation of this model.
   Overall, we believe the depression task has been significantly more challenging than previous
edition of the eRisk task. We observed this too while preparing our systems during the training
phase, and we believe this might be due to the nature of the data. eRisk datasets are obtained
by searching users on Reddit that have explicitly said that they were diagnosed with the mental
health problem the task is about (depression, in this case). While other problems such as
anorexia and self-harm still have a lot of stigma, openly talking about one’s depression is seen
more and more in this day and age. Therefore, it is more possible that most positive users in
previous years, where anorexia or self-harm were detected, were accounts created exclusively
to talk about that specific problem (Reddit users call these kind of accounts "throwaways"),
while positive users in the depression dataset are normal users that use their Reddit account to
talk about their hobbies, interests, etc.
   Table 6 shows our ranked results compared to teams that obtained the best results in this
category. In this case we can see some better results in some of the categories for runs 1 and
3. We can see that our results are better in the beginning, when only one message has been
processed, and they decrease as time goes by. This might indicate that our system performs
better when only a limited amount of messages for every user are observed, and that observing
too many messages results in yielding too many false positive results. This is in concordance
with results from table 5, where we can see that Recall is very high for four of the five runs,
while Precision is very low.
Table 6
Ranking-based evaluation. Our team’s results (UNED-MED) are compared to the best team’s results.
The best results overall are bolded. Continuation
                                          1 writing                   100 writings
team                  run P@10 NDCG@10 NDCG@100 P@10 NDCG@10 NDCG@100
UNED-MED                1      .60        .64       .47      .80       .74            .50
UNED-MED                3      .80        .73       .36      .40       .51            .30
BLUE                    0      .80        .81       .66      .80       .80            .68
BLUE                    1      .80        .84       .74      .80       .86            .72
BLUE                    2      .30        .35       .20      .30       .38            .16
TUA1                    0      .60        .67       .52      .70       .80            .57
TUA1                    2      .60        .67       .52      .70       .80            .57
Sunday-Rocker2          3      .60        .69       .34      .00       .00            .00
UNSL                    1      .60        .73       .66      .60       .71            .66
Sunday-Rocker2          1      .90        .88       .65      .00       .00            .00
NLPGroup-IISERB         1      .80        .84       .33      .00       .00            .00
NLPGroup-IISERB         4      .90        .92       .69      .00       .00            .00
NLPGroup-IISERB         0      .90        .92       .33      .00       .00            .00
CYUT                    3      .70        .72       .59      .80       .74            .60
CYUT                    4      .60        .69       .59      .80       .84            .61


6. Conclusions and Future Work
This paper presented the UNED-MED participation for the eRisk 2022 T2 task. We developed
several classifier models based on TF-IDF, text and specially-tailored features, and a Deep
Learning classifier model with Embeddings. We also implemented several strategies to reduce
the imbalance of the training data: we obtained more data from Reddit, and we relabeled the
original training data. The test results show that our systems obtain modest results, and that
more effort is needed to achieve state-of-art results.
   In future work, we would like to keep exploring strategies to relabel the data, and maybe
experiment with zero-shot learning. This would allow the system to be portable from one kind
of disease to another with minimal effort.


Acknowledgments
This work has been partially supported by the Spanish Ministry of Science and Innovation
within the DOTT-HEALTH Project (MCI/AEI/FEDER, UE) under Grant PID2019-106942RB-C32,
as well as project EXTRAE II (IMIENS 2019), the research network AEI RED2018-102312-T
(IA-Biomed), and a predoctoral contract UNED - Santander.


References
 [1] M. De Choudhury, S. Counts, E. Horvitz, Social Media as a Measurement Tool of Depression
     in Populations, in: Proceedings of the 5th Annual ACM Web Science Conference, WebSci
     ’13, Association for Computing Machinery, New York, NY, USA, 2013, pp. 47–56.
 [2] J. W. Pennebaker, M. R. Mehl, K. G. Niederhoffer, Psychological Aspects of Natural
     Language Use: Our Words, Our Selves, Annual Review of Psychology 54 (2003) 547–577.
 [3] L. D. Parapar J., Martín Rodilla P., C. F, Overview of erisk 2022: Early risk prediction on
     the internet., in: Experimental IR Meets Multilinguality, Multimodality, and Interaction.
     13th International Conference of the CLEF Association, CLEF 2022, Springer International
     Publishing, 2022.
 [4] D. E. Losada, F. Crestani, A Test Collection for Research on Depression and Language Use
     CLEF 2016, Évora (Portugal), Experimental IR Meets Multilinguality, Multimodality, and
     Interaction (2016) 28–29. URL: https://tec.citius.usc.es/ir/pdf/evora.pdf.
 [5] F. Barbieri, L. E. Anke, J. Camacho-Collados, Xlm-t: Multilingual language models in
     twitter for sentiment analysis and beyond, 2021. URL: https://arxiv.org/abs/2104.12250.
     doi:10.48550/ARXIV.2104.12250.
 [6] E. Campillo-Ageitos, H. Fabregat, L. Araujo, J. Martínez-Romo, Nlp-uned at erisk 2021:
     self-harm early risk detection with tf-idf and linguistic features, in: CLEF, 2021.
 [7] J. W. Pennebaker, The Secret Life of Pronouns What Our Words Say About Us, Bloomsbury
     Press, 2011.
 [8] T. Edwards, N. S. Holtzman, A meta-analysis of correlations between depression and first
     person singular pronoun use, Journal of Research in Personality 68 (2017) 63–68. URL:
     http://dx.doi.org/10.1016/j.jrp.2017.02.005.
 [9] D. Mowery, H. Smith, T. Cheney, G. Stoddard, G. Coppersmith, C. Bryan, M. Conway,
     Understanding depressive symptoms and psychosocial stressors on twitter: A corpus-based
     study, Journal of Medical Internet Research 19 (2017) e48. URL: https://doi.org/10.2196/
     jmir.6895. doi:10.2196/jmir.6895.
[10] J. Pennington, R. Socher, C. D. Manning, GloVe: Global Vectors for Word Representation,
     in: Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543.
     URL: http://www.aclweb.org/anthology/D14-1162.
[11] L. Breiman,        Machine Learning 45 (2001) 5–32. URL: https://doi.org/10.1023/a:
     1010933404324. doi:10.1023/a:1010933404324.