A Data-Driven Approach for Measuring the Severity of
          the Signs of Depression using Reddit Posts

     Paul van Rijen1,2, Douglas Teodoro1,2, Nona Naderi1,2,3, Luc Mottin1,2, Julien
                       Knafou1,2, Matt Jeffryes1,2, Patrick Ruch1,2
     1
         BiTeM group, HES-SO / HEG Geneva, Information Sciences, Geneva, Switzerland
            2
              SIB Text Mining, Swiss Institute of Bioinformatics, Geneva, Switzerland
                            3
                              University of Toronto, Toronto, Canada
                                contact: paul.vanrijen@hesge.ch


         Abstract. In response to the CLEF eRisk 2019 shared task on measuring the se-
         verity of the signs of depression from threads of user submissions on social me-
         dia, our team has developed a data-driven, ensemble model approach. Our system
         leverages word polarities, token extraction via mutual information, keyword ex-
         pansion and semantic similarities for classifying Reddit posts according to the
         Beck’s Depression Inventory (BDI). Individual models were combined at the
         post level by majority voting. The approach achieved a baseline performance for
         the assessed metrics, including Average Hit Rate and Depression Category Hit
         Rate, being equivalent to the median system in the limit of one standard devia-
         tion.

         Keywords: Depression severity assessment, Social networks, Natural language
         processing, Machine learning.


1        Introduction

Depression is increasingly recognized as a major burden in public healthcare worldwide
[1, 2]. In 2015 the World Health Organization (WHO) estimated that the total number
of people living with depression was 322 million [2]. Depression is ranked as the single
largest contributing factor to non-fatal health loss worldwide [1]. Major depressive dis-
order is associated with increased morbidity, disability and costs, increased mortality
due to other co-occurring medical conditions including cardiovascular and pulmonary
diseases and is a leading cause of suicide [2–4]. In addition to the high burden of dis-
ease, the majority of patients (50% globally) do not receive appropriate care [2]. Barri-
ers to proper diagnoses and treatment include social stigmas and a low detection rate in
primary care [5, 6]. Accurate and early detection of depression can help to lower these
barriers and thus mitigate the associated health risks.
   Social media networks, such as Facebook, Twitter and Reddit, enable people to share
their opinions and sentiments about a wide range of topics online [7]. In recent years
various studies have explored the potential of data from social media networks for de-
tecting signs of depression [8, 9]. In addition, the scientific community has put forward
  Copyright (c) 2019 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September
2019, Lugano, Switzerland.
various shared tasks such as CLPsych [10] and CLEF eRisk [11, 12]. In CLEF eRisk
2018, the objective was to predict whether a user was depressed or not given a set of
posts in a chronological order. Trotzek et al. [13] achieved the top F1-score using a bag
of words ensemble method.
   For 2019, the CLEF eRisk includes a task aimed towards measuring the severity of
the signs of depression from threads of user submissions on social media [14, 15]. The
eRisk task 3 involves filling in a Beck's Depression Inventory (BDI) questionnaire [16],
which assesses the presence of feelings like sadness, pessimism, loss of energy, etc., in
an individual, using a set of social media posts. Hence, the task changed from a standard
classification task, as in tasks Early Detection of Signs of Anorexia (task 1) and Self-
harm (task 2) of CLEF eRisk 2019 [17], to a combination of an information retrieval
and an interactive dialogue task, where the system should simulate how a user would
answer/fill in the questionnaire [18]. In response to this challenge our team developed
a data-driven, multi-model approach based on word-polarities, mutual information and
semantic similarities.


2      Methods

2.1    Beck's Depression Inventory
The BDI questionnaire has 21 questions in the following categories: sadness, pessi-
mism, past failure, loss of pleasure, guilty feelings, punishment feelings, self-dislike,
self-criticalness, suicidal thoughts or wishes, crying, agitation, loss of Interest, indeci-
siveness, worthlessness, loss of energy, changes in sleeping pattern, irritability, changes
in appetite, concentration difficulty, tiredness or fatigue, and loss of interest in sex. The
answers vary in a [0-3] scale, where 0 means the absence of the feeling and 1 to 3 the
presence from a milder (1) to a stronger (3) form.

2.2    Task data
As shown in Table 1, the dataset for eRisk 2019 consists of Reddit posts from 20 users
and contains the user identifier, the post timestamp, title and post content. The data was
annotated at the user level with the depression severity according to Beck’s Depression
Inventory. The dataset was shared with the participants without the labels for the system
development and was also used for the model evaluation during the test phase.

                        Table 1. Statics on the eRisk 2019 T3 dataset.
                    # users                                           20
                    # posts                                        10491
                    Median # posts per user                          327
                    Sd dev # posts per user                          446
                    Min # posts per user                              29
                    Max # posts per user                            1510
2.3    BDI questionnaire answering models
In this section, we describe the models used to automatically fill in the BDI question-
naire using the user’s Reddit posts.

Model 1 - Word polarity. In this model, we aim to leverage word polarities for first
classifying Reddit posts as depressive and next associate posts to relevant BDI dimen-
sions.

Resources. For this model, we made use of the Multi Perspective Question Answering
(MPQA) subjectivity lexicon [19][20] of over 8000 cues that can be used to express
private states including emotions, evaluations and stances. In this lexicon, cues are an-
notated with positive, negative, or neutral polarity. In addition, this lexicon provides
information regarding the part-of-speech of the cues and whether they are stemmed or
not. In our model, we only considered single-word cues to determine whether a post is
depressive or not.
   For analyzing the posts with the BDI dimensions, we created a lexicon that provides
cues for each dimension by first randomly selecting three subjects' writings (sub-
ject2341, subject5897 and subject9694). The MPQA single-word cues that appeared in
these writings were used as cues for the BDI dimensions lexicon. Next we expanded
the list of cues with the following resources: WordNet [21] to find synonyms, a sexual
desires vocabulary [22] and the F.E.A.S.T.'s Eating Disorders Glossary [23]. The an-
notation process of assigning BDI dimensions to each of the cues was done by three
team members. In the final version of lexicon, we included only the annotations that
were agreed upon by at least two annotators. Some cues can be associated with multiple
BDI dimensions. For instance, the cue ‘hate’ was associated to both the ‘agitation’ and
‘self-dislike’ dimensions, as illustrated by the following examples: Post 1: ‘Hate when
people do that.’, Post 2: ‘My life is already disintegrating, and I hate my grades.’. The
final lexicon contained 668 words in total for all BDI dimensions. On average, the lex-
icon included 30 terms for each dimension. The majority of cues (583) were annotated
with only a single dimension.

Classifier. First, we tagged the words in each post according to their polarity using the
MPQA subjectivity lexicon. Since no training data was available during the official
phase, we empirically set a threshold of 0.1 for the ratio of negative to positive words
for classifying the posts as depressive. Only considering the depressive posts, we then
tagged words with BDI dimensions using the developed lexicon. Finally, we calculated
questionnaire responses by normalizing the tag counts for each BDI dimension into a
[0-3] score.

Model 2 - Mutual information. In this model, we attempt to create a training dataset
from Reddit to classify posts as depressive or not. We used the mutual information
measure to extract relevant tokens from depressive posts [24]. Kraskov et al. proposes
a model to estimate the mutual information M(X,Y) from samples of random points
distributed according to some joint probability density µ(x,y) based on entropy esti-
mates from k-nearest neighbor distances.

Data. Two subreddit collections, containing 107,129 posts, were extracted as candi-
dates for providing positive and negative depression tokens. The positive collection
included 12 mental health related subreddits, such as Anxiety, depression, eating_dis-
orders, self-harm, social anxiety, and SuicideWatch. The negative collection included
32 general subreddits, such as all, AskReddit, explainlikeimfive, funny, movies, and
worldnews.

Training collection. Each post of the positive and negative collections was tokenized,
stopword-removed, and stemmed, and unigram tokens were extracted and associated to
the respective subreddit. Using the mutual information criteria, the 200 most informa-
tive tokens from each collection were used to tag each post. If a post from the positive
collection contained more positive tokens, it was deemed as positive. Similarly, if a
post from the negative collection contained more negative tokens, it was deemed as
negative. The training set was then created with the positive and negative posts tagged
with the 200 most informative tokens extracted from both collections. The final training
collection contains 3,318 positive and 58,328 negative posts.

Classifier. A logistic regression classifier was trained to categorize posts into depres-
sive or not using the positive and negative posts. Then, keywords from the BDI cate-
gories were expanded using WordNet and used to tag the positively classified posts.
This model did not take into account the nuances of the positive answers for a BDI
category, i.e., it considered the task as binary, assigning answers as 0 (negative) or 2
(positive) for a post.

Model 3 - Semantic similarity. Word embeddings have shown to capture semantic
similarities and in recent years, various models have been proposed to generate these
embeddings, such as word2vec [25], GloVe [26], and BERT [27]. Here, we propose to
find the most semantically similar user posts to the questionnaire responses in order to
estimate how a user may respond to the questionnaire. Given word embeddings, we
generate the representation of each user post by averaging over the embeddings of
words in the post. We use a similar approach to represent the questionnaire response
vectors, i.e., we average the embeddings of words in each questionnaire response. We
will then compute the similarity of a user post and a questionnaire response using cosine
similarity. We use pre-trained GloVe word embeddings1 [26] (trained on 2 billion
tweets and has 200 dimensions) to represent the words. In order to filter out the irrele-
vant posts, in the first step, we remove the posts that are not similar to the questionnaire
responses using only the noun and verb vectors and a threshold that was chosen empir-
ically (0.8). For the remaining posts, we compute the vector-based distance of each post
and questionnaire responses and choose the most similar response for that post. Treat-
ing each post as the average of word embeddings does not consider word orders and

  1       https://nlp.stanford.edu/projects/glove/
not likely produce a good representation for the longer posts, but it has shown to provide
a relatively good baseline. This post-level representation can be improved by leverag-
ing state-of-the-art sentence embedding models [28].

Model 4 – Ensemble. Two ensemble approaches were tested - micro and macro-voting
using models 1 to 3. In micro-voting, models were combined at the post level. If more
than two models classified a post as positive for depression, and if two (majority) or
three (strict) models classified the post as positive for a category, the category was
deemed as positive for that user. In macro voting, results were combined using the av-
erage category prediction from the three models. The official run (BiTeM run 0) was
generated using the strict micro-voting ensemble.

2.4    Tuning individual models
For each of the individual models we applied a threshold k, which determines the min-
imal number of positive posts (i.e., categorized as depressed) the system would need to
consider a user as depressed. Only then, responses would be given to questions in the
21 BDI categories. Hence, a positive post could be considered as a proxy for a depres-
sive episode. As there was no training data for this task, we used an empirical k=5, that
is, five depressive episodes would be needed to regard a user as depressed. Then, for
each category, if multiple answers (0 to 4) were retrieved for a deemed positive user,
then the system assigned the response with the highest value for the category.


3      Results

System effectiveness metrics considered for this task are Average Hit Rate (AHR), Av-
erage Closeness Rate (ACR), Average Difference between Overall Depression Levels
(ADODL) and Depression Category Hit Rate (DCHR). The AHR measures the ratio of
cases where the computed questionnaire produces exactly the same answer as the real
questionnaire. The ACR measures the averaged absolute distance on an ordinal scale
between the automated answer and the real answer. ADODL assesses the system’s per-
formance by first calculating the overall depression score (sum of all answers) and,
next, the absolute difference (ad_overall) between the automated score and the real
overall depression score. Depression levels are normalized as follows; DODL = (63 -
ad_overall) / 63. DCHR measures the fraction of cases in which the automated ques-
tionnaire resulted in same depression severity categorization as the real questionnaire.
Table 2 shows the four depth-of-depression categories and the associated depression
levels used in this task[15].
                           Table 2. Depth-of-depression categories

                      Depression category              Depression levels
                      Minimal                                        0-9
                      Mild                                        10-18
                      Moderate                                    19-29
                      Severe                                      30-63


3.1    Official results
Our team submitted one official run to the Task 3 of the eRisk challenge. This run
combined the results of the three models described above using voting. The voting was
performed in a micro-average fashion, i.e., the results for each model were combined
at post level. Table 3 shows the results of our model. Overall, it has a baseline perfor-
mance for all the metrics, being equivalent to the median model of the participants
within the limit of one standard deviation.

Table 3. Official evaluation results. BiTeM results in comparison with the overall task systems

       Run                   AHR (%)        ACR (%)      ADODL (%)         DCHR (%)
       BiTeM run 0             32.14           62.62         72.62             25.00
       Median                  35.71           66.51         74.81             25.00
       Std dev                   5.91           5.44          4.04              9.93
       Min                     22.38           56.19         66.19              5.00
       Max                     41.43           71.27         81.03             45.00


3.2    Unofficial evaluation results

After the official phase our team conducted some further experiments. In addition to
the micro and macro ensemble models, we evaluated the performance on the three mod-
els separately. For each of these models we applied a k=5 threshold, i.e., if the model
classified at least 5 posts as positive for a category then the category was deemed as
positive for that user. Table 4 describes the results of the various models. Overall, both
ensemble micro models outperform the individual models and the ensemble macro
model. The ensemble micro majority model outperforms the model used in the official
phase on the ACR and ADODL metrics but at significant penalty in DCHR.
  Table 4. Unofficial evaluation results. Individual models and micro and macro ensembles.
                           *Model used during the official phase.

 Run                  Voting         AHR (%)       ACR (%)      ADODL (%)        DCHR (%)
 Model 1                               29.76         56.75         71.98            30.00
 Model 2                               26.42         60.00         73.81            25.00
 Model 3                               19.29         47.70         59.44            20.00
 Ensemble micro        majority        25.71         66.59         77.38            10.00
                       *strict         32.14         62.62         72.62            25.00
 Ensemble macro        average         25.00         63.73         74.37            20.00


3.3    Binary classifier
We repeated the unofficial experiments under the assumption that the questionnaire is
binary, i.e., the user expressed any feelings of depression or not. Table 5 shows the
model results under this assumption. Similar to the non-binary results, the ensemble
micro majority model outperforms the model used in the official phase but again at a
significant penalty in the DCHR metric.

                  Table 5. Evaluation considering the questionnaire as binary.
                      *Model used during the official phase. **Baseline.

Run                     Voting        AHR (%)        ACR (%)      ADODL (%)      DCHR (%)
Model 1                                 41.91          80.63          84.44         35.00
Model 2                                 48.81          82.94          88.33         45.00
Model 3                                 50.00          83.33          87.46         25.00
Ensemble micro         majority         56.90          85.63          89.60         15.00
                       *strict          43.81          81.27          84.76         35.00
Ensemble macro         average          48.33          82.78          87.38         25.00
**All 0’s                               34.52          78.17          78.17         30.00
**All 1’s                               65.48          65.48          88.49         30.00


3.4    Impact of k on the individual model’s performance
Fig 1. shows how the ADODL metric varies in function of k, i.e., the number of positive
posts necessary to confirm a category as positive. Models 1 and 2 presents their highest
ADODL around k=5 whereas model 3 in k=3. As there was no training set during the
official submission phase, we set these values empirically to k=5 for all the individual
models based on a manual analysis of some results. We suspected that k=1 would create
too many false positives. A similar pattern is also seen for the other metrics (not shown
here for brevity).
               Fig. 1. Variation of the model’s performance in function of k.


4      Discussion

Social media can provide valuable resources for assessing individual’s mental health
that could be useful for early detection and consequent healthcare provision. We devel-
oped a simple model for measuring the severity of the signs of depression from Reddit
posts based on word polarities, mutual information and semantic similarities. The en-
semble model used in the official phase achieved modest results. This could be ex-
plained by the significant negative effect of weak individual models during the con-
struction of the ensemble model.
   Nevertheless, both micro ensemble models significantly improved upon the individ-
ual models’ results for all the metrics apart from DCHR. Indeed, for the DCHR metric,
model 1 presented the best performance in the standard questionnaire answer, being
able to predict the correct depression severity category for 30% of the users. The en-
semble macro model did not improve upon the ensemble micro models. One possible
cause can be the relatively small set of candidate models from which the results were
averaged to calculate the ensemble category predictions.
   Answering the BDI questionnaire without training data proved to be a challenging
task. Indeed, even when considering the questionnaire as binary, the participant models
were outperformed by a naïve all-positive answer baseline on some of the metrics.
Model 2 performed the best in the DCHR and correctly predicted depression in 45% of
the cases when considering the questionnaire as binary. This remarkable improvement
over the 20% performance in DCHR in the standard questionnaire could be explained
by the fact that this model, in contrast to model 1 and 3, considered the task as binary
already in its conception, not taking the nuances in positive answers into account.
   Finally, as expected, training the models for some parameters would significantly
improve their performance. Indeed, most of the individual and ensemble model param-
eters, such as cut-off, k, and voting weight, were set empirically during the official
phase and the results reported here do not try to tune them based on the gold standard
answers. As shown in Fig. 1, tuning only k, for example, would result on an average
improvement of up to 13% for the ADODL metric if we consider k=1 as the baseline.
This effect is also seen for the AHR, ACR and DCHR metrics, which can have an av-
erage relative performance increase of up to 32%, 14% and 33%, respectively, with
tuning.


5      Conclusion

The task T3 of CLEF eRisk 2019 aimed to measure the severity of the signs of depres-
sion using user threads available in social media. The organizers provided a dataset
containing Reddit posts from 20 users and the goal was to automatically fill the 21
questions of Beck's Depression Inventory for each of the users. Our team developed a
data-driven, ensemble model combining sentiment lexicons, mutual information and
embedding similarities in order to overcome the lack of training samples. The model
achieved a baseline performance, being equivalent to the median system from the over-
all challenge. Nevertheless, answering the BDI questionnaire without training data
showed to be a challenging task, with an average hit rate of less than 42% for the top 1
system (32% in our case). Indeed, for some metrics, our system was outperformed by
a naïve all-positive answer baseline in a binary classification. As next steps, we aim to
leverage the post evidences created during this task to improve the performance of our
classification model.

References
1. Marcus, M., Yasamy, M.T., Van Ommeren, M., Chisholm, D., Saxena, S.: Depression: A
   global public health concern. World Health Organization Paper on Depression. 6–8 (2012).
2. World Health Organization: Depression and other common mental disorders: global health
   estimates. World Health Organization (2017).
3. Forte, A., Baldessarini, R.J., Tondo, L., Vázquez, G.H., Pompili, M., Girardi, P.: Long-term
   morbidity in bipolar-I, bipolar-II, and unipolar major depressive disorders. Journal of Affec-
   tive Disorders. 178, 71–78 (2015). https://doi.org/10.1016/j.jad.2015.02.011.
4. Kessler, R.C., Berglund, P., Demler, O., Jin, R., Koretz, D., Merikangas, K.R., Rush, A.J.,
   Walters, E.E., Wang, P.S.: The Epidemiology of Major Depressive Disorder: Results From
   the National Comorbidity Survey Replication (NCS-R). JAMA. 289, 3095–3105 (2003).
   https://doi.org/10.1001/jama.289.23.3095.
5. Rodrigues, S., Bokhour, B., Mueller, N., Dell, N., Osei-Bonsu, P.E., Zhao, S., Glickman, M.,
   Eisen, S.V., Elwy, A.R.: Impact of Stigma on Veteran Treatment Seeking for Depression.
   American       Journal     of    Psychiatric    Rehabilitation.   17,    128–146       (2014).
   https://doi.org/10.1080/15487768.2014.903875.
6. Vermani, M., Marcus, M., Katzman, M.A.: Rates of Detection of Mood and Anxiety Disor-
   ders in Primary Care: A Descriptive, Cross-Sectional Study. Prim Care Companion CNS Dis-
   ord. 13, (2011). https://doi.org/10.4088/PCC.10m01013.
7. Gramlich, J.: 5 Facts about Americans and Facebook. Pew Research Center. 10, (5).
8. Choudhury, M.D., Gamon, M., Counts, S., Horvitz, E.: Predicting Depression via Social Me-
    dia. In: Seventh International AAAI Conference on Weblogs and Social Media (2013).
9. Guntuku, S.C., Yaden, D.B., Kern, M.L., Ungar, L.H., Eichstaedt, J.C.: Detecting depression
    and mental illness on social media: an integrative review. Current Opinion in Behavioral Sci-
    ences. 18, 43–49 (2017). https://doi.org/10.1016/j.cobeha.2017.07.005.
10. Coppersmith, G., Dredze, M., Harman, C., Hollingshead, K., Mitchell, M.: CLPsych 2015
    shared task: Depression and PTSD on Twitter. In: Proceedings of the 2nd Workshop on Com-
    putational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality.
    pp. 31–39 (2015).
11. Losada, D.E., Crestani, F., Parapar, J.: eRISK 2017: CLEF lab on early risk prediction on the
    internet: experimental foundations. In: International Conference of the Cross-Language Eval-
    uation Forum for European Languages. pp. 346–360. Springer (2017).
12. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk: Early Risk Prediction on the In-
    ternet. In: International Conference of the Cross-Language Evaluation Forum for European
    Languages. pp. 343–361. Springer (2018).
13. Trotzek, M., Koitka, S., Friedrich, C.M.: Word Embeddings and Linguistic Metadata at the
    CLEF 2018 Tasks for Early Detection of Depression and Anorexia.
14. Losada, D.E., Crestani, F., Parapar, J.: Early Detection of Risks on the Internet: An Explora-
    tory Campaign. In: European Conference on Information Retrieval. pp. 259–266. Springer
    (2019).
15. CLEF eRisk: Early risk prediction on the Internet | CLEF 2019 workshop,
    https://early.irlab.org/.
16. Beck, A.T., Steer, R.A., Carbin, M.G.: Psychometric properties of the Beck Depression In-
    ventory: Twenty-five years of evaluation. Clinical psychology review. 8, 77–100 (1988).
17. Nona Naderi, Julien Gobeill, Douglas Teodoro, Emilie Pasche, Patrick Ruch: A Baseline Ap-
    proach for Early Detection of Signs of Anorexia and Self-harm in Reddit Posts. In: Proceed-
    ings of the CLEF 2019 Workshop.
18. Sutcliffe, R.F., Peñas, A., Hovy, E.H., Forner, P., Rodrigo, Á., Forascu, C., Benajiba, Y.,
    Osenova, P.: Overview of QA4MRE Main Task at CLEF 2013. In: CLEF (Working Notes)
    (2013).
19. Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing contextual polarity in phrase-level senti-
    ment analysis. In: Proceedings of Human Language Technology Conference and Conference
    on Empirical Methods in Natural Language Processing (2005).
20. MPQA Resources, http://mpqa.cs.pitt.edu/#subj_lexicon.
21. Fellbaum, C.: WordNet: An electronic lexical database Cambridge. MA: MIT Press. (1998).
22. feeling sexual excitement or desire - synonyms and related words | Macmillan Dictionary,
    https://www.macmillandictionary.com/thesaurus-category/british/feeling-sexual-excite-
    ment-or-desire.
23. Eating Disorders Glossary, http://glossary.feast-ed.org/.
24. Kraskov, A., Stögbauer, H., Grassberger, P.: Estimating mutual information. Phys. Rev. E.
    69, 066138 (2004). https://doi.org/10.1103/PhysRevE.69.066138.
25. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations
    in Vector Space. (2013).
26. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In:
    Proceedings of the 2014 conference on empirical methods in natural language processing
    (EMNLP). pp. 1532–1543 (2014).
27. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional
    transformers for language understanding. arXiv preprint arXiv:1810.04805. (2018).
28. Ethayarajh, K.: Unsupervised random walk sentence embeddings: A strong but simple base-
    line. In: Proceedings of The Third Workshop on Representation Learning for NLP. pp. 91–
    100 (2018).