USDB at eRisk 2020: Deep learning models to
    measure the Severity of the Signs of Depression
                 using Reddit Posts

Amina MADANI1 , Fatima BOUMAHDI1 , Anfel BOUKENAOUI1 , Mohamed
           Chaouki KRITLI1 , and Hamza HENTABLI2
              1
                LRDSI Laboratory, BLIDA 1 University, Blida , Algeria
           {a_madani,f_boumahdi}@esi.dz, anfelboukenaoui@gmail.com,
                            kritlichawki56@gmail.com
    2
      Information Assurance and Security Research Group, Faculty of Computing,
                      Universiti Teknologi Malaysia, Malaysia
                            hentabli_hamza@yahoo.fr


        Abstract. In this paper, we describe the participation of our USDB
        group (University of Saad Dahleb Blida) in the shared task T2 of the
        eRisk Lab at the CLEF 2020 workshop. This task focused on measuring
        the severity of the signs of depression from a thread of user posts. In
        response to this task, we study the performance of two different deep
        learning models (CNN and BiLSTM) in order to provide more perspec-
        tives for depression researches.

        Keywords: Depression severity · Social networks · Natural language
        processing · Deep learning · CNN · BiLSTM · Sentiment analysis.


1     Introduction

Depression identification has been the subject of research of many fields, psy-
chiatry, psychology, medicine and even sociolinguistics fields. Depression comes
in different degrees and the examinations are usually done through one of the
popular questionnaires used by psychologists, such as the Center of Epidemio-
logic Studies Depression Scale (CES-D) [26], Beck’s Depression Inventory (BDI)
[4] and Zung’s Self-Rating Depression Scale (SDS) [38]. But, these examinations
lack empirical data as they use the patient’s observations or a third-party’s ones
which puts the results under the risk of flawed subjective human testing that
can be manipulated easily, often with the purpose of gaining antidepressants or
just to hide one’s own depression from peers [21].
    Twitter, Facebook and Reddit are different social media platforms that allow
people to share their opinions and their personal thoughts. It has been proven

    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
that such data can be used to study clinical matters, especially when it comes
to mental illnesses like depression.
    Many research works have analyzed the prowess of these data for determining
indications of depression. Furthermore, the scientifc community has set forth
different shared tasks like eRisk (Early Risk Prediction on the Internet) of CLEF
(Conference and Labs of the Evaluation Forum).
    In this paper, we describe the participation of our team USDB to the CLEF
eRisk 2020 task 2. The goal of the task was to measure the severity of the signs
of depression, considering a set of user posts on Reddit. Participants had to
answer the standard BDI depression questionnaires using the text of postings
for 70 users.
    Our team proposed an approach based on deep learning models to auto-
matically fill the BDI questionnaire that are Convolutional Neural Networks
model (CNN) [15] and Bidirectional Long Short Term Memory model (BiLSTM)
[3,35,36], which is a type of recurrent neural network. Subsequently, to generate
different runs, we use two statistical methods. For the first time, neural networks
are used for the task of measuring the severity of the signs of depression.
    The rest of this paper is organized as follows: section 2 presents related works
that tackled the same problem as ours. In section 3, we explore our proposed
approach. The following three sections are dedicated to the description of the
dataset, the metrics evaluation used and the results obtained. Finally, we con-
clude the paper and discuss future perspectives for our proposed approach.


2   Related work

Since 2017 until now a shared task on eRisk has been organised. In 2017 and 2018,
the challenge consists in performing a task on early risk detection of depression.
Several researchers were focused on detecting depression. [9,28,30,19,22,27,33,31,7]
are different approaches that propose interesting models and evaluate them using
eRisk dataset.
    In 2019, a new task was added. Measuring the severity of the signs of depres-
sion consists of estimating the level of depression from a thread of user submis-
sions. The results of all participating teams can be found in [17]. [29,2,6,32] are
several works of different teams.
    In paper [29], authors proposed a rule-based method that combines machine
learning with psycholinguistics and behavioural patterns. They divided the 21
questions into 6 groups that are : Depression, Guilt, Appetite, Anxiety, Fatigue
and Sleep. Based on the presence of occurrences of the features considered for
each group, they produced responses for each user.
    In their work, [2] extracted features from users using the GPT-1 (Generative
Pre-trained Transformer version 1) language model [25,37] and Linguistic Inquiry
Word Count tool (LIWC) [23]. They predict responses in two ways, an unsuper-
vised and supervised one. For the unsupervised manner, they used an approach
based on vectorial representation of the user and vectorial representation of pos-
sible response using GPT-1. Cosine Similarity between vectors was calculated
to choose the user’s response. For the supervised way, they used data of a PhD
study, where Psychology students answer the BDI and other questionnaires and
also completed parts of writing about a negative personal problem. Next, they
trained support vector machines using the training data. They also submitted
another supervised approach that used GPT-1 features and AutoSklearn [10].
Paper [2] concluded that without training dataset, tasks were not easy and are
unsupervised and data must be annotated to improve the quality of depression
prediction.
    The Paper of [6] used SS3 [5] which is a word-based classifier that estimates
risk based on term statistics. To train SS3, they adapted the dataset of eRisk
2018 depression detection task. They transformed the output of SS3 from a 2-
dimensional vector into a BDI depression rank. All the questions were answered
with 0 for persons whose depression rank was less or equal to 0. For the others,
they used different methods (based on textual hint, word matching. . . ). SS3
obtained the best AHR and ACR values, and the second-best ADODL and
DCHR.
    To automatically fill in the BDI questionnaire, the authors of [32] developed
four models. They submitted only the results of the fourth model. The models
are:
 – Word polarity model using the Multi Perspective Question Answering (MPQA)
   subjectivity lexicon [34,1].
 – Mutual information model by creating a training dataset from Reddit and
   using the mutual information measure to extract important tokens from
   depressive messages [14].
 – Semantic similarity model which is based on post-level representation. The
   pre-trained GloVe word embeddings [24] is used to represent the words.
 – In the fourth model, the results of the three models are combined using
   voting.
    [16] say that no team was able to reach best results for each of the evaluation
measures because of the difficulty of the challenge and probably the similarity
of the approaches.


3   Method
In this section, we will introduce the architecture of our proposed approach (see
Fig. 1). First, we do preprocessing of posts. We extract keywords that contain
most important information by removing special characters, punctuations, URL
and stop-words. Words would all be stemmed and lemmatized to remove noise
from posts.
    Some of the publications are longer or shorter. Padding is then necessary,
because we need to have the inputs with the same size. We fixed the sequence
length of posts to 250 and shorter input sequences are padded with zeros.
    Next, we transform distributed representations of words in a vector space
using the Skip-gram model [20] which is used to predict the context word for a
given target word. For a given sequence of words, the objective is to find word
representations that are useful for predicting the surrounding words in a post.
    After that, the sentences are encoded by means of a CNN or a Bi-LSTM
model. Our first method is based on a CNN model, one of the most popular
deep neural networks. Our model consists of :

 – two one-dimensional convolutional layers,
 – max-pooling layer : maintains only the most important words in each feature
   obtained from the previous layers,
 – two others one-dimensional convolutional layers,
 – fully connected input layer: flattens the outputs of the convolutional layer
   to map them into a single vector,
 – fully connected layer:applies weights to predict the correct label,
 – fully connected output layer:gives the final probabilities.

    In these layers, each linear activation is run through ReLU (Rectified Linear
Unit). The rectified linear activation function will output the input directly if is
positive, otherwise, it will output zero.
    Our second method involves the usage of Recurrent Neural Networks (RNN)
[8,12] with Long BiLSTM [11] to make predictions on sequences of texts. LSTM
[13] is used in different problems due to their ability to remember information
over long periods of time. LSTM use 3 gates to capture Long Term Dependen-
cies: Forget Gate for adding information to the cell state, Input Gate to choose
what component of previous cell state must be forgotten and Output Gate to
ascertain information to output at current cell state. The BiLSTM model brings
the advantage of maintaining two separate states for inputs using two different
LSTMs. The first LSTM is going forward from the beginning of the sentence,
while in the second LSTM, the input sequence are fed in backward. BiLSTM al-
lows capturing information of surrounding inputs and learning faster than LSTM
model.
    Last, for the two deep learning models we have a final layer, in which the
representation is fed to the final fully connected softmax layer as an output
feature vector. The results will be decimal probabilities for each answer. Each
publication has only one answer for each question.
    For each post, our two models generate 21 outputs that are answers to the
BDI’s questions. Finally, in order to know the level of depression for a user, we use
two statistical methods to generate 4 runs. The two first runs concern the CNN
model when the BiLSTM model is applied for the others. For run1 (CNN_max)
and run3 (BiLSTM_max), we calculate for a question the frequency of each
generated answer to choose the most recurrent answer, which makes it as relevant
as possible. For run2 (CNN_suite) and run4 (BiLSTM_suite), we calculate for a
question, the higher length sequence of a same answer for all generated answers
from which we choose the answer of the higher value to be a response of this
question.
Fig. 1. The architecture of our proposed approach.
4    Dataset description

eRisk 2020 Task 2 is a continuation of 2019’s task 3. This year, a dataset with
70 files was provided. Each file contains a set of posts of one user on Reddit. In
Table 1, we present the characteristics of the dataset.


                Table 1. Statics on the eRisk 2020 Task 2 dataset.

                      Number of users                    70
                      Number of posts                 35562
                      Average number of posts per user 508
                      Min number of posts per user       25
                      Max number of posts per user     1355


    Based on a user’s history of posts, task 2 was aimed to estimate the depression
level of a user by automatically answering each individual question derived from
the BDI questionnaire. The questionnaire has 21 questions in different classes of
feelings like sadness, pessimism, crying, loss of energy, etc.
    The possible responses are 0, 1a, 1b, 2a, 2b, 3a, 3b for questions 16 and 18
and 0, 1, 2, 3 for the rest of the questions.
    The 2019’s questionnaires and their golden truth responses were provided.
Thus, they can be used as a training dataset.


5    Metrics evaluation

Four metrics are used for results evaluation:

 – Average Hit Rate (AHR): where HR calculates the percentage of cases
   where our answers are the same as the golden truth responses.
 – Average Closeness Rate (ACR): computes the absolute difference be-
   tween our answer and the true one.
 – Average Difference between Overall Depression Levels (ADODL):
   for a user, the absolute difference between the generated overall depression
   score (sum of all the answers) and the real score is calculated.
 – Depression Category Hit Rate (DCHR): Four depression categories
   based on the sum of all answers of the 21 questions can be found. The cat-
   egories are minimal depression, mild depression, moderate depression and
   severe depression. DCHR verify the depression category of the real question-
   naire with the category of the automated obtained answers.

    More details about these used measures and examples are given in [17].
6     Results
In this year, 5 teams submitted 17 different runs to the eRisk task 2. Our team
submitted 4 runs to this task. Table 2 shows our results comparing to the best
results for each metric. The results of all participants can be found in [18]. We
observe that no single team was able to achieve the best results for each of the
four metrics evaluation.
    Comparing only our runs, run 2 and run 4 did not perform well on the
dataset. Although, run 1 performed the best in the AHR with 34.97% of the
answers right and the best in the DCHR which is able to predict the correct
depression severity category for 25.71% of the users. In contrast to the 3 runs,
run 3 had higher ACR (67.78%) and ADODL (79.30%) scores. Therefore, we
notice that using the first statistical method based on the frequency of answers
is better than using the second method based on the higher length sequence of
answers.
    Most importantly, we believe combining the CNN model with the BiLSTM
one could improve the feature extraction process and enhance the model’s per-
formance to predict better results.

    Table 2. Evaluation of our runs along with the best results achieved in task 2.

                                AHR    ACR             ADODL DCHR
              Run1:CNN_max      34.97% 67.19%          76.85% 25.71%
              Run2:CNN_suite    32.79% 66.08%          76.33% 17.14%
              Run3:BiLSTM_max 34.01% 67.78%            79.30% 22.86%
              Run4:BiLSTM_suite 33.54% 67.26%          78.91% 20.00%
              Best scores       38.30% 69.41%          83.15% 35.71%


7    Conclusion
The aim of this article is to exploit artificial intelligence’s deep learning models
in order to measure automatically the severity of the signs of depression from an
individual’s posts. We described the participation of our research group at task
2 of the CLEF eRisk 2020 using using the CNN model, the BiLSTM model and
statistical methods to generate runs.
    We conclude that no run was able to predict overall depression better than
other because this task is not easy and a training dataset with 20 users is not
sufficient.
    For future work, we plan to principally combine the CNN model with the
BiLSTM one and we will analyze in more details the obtained results.

References
 1. Mpqa resources. http://mpqa.cs.pitt.edu/#subj_lexicon
 2. Abed-Esfahani, P., Howard, D., Maslej, M., Patel, S., Mann, V., Goegan, S.,
    French, L.: Transfer learning for depression: Early detection and severity prediction
    from social media postings. In: CLEF (Working Notes) (2019)
 3. Baziotis, C., Pelekis, N., Doulkeridis, C.: Datastories at semeval-2017 task 4: Deep
    lstm with attention for message-level and topic-based sentiment analysis. In: Pro-
    ceedings of the 11th international workshop on semantic evaluation (SemEval-
    2017). pp. 747–754 (2017)
 4. Beck, A., Ward, C., Mendelson, M., Mock, J., Erbaugh, J.: An inventory for mea-
    suring depression. archives of general psychiatry, vol. 4 (1961)
 5. Burdisso, S.G., Errecalde, M., Montes-y Gómez, M.: A text classification frame-
    work for simple and effective early depression detection over social media streams.
    Expert Systems with Applications 133, 182–197 (2019)
 6. Burdisso, S.G., Errecalde, M., Montes-y Gómez, M.: Unsl at erisk 2019: a uni-
    fied approach for anorexia, self-harm and depression detection in social media. In:
    CLEF (Working Notes) (2019)
 7. Cacheda, F., Iglesias, D.F., Nóvoa, F.J., Carneiro, V.: Analysis and experiments
    on early detection of depression. CLEF (Working Notes) 2125 (2018)
 8. Elman, J.L.: Finding structure in time. Cognitive science 14(2), 179–211 (1990)
 9. Fatima, B., Amina, M., Nachida, R., Hamza, H.: A mixed deep learning based
    model to early detection of depression. Journal of Web Engineering pp. 429–456
    (2020)
10. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Ef-
    ficient and robust automated machine learning. In: Advances in neural information
    processing systems. pp. 2962–2970 (2015)
11. Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional
    lstm and other neural network architectures. Neural networks 18(5-6), 602–610
    (2005)
12. Graves, A., Schmidhuber, J.: Offline handwriting recognition with multidimen-
    sional recurrent neural networks. In: Advances in neural information processing
    systems. pp. 545–552 (2009)
13. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation
    9(8), 1735–1780 (1997)
14. Kraskov, A.S., Stogbauer, H.: H. & grassberger. Estimating mutual information.
    Phys. Rev. E 69(6) (2004)
15. LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time
    series. The handbook of brain theory and neural networks 3361(10), 1995 (1995)
16. Losada, D.E., Crestani, F., Parapar, J.: Early detection of risks on the internet:
    an exploratory campaign. In: European Conference on Information Retrieval. pp.
    259–266. Springer (2019)
17. Losada, D.E., Crestani, F., Parapar, J.: Overview of erisk 2019 early risk prediction
    on the internet. In: International Conference of the Cross-Language Evaluation
    Forum for European Languages. pp. 340–357. Springer (2019)
18. Losada, D.E., Crestani, F., Parapar, J.: Overview of erisk 2019 early risk prediction
    on the internet. In: International Conference of the Cross-Language Evaluation
    Forum for European Languages. Springer (2020)
19. Maupomé, D., Meurs, M.J.: Using topic extraction on social media content for the
    early detection of depression. CLEF (Working Notes) 2125 (2018)
20. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
    sentations of words and phrases and their compositionality. In: Advances in neural
    information processing systems. pp. 3111–3119 (2013)
21. Nadeem, M.: Identifying depression on twitter. arXiv preprint arXiv:1607.07384
    (2016)
22. Paul, S., Jandhyala, S.K., Basu, T.: Early detection of signs of anorexia and de-
    pression over social media using effective machine learning frameworks. In: CLEF
    (Working Notes) (2018)
23. Pennebaker, J.W., Chung, C.K., Ireland, M., Gonzales, A., Booth, R.J.: The devel-
    opment and psychometric properties of liwc2007: Liwc. net. Google Scholar (2007)
24. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word repre-
    sentation. In: Proceedings of the 2014 conference on empirical methods in natural
    language processing (EMNLP). pp. 1532–1543 (2014)
25. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language un-
    derstanding by generative pre-training (2018)
26. Radloff, L.S.: The ces-d scale: A self-report depression scale for research in the
    general population. Applied psychological measurement 1(3), 385–401 (1977)
27. Ragheb, W., Moulahi, B., Azé, J., Bringay, S., Servajean, M.: Temporal mood
    variation: at the clef erisk-2018 tasks for early risk detection on the internet (2018)
28. Stankevich, M., Isakov, V., Devyatkin, D., Smirnov, I.: Feature engineering for
    depression detection in social media. In: ICPRAM. pp. 426–431 (2018)
29. Trifan, A., Oliveira, J.L.: Bioinfo@ uavr at erisk 2019: delving into social media
    texts for the early detection of mental and food disorders. In: CLEF (Working
    Notes) (2019)
30. Trotzek, M., Koitka, S., Friedrich, C.M.: Utilizing neural networks and linguistic
    metadata for early detection of depression indications in text sequences. IEEE
    Transactions on Knowledge and Data Engineering (2018)
31. Trotzek, M., Koitka, S., Friedrich, C.M.: Word embeddings and linguistic metadata
    at the clef 2018 tasks for early detection of depression and anorexia. In: CLEF
    (Working Notes) (2018)
32. Van Rijen, P., Teodoro, D., Naderi, N., Mottin, L., Knafou, J., Jeffryes, M., Ruch,
    P.: A data-driven approach for measuring the severity of the signs of depression
    using reddit posts. In: CLEF (Working Notes) (2019)
33. Wang, Y.T., Huang, H.H., Chen, H.H.: A neural network approach to early risk
    detection of depression and anorexia on social media text. In: CLEF (Working
    Notes) (2018)
34. Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing contextual polarity in phrase-
    level sentiment analysis. In: Proceedings of human language technology conference
    and conference on empirical methods in natural language processing. pp. 347–354
    (2005)
35. Zhang, Y., Wang, J., Zhang, X.: Ynu-hpcc at semeval-2018 task 1: Bilstm with
    attention based sentiment analysis for affect in tweets. In: Proceedings of The 12th
    International Workshop on Semantic Evaluation. pp. 273–278 (2018)
36. Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., Xu, B.: Attention-based bidirec-
    tional long short-term memory networks for relation classification. In: Proceedings
    of the 54th Annual Meeting of the Association for Computational Linguistics (Vol-
    ume 2: Short Papers). pp. 207–212 (2016)
37. Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., Fidler,
    S.: Aligning books and movies: Towards story-like visual explanations by watching
    movies and reading books. In: Proceedings of the IEEE international conference
    on computer vision. pp. 19–27 (2015)
38. Zung, W.W., Richards, C.B., Short, M.J.: Self-rating depression scale in an out-
    patient clinic: further validation of the sds. Archives of general psychiatry 13(6),
    508–515 (1965)