A RoBERTa-based model on measuring the severity of the signs
of depression
Shih-Hung Wu1 and Zhao-Jun Qiu1
1
    Chaoyang University of Technology, Taichung, Taiwan (R.O.C)


                 Abstract
                 In this paper, we describe our approach to the CLEF 2021 lab eRisk Task 3: Measuring the
                 severity of the sign of depression. The main purpose of this task is to automatically measure
                 the severity of the user's depression by analyzing the user’s posting on social media. We adopt
                 the deep learning pretrained language model, RoBERTa, as the basis of our system and propose
                 two different approaches as the post-processing and submit 3 runs. The two post-processing
                 weighting mechanisms is designed to make the system that will give prediction on higher level
                 of severity. This is according to our observation on the results of last year eRisk lab that systems
                 tend to give lower level of severity. With a fixed weighting approach, our second run gives the
                 best Average Difference between Overall Depressions Levels (ADODL) and Depression
                 Category Hit Rate (DCHR) this year.

                 Keywords 1
                 Deep Learning, RoBERTa

1. Introduction

    Social media is popular, it can be seen that with the spread of mobile networks, people use social
media more frequently. According to DIGITAL 2021: GLOBAL OVERVIEW REPORT [1], social
media users have reached more than half of the global population. People express emotions through
social media has become a daily habit.
    Researcher can analyze these postings with natural language processing technology and get useful
results. In eRisk Task 3: Measuring the severity of the sign of the sign of depression, systems try to
predict the severity of a user's depressive symptoms by analyzing the user’s postings on social media.
Similar studies have been conduct on other social media, such as Facebook language predicts depression
in medical records [2] and forecasting the onset and course of mental illness with Twitter data [3], which
have shown the importance of evaluating user depression levels through social media postings.
    The main mission of eRisk 2021 Task 3 is to explore the feasibility of automatically estimating the
severity of multiple symptoms associated with depressive symptoms. The organizers estimate a user's
level of depression by the user’s response to each question in the questionnaire of Baker's Depression
List (BDI), which assesses the existence of feelings such as sadness, pessimism, and lack of energy.
The questionnaire has 21 questions, each with four answers (from 0 to 3) or seven answers (0,1a, 1b,
2a, 2b, 3a, 3b). The system performance will be assessed by the overlap between the questionnaire filled
out by real users and the questionnaire filled out by the system (number of correct predictions) [4].
    This is the third time that the task of depression prediction is held in eRisk lab. In the past eRisk
Tasks on depression prediction, many teams have come up with different ways to study this topic, such
as, the USDB team used two different deep learning models (CNN and BiLSTM) [12], the iLab team
focused on the pre-processing aspects of training data [13], the RELAI team used topic model (LDA


1
 CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
EMAIL: shwu@cyut.edu.tw (A. 1); s10827617@gm.cyut.edu.tw (A. 2)
ORCID: 0000-0002-1769-0613 (A. 1); 0000-0002-4616-9624 (A. 2)
             ©️ 2021 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)
and Anchor) [14] to conduct the research, the BioInfo@UAVR teams used the classifier of Yates et al.
[15] and they had trained before to predict whether users were depressed [16].
    Most of the previous works focus on training using different models, or pre-processed data. Our
approach this time, mainly focus on post-processing, after we used state-of-the-art deep learning
pretrained language model, RoBERTa, as the basis of our system. We propose two different approaches
as the post-processing and submit 3 runs. According to our observation on the results of last year eRisk
lab that systems tend to give lower level of severity. Our post-processing weighting mechanisms is
designed to make the system that will give prediction on higher level of severity.
    The rest of this article is organized as follows: Section 2 describes how eRisk Task 3 provides data
and how to evaluate system. The methodology is described in Section 3, which reports our research
process and our experimental settings. The last two sections explore results we have come up with, as
well as the future direction of the study.

    2. Data and Observation

    The organizers of 2021 eRisk T3 provide the test dataset of 2019 and 2020, and as the training data.
The 2020 dataset has a total of 20 users, during system developing phrase, we use the 2020 data as the
training set to train the model and the 2019 dataset as the validation set to test the model. The dataset
includes a severity questionnaire assessment of depressive symptoms, as well as postings on social
media of a user's daily life. The questionnaire consists of a total of 21 questions and has four answers
for each question, excepting that questions 16 and 18 has 7 answers [4].
    eRisk T3 uses four different scoring metrics to evaluate the model, namely Average Hit Rate (AHR),
Average Closeness Rate (ACR), Difference between Overall Depressions Levels (DODL) Average
(ADODL) and Depression Category Hit Rate (DCHR) [8]. During system development, we focus on
the AHR and DCHR metrics, since the other two metrics are relative metrics of these two metrics. We
believe that optimize the two metrics will also optimize the other two.
        Average Hit Rate (AHR): For each user in the 21 questions, if the system predicted the actual
         result of the user in ten questions, the hit rate is 10/21, and AHR is the average hit rate to all
         users.
        Depression Category Hit Rate (DCHR): System predicts the questionnaire results obtained an
         estimate of recognized depression, which matches the assessment information obtained in the
         actual questionnaire.


Figure 1: CLEF 2020 eRisk: Task2. Performance Results [5]
    According to the 2020 CLEF eRisk results in Fig. 1 [5], the all 0’ and all 1’s prediction results were
36% and 29% in AHR, respectively. However the percentage dropped to 14% and 25% when evaluating
DCHR, from which we speculate that the actual forecast data is tent to a higher level of serenity. That
is, training set shows that users will give answers to each question with a lower level of serenity, but
the overall serenity is not that low.
    Table 1 shows the statistics of the training data, each user’s postings is labelled according to the
user’s answers in the questionnaire, and the chart shows that most of the statistics are slightly biased to
lower level of severity, so we expect to weight the results during the post-processing process to make
the results more prone to higher level of severity will give better overall result.

Table 1
 (a) The percentage of the posting distribution for the 2020 dataset, each posting is labelled based on
the results of the user’s answer to the questionnaire. Assuming that the number of postings of the
users who answer 0 to question 1 of the questionnaire is 350, and the total number of postings all
users is 1000, the percentage is equal to (350/1000 = 35%)
        1    2    3    4    5    6    7     8    9 10 11 12 13 14 15 17 19 20 21
   0 35 26 24 40 43 62 31 29 57 45 38 39 43 46 20 38 35 27 71
   1 51 44 50 29 38 22 20 35 34 32 36 32 27 27 43 39 28 34 22
   2    9 22 20 29 14 10 35 32 6                      7 17 11 20 23 27 15 25 26 1
   3    6    8    5    8    5    6 14 3          3 15 9 18 10 4 10 9 13 13 6

(b) The percentage of the posting distribution on Question 16 and 18
        16    18
   0     0     0
  1a    11    37
 1b     49    22
  2a     9    12
 2b     16    14
  3a     3     6
 3b     13     8


3. System Architecture

    Fig. 2 shows our system flowchart. As mentioned in previous sections, we use a pre-trained model
to give the Run1 and weighting the Run1 results into other two runs. The Preprocessing is quite simple,
our system just delete URL, special characters, and white space from the users’ postings. And sent it to
BERT or RoBERTa model.
    We build one model for each question, therefore, there are 21 models. Each posting is labelled with
the answer of the user to the question. This labeling is assuming that each posting will give the same
information on the choice of the user. We train the classifier by BERT/RoBERTa models according to
the sentences in the training set. For each question, we train one classifier to decide whether one
sentence lead to which answer. Since each author writes a lot of sentences, our system aggregate the
vote of each sentence as our system output. In the first run, our system works with a majority vote
principle, one answer will be selected if it get most votes. In the second run, we emphasize the weight
of the votes by weighting more on the answers with serious results. That is, tend to be more depression.
The weights are 1 to 7 for the votes of 0 to 6 respectively. In the third run, we further lower the weight
of vote to 0 by rules.
              Start                                       CSV training data


            Input all                                  Run the RoBERTa model
            XML Files                 Extract the final 4 level as the input of fine-tuning linear
                                                      classifier (details in Fig. 4)


       Extract the label of                                  Count the
          each posting                                    prediction results


       Preprocessing that
       delete URL, special
        char, and white
             space
                                        Run1                    Run2                    Run3
                                     No Weighting          Fixed Weighting        Threshold Method
       Convert the file in
         to CSV format
                                                                  End

    Figure 2: Our System flowchart


    3.1.         Data Processing

    The training data contains data from 70 volunteers with questionnaire results as well as posts or
comments on their social networks. The XML file format is shown in Fig. 3. We extract the TEXT
content into a CSV file as our training data. We remove the URL, path, special characters, and each of
the comments is organized into one line, saved in the first column. We aggregate all the posts from each
anonymous ID and associate them to the user's questionnaire results, the answer to the question in order
after the first column. We get a total of 33,155 comments, we use 80% (26524) for training, 20% (6631)
for verification during our system development phrase.


    Figure 3: The XML format of each post in the dataset[4], where ID is the anonymous user ID,
     TITLE is the post title, INFO is the source, and TEXT is the content of the post
    3.2.        Pre-trained Model

    Since BERT gives good results in several natural language processing applications in recent years
[17], so we adopt the BERT pre-trained model as our basis of our system. At first, we chose the pre-
trained model "bert-base-uncased" and made improvements by referring to the methods of [6] and [7].
Instead of using the final output of the BERT model directly, our system extracts the output of the last
four hidden layers as the input vector for linear classification, as shown in Fig. 4. The Hyper-parameters
of our model is: Hidden size=768, Learning r=1e-5, weight_decay=1e-2, Epoch=5
    Fig. 5 Shows the result of the model on Q3-Q6, since more epoch do not give better result, we limited
our fine-tuning epoch to 5.


Figure 4: Our system extracts the output vector from the last four layers of the model’s hidden layer
and joins the four output vectors as the input vector of the linear classifier

                0.67
                0.66
                0.65
                0.64
                0.63
                0.62
                0.61
                 0.6
                0.59
                       Epoch 1     Epoch 2   Best_Score        Epoch 3    Epoch 4   Epoch 5

                                             Q3           Q4         Q5

Figure 5: To decide the number of Epoch, we test our system on Q3-Q6
    Later we use RoBERTa as the core of the system [18]. Since RoBERTa is optimized on the basis
of BERT, and the authors have expanded the training dataset, trained with longer sequences,
dynamically generated the shields used by MLM. We test them with our training data, and Fig. 6
shows that RoBERTa is significantly more accurate than BERT in our system.
               0.9

               0.8

               0.7

               0.6

               0.5

               0.4

               0.3

               0.2

               0.1

                   0
                        Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21
            BERT       0.680.630.660.650.660.710.620.630.720.650.640.630.630.640.630.590.65 0.6 0.610.610.77
            RoBERTa 0.690.640.680.680.670.720.630.670.750.670.660.650.660.660.640.590.660.620.630.640.77

Figure 6: Q1-Q21 accuracy of our system during development

    3.3.               Post-processing

   In the first run, our system output the original prediction result as a baseline for the follow-up runs,
there is no any weighting of the predicted results. The prediction of each question is a simple majority
vote, the system output the answer with the largest cumulative number. Fig. 7 shows the prediction
distribution of each question. We find that the prediction, affected by the training set, tend to favor less
severity.

       90
       80
       70
       60
       50
       40
       30
       20
       10
        0
              Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21

                                               0   1   2    3   4   5   6

Figure 7: The answer prediction distribution of our Run1

    In the second run, the predictions are adjusted according to the predictions in the first run, with a
fixed-weight weighting mechanism to give with a higher severity answers. For example, Q1 has four
different level of severity (from 0 to 3), so we give 1 to 4 as the severity weight. That is, if one posting
is predicted as 1 by our model, we count it twice; if one posting is predicted as 2 by our model, we
count it 3 times; and if one posting is predicted as 3 by our model, we count it 4 times. Our system
finally set the prediction of the question according to the maximum number after the weighting. Fig.
8 shows Run2 prediction distribution, the distribution is pushed to the higher levels of severity.

          90
          80
          70
          60
          50
          40
          30
          20
          10
           0
               Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21

                                         0   1   2   3   4   5   6

Figure 8: The answer prediction distribution of our Run2

   In Run 3, the weighting is adjusted based on the percentage of the training data distribution. We use
the percentage of distribution in Table 1 as the threshold value, the answer with the highest percentage
of each question is selected as a default answer. Our system modify the weighting in order from the
most severe to the slightest order, as long as the percentage of the prediction result is greater than the
percentage of the distribution of training data, the prediction results as the final answer. For example,
for Q1, the distribution in training data is (0:35%, 1:51%, 2:9%, 3:6%), the answer with the highest
percentage is 1 in Q1, then 1 is our default answer. Suppose the original system predict output
distribution for some user is (0:36%, 1:51%, 2:10%, 3:3%), our system will first check the percentage
of answer 3, in this case 3% does not exceed the 6% threshold, so it is not our choice. Our system then
will check the percentage of answer 2, in this case is 10%, which does exceed the 9% threshold, our
system will output answer 2. In short, our system tends to choose the answer with higher severity.

          60

          50

          40

          30

          20

          10

           0
               Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21

                                         0   1   2   3   4   5   6

Figure 9: The answer prediction distribution of our Run3

  Fig. 9 shows the Run 3 prediction distribution. Unlike the first two runs, this distribution results are
more broadly, less concentrated. However, overall performance is not the best.
4. Results and Discussion

   Table. 2 shows the results of our three Runs this year in task 3 [8], and we compare them to be best
results in last two years [5] [9]. This year, nine teams sent out 36 runs, of which we got the best results
in ADODL and DCHR, and our ADODL performed better than the best results in the past 2 years.
   The best DCHR and AHR results in year 2019 still hard to match. The best DCHR's practices can
be seen in [10], they use unsupervised methods to make the results. The authors also noted that by a
simulation in Fig. 10 that comparing the results of random, the authors felt that although the data were
the best, they did not perform better than random. The best AHR practice can be seen in [11], the
authors first decided each user’s depression level then decided the answer to the questionnaire. This
approach is totally different from our approach.


Table 2
System performance of our runs and best results in recent three years
                                   AHR               ACR             ADODL                 DCHR
 CYUT RUN1                         32.02%            66.33%          75.34%                20.00%
 CYUT RUN2                         32.62%            69.46%          83.59%                41.25%
 CYUT RUN3                         28.39%            63.51%          80.10%                38.75%
 Best result in this year [8]      35.36%            73.17%          83.59%                41.25%
 Best result in year 2020 [5]      38.30%            69.41%          83.15%                35.71%
 Best result in year 2019 [9]      41.43%            71.27%          81.03%                45.00%


Figure 10: Histograms of randomly generated submissions with team submissions marked by vertical
lines. (2019)[10]

   Table. 2 shows that the overall performance of Run2 is better than the other two runs. We further
show the number of correct prediction of the Q1-Q21 individual results in Fig. 10. From Fig. 11, we
can see that Run2 give better prediction in 9 out of the 21 questions. We can also find that Q16 and Q18
are real hard to predict, where Q16's Run1 only predicts correctly once. This observation suggests that
our weighting post-processing is valid, and most effective in Q16. However over-weighting also results
in a decrease in Run3 results, such that in Run3 of Q1, obviously it makes the prediction results worse.
We find that the appropriate adjustment gives better results might due to the unbalanced distribution in
training data. Adjustment to fit the training data distribution is an effective post-processing.

          50
          45
          40
          35
          30
          25
          20
          15
          10
           5
           0
               Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21

                                           Run1    Run2    Run3

Figure 11: The number of correct prediction of the three runs

5. Conclusion and Future Works

    The goal of eRisk T3 is to automatically assess the severity of depressive by analyze the postings of
a person. We used a deep learning approach based on pre-trained model RoBERTa to build our system.
We submit three runs with different post-processing weighting mechanism. Run2 gives the best
ADODL and DCHR this year.
    In our experiments, we assume that each posting will give the same information on the choice of the
user. We believe that this is not a good assumption. Since a user might give positing in different
emotions in different time, that will be very different from what the user might answer to each of the
questions. This is one point that we will improve in the future. It should be that even for a user that
shows higher level of severity according to the questionnaire, there will be only some of the sentences
might show higher level of severity. Therefore, the sentences should be filtered with other tools. Only
the ones that shows higher level of severity should be associated with the higher scores.
    In the future, we plan to optimize the data, by comparing depression articles with non-depressive
articles, extract the content of articles that shows depression, and remove the content of over-familiar
articles, reducing the impact of useless content on the model.

6. Acknowledgement

   This study was supported by the Ministry of Science and Technology under the grant number MOST
110-2221-E-324-011.

7. References

    [1]   SIMON KEMP.: DIGITAL 2021: GLOBAL OVERVIEW REPORT. URL:https://datar-
          eportal.com/reports/digital-2021-global-overview-report (2021).
    [2]   Eichstaedt, J.C., Smith, R.J., Merchant, R.M.: Facebook language predicts depression in
          medical records. Proceedings of the National Academy of Sciences (PNAS) 115(44), 11203–
          11208 (2018).
    [3]   Reece, A.G., Reagan, A.J., Lix, K.L.M. et al. Forecasting the onset and course of mental
          illness with Twitter data. Sci Rep 7, 13006 (2017). URL:https://doi.org/10.1038/s41598-017-
          12961-9
[4]  CLEF eRisk: Early risk prediction on the Internet. URL:https://erisk.irlab.org/ (2021)
[5]  Losada D.E., Crestani F., Parapar J.: Overview of eRisk at CLEF 2020: Early Risk Prediction
     on the        Internet (Extended Overview)            (2020). URL:http://ceur-ws.org/Vol-
     2696/paper_253.pdf
[6] Chris McCormick.: BERT Word Embeddings Tutorial. URL:https://mccormickml.c-
     om/2019/05/14/BERT-word-embeddings-tutorial/ (2019)
[7] Use the huggingface pre-trained model to solve 80% of nlp problems. (2020, October 18).
     URL:https://www.bilibili.com/video/BV1Dz4y1d7am/?spm_id_from=333.788.videocard.15
[8] Parapar, J., Martin-Rodilla P., Losada, D. E., & Crestani, F. (2021, September). Overview of
     eRisk 2021: Early Risk Prediction on the Internet. In Proceedings of the Twelfth International
     Conference of the Cross-Language Evaluation Forum for European Languages (pp. tbp).
     Springer, Cham.
[9] Losada D.E., Crestani F., Parapar J.: Overview of eRisk at CLEF 2019 Early Risk Prediction
     on the Internet (extended overview) (2019). URL:http://ceur-ws.org/Vol-2380/paper_248.pdf
[10] Abed-Esfahani P., Howard D., Maslej M., Patel S., Mann V., Goegan S., and French L.:
     Transfer Learning for Depression: Early Detection and Severity Prediction from Social Media
     Postings. (2019). URL:http://ceur-ws.org/Vol-2380/paper_102.pdf
[11] Burdisso S.G., Errecalde M., Montes-y-G´omez M.: UNSL at eRisk 2019: a Unified
     Approach for Anorexia, Self-harm and Depression Detection in Social Media. (2019). URL:
     http://ceur-ws.org/Vol-2380/paper_103.pdf
[12] MADANI A., BOUMAHDI F., BOUKENAOUI A., KRITLI M.C., and HENTABLI H.:
     USDB at eRisk 2020: Deep learning models to measure the Severity of the Signs of
     Depression using Reddit Posts. URL:http://ceur-ws.org/Vol-2696/paper_39.pdf
[13] Martínez-Castaño R., Htait A., Azzopardi L., Moshfeghi Y.: Early Risk Detection of Self-
     Harm and Depression Severity using BERT-based Transformers iLab at CLEF eRisk 2020.
     (2020). URL:http://ceur-ws.org/Vol-2696/paper_50.pdf
[14] Maupomé D., Armstrong M.D., Belbahar R.: Early Mental Health Risk Assessment through
     Writing Styles, Topics and Neural Models. (2020). URL:http://ceur-ws.org/Vol-
     2696/paper_53.pdf
[15] Yates, A., Cohan, A., Goharian, N.: Depression and self-harm risk assessment in online
     forums. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language
     Processing. p. 2968–2978. Association for Computational Linguistics (2017).
[16] Trifan A., Salgado P., Oliveira J.L.: BioInfo@UAVR at eRisk 2020: on the use of
     psycholinguistics features and machine learning for the classification and quantification of
     mental diseases. (2020). URL http://ceur-ws.org/Vol-2696/paper_43.pdf
[17] Jacob Devlin; Ming-Wei Chang; Kenton Lee; Kristina Toutanova. BERT: pre-training of deep
     bidirectional transformers for language understanding. In Proceedings of the 2019 Conference
     of the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, (2019).
[18] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy,
     Mike Lewis, Luke Zettlemoyer and Veselin Stoyanov. RoBERTa: A Robustly Optimized
     BERT Pretraining Approach. Computing Research Repository, (2019). arXiv:1907.11692.
     version 1