=Paper=
{{Paper
|id=Vol-2936/paper-85
|storemode=property
|title=Towards transfer learning using BERT for early detection of self-harm of social media
                        users 
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-85.pdf
|volume=Vol-2936
|authors=Qamar Un Nisa,Rafi Muhammad
|dblpUrl=https://dblp.org/rec/conf/clef/NisaM21
}}
==Towards transfer learning using BERT for early detection of self-harm of social media
                        users ==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-85.pdf</pdf>
<pre>
Towards transfer learning using BERT for early detection of
self-harm of social media users
Qamar un Nisa 1 and Rafi Muhammad 1
1
    National University of Computer & Emerging Sciences - FAST, ST 4 Sector 17D Shah Latif Town Karachi
    75030, Karachi, Pakistan


                 Abstract
                 Online social media channels are everywhere and almost everyone uses them to post
                 various parts of their lives that they are not comfortable sharing in real. Large amounts
                 of data can be collected from these channels and used to study and development of
                 systems that detect depression and tendency of self-harm in individuals on the internet.
                 It is generally believed that social posts from a time order can be used to predict
                 depression or signs of self-harm for a user. Deep learning models for Natural Language
                 Processing tasks are already producing better results on language-based analysis of
                 sentiment, offensive or depression detection for a collection of text. BERT and
                 transformer-based architectures have proven to achieve state-of-the-art results for
                 Natural Processing Tasks. We will apply transfer learning and supervised learning
                 algorithm Logistic Regression on data to early detect signs of self-harm for a user. BERT
                 is used to generate word embeddings of sentences, has a limitation of processing
                 sentences only 512 tokens long, so we created custom algorithm to break sentences then
                 generate embeddings then concatenate them for supervised learning. Depression leads to
                 various forms of self-harm and our goal is to detect the tendency of self-harm in
                 individuals as soon as possible to intervene.

                 Keywords 1
                 BERT transformers, Word Embeddings, Early Detection of Depression, Transfer
                 Learning

1. Introduction
    Globally, depression is a common illness affecting more than 264 million people [1]. If left
unchecked, depression can get worse with time, resulting in self harm or worse; suicide, according to
World Health Organization, one person commits suicide every 40 seconds [2] various helplines and
awareness campaigns have been introduced to people in the hope of lowering suicide rate but not
everyone has that awareness or access to connect with a professional. 15 to 29-year-old people are most
affected by depression that leads to suicide, making it 2nd most subsequent cause of death for people in
this age group. Technological evolution has changed habits of people due to advanced devices, social
media networks and lifestyle, millennials prefer to text instead of talking to people on call or face to
face. According to a survey by OpenMarket, which surveyed 500 millennials, “75% of millennials
would rather lose the ability to talk versus text” [3]. With the rise of social media, there are various
websites and channels where people openly talk about their struggles and battles with depression. There
are support groups where people share their stories about depression, attempted suicide and other self-
harm incidents. People prefer to text/write on social media about depression instead of seeing a
professional for help, in some cases even preferring to talk to Bots. Because of social media channels

1
 CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
EMAIL: k180914@nu.edu.pk (A. 1); muhammad.rafi@nu.edu.pk (A. 2)
             ©️ 2021 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)
we now have some textual data that can be salvaged for early detection of depression. There are some
hidden patterns and styles in textual data that can help identify an author or even gender. Shlomo
Argamon et al [4] did research that stated that the gender of a writer can be identified based on how
many pronouns, facts, and noun-specifiers are used by a writer. Female writers use more pronouns like
“her, she, him” and male writers use more facts and noun specifiers like “the number of facts” [4].
Similarly depressed individuals can also be detected by using sophisticated natural language processing
techniques. Technology is helping in all areas and with the help of new natural language processing
methods and tools, depressed individuals hopefully can be identified using natural language processing
techniques before their condition gets worse or they provide any self-harm.
    Idea of Early Depression Detection (EDD) is to monitor texts of a user over time in chronological
order and assess using different learning approaches whether a user is depressed, in early stages to
intervene and get some help to user. Because of unavailability of different public datasets regarding
early detection of depression in internet users based on their comments and texts on social media, there
haven’t been diverse solutions to the specific problem to check from other datasets. Researchers have
used Natural Language approaches along with ensemble classifiers for EDD but very few have tried to
solve this with state-of-the-art deep learning models. In this study, we hope to deal with the problem,
looking at the previous techniques that yielded good results, work with them along with new models
and result in a working solution. Granted, a machine learning model cannot be as effective as a
professional counsellor or psychiatrist but not everyone has access to a professional and with the rise
of social media and textual data, a sophisticated model can help internet users.
    This paper describes an experiment to work on the early depression detection problem of classifying
a user’s text from reddit in chronological order and predicting signs of self-harm. The approach used in
this experiment is using BERT to create word embeddings and then Logistic Regression used for
prediction.

2. Related Work
    Detection of depression through textual data is a new and challenging field of research as it has been
identified 7-8 years back. Assessing depression online through series of text has immense impact, as a
lot of people are suffering from depression and don’t have access to mental health officers. There is
lack of data publicly available for Early Depression Detection (EDD) due to many reasons, including
that the data is collected from social media sites which prohibit distribution based on the GDPR laws.
Team of engineering students classified anxiety using EEG signals, making use of IoT and Machine
Learning but for that every person has to have the EEG headset [5], hence the detection of depression
through social media is essential. Because of the incremental classification problem elevated by EDD,
there aren’t any baseline strategies to detect depression or modelling as well. There have been various
approaches by different authors for early detection of depression, some using feature extraction, while
others using ensemble of classifiers. Some authors have worked on the problem without taking user
history and progress over time, while others have used time-aware methodology. David E. Losada
(2016) collected data from different social sites, largely reddit to create a dataset for detection of
depression, the dataset is collected from popular social forum “Reddit” [6]. Losada’s dataset is used as
baseline dataset for EDD at eRisk conference and other research projects. Losada also worked on
creating a baseline method and evaluation metric called “Early Risk Detection Error” (ERDE) for
models created to detect depression in texts over time. Different authors have used different types of
machine learning, natural language processing or hybrid approaches to solve the problem.

    2.1.         Deep Learning
   Deep Learning is a type of machine learning inspired by the structure and working of human brain,
neurons connected to one another, in deep learning, it is called “Neural Network”. Supervised Machine
Learning algorithms need sets of features in data to be classified, but in deep learning features are
identified and implicitly learned by the Neural Network itself. Deep learning techniques and
frameworks like ULMfit, ELMo etc have been used to do word embeddings then passed on to text
classifiers [7] or doing the text classification itself [8]. Elena Fano et al (2019) [7] used deep learning
framework ELMo to detect features, word embeddings and classified the output using multi-layer
perceptrons.
    M Trotzek et al. (2018) specified that language metadata is important in EDD text sequences, used
Neural networks for word embeddings and classification [8]. Trotzek used pre-trained word vectors
taken from Wikipedia and trained on fastText and GloVe. Trotken trained fastText model on reddit
dataset that has 1.7 billion user texts and posts [9]. Trotken pre-processed the data that it also has
punctuations, emoticons and also special character words since he believes based on linguistic research
that a depressed user has certain textual attributes, and they are important to preserve too. Trotzek used
Convolutional neural network for text classification, in addition to the reddit data, Trotzek also used
metadata features as secondary input to the CNN.
    Matero et al. (2019) pursued that depression detection for just one source of user writings is limited
and proposed to use 3 different types of features or dimensions to predict instead of just textual data.
He proposed to use Open Vocabulary features (word embeddings) from BERT, Theoretical dimensions
like affect, intensity, valence, dominance etc and based on research said age and gender play a role too
as the ratio of suicide is different in both genders. Third dimension he proposed is meta features of data
like n-grams and statistics to get more inference about data. Matero proposed to use dual-context, one
context from Reddit [6] writings and other features Meta or theoretical dimensions [10]. Matero didn’t
use time-aware classification methodology, so it is taking into account different features but not the
user text history and used logistic regression to classify the risk of suicide. Matero evaluated his results
with Accuracy and F1 score, no ERDE as there wasn’t time aware methodology used.


3. Problem Statement – eRisk 2021 Task 2: Early Detection of Signs of Self-
   Harm
   Consider a collection of social media posts of a user U, user posts several contents P in a time n.
Posts of a user are stored in chronological order over time. The frequency of how much a user posts in
a day, week etc. and the order of those posts in a time for example each day the posts P get more
depressing and gradually have explicit words like ‘I had a panic attack today’, ‘World would be better
without me’ etc. These type of analysis on the posts can be used to infer depressions and signs of self-
harm early in a user.

                                       Uposts = {P1, P2, P3……. Pn)

   Idea is to propose a model that can identify a depressed user without the situation being critical that
a user cause self-harm to themselves, through a sophisticated system, there can be intervention to help
the user. Each post P of a user will be passed through the model sequentially and using time-aware
methodology (considering past posts of user), user will be predicted by the model to be depressed or
not depressed. Due to the nature of task, it is important that the model predicts user to be depressed in
time, so not only does the model have to predict correctly but also in time as late prediction will not be
useful in the given scenario.
    We participated in Task 2 [11] for eRisk 2021, this task was also featured in 2019 [12] as Task 2
and in 2020 [26] as Task 1.


4. DataSet
   The dataset used is following the same format as [6]. Dataset is a collection of a user’s texts,
comments on other people’s posts, their own posts with titles, collected from social forum website
Reddit. Dataset is anonymous and users are denoted by user numbers, additional data is the time and
date of posts and comments. There are 2 classes of users “self-harm” and “not self-harm”, denoted by
1 and 0 in data. The dataset released by CLEF eRisk 2021 [11] is used for the experiment and research.
Table 1
Statistics and details about the 2021 Task 2 Dataset [11].
                                  Training Data     Training Data        Test Data          Test Data
                                    self-harm        not self-harm       self-harm        not self-harm
      Number of Users                   41                299               104                 319
      Number of Posts                 6927              163506             11691              91136
    Min. Posts per User                 8                  10                 9                  9
    Max. Posts per User                997               1992               942                1990
  Mean of Posts per User               168                546               112                 285


Figure 1: Histogram of all user writings (training and testing). Out of total 763 users, 376 users have
writings from 8 to 150. This can also mean that 150 user writings should be processed in training stage
if there is limitation of computing resources or time as majority users have writings no more than 150
and an optimal solution should give accurate results on first 150 user writings.

5. Methodology
    Model that does additional natural language processing on data like bag of words, word embeddings
then resulting data passed through classifier or ensemble classifiers has better outcome in past research.
Word embeddings are the most important part of text classification tasks, there have been many pre-
trained deep learning frameworks that provide state-of-the-art word embeddings. Elena Fano [7] used
GLoVe and ELMo for word embeddings. BERT has limitation of processing only 512 tokens, hence
not used by everyone and efficiently. Matero et al [10] used BERT similarly for word embeddings but
they truncated the sentences in training due to the 512 token limitations, so all user writings weren’t
used to training. Trotzek [8] also experimented with BERT word embeddings but also stated the
limitation of tokens and processing only first 512 tokens of sentences, [8] used fastText for word
embeddings as they didn’t have any limitation and provided better results. In this study we will use
BERT [13] to generate word embeddings then use the extracted features to train supervised machine
learning model, using Logistic Regression Classifier similar to [6].
    5.1.        Data Processing
   The training data for Task 2 was collection of reddit user writings in XLM format, since the data
was real collected from Reddit (a social networking forum) the data had to be cleaned, prepared to be
processed to accommodate the 512-token limitation of BERT. First all URLs from each User Post is
removed, then all words in the post are counted. An empty StringS is initialized for each user to store
user posts/writing. If the total words in post are less than 300 then we check further to see if
concatenating user post with StringS will result in more than 300 words or not. If not, then both are
concatenated, and next post is processed and go through the same flow. If the post contains more than
300 words, then it is broken down into small sentences and split using punctuation marks, the same
process of checking words again and concatenating with Strings is applied then, if after splitting on
punctuations the word count is still more than 300 then it means it’s a long paragraph and is not being
split using punctuations, we then split the long post using blank space and concatenate with StringS till
StringS has 300 words. If StringS and the new post’s length after being concatenating is more than 300
then StringS is saved in dataframe as a row and then emptied for next sentence. The whole flow of how
the data was processed is shown in detail in Figure 2 as Flowchart. Each row in new created Data has
UserId, newly processed sentences, Class Label of the user being depressed or not. Newly generated
data of 340 users now has 16060 rows. Now prepared Data is then used to generate word embeddings
using transfer learning on BERT.

   The motivation behind this approach is that when a sentence is passed to BERT, if a sentence has
15 words it still outputs a vector of 786 dimension. Using this approach, smaller sentences in
chronological order are concatenated instead of removing them as each post of a user is important.
Other teams have truncated posts or removed small posts from data. Using this approach, we aren’t
losing any data and are also maintaining the chronological order of posts and the context of sentences.


Figure 2: Flow-chart of Data Preparation Stage, breaking user posts into small sentences of 300 words.
Data preparation is used for training classifier, not while testing as seen in Figure 3.


Table 2
Statistics and details about the 2021 T2 training Data, ready to be prepared to be processed on BERT.
                                            Data Stats                   Comments
         Total number of Users                340                             -
       Minimum Sentence per User               1               21 Users have just 1 sentence
       Maximum Sentence per User              817          1 User has 817 sentences generated; 2nd
                                                                  highest is 491 sentences
            Mean of Sentences                  47          The average number of sentences is 47


    5.2.        BERT Word Embeddings
    Bidirectional Encoder Representations from Transformers (BERT) was created by google [13] to
out-perform various Natural Language Processing models like ELMo, ULMfit, Generative Pre-
Training etc as they are all uni-directional and hardly bidirectional for downwards NLP streams. BERT
is called as a method of pre-training language representations, we first pre-train a large general-purpose
corpus then uses that for downstream NLP tasks that we are targeting like early detection in our study.
BERT has pre-trained word embeddings of 30,000-word tokens, each word token having 768 features.
    BERT pre-trained model is available via hugging face transformer library, BERT has different
versions that vary in training parameters or cases like lower-case, upper-case, multilingual etc. We used
‘bert-base-uncased’ for transfer learning to generate word embeddings for the sentences and user
writings.

   The prepared training data is passed to BERT in batches of 3000 rows to generate word embeddings.
The word embeddings generated are of 16060 x 768 dimensions
   Some more data processing is applied to the generated word embeddings to prepare for machine
learning, all the rows of a single user are concatenated to have one row per user for classification. This
resulted in 340 x 627456 dimensions. The columns were huge, so further dimensionality reduction was
applied for faster processing and training of model. After reduction the size of columns is 1655.

    5.3.        Depression / Self-harm Classifier
    Depression classifier is trained using processed word embeddings are then sampled on Logistic
Regression using cross-validation to find the optimal weight and the optimal value of C for
normalization. The optimal values are then applied on training data word embeddings and model is
saved to be used for testing and prediction. We created another variation, where classifier is built using
AutoML open-source library ‘auto-sklearn’, which applies different machine learning algorithms on
data and does hyperparameter tuning based on the accuracy metric [17]. Logistic Regression is chosen
as classifier to perform experiment on test-server data provided by CLEF eRisk team. The results on
training data with different confidence score or threshold values are shown in Table 3. The confidence
value of 0.5 on the logistic regression output gave the most optimal results in our experiment of
predicting on test data set during training. The setting of CLEF eRisk states that once a prediction about
a user is made, it cannot be changed later on so for first 5 user posts, we decided to use 0.75 confidence
value. AutoML took longer time to train model and outputs bad scores as due to the sparsity of training
data it is adding 0 in test data set, due to this AutoML model is predicting 0 unless the data is of same
dimensions (1655) without the 0 padding at the end. AutoML model predicted just one sample to be
positive but that was not true positive. AutoML model was going to be run2 but due to bad results in
training we did not submit it and only submitted run of logistic regression model.
Table 3
Training results on evaluation metrics using Logistic Regression and AutoML classifier and different
confidence scores as threshold values for prediction.
 Classifier Threshold         F1            P           R         ERDE5    ERDE50        ERDE100
                Value
    LR            0.5      0.387368 0.247978 0.884615 0.193439 0.190534 0.190534
    LR           0.75      0.332203 0.256545 0.471154 0.215949 0.212559 0.212559
    LR            0.8      0.253165 0.225564 0.288462 0.237447 0.234808 0.234808
  AutoML          0.5      0.000000 0.000000 0.000000 0.246444 0.246444 0.246444


Figure 3: High Level Flow Diagram / Architecture Diagram of the experiment


    5.4.         Baseline Classifier
     The research in Early Depression Detection [7] has been used as a benchmark by researchers in
the domain [18]. Hence, for the experimental setup for this study, we have reproduced the experiment
of [7] to be used as a Baseline Model, the experiment is implemented as follows:

     CLEF eRisk Dataset has negative and positive users in the format discussed in Chapter 4.1. All
negative and positive writings are appended to one corpus. Feature extraction on is performed by
vectorizing the corpus TfidfVectorizer, having standard stop list in English language and minimum
document frequency of 20 – words that appear in less than 20 user writings are removed. Vectorization
is done with sklearn library in python. Resulting vector is stored in a sparse 486x12968 array.

     Logistic regression is used for classification. Vector of ground truth values is created for 486 users,
denoting depressed as 1 and not depressed as 0. Implementation of Logistic Regression is done using
sklearn library of Logistic Regression, in this experiment logistic regression is taking 4 parameters.
Penalty to specify the normalization which is ‘L1’ in our study. C is the penalty parameter
accompanying with the error term of the optimisation of model, like in SVM’s C should have smaller
value for strong regularization. class_weight is the last parameter to specify weights for classes, this is
useful for data which has class imbalance.
      Values of C and class_weight is optimized using the standard protocol by Chih-Wei hsu [34], grid
search with exponentially growing sequences of C = 2−10, 2−4, ..., 29 and w = 20, 21, ..., 29 is applied
on parameters. 4-fold cross-validation is done on eRisk 2021 t2 training data with the above sequences
to find the optimal value. The value of C and w that maximizes F1 score, precision and recall, in [7] is
C=16 and w=4. But we didn’t get that value while experimenting. Based on the 4-fold cross-validation
the got two sets of values that optimized F1 score, C=256 & w=64 and C=128 & w=32. The respective
values are used with other hyperparameters, and logistic regression is applied on training data to build
a self-harm classifier. We created two Baseline classifiers with different hyperparameters, to compare
with our models.

6. Evaluation Metrics
   Several evaluation metrics are proposed and widely used for the specified task, shared in detail
below as mentioned in [11]:


    6.1.         Early Risk Detection Error (ERDE)
    In detecting a serious illness or disease that causes harm to an individual, time is of the essence and
detecting at right time to take preventive measures before the condition worsens. ERDE is a metric [6]
to measure the error or accuracy of the model, too early detection can lead to false positive
classifications or too late detection can lead to serious harm so the prediction should be at an optimal
time for intervention, ERDE measures that for the model.
    The delay in model is measured by the total user posts depression classifier processed to make the
classification, denoted by k. Since positive classes in the cases is low, we can suffer “class imbalance”
problem, so each error is weighted differently in ERDE.
    Here, EDD system has made a classification decision of a user being depressed or not, d at a given
time t. The prediction d can lead one of the four golden truth judgements: True Positive, True Negative,
False Positive, False Negative. ERDE measure based on these four cases, is defines as:


   In different studies, sometimes the positive class is majority or sometimes negative is majority, in
our study the positive classes are in minority. To build a measure that takes into consideration the class
imbalance, cfn here is set to 1 and it should be greater than cfp, cfp can be set to be some proportion of
positive examples (e.g., if there are 3% positive examples then set cfp to 0.03). The third factor, latency
cost lco(k) (∈ [0, 1]) encodes a penalty if true positives are identified with delay and is set as [6]
                                                                                                      (2)


    This function is parametrized by o, which controls the point where the cost grows more quickly.
The error is calculated by taking mean of all classification of ERDE values. [11]
    ERDEo is a baseline proposed data stream classification metric where not only do we have to make
prediction but also identify when to make it, which is crucial for Early detection of depression and self-
harm domain.

    6.2.         F-latency
    CLEF eRisk 2021. [11] proposed to use Flatency, in addition to ERDE metric, to get more
interpretable evaluations. Flatency was proposed by Sadeque and colleagues [16]. F-latency is also
called and referred to as Latency-weighted F1. Which is F1 score multiplied by speed, speed is overall
factor where if a user is correctly identified and classified in their first post and it is close to 0 for slow
systems that have to process hundred or more writings of a user to identify true positives.

    6.3.         Standard Classification Measures
    Whenever machine learning models are mentioned for supervised learning, the Accuracy of the
model is an important metric to look at, but for some cases the accuracy of the model doesn’t specify
or give a whole view of the model therefore other standard classification measures like Precision, Recall
and F-measure are used to evaluate the model further. Precision, Recall and F1 Score classification
metrics are used in CLEF eRisk tasks related to depression as well [11].

7. Result & Discussion
   The experiment delivered good results for ERDEo [11] and Recall, but the accuracy and efficiency
of model is not optimal and leaves room for improvement. The confidence value of 0.5 used as threshold
by Logistic Regression Model is increasing F1 Score slightly, the recall is increased drastically as
compared to other results of 0.75 and 0.8 confidence score threshold value. ERDE is decreased with
0.5 confidence score as well. Submission of CLEF eRisk 2021 Task 2 submission was done on 6 user
writings, which is low number to see the performance of whole model, so we applied the model after
the submission to test what the result would have been if more writings were processed and predicted.
We also tested our model with Baseline model created using the approach in [7]. The ERDE o of our
runs had better ERDEo than the baseline model, our proposed model had better recall in train stage too
as compared to the result of baseline model on train data. Explanation of runs along with predictions
and result on metrics are shared in Table 4.

Table 4
Result of classifier on test-server data predictions. LR-T2 run1 & run2 are the runs submitted for CLEF
eRisk 2021 task2 - Early Detection of Signs of Self-Harm. LR-T2 test run1 & run2 are experiments
performed using same model but run on more user posts after the task2 submission.
        Classifier     Number           F1            P          R         ERDE5        ERDE50
                       of Posts
        LR – T2            6          0.172        0.124       0.283        0.101        0.097
           run1
        LR – T2            6          0.172        0.124       0.283        0.101        0.097
           run2
        LR – T2           104         0.185        0.103       0.848        0.097        0.096
        test run1
        LR – T2           107         0.187        0.105       0.855        0.096        0.095
        test run2
        Baseline-         All         0.545        0.419       0.778        0.141        0.122
         w-64[7]
        Baseline-         All         0.558        0.441       0.759        0.139         0.12
         w-32[7]

   Run 1 is not cleaning URLs, punctuation marks and bad formats like #8217 for encoding of
apostrophe (‘) etc., on user posts processed, whereas Run 2 is processing user posts and cleaning data
before generating word embeddings. Run 2 isn’t much different from Run 1 as there is very slight
change in results of them, both not being optimal for Precision, F1 score. The recall on model is good.
Run1 classified 1241 users as positive, out of those only 129 are true positives and Run1 classified 1232
as positive, correctly identifying 130 users. Run1 has read 104 user writings to make all predictions,
instead of reading all, Run2 has read 107 user writings to make all predictions. In first 5 posts, 1130
users were classified in Run2 making the speed of experiment to be optimal, details shown in Figure 4.


           Figure 4: Histogram of user writings read for each user for classification of Run2.

    7.1.        Future work
    The approach of Data Preparation is unique, due to the limitation of resources and time, we
couldn’t train BERT or RoBERTa on our generated data and used Transfer Learning instead. The
proposed approach can be improved in future by following methodologies:

       ● The approach of Data Preparation is unique, due to the limitation of resources and
         time, we could not train BERT or RoBERTa on our generated data and used Transfer
         Learning instead. Own model on prepared data using our approach can be built from
         scratch or fine-tuned on BERT, RoBERTa to get more accurate results specific to
         the domain of Early Risk Prediction.
       ● Similar to team iLab from CLEF eRisk 2020 [14], another experiment can be done
         to process Test data similar to Train data in our experiment, pass-through
         classification model and the output probability averaged for all generated sentences
         and make decision if a user is depressed or not.
       ● Similar to team iLab from CLEF eRisk 2020 [14], another experiment can be done
         to process Test data similar to Train data in our experiment, pass-through
         classification model and the output probability averaged for all generated sentences
         and make decision if a user is at risk or not
       ● Instead of using BERT for extracting word embeddings, use BERT and its different
         architectures for classification and compare for better model.
       ● Data can be experimented and compared with different models of BERT like
         RoBERTa or more state-of-the-art models like GPT-3.


8. Acknowledgements
   I would like to thank my university National University of Computers & Emerging Sciences, my
research supervisor, Muhammd Rafi and my colleagues at IBM, Asna Javed, Mohammad Ali Khan and
Fawaz Siddiqui for supporting me with my research and helping out with few technicalities.

9. References

   [1] GBD 2017 Disease and Injury Incidence and Prevalence Collaborators. (2018).
       Global, regional, and national incidence, prevalence, and years lived with disability
       for 354 diseases and injuries for 195 countries and territories, 1990–2017: a
       systematic analysis for the Global Burden of Disease Study 2017. The Lancet.
   [2] World Health Organization, Mental Health and Substance use, 2016. URL:
       https://www.who.int/teams/mental-health-and-substance-use/suicide-data.
   [3] J. Loechner, Text vs. Talk Gets Millennials' Attention, 2016. URL:
       https://www.mediapost.com/publications/article/275332/text-vs-talk-gets-millennials-
       attention.html.
   [4] S. Argamon, J. Fine, Gender, Genre, and Writing Style in Formal Written Texts, Text
       - Interdisciplinary Journal for the Study of Discourse 23(3) (2003)
       doi:10.1515/text.2003.014.
   [5] A. Arsalan, M. Majid, S. M. Anwar, Electroencephalography Based Machine
       Learning Framework for Anxiety Classification, (2019), in: Bajwa, I. Sarwar,
       Sibalija Et al., Intelligent Technologies and Applications, Second International
       Conference, INTAP 2019, Bahawalpur, Pakistan, 2020, pp. 187-197.
       doi:10.1007/978-981-15-5232-8_17.
   [6] D. E. Losada, F. Crestani, A Test Collection for Research on Depression and
       Language Use, in: International Conference of the Cross-Language Evaluation Forum
       for European Languages (2016). doi:0.1007/978-3-319-44564-9_3
   [7] E. Fano, J. Karlgren, N. Joakim, Uppsala University and Gavagai at CLEF eRISK:
       Comparing Word Embedding Models, in: Working Notes of CLEF 2019 - Conference
       and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019,
       CEUR-WS.org , 2019
   [8] M. Trotzek, S. Koitka, and C. M. Friedrich, Utilizing Neural Networks and Linguistic
       Metadata for Early Detection of Depression Indications in Text Sequences, IEEE
       Transactions on Knowledge and Data Engineering 32(3) (2018) 588-601.
       doi:10.1109/TKDE.2018.2885515.
   [9] Reddit, I have every publicly available Reddit comment for research. ~ 1.7 billion
       comments @ 250 GB compressed. Any interest in this?, 2016. URL:
       https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available
       _reddit_comment/
   [10]        M. Matero, A. Idnani, Y. Son, Et al., Suicide Risk Assessment with Multi-
       level Dual-Context Language and BERT, in: Proceedings of the Sixth Workshop on
       Computational Linguistics and Clinical Psychology (2019). doi:10.18653/v1/W19-
       3005.
   [11]        Parapar J., Martín-Rodilla P., Losada D.E., Crestani F. (2021) eRisk 2021:
       Pathological Gambling, Self-harm and Depression Challenges. In: Hiemstra D.,
       Moens MF., Mothe J., Perego R., Potthast M., Sebastiani F. (eds) Advances in
       Information Retrieval. ECIR 2021. Lecture Notes in Computer Science, vol 12657.
   [12]        D. E. Losada, F. Crestani, J. Parapar, (2019) Overview of eRisk 2019 Early
       Risk Prediction on the Internet. in: Crestani, F., Braschler, M., Savoy, J., Rauber, A.,
       Müller, H., Losada, D.E., Heinatz Bürki, G., Cappellato, L., Ferro, N. (eds)
       Experimental IR Meets Multilinguality, Multimodality, and Interaction. 10th
   International Conference of the CLEF Association, CLEF 2019, Lugano, Switzerland,
   September 9–12, 2019.
[13]       J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep
   Bidirectional Transformers for Language Understanding, 2019. doi:10.18653/v1/N19-
   1423
[14]       R. Martinez-Castano, A. Htait, L. Azzopardi, Y. Moshfeghi, (2020) Early risk
   detection of self-harm and depression severity using BERT-based transformers : iLab
   at CLEF eRisk 2020. In: Early Risk Prediction on the Internet, 2020-09-22 - 2020-09-
   25.doi
[15]       J. M. Loyola, M. L. Errecalde, H. J. Escalante, M. Montes, Learning When to
   Classify for Early Text Classification, in: Computer Science – CACIC 2017 (pp.24-
   34) (2018). doi:10.1007/978-3-319-75214-3_3
[16]       F. Sadeque, D. Xu, S. Bethard, Measuring the Latency of Depression
   Detection in Social Media, in: Proceedings of the Eleventh ACM International
   Conference on Web Search and Data Mining, WSDM 2018, pp. 495–503. ACM,
   2018.
[17]       M. Feurer, A. Klein, K. Eggensperger, J. Springberg, M. Blum, F. Hutter,
   Efficient and Robust Automated Machine Learning, in: Neural Information
   Processing Systems 28, NIPS 2015, pp. 2862-2970

</pre>