=Paper=
{{Paper
|id=Vol-3180/paper-69
|storemode=property
|title=Early detection of depression using BERT and DeBERTa
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-69.pdf
|volume=Vol-3180
|authors=Sreegeethi Devaguptam,Thanmai Kogatam,Nishka Kotian,Anand Kumar M
|dblpUrl=https://dblp.org/rec/conf/clef/DevaguptamKKM22
}}
==Early detection of depression using BERT and DeBERTa==
Early detection of depression using BERT and DeBERTa Sreegeethi Devaguptam, Thanmai Kogatam, Nishka Kotian and Anand Kumar M Department of Information Technology, National Institute of Technology Surathkal, Karnataka. Abstract In today’s world, social media usage has become one of the most fundamental human activities. On the report of Oberlo, at present, 3.2 billion people are on social media, which comprises 42% of the World’s population. People usually post about their daily life style, special occasions, views about on-going issues and their networks on the social media platforms. People also share things on social media which otherwise would not have shared with other people. Social media helps us to stay connected, keep informed, mobilise on social issues. Due to the surge of suicide attempts, social media can act as a life saver in detecting and tracing users who are on the verge of depression and self-harm. Natural language processing methods with the help of deep learning are aiding in solving language/text related real world problems like sentiment analysis, translation of text into different languages, depression detection. Many transformer based models like BERT (Bidirectional Encoders Representations from Transformers) are put to use to solve NLP problems, which voluntarily learns to attend to different features differently (Weighing). In this paper, a supervised machine learning algorithm with transfer learning approach is used to detect self-harm tendency in the social media users at the earliest. Keywords Natural Language Processing, BERT, DeBERTa, transfer learning, text augmentation, social media. 1. Introduction Major depressive disorder, often known as clinical depression, is a mood condition that affects how you feel and creates a permanent feeling of melancholy and loss of interest. Depression has an impact on how a person feels, thinks, and acts, and can result in a number of emotional and physical issues. According to the World Health Organization’s 2020 report, about 264 million people worldwide suffer from depression. It is critical to have it checked; otherwise, there is a risk of depression worsening over time, leading to self-harm or suicide. Major depressive episodes are most common in adolescents aged 12 to 17 years, followed by young adults aged 18 to 25 years, and are rare in people over the age of 50 according to the Substance Abuse and Mental Health Services Association, 2018. Various help lines and mental health awareness initiatives have been established to minimise the suicide rate and to assist persons suffering from depression. Because of modern gadgets, social media networks, and lifestyle, people’s behaviours have changed. Millennials prefer to text rather than converse on the phone or face to face. According to a poll of 500 millennials conducted by OpenMarket, "75% of millennials would rather lose the capacity to converse than text". With the growth of social media, there are now a plethora of websites and channels where individuals freely discuss their depression CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) difficulties and battles. People can share their tales regarding despair, attempted suicide, and other self-harm occurrences in support groups. People prefer to text/write about sadness on social media rather than seek professional treatment, and some even prefer to chat to Bots. We now have some textual data that may be rescued for early diagnosis of depression/self-harming inclinations thanks to social media outlets. Textual data has certain hidden patterns and styles that can be used to identify an author or even gender. Depressed people can be identified using some tools of Natural Language processing. Depressed persons can perhaps be diagnosed using natural language processing techniques before their condition worsens or they self-harm, thanks to improved natural language processing methodologies and tools. The goal of Early Depression Detection (EDD) is to track a user’s messages over time in chronological sequence and determine whether or not the user is depressed early enough to intervene and provide assistance. There haven’t been any diversified solutions to the specific challenge to check from other datasets due to the lack of distinct public datasets relating early diagnosis of depression in internet users based on their writings and also feed on their social media accounts. For EDD, researchers have employed Natural Language methods combined with ensemble classifiers, but only a few have attempted to address the problem using cutting- edge deep learning models. Although a machine learning model cannot replace a professional therapist or psychiatrist, not everyone has access to one, and with the proliferation of social media and textual data, a smart model can assist internet users. In this paper, deep learning models are trained for early detection of depression based on texts taken from reddit in chrono- logical order. This paper follows the approach of fine tuning BERT and DeBERTa models along with data augmentation to get a balanced dataset. 2. Related Work Self-harm detection has got the limelight in deep learning domain as it directly affects the lives of the people. This has attracted many scholars and researchers to work on this problem. There is a huge contribution made in detection of self-harm [8, 9]. Many of the papers have been on different ML (Machine Learning) algorithms in addition with transfer learning models [6, 7]. Yueh-Ming Tai et al. [1] has collected the medical data and eight other factors of Taiwan Soldiers and people admitted in hospitals due to mental disorders like age, depression status, family and community history with depression related problems . They have used Artificial Neural Networks (ANN) algorithm called Radial Basis Function (RBF) models to detect self-harm history and suicide attempt history. Taru Jain et al. [2] has taken tweets from Twitter as their dataset and trained Adversarial Machine Learning algorithm. The paper also talks about the detection of character and word level detection of the self-harm risk. In Adversarial training, they have used augmentation with GANs like SentiGAN. A. Benton et al. [3] proposes a model which considers the demographic attributes along with mental attributes of a person under study. It has deployed Multi-task Learning (MTL) Framework and logistic regression to detect the onset of different neurological problems in a person. Pratool Bharti et al. [4] uses watch dog model which has three important phases in the detection of self harm risk of a person likely being an accelerometer is given to every studied user to wear it on wrist, developing an to predict whether a person is active/lazy and deploying a machine learning algorithms like random forest classification which can detect the self harm risk in a person. Parallel efforts have been made to develop strong strategies to improve the robustness of AI systems. Prior research has included word recognition models [13], [14], and denoising auto-encoders [15] to combat character level attacks. Adversarial training has been proven to be effective for word level perturbations [16], and we employ it in our study as well. Other protection strategies include reinforcement learning [17] and detecting adversarial noise before it has an impact on model predictions [18]. 3. Methodology Two transformer based models were trained to detect early traces of depression. From figure 1, it can be seen that the dataset provided is quite imbalanced. It contains a much higher number of writings of users who did not have depression compared to those who did. We applied text augmentation to increase the size of data in the positive class and performed downsampling on the negative class. Figure 3 represents the flowchart of our method (a sample paraphrased sentence is used). A description of the techniques used is given below. 3.1. Data Preprocessing The first stage is data preprocessing. The writings were cleaned by applying several techniques. Conversion of text to lowercase, removal of urls and html tags, removal of special characters and numbers, stopword removal and substitution of emoticons with their corresponding textual descriptions were included as a part of our preprocessing stage. As most of the writings were within 100 words as seen in Figure 2, longer posts were split into parts to make sure that the length was below 100. Figure 1: Pie chart of number of users in each class Figure 2: Histogram of post lengths 3.2. Data Augmentation Data augmentation techniques are used to artificially generate additional data using the available data. It is very widely used in several of the computer vision tasks and can also be used in NLP. Text augmentation is quite challenging as it requires understanding of the context in the sentences. In our method, word level augmentation was applied using the Synonym augmentor from the nlpaug library. The source database used was wordnet and the maximum number of words that would be augmented was set to 20. It simply replaces words with suitable synonyms. This is possibly one of the preferable method in terms of computation cost. Word embedding based augmentation techniques using Glove or even Bert may give better results. Augmentation was done only for the train set. 3.3. Classification Two transformer based models were trained - namely BERT and DeBERTa. These contain stacked transformer blocks each of which contains a self attention head followed by a fully connected feed forward network. After performing the intial steps, the dataset is divided as 80% data for training the data and 20% for validating the data The input textual data is first converted into the required input data format using a tokenizer. Splitting the tokens is handled by the tokenizer i.e., the sequence is split into tokens available in the tokenizer vocabulary. These tokens can be words or subwords. Additionally, it performs truncation and padding to make sure that the maximum length is 100 and adds special tokens. It returns the input ids and the attention mask in tensor format. Once the input data is prepared, it can be used to train the models. Transfer learning technique is used wherein the BERT model which was originally pretrained on bookcorpus and wikipedia data on two tasks i.e language modelling and next sentence prediction is fine tuned on our self harm detection dataset to perform sequence classification. Figure 3: Flowchart for classification using BERT 3.3.1. BERT BERT and other Transformer encoder architectures have proven to be quite effective in a range of NLP tasks. They create natural language vector-space representations that can be used in deep learning models. Attention mechanism is used to learn contextual relations between the words. The context of a word is understood from both the left and right parts surrounding it. These models are pre-trained on a huge corpus of text before being fine-tuned for specific tasks. We have used BERT-base cased model for training the model, which has trained weights of the original BERT model. As we are dealing with the binary classification problem i.e., self-harm or not self-harm, we use binary cross-entropy loss function. For optimization, we have used Adam optimizer (learning_rate=2e-5, epsilon=1e-08) which limits the prediction loss and does regularization by weight decay (not utilizing moments). 3.3.2. DeBERTa DeBERTa improves on the BERT and ROBERTa models. It uses a disentangled attention mech- anism where each word is represented in terms of two vectors, one to encode information about the word’s position and other about its content. The attention weights are calculated from disentangled matrices on the basis of content and relative arrangement. This is created from the idea that the relative positioning of the words could provide useful information. The dependency between a pair of words could be higher when they occur adjacent to each other. Second, deBERTa uses an enhanced mask decoder which takes into consideration the absolute word positions. Although the relative positional and content are considered by the attention mechanism, it does not take into account the absolute positions which could be pivotal in certain cases. Further, it uses the virtual adversial training regularization technique to improve the performance. It is incorporated right before the softmax layer. This helps to prevent over-fitting and improve generalization. We used SparseCategoricalCrossentropy as the loss function, Adam as the optimizer (learning_rate=2e-05, epsilon=1e-06) and accuracy as the metric. 4. Results Table 1 contains the performance evaluation results that were obtained. For run 0, BERT model without text augmentation is used, run 1 uses DeBERTa model without text augmentation, run 2 uses BERT model with text augmentation and run 3 uses DeBERTa model with text augmentation. The system fires an alert when the last user post is classified as positive. Comparing the results from these runs it can be found that the highest Recall value was obtained using deBERTa when augmentation was not used while the highest F1 was obtained using deBERTa with augmentation technique. ERDE values were found to be greater in run 2. These values though are not very ideal and there is scope for improvement. There was a considerable time lapse of 01:52:57 while processing the writings which is not optimal. The low number of writings processed (6) may have not been able to give a good idea about the overall model. Run P R F1 ERDE5 ERDE50 latencyTP speed latency-weightedF1 0 0.138 0.796 0.235 0.047 0.039 2.0 0.996 0.234 1 0.135 0.806 0.231 0.047 0.039 2.0 0.996 0.230 2 0.132 0.786 0.225 0.050 0.040 2.0 0.996 0.225 3 0.149 0.724 0.248 0.049 0.039 2.0 0.996 0.247 Table 1 Decision based evaluation result 5. Conclusions and Future work In this experiment a unique text augmentation method was applied to create additional samples of posts with signs of depression in order to create a balanced dataset. This was done after random downsampling on the negative class. As seen from the results, majority of the per- formance metrics did not show a great change between the models trained with and without augmentation. The two different transformer models used showed only minor differences in terms of performance. Due to limitations on computing resources the models were trained only for one epoch and also substantial amount of data was removed during the downsampling process applied for the negative class. We have only explored one of the many text augmentation methods. The augmentation process can be refined by creating a pipeline of several different augmentors as well as by the use of more advanced techniques that can take into account the context in the sentences. These aspects can be considered to improve the experiment in the future. Ultimately, the development of an efficient tool that can accurately identify signs of depression from posts can be very beneficial and could be integrated to social media platforms. Further to improve upon the work, an explainable AI model can be developed so that the predicted result can be interpreted and can give an idea about why a post is classified into that particular class. References [1] Yueh-Ming Tai; Hung-Wen Chiu: Artificial Neural Network Analysis on Suicide and Self- Harm History of Taiwanese Soldiers, Second International Conference on Innovative Com- puting, Information and Control (ICICIC 2007). [2] Taru Jain : Adversarial Machine Learning for Self Harm Disclosure Analysis, 2020 IEEE Sixth International Conference on Multimedia Big Data (BigMM). [3] Adrian Benton, Margaret Mitchell, Dirk Hovy : Multitask learning for mental health condi- tions with limited social media data, Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1 Long Papers, pp. 152-162, Apr. 2017. [4] Pratool Bharti, Anurag Panwar, Ganesh Gopalakrishna, and Sriram Chellappan : Watch dog : Detecting Self-Harming Activities from Wrist Worn Accelerometers, IEEE Journal of Biomedical and Health Informatics. [5] Hassan Alhuzali, Tianlin Zhang and Sophia Ananiadou: Predicting Sign of Depression via Using Frozen Pre-trained Models and Random Forest Classifier, CLEF eRisk. [6] Lu´ıs Oliveira, Bioinfo@ uavr at erisk 2020: on the use of psycholinguistics features and machine learning for the classification and quantification of mental diseases (2020). [7] Rodrigo Martínez-Castaño, Amal Htait, Leif Azzopardi, Yashar Moshfeghi : Early risk detection of self-harm and depression severity using bert-based transformers (2020). [8] Sharath Chandra Guntuku, Daniel Preotiuc-Pietro, Johannes C. Eichstaedt, Lyle H. Ungar : What Twitter profile and posted images reveal about depression and anxiety, Proceedings of the international AAAI conference on web and social media, volume 13, 2019, pp. 236–246. [9] Mario Ezra Aragón, Adrian Pastor López-Monroy, Luis Carlos González-Gurrola, Manuel Montes-y-Gómez : Detecting depression in social media using fine-grained emo- tions,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 1481–1486. [10] Ana-Maria Bucur , Adrian Cosma and Liviu P. Dinu : Early Risk Detection of Pathological Gambling, Self-Harm and Depression Using BERT, CLEF eRisk 2021. [11] Elena Campillo-Ageitos , Hermenegildo Fabregat , Lourdes Araujo and Juan Martinez- Romo : NLP-UNED at eRisk 2021: self-harm early risk detection with TF-IDF and linguistic features, CLEF eRisk 2021. [12] Diana Inkpen, Ruba Skaik, Prasadith Buddhitha, Dimo Angelov and Maxwell Thomas Fredenburgh : uOttawa at eRisk 2021: Automatic Filling of the Beck’s Depression Inventory Questionnaire using Deep Learning, CLEF eRisk 2021. [13] Danish Pruthi, Bhuwan Dhingra, Zachary C. Lipton : Combating adversarial misspellings with robust word recognition, ACL, 2019. [14] Keisuke Sakaguchi, Kevin Duh, Matt Post and Benjamin Van Durme : Robsut wrod re- ocginiton via semi-character recurrent neural network, Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence February 4-9 2017, pp. 3281-3287, 2017. [15] Keita Kurita, Anna Belova, Antonios Anastasopoulos : Towards robust toxic content classification, 2019. [16] Di Jin, Zhijing Jin, Joey Tianyi Zhou and Peter Szolovits : Is bert really robustƒ a strong baseline for natural language attack on text classification and entailment, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 8018-8025, 04 2020. [17] Jingjing Xu, Liang Zhao, Hanqi Yan, Qi Zeng, Yun Liang and Xu Sun, "LexicalAT: Lexical- based adversarial reinforcement training for robust sentiment classification", Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Nat-ural Language Processing (EMNLP-IJCNLP), pp. 5518-5527, Nov. 2019. [18] Yichao Zhou, Jyun-Yu Jiang, Kai-Wei Chang, Wei Wang : Learning to discriminate pertur- bations for blocking adversarial attacks in text classification, EMNLP/IJCNLP, 2019. [19] Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen : DeBERTa: Decoding-enhanced BERT with Disentangled Attention ,ICLR, 2021. [20] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Attention Is All You Need, 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. [21] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019. [22] Javier Parapar, Patricia Martín-Rodilla, David E. Losada, Fabio Crestani: Overview of eRisk 2022: Early Risk Prediction on the Internet. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. 13th International Conference of the CLEF Association, CLEF 2022. Springer International Publishing, Bologna, Italy.