-

A Pretrained Language Model for Mental Health Risk Detection

Diego Maupomé

Fanny Rancourt

Raouf Belbahar

Marie - Jean Meurs

0 0 Université du Québec à Montréal , Montréal, QC , Canada

Early detection of mental health issues is a key contributor to eficient treatment. Natural language processing-based approaches can provide automated means to facilitate access to appropriate services and support for at-risk individuals. Using pretrained language models provides state-of-the-art results in various downstream tasks as these models leverage significant amounts of textual content. They can be critical in data-scarce research areas, such as early detection of mental health issues. Nonetheless, exposing models to domain-specific language can be beneficial to their performance in downstream task. To this end, we release pretrained language models, MentalHealthBERT, leveraging content from Reddit fora discussing anorexia, depression and self-harm. These models are evaluated on risk detection tasks for the respective conditions.

1. Introduction that large models trained on suficiently large data sets will learn to produce useful representations of text reEarly intervention in mental health and well-being has gardless of what specialized task these representations become a critical principle of mental health care, ush- will serve. Such a framework leverages large quantities ering in an international wave of service reform [ 1, 2 ]. of data for models to learn aspects of language that are Given the ever-growing use and diversity of online social thought to precede the specifics of the specialized task. media, there has been a vast increase in research interest While this assumption may hold for tasks, pretraining for the use of Natural Language Processing (NLP) for data can also issue from diferent sources than the specialthe development of automated means of analyzing on- ized data. As such, representations produced by generalline textual content in the service of mental health care purpose models might be inadequate. Recent work has support and early intervention in particular [3, 4]. pointed to the benefits of domain specificity in large pre

The inference of such predictive models requires the trained models. Broadly, the term domain refers to the gathering of annotated data. These data map online tex- topics, mode or register of documents. Domain specitual content to an assessment of certain aspects of the ifcity concerns can take the form of models pretrained mental health of the authors of this content. Such as- entirely on domain-specific data or domain adaptation.In sessments are dificult to produce. Whereas for other either case, gains in downstream task performance have common tasks in NLP, annotation can operate on the ob- been reported for several tasks and domains from the use servation itself (e.g. the text), annotation relating to men- of such domain-specific pretraining [11]. tal health generally requires further information about The textual data analyzed for mental health care purthe author of the textual content. That is, the true aspects poses issues from Internet fora and social media. These of interest pertain to the author rather than the text. In data can difer both in register and topics from news particular, clinically grounded assessments require ac- or encyclopedia articles comprising significant parts of cess to the individual. As such, gathering annotated data large corpora. Nonetheless, there is no established linis expensive and time-consuming. guistic consensus on what constitutes a domain [see 12,

In the absence of large quantities of annotated data, it is Sec. 3.4.1]. Given this dificulty in defining the notion a well-established principle of machine learning that pre- of domain, it is dificult to delineate given domains or to training on an unsupervised task can help performance establish quantitative diferences between them. on a downstream supervised task. As such, there has been Pragmatically, one might ask whether a more narrow increased interest in the production of pretrained models concept of a given domain may provide more benefit to leveraging large amounts of textual content [5, 6, 7, 8]. downstream task performance than a broader one. The Such models are made available for use on a variety of present work seeks to study this issue in the context of specialized downstream tasks [9, 10]. The core tenet is mental health risk assessment. Models are pretrained on data from Internet fora revolving around three diferent mental health concerns: anorexia, depression and selfharm. The models are evaluated on detection tasks surrounding these concerns and compared to models trained on broader data [8]. Our results corroborate the benemodels but only show advantages to pretraining-data system does not aim to diagnose mental health disorders specificity in one case: anorexia. and should not be used to do so.

However, the misuse of this kind of work can have negative societal impacts. For example, an organization could 2. Data use our pretrained language models to detect at-risk job applicants of mental health disorders before hiring. This 2.1. Retrieval practice, violating the terms of the release agreement, The data were extracted from three Reddit1 fora would further spur discrimination in hiring processes in (known as subreddits): depression, selfharm and addition to well-documented gender and racial unfairAnorexiaNervosa. This extraction was performed us- ness [ 14, 15 ]. ing Pushshift2 [13]. For all three subreddits, it was limited While this line of research could potentially advance to posts published from the 1st of January 2019 to the early intervention and treatment processes, it does not 25th of November 2020.3 Further, posts struck of as “re- directly address the stigma surrounding mental health moved” were discarded. The fields associated with each issues and underlying the high rate of treatment avoidpost include the title and body of the post, as well as ance and discontinuation [ 16, 17 ]. Further, widespread the timestamp, the score (aggregation of up- and down- study and deployment of models in this direction could votes), the number of replies and the identifier of the potentially lead to self-censorship, defeating its purpose. parent post. No additional filtering was applied. It is also important to note that demographic data on the authors is missing. As noted by Shatz [ 18 ], most 2.2. Description subreddits do not have data regarding their community demographics. Hence, it is impossible to ensure that the All subreddits considered are described as communities textual productions used to train the released model adethat ofer a safe place and peer support for people afected quately represent content from diverse individuals. To by the aforementioned issues4. Summary statistics for the best of our knowledge, there is no readily available the corpus are presented in Table 1. dataset containing information regarding the author’s

The depression forum is by far the biggest com- age, gender, ethnicity, or location. From inferred demomunity of the three with more than 736,000 members graphics, Amir et al. [ 19 ] presented that those sensitive as of March 2nd 2021. Of those, about 45% authored at attributes afect depression prevalence across social meleast one publication (i.e. a post or a comment) in the se- dia users. Aguirre et al. [ 20 ] observed performance gaps lected time frame. A similar proportion can be observed related to gender and racial attributes. To address this from AnorexiaNervosa. In turn, it jumps to almost gap, a data collection combining strict privacy policies two-thirds for selfharm. Across all three subreddits, and clinical supervision must be achieved. As noted by approximately 40% of the authors published exactly once. Aguirre et al. [ 20 ], storing such sensitive data comes with Despite having the fewest overall publications, threads serious potential harms. Therefore, it is critical to enforce on AnorexiaNervosa seem to generate the most en- protective measures such as data anonymization. gagement, having a higher ratio of comments per thread and remaining active for longer periods. The smaller 4. Pretraining size of this community is a likely explanation for these observations.

4.1. Preprocessing

3. Ethical Considerations One key issue in modeling corpora from Internet fora rather than an edited outlet, such as a newspaper or All posts collected in the aforementioned subreddits are encyclopedia, is the longer vocabulary tail caused by public but our collection will not be publicly available. misspellings, neologisms and even usernames. Common Further, resources discussed in this work will be released practice would be to remove words having fewer than upon the signature of a User Agreement. The released three occurrences [ 21 ]. Keeping such words would inmodel should only be used in combination with other crease the computational burden of the model while havscreening tools for prevention purposes under the super- ing little chance of learning because of the limited numvision of trained mental health professionals. Hence, this ber of occurrences. However, this is not suitable for our purposes: Important words might be misspelled or obfus1https://www.reddit.com cated, but their exclusion will hinder performance [ 22 ]. 32hTthtepsla:/t/epsutsphosshtifftr.oiom/ AnorexiaNervosa was published on the 3rd Similarly, usernames and neologisms might be composed of December 2020. from familiar, significant words. As such, we preserve the 4As per their respective "About Community" section of each subred- entire vocabulary of each dataset, relying on subworddit AnorexiaNervosa depression selfharm

Tokens Vocabulary Posts

average number of tokens

Comments average number of tokens Unique author

Community size* level tokenization to capture these variations.

Before learning this tokenization, the data was split into training and validation sets by stratifying across length (word count) percentiles. This preserves the key length statistics, such as the median and interquartile range. In terms of vocabulary, words in the validation set not present in the training set make up 0.50%, 0.20% and 0.10% of occurrences in the anorexia, self-harm and depression sets, respectively.

The data was tokenized by Byte-Pair Encodings (BPEs) [ 23 ] at the byte level [6], with the merges extracted from all three datasets. This consolidation was done to provide a more robust tokenization scheme, less skewed towards any particular forum, while still learning the words and spellings of online parlance. For comparison, each dataset was also tokenized using merges learned exclusively from itself. tory level, which spans a variety of presumably independent writings. In order to make this subject-level prediction, information gathered across a set of writings needs to be aggregated. To achieve this, token embeddings are averaged together within posts and subsequently fed to a feed-forward network with a single hidden layer and hyperbolic tangent activation. The resulting document vectors are then aggregated by averaging into a single vector encoding a history of writings. This vector is then mapped to a binary prediction for the sequence of writings by a feed-forward network with a single hidden layer with hyperbolic tangent activation.

5.1. Experiments

We evaluate the MentalHealthBERT models on the eRisk datasets [26, 27, 28]. These datasets comprise Reddit users (subjects) labeled as being at risk (positive) or not (negative) for depression, self-harm or anorexia, respectively. For each subject, a history of their writings is included, spanning a variety of subreddits. The proportion of positive subjects is fairly small and varies somewhat, as does the size of the datasets, as shown in Table 2.

The key issue is utilizing the document-level encoding aforded by MentalHealthBERT in predictions at the his

The experiments compare the performance of Mental

HealthBERT to the generic RoBERTa Transformer as well as the latter further pretrained on our data (domain adap4.2. Training tation). For MentalHealthBERT, experiments were carried out using BPEs learned from the combined dataset as Once tokenized, these datasets were used to train Trans- well as from the individual collections. Additionally, we formers [ 24 ] using the RoBERTa approach [6]. Models run experiments using MentalRoBERTa [8]. This model are trained by the Adam optimizer [ 25 ] with a learning was pretrained on Reddit data from several diferent fora rate of 5E-4 on batches of 256 sequences of a maximum touching on mental health topics5. It should be noted that length of 256 tokens. Training takes place over a max- the results reported by the authors on the eRisk depresimum of 300 epochs, applying early stopping based on sion detection task are not comparable to those reported validation set perplexity. here, as they make use of a custom data split with some resampling [29]. Models are evaluated per the area under the precision-recall curve. 5. Mental Health Risk Detection One dificulty of detecting potential threats to mental health is the small proportion of positive subjects that can be found in datasets and, indeed, in a real-world setting.

Additionally, for the selected datasets, these proportions vary widely between the training and testing sets, as shown in Table 2. Models were evaluated using the latest set of data for each task: 2022 for Depression, 2021 for Self-Harm and 2019 for Anorexia. Training and validation

5An exhaustive list of the fora from which pretraining

data were extracted is not available, but they include depression, SuicideWatch, Anxiety, offmychest, bipolar, mentalillness, and mentalhealth.

Train Test

dataset positive negative positive negative

Depression Self-Harm

Anorexia 214 145 61 1493 618 411 1302 1296 742 sets for each task were obtained by combining the data be due to a deficiency in eating disorder content in prefrom all previous sets and randomly selecting 80% of training MentalRoBERTa, though we cannot confirm this. subjects for training and 20% for validation, preserving Tokenization seems to be inconsequential, with a more equal proportions of positive and negative subjects. marked decrease in performance for the combined tok

To address this class imbalance in training, a number of enizer in depression. Given the dificulty that specialized strategies were deployed, including inverse class weight- tokenization puts in transferring learning, it is dificult ing, class weighting based on efective samples [ 30] and to support in light of these results. Finally, there appears Focal Loss [31]. These proved to be inefective in val- to be little diference in performance between domainidation. The most efective mechanism proved to be adapted RoBERTa and the best MentalHealthBERT model, sampling batches of even proportions of positive and suggesting no real benefit to training blank models over negative subjects in training. adapting pretrained models. In light of these results, it

The number of writings used to arrive at a prediction is dificult to establish whether the specific domain prefor a subject was set to = 50. In order to reduce over- training of MentalHealthBERT helps downstream perforiftting, a contiguous sample of writings was taken per mance more so than more the general domain adaptation subject at training time. In validation and testing, only found in MentalRoBERTa. As mentioned, benefits are the last writings were taken. The classifiers are trained only observable for the anorexia task. Given what is by the Adam optimizer [ 25 ] over 10 epochs. Given the known of the pretraining of MentalRoBERTa, it is difirelatively modest size of the datasets in terms of posi- cult to establish whether this may be due to any material tive subjects, only the top two layers of the Transformer characteristics of discourse around anorexia or its relaencoder were trained, with a learning rate of 1E-5. The tively smaller weight in pretraining. remainder of the model had a learning rate set to 1E-4.

Results on the eRisk datasets are presented in Table 3.

Results for the base RoBERTa model indicate improve- 6. Conclusion ments with domain adaptation, in agreement with the literature [11]. Perhaps counterintuitively, these improve- There is increased research interest in the development ments appear to decrease with the amount of domain of NLP approaches to assist in early risk assessment in adaptation data available. MentalRoBERTa and Mental- mental health care. Gathering annotated data is a costly HealthBERT achieve comparable results in all but the process, making pretraining a crucial step in the modelanorexia task, for which MentalRoBERTa and domain- ing process. Thus, pretrained language models can be a adapted RoBERTa outperform MentalRoBERTa. This may valuable resource. However, general-purpose language

models, while trained on large amounts of data , may not [3]

H.-C.

Shing ,

Resnik ,

D. W.

Oard , A prioritization be suited to specific domains, such as mental health dis- model for suicidality risk assessment , in: Proceedcussions. As such, there is interest in adapting language ings of the 58th Annual Meeting of the Association models to particular domains . for Computational Linguistics , 2020 , pp. 8124 - 8137 . In the case of mental health risk assessment from text , [4]

Maupomé ,

M. D.

Armstrong ,

Rancourt , M. - domain-specific pretraining resources would contain dis- J. Meurs, Leveraging textual similarity to predict course concerning mental health concerns. However, it Beck Depression Inventory answers, Proceedings of is worth considering whether discourse issuing from out- the Canadian Conference on Artificial Intelligence lets specific to a particular mental health concern are ( 2021 ). doi: 10 .21428/594757db.5c753c3d.

more adequate than discourse around mental health is- [5]

M. E.

Peters ,

Neumann ,

Iyyer , M. Gardner, sues at large. Our experiments have thus made use of data C. Clark ,

Lee ,

Zettlemoyer , Deep contextualextracted from fora dedicated to specific mental health ized word representations , arXiv: 1802 . 05365 [cs] concerns to pretrain models . These models are compared ( 2018 ).

to general-purpose language models as well as language [6 ]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi , D. Chen, models pretrained on broader mental health content in O . Levy,

Lewis ,

Zettlemoyer , V. Stoya mental health risk assessment task. Our results indi- anov, Roberta: A robustly optimized BERT cate that domain adaptation does improve classification pretraining approach , arXiv: 1907 . 11692 ( 2019 ).

performance. However,

a diference in performance be - arXiv: 1907 .11692.

tween more narrowly pretrained models is only manifest [7 ]

Clark , M.-

Luong ,

Q. V.

Le , C. D. Manning, in anorexia risk detection . ELECTRA: Pre-training text encoders as discrimiFurther work is needed to understand how textual nators rather than generators , arXiv: 2003 . 10555 [cs] data from separate mental health topics interact in terms ( 2020 ). URL: http://arxiv.org/abs/ 2003 .10555, arXiv: of benefits from pretraining: more experimentation is 2003 . 10555 .

needed to find whether the detection of certain mental [8 ]

Ji ,

Zhang , L. Ansari,

Fu ,

Tiwari , E. Camhealth concerns is improved by pooling pretraining data bria, MentalBERT: Publicly available pretrained and whether these gains in detection performance align language models for mental healthcare , in: N. Calzowith the comorbidity of the underlying disorders . Were lari,

Béchet ,

Blache ,

Choukri ,

Cieri , T. Dethis the case, those benefits might be explained by the clerck , S. Goggi,

Isahara ,

Maegaard , J. Mariani, mention of related concerns in discussions about a spe- H.

Mazo , J.

Odijk , S. Piperidis (Eds.), Proceedings cific mental health concern. While pretraining data for of the Thirteenth Language Resources and Evaluour experiments was extracted from dedicated fora, our ation Conference, European Language Resources experiments do not control for the mention of related Association , Marseille, France, 2022 , pp. 7184 - 7190 .

disorders or threats to mental health . URL: https://aclanthology.org/ 2022 .lrec- 1 . 778 . [9]

Wang ,

Singh ,

Michael ,

Hill ,

Levy , S. R.

Release of Resources Bowman, GLUE: A multi-task benchmark and analysis platform for natural language understanding, Given the sensitive nature of the resources introduced , arXiv: 1804 .07461 [cs] ( 2019 ).

the models and associated open-source code will be re - [10]

Wang ,

Pruksachatkun ,

Nangia , A.

Singh, leased upon signing of a User Agreement providing de- J.

Michael , F.

Hill , O.

Levy , S. R.

Bowman , Supertails on permitted uses. GLUE: A stickier Benchmark for general-purpose language understanding systems , arXiv: 1905 .00537 References [cs] ( 2020 ). [11]

Gururangan ,

Marasović , S. Swayamdipta,

[1]

Schotanus-Dijkstra ,

C. H. C.

Drossaert ,

M. E. K.

Lo ,

Beltagy ,

Downey ,

N. A.

Smith , Don't Pieterse , B.

Boon , J. A.

Walburg , E. T. Bohlmeijer, stop pretraining: Adapt language models to doAn early intervention to promote well-being and mains and tasks , arXiv: 2004 .10964 [cs] ( 2020 ). URL: lfourishing and reduce anxiety and depression: A http://arxiv .org/abs/ 2004 .10964. randomized controlled trial, Internet Interventions [12]

Plank , Domain Adaptation for Parsing, Ph.D. the9 ( 2017 ) 15 - 24 . URL: https://www.sciencedirect. sis, University of Groningen, 2011 . com/science/article/pii/S2214782916300288. doi:10. [13]

Baumgartner ,

Zannettou ,

Keegan , M. Squire, 1016 /j.invent. 2017 . 04 .002. J. Blackburn , The Pushshift Reddit dataset , in:

[2]

P. D.

McGorry ,

Mei , Early intervention in Proceedings of the international AAAI conference youth mental health: Progress and future directions, on web and social media , volume 14 , 2020 , pp. 830 - Evidence-Based Mental health 21 ( 2018 ) 182 - 184 . 839 .

[14]

Sánchez-Monedero ,

Dencik , L. Edwards, What does it mean to 'solve' the problem of discrimina- [26]

D. E.

Losada ,

Crestani ,

Parapar , Overview of tion in hiring? Social, technical and legal perspec- eRisk: Early risk prediction on the Internet, in: tives from the UK on automated hiring systems , in: International Conference of the Cross-Language Proceedings of the 2020 Conference on Fairness, Evaluation Forum for European Languages , 2018 , Accountability, and Transparency , 2020 , pp. 458 - pp. 343 - 361 . 468 . [27]

D. E.

Losada ,

Crestani ,

Parapar , Overview of

[15]

Quillian ,

Pager ,

Hexel ,

A. H.

Midtbøen , eRisk 2019 : Early risk prediction on the Internet, Meta-analysis of field experiments shows no change in: International Conference of the Cross-Language in racial discrimination in hiring over time , Pro- Evaluation Forum for European Languages , 2019 , ceedings of the National Academy of Sciences 114 pp. 340 - 357 . ( 2017 ) 10870 - 10875 . [28]

D. E.

Losada ,

Crestani ,

Parapar , Overview of

[16]

O. F.

Wahl , Stigma as a barrier to recovery from eRisk 2020: Early risk prediction on the Internet, in: mental illness , Trends in Cognitive Sciences 16 Experimental IR Meets Multilinguality , Multimodal ( 2012 ) 9 - 10 . ity, and Interaction Proceedings of the Eleventh

[17]

Henderson ,

Evans-Lacko , G.

Thornicroft, Men- International Conference of the CLEF Association tal illness stigma, help seeking, and public health (CLEF

2020 ), 2020 . programs, American Journal of Public Health 103 [29]

Ji , Private Correspondence, 2022 . ( 2013 ) 777 - 780 . [30]

Cui ,

Jia , T.-

Lin ,

Song ,

Belongie , Class-

[18] I. Shatz , Fast, Free, and Targeted: Reddit as a Source balanced loss based on efective number of samples, for Recruiting Participants Online , Social Science in: Proceedings of the IEEE/CVF Conference on Computer Review 35 ( 2017 ) 537 - 549 . Computer Vision and Pattern Recognition, 2019 ,

[19]

Amir ,

Dredze ,

J. W.

Ayers , Mental health pp. 9268 - 9277 . surveillance over social media with digital cohorts , [31] T.-Y. Lin , P.

Goyal , R.

Girshick , K.

He , P.

Dollár , Foin: Proceedings of the Sixth Workshop on Compu- cal loss for dense object detection , arXiv:1708.02002 tational Linguistics and Clinical Psychology , 2019 , [cs] ( 2018 ). arXiv:1708 . 02002 . pp. 114 - 120 .

[20]

Aguirre ,

Harrigian ,

Dredze , Gender and racial fairness in depression research using social media , in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics , 2021 , pp. 2932 - 2949 .

[21]

Merity ,

Xiong ,

Bradbury ,

Socher , Pointer sentinel mixture models , arXiv:1609 .07843 [cs] ( 2016 ). arXiv: 1609 . 07843 .

[22]

Plank , What to do about non-standard (or non-canonical) language in nlp ( 2016 ). arXiv: 1608 . 07836 .

[23]

Sennrich ,

Haddow , A. Birch, Neural machine translation of rare words with subword unit , in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics , 2016 , pp. 1715 - 1725 . URL: https://www. aclweb.org/anthology/P16-1162. doi: 10 .18653/ v1/ P16 -1162.

[24]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , L. u. Kaiser, I. Polosukhin , Attention is all you need , in: I. Guyon,

U. V.

Luxburg ,

Bengio ,

Wallach ,

Fergus ,

Vishwanathan , R. Garnett (Eds.), Advances in Neural Information Processing Systems , volume 30 , 2017 . URL: https://proceedings.neurips.cc/paper/2017/lfie/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf .

[25]

D. P.

Kingma ,

Ba , Adam: A method for stochastic optimization , arXiv:1412.6980 ( 2014 ). arXiv: 1412 . 6980 .