-

First Insights on a Passive Ma jor Depressive Disorder Prediction System with Incorporated Conversational Chatbot

Computer Science

Engineering

Chalmers University of Technology

Fionnd@student.chalmers.se 2 0 Hong Kong University of Science and Technology 1 Insight Centre for Data Analytics, Data Science Institute, National University of Ireland , Galway 2 University of Gothenburg , Sweden

Almost 50% of cases of major depressive disorder go undiagnosed. In this paper, we propose a passive diagnostic system that combines the areas of clinical psychology, machine learning and conversational dialogue systems. We have trained a dialogue system, powered by sequence-to-sequence neural networks that can have a real-time conversation with individuals. In tandem, we have developed speci c machine learning classi ers that monitor the conversation and predict the presence or absence of certain crucial depression symptoms. This would facilitate real-time instant crisis support for those su ering from depression. Our evaluation metrics have suggested this could be a positive future direction of research in both developing more human like chatbots and identifying depression in written text. We hope this work may additionally have practical implications in the area of crisis support services for mental health organisations.

Depression Social Media Conversational Chatbot

Each year, 300 million individuals worldwide will su er a major depressive episode lasting a minimum of two weeks [ 1 ]. Furthermore, less than 50% will be correctly diagnosed and o ered appropriate treatment. Of those left untreated, MDD can lead to suicide, an estimated 800,000 people a year lose their life to suicide [ 1 ]. One particular issue which contributes to the low diagnostic and treatment rates is how the etiology of depression can directly interfere with an individual seeking treatment.

Since many mental disorders are not characterised by clear changes in external or physical appearance, detection and diagnosis become more challenging. In the case of depressive disorders, individuals are often not aware that their symptoms are due to a medical disorder and often attribute them to poor mood or external factors [ 2 ]. To make this increasingly more complex, MDD often also negatively a ects an individual's social interactions. This can hamper an individual to actually seek professional support or talk about their experiences. This presents a unique challenge in the medical community, in how to identify and support individuals to come forward for diagnosis.

In this work, we propose the concept of passive diagnosis, a term for a new eld of research seen over the last two or three years. This work is not exclusive to the research of mental health disorders, but concerns itself with using machine learning techniques for predicting a potential future medical disorder. In comparison to the traditional concept of active diagnosis, where an individual su ering certain symptoms would actively seek out a medical diagnosis, this process can now be facilitated by adding a passive element. Unlike a medical professional who has limited time and resources, it is feasible to have machine learning algorithms constantly passively observe an individual's health. Once these algorithms detect certain changes in an individual's health that might be indicative of a disorder, the algorithm can inform the individual and an appropriate human professional for further investigation.

An example of this application is DeepCare, developed by a research team in Google [ 3 ]. This end to end application is designed to diagnose a wide range of disorders. This eld of work allows medical professionals to actively provide interventions to those at high risk before the disorder even sets in. Our work can be considered to follow a similar trend, where we propose a passive diagnostic approach to MDD. 2

Related Work 2.1 Depression Detection

As far back as 1901, Sigmund Freud proposed that language could give us an insight into certain mental illnesses [ 4 ]. A combination of NLP and psychology led to a number of publications investigating how di erent mental illnesses such as bipolar disorder [ 5 ], MDD [ 6 ] and anorexia [ 7 ] can manifest and be predicted through an individual's speci c use of certain language characteristics. Examples of this work include higher counts of the word "I", and lower counts of future temporal words in depressed student's written notes [6{8]. Penenbaker explains how many of these characteristics are consistent with the etiology of MDD [ 4 ].

This work led to di erent researchers in the machine learning community investigating if there was su cient basis for classi ers to distinguish between individuals su ering certain mental disorders and those not [ 5, 7 ]. Although this body of work has a solid machine learning background and high evaluation scores, from our perspective its practical medical application is limited. All the approaches look at MDD as a binary outcome variable, predicting at time x, if an individual positively diagnoses for MDD or not.

We understand how this approach makes sense given that many elementary machine learning classi ers perform best when predicting a simple binary outcome. From the perspective of a medical professional, however, we can rarely place individuals into binary classes. MDD is de ned by a speci c combination of nine symptoms, the presence or absence of certain symptoms can have dramatically di erent e ects on the diagnosis [ 9 ]. For many professionals, MDD will be viewed as a spectrum, with individuals falling from low risk to high risk [ 2 ]. This limitation has also been noted by [ 10 ] whose proposed solution was recording the speci c mention of certain symptoms in online text.

Following consultation with medical professionals, we decided to overcome this limitation by building a host of separate classi ers that work on a symptomatic level. The DSM-V lists nine di erent symptoms that can be present during the occurrence of MDD. We propose that ve of these symptoms can be reasonably detected to some degree through an online human-computer interaction. These targeted symptoms are depressed mood most of the day, weight change not attributed to dieting, sleep change characterised by insomnia or hypersomnia, inappropriate guilt, and suicidal ideation. We developed ve separate classi ers and allow medical professionals to make an overall diagnosis based on the results of the classi ers and their own domain expertise. 2.2

Conversation Chatbots

Recent approaches to building conversational chatbot systems are dominated by the usage of neural networks. The authors of [ 11 ] present an approach for conversational modelling, which uses a sequence-to-sequence neural model. Their model predicts the next sentence given the previous sentences for an IT helpdesk domain, as well as for an open-domain trained on a subtitles dataset. For an open-domain dialogue generation, [ 12 ] propose an adversarial training approach utilising reinforcement learning to produce sequences that are indistinguishable from human-generated dialogue utterances.

A heuristic that guides the development of neural baseline systems for the extractive conversational chatbot task is described in [ 13 ]. Their system, called FastQA, demonstrates good performance, due to the awareness of question words while processing the context. [ 14 ] demonstrate an approach to non-factoid answer generation with a separate component, which is based on bidirectional LSTMs to determine the importance of segments in the input.

The increased use of social media as a communication tool between customers and brands has allowed for the development of these systems to handle realtime inquiries [ 15 ]. In addition, increasingly, mental health support services are incorporating real-time communication tools such as texting and social media messaging as methods for individuals to talk to a counselor [ 16 ]. Some work has already explored the possibility of building conversational chatbots that emulate a counselor, this work makes use of both audio, visual and text-based interactions [ 17 ]. 3

Experimental Settings

In this section, we give an overview of data collection techniques employed and the feature extraction methods used for each of our ve classi ers. Additionally, we outline the process employed for developing our conversational system. 3.1

Sequence-to-Sequence Neural Network Toolkit

To train the conversational system, we use the OpenNMT toolkit [ 18 ], which is a generic deep learning framework mainly specialising in sequence-to-sequence (seq2seq) models and covers a variety of tasks. We used the default neural network training parameters, i.e. 2 hidden layers, 500 hidden bidirectional LSTM4 units, input feeding enabled, batch size of 64, 0.3 dropout probability and a dynamic learning rate decay.

Our data was composed of 13,053,384 million question-answer pairs, 6,237,118 of which were obtained from the subreddit /r/AskReddit (see examples in Table 1). Subreddit submissions were considered as questions and the rst reply as an answer. The remaining pairs were extracted from the OpenSubtitles dataset [ 19 ]. The conversational model was trained for 13 epochs, which was further tuned on a selected extraction from the eRisk corpus (cf. 3.2). Data: We consider the existing work published in the eRisk proceedings to have been inadvertently focused on the symptom of depressed mood. We were provided with the eRisk task training set created by [ 20 ] consisting of comments and submissions from 486 Reddit users, of which 83 users were labelled as su ering depression. We considered each submission or comment as a single data point labelled as depressed or non-depressed. This led to a collection of 307,065 data points, of which 10.74% had the depressed label.

Linguistic based features: Five di erent groups of linguistic features were included, the rst of which is the Linguistic Inquiry and Word Count lexicon (LIWC), which is commonly used in the eRisk task [ 21 ]. This lexicon scores 78 di erent linguistic features related to social, clinical, health and cognitive psychology. Scores are percentages of words in a text that re ect a given emotion, scaled between 1 and 0, where 1 indicates all words in a sentence re ect a given emotion [ 21 ]. The authors of [ 22 ] found that the use of activation and dominance sentiment characteristics to be a strong predictor of MDD in their Twitter dataset [ 22 ].

4 LSTM - long short-term memory

The Warriner lexicon [ 23 ] contains a more detailed analysis of valence, activation and dominance scores for words grouped by male, female and overall, providing a total of nine features. The NRC A ect Intensity Lexicon contains four di erent emotion scores: anger, fear, joy, sadness [ 24 ], while the SenticNet 5 lexicon provides polarity and intensity in combination with one of aptitude, attention, pleasantness or sensitivity [ 25 ]. All features above were included, by calculating the mean word score of a post. Drawing on the work of psycholinguistics, we included the additional two features, counts of the personal pronoun "I" and the Flesch Kincaid readability scores [ 6 ].

Text embedding: All of our classi ers employ the same text embedding approach, which draws on the work of [ 7 ] and utilized Doc2Vec [ 26 ]. The authors of [ 6 ] found that in the context of Reddit data, using pre-compiled word embeddings had signi cantly lower performance compared with training the embeddings on their own Reddit data directly. A comparison on our part using fastText [ 27 ] found a similar situation.

In summary, the approach consists of mapping each word to a unique multidimensional vector and trying to predict the next word in the sentence. The Doc2Vec approach also maps each paragraph to a unique vector. Both vectors are then concatenated to predict the next word in a context. A number of different variations of this approach have been proposed [ 26 ]. We compared a set of these approaches on error rate based on a logistic regression model trained on the paragraph embeddings.

In our training, the approach with the lowest error rate was a combination of the Distributed Bag of Words (DBOW) and Distributed Memory (DM). This combination has been proposed as an optimal method by [ 26 ]. It encompasses the DM method, which is explained above and a DBOW version of Doc2Vec, where the word vector is dropped and instead forces the model to predict words randomly sampled from the paragraph vector by using a sliding window. Parameters for both algorithms includes discounting all words which occurred less than twice in the corpus, 20 epochs and a nal vector output of size 100. The two 100-dimensional vector outputs were concatenated to a joint 200-dimensional vector. 3.3

Suicidal Ideation Classi er

Data: A limited number of publications have gained access to small collections of suicide notes and performed some basic linguistic analysis on them [ 28, 29 ]. Our approach is based around a public subreddit titled /r/SuicideNotes (SN). This subreddit describes itself as "A location to immortalize your nal words, or read the last words that others have written down." and contains 1210 submissions as of the end of August 2018.

To enhance the validity of this data, we drew upon the work of [ 30 ] who identi ed users with suicidal tendencies by applying a time series approach to users posting on Reddit. Our approach consisted of selected each user (738) who had posted a note on SN, and selected all their historical posts using the complete Reddit dataset [ 31 ]. We began by removing users who only had a single post ever (throwaway accounts), we then only selected users whose last post ever was on SN. This gave us a list of 112 users who met the following two conditions, i.e., (i) history of posting to Reddit and (ii) had posted to SN and had never interacted with Reddit again.

A total of 1,502 Reddit posts labelled as having come from a suicidal user where extracted. We randomly extracted 1,500 more posts from Reddit to use as a control group. To further validate the nature of the posts, we refer to the work of [ 28 ] on syntactic features associated with suicide notes. Notably, we would expect to see a higher account of the personal pronoun "I" in suicide notes compared to control posts, which we do see in Figure 1.

Count of i e trua Present e ift c c tna Achievement y S

Negative

Group

Suicide Normal 0.08 0.00 0.02 0.04 0.06

Percentage occurrence within the text

Feature creation & text embedding: For each post, we extracted the

following features: 78 linguistic features related to psychology extracted using the LIWC lexicon. We found the optimal Doc2Vec text embedding approach to be a single DBOW method with a setup of 100 dimensions, a minimum word occurrence count of two and 20 epochs. 3.4

Insomnia / Hypersomnia Classi er

Data: Our dataset for this classi er is all posts on the /r/Insomnia subreddit, which describes itself as "Posts and discussion about insomnia and sleep disorders.". We collected 40,000 posts from /r/Insomnia and an additional 40,000 random posts from /r/AskDocs a subreddit focusing on general medical related questions. In the results stage, we nd an abnormally high degree of accuracy suggesting over tting of the data. We compensated this by adding 20,000 randomly selected Reddit posts to each class as noise.

Feature creation & text embedding: We followed a similar approach to

the suicidal ideation classi er, for each post we extracted 78 LIWC linguistic features. Text embedding was Doc2Vec with a DBOW approach providing the lowest error rate. Our total dataset was thus 120,000 posts, each of which had 178 features. 3.5

Weight Change Classi er

Data: Since our conversational system is only designed as an initial classi cation approach, we are expecting this classi er will give an indication (positive class) that the individual is talking about weight change or in the case of a negative class prediction, the individual has made no mention of aspects related to weight change. Data was collected from the r/Loseit subreddit, a community dedicated to weight change. 80,000 posts were collected and labelled as belonging to the positive class and another 80,000 from the /r/AskDocs subreddit and labelled as belonging to the negative class. Additional 40,000 posts were randomly allocated between classes to prevent over tting and simulate noise.

Feature creation & text embedding: The exact same process that we

applied to our sleep classi er was taken here. A 178 feature space was created with 100 Doc2Vec DBOW approach and 78 LIWC features. 3.6

Excessive or Inappropriate Guilt

Approach: We found no existing research published on the syntactic features associated with inappropriate guilt in the English language, additionally we found no speci c way to isolate guilt related data on Reddit to use. The developers of Linguistic Inquiry with Word Count (LIWC) suggested that guilt can be recognised in certain cases from a combination of negative emotions and anxiety [ 21 ]. Both of these are features extractable with LIWC and their sum can serve as a noisy proxy for guilt in a post. We presented this count directly as an indicator of guilt and performed no further modelling. 4

Methodology

In this section, we give an overview of the algorithms employed in the development of our classi er. 4.1

Depressed Mood Classi er

The primary issue a ecting the developed of this classi er was uneven class balance (majority class = 89%). We overcame this issue by applying a synthetic minority oversampling technique (SMOTE) [ 32 ]. This approach consists of oversampling the minority class to create an arti cial dataset with an even class balance. We then applied a Random Forest classi er tuned on a grid search method (class weight= 'balanced subsample', bootstrap= 'false', criterion= 'entropy',n estimators= 9), all features underwent standard scaling. 4.2

Suicidal Ideation Classi er

This model incorporated a logistical regression classi er. Our choice of this approach was due to its percentage outcome. As [ 33 ] suggests, rather than considering binary outcome as two independent events, we can consider the outcome as an unobserved continuous variable. In this case, the propensity of an individual to attempt suicide. We applied an L2 penalty, with balanced class weights and all features underwent standard scaling.

Threshold values Threshold values allow for distinction of binary classes when working on a continuous scale. The allocation of a threshold value is often considered important in medical literature where there might be a consideration to knowingly over or under predict certain classes. The most naive approach is often maximizing the area under the curve when sensitivity is plotted against 1-speci city. Although this can give the most balanced class allocation, we would consider reducing false negative predictions to be of reasonable importance. To do this, we chose to set a speci city value of 0.95 which allocates us a threshold value of 0.55 sensitivity value of 0.61 and a Youden's index of 0.50.

4.3 Sleep Change & Weight Change Classi ers

Both classi ers employed a logistic regression algorithm with L2 penalization. Independent grid searches were applied but resulted in the same set of optimized parameters. These were intercept scaled to 1, balanced class weight and the Scikit Learn C parameter which is the inverse of the regularization strength was set to 1 as well.

Threshold values & scoring Although false negative predictions for sleep or weight change is not as serious as missing attempts at suicide, we still optimized the threshold value of the logistic regression by setting the speci city value to 0.95. Cuto values for the sleep change and weight change classi ers are 0.72, and 0.74 respectfully.

4.4 Conversational System Interface Design

A demo interface5 (Figure 2) was developed which combined the conversational model and classi ers. The interface consisted of a text eld where users can write a comment which is sent to the conversational model. The reply to this input is shown on the screen to the user. In this section, we begin by reporting on the metric evaluation scores acquired during the training of the models. The second section reports on the overall evaluation process we employed in the project.

5 http://server1.nlp.insight-centre.org/marvin/demo.html Classi er Evaluation

Metric scores during training for each of the four classi ers we developed are presented in table 2. In all cases, the scores presented are mean scores following 10 fold cross validation performance on a withheld test set composed of a random 20% sample of the original dataset. The depressed mood classi er however employed the SMOTE balanced subsample approach. We recruited seven participants as a convenience sample. All participation was anonymous and voluntary, no demographic details were collected. Initially, participants were instructed to have a short interaction with the conversational system. Beginning with answering the question "How are you?". Participants could end the conversation at any stage by exiting out of the conversation, but were asked to try and hold the conversation for at least 20 messages.

The following step of the evaluation was to establish ground truths. Participants were asked to complete the Beck's Depression Inventory-II [ 34 ], which is a 21 item multiple response questionnaire that ranks participants on a scale of no indication of MDD to an strong indication of MDD. The advantage of the Beck's Inventory is that it is a short and highly standardised instrument that has seen applications across a wide range of research studies [ 35 ]. 5.3

Overall Project Evaluation Results

We propose two metric evaluation approaches for our project, (i) evaluates classi ers individually, (ii) overall evaluation. For the individual classi er evaluation, we established ground truths by dividing each question on the Beck's Inventory to a speci c symptom it was investigating as per table 3. A score above two on a question was considered a positive score (presence of a symptom), and the same if a classi er returned a mean prediction score above it's respective threshold value. Row two, three and four in table 3 provides the metric scores for each classi er6.

The second evaluation process, which investigated the overall accuracy was computed by considering a score above 19 on the Beck's Inventory as ground truth presence of an MDD. If two out of the four classi er provided a positive prediction, this was considered an overall positive prediction of depression. Results are presented in the rst row of table 3. 6

Conclusions

Over the course of the above text, we initially began by demonstrating why MDD is worthy of study, and how passive diagnoses is an important future issue. Our proposed approach builds on that of the existing machine learning communities contributions to MDD, whereby our methodology is the rst to view MDD prediction on a symptomatic level. In addition to a theoretical proposal, we hope our work may lead to a future practical application.

Within the scope of the work, we note two key limitations that must be addressed in our future works. Initially, the ever-present problem of suitable data presents itself. The authors of [ 20 ] explores the advantages and disadvantages with regards to di erent depression related data collection approaches. In all cases, none of our labelled data actually employs medical diagnoses. Therefore, we can not be completely con dent our labelled data is representative of the actual disorder. Our second limitation concerns that of the evaluation stage of the project. We accept that a sample size of seven individuals is quite limited in its evaluation scope. Nevertheless, we feel this is su cient as a proof of concept, and despite these two limitations we have demonstrated a future direction for research.

With regards to interpreting the results from our evaluation stage, for the sleep change classi er, we see low precision and high recall scores indicating the possibility that the threshold value has been set to high. No one in our sample indemni ed as having the presence of suicidal ideation or negative weight change, and respectfully our classi ers did not predict any false positives in these cases. In nal conclusion, within the scope of our limited sample size, we are positive regarding the results of four of our classi ers, and suggest a revaluation of the threshold value assigned to one. Ultimately, we hope to see that individuals 6 Neither Suicide ideation (Question 9) nor weight change (Question 18) are included as all scores are equal to one. su ering an MDD episode will no longer su er alone but rather will have more rapid and easy access to diagnostic services and thus receive support in a timely manner.

Acknowledgement. This publication has emanated from research conducted with the nancial support of Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289 (Insight).

1. WHO: Depression fact sheet ( 2018 ), http://www.who.int/news-room/factsheets/detail/depression

2. Belmaker , R. , Agam , G.: Major depressive disorder . New England Journal of Medicine 358 ( 1 ), 55 { 68 ( 2008 )

3. Pham , T. , Tran , T. , Phung , D. , Venkatesh , S. : Deepcare: A deep dynamic memory model for predictive medicine . In: Paci c-Asia Conference on Knowledge Discovery and Data Mining . pp. 30 { 41 . Springer ( 2016 )

4. Pennebaker , J.W. , Mehl , M.R. , Niederho er, K.G.: Psychological aspects of natural language use: Our words, our selves . Annual review of psychology 54(1) , 547 { 577 ( 2003 )

5. Huang , Y.H. , Wei , L.H. , Chen , Y.S.: Detection of the Prodromal Phase of Bipolar Disorder from Psychological and Phonological Aspects in Social Media . arXiv preprint ( 2017 )

6. Trotzek , M. , Koitka , S. , Friedrich , C.M. : Utilizing Neural Networks and Linguistic Metadata for Early Detection of Depression Indications in Text Sequences . arXiv preprint ( 2018 )

7. Ramiandrisoa , F. , Mothe , J. , Benamara , F. , Moriceau , V. : Irit at e-risk 2018 . In: E-Risk workshop. pp. 367 { 377 . Almquist & Wiksell

8. Rude , S. , Gortner , E.M. , Pennebaker , J. : Language use of depressed and depressionvulnerable college students . Cognition & Emotion 18 ( 8 ), 1121 { 1133 ( 2004 )

9. Association , D..A.P. , et al.: Diagnostic and statistical manual of mental disorders . Arlington: American Psychiatric Publishing ( 2013 )

10. Karmen , C. , Hsiung , R.C. , Wetter , T. : Screening internet forum participants for depression symptoms by assembling and enhancing multiple nlp methods . Computer methods and programs in biomedicine 120(1) , 27 { 36 ( 2015 )

11. Vinyals , O. , Le , Q.V. : A neural conversational model . CoRR 1506 .05869 ( 2015 )

12. Li , J. , Monroe , W. , Shi , T. , Jean , S. , Ritter , A. , Jurafsky , D. : Adversarial learning for neural dialogue generation . In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017 , Copenhagen, Denmark, September 9- 11 , 2017 . pp. 2157 { 2169 ( 2017 )

13. Weissenborn , D. , Wiese , G. , Sei e, L.: Making neural qa as simple as possible but not simpler . In: CoNLL ( 2017 )

14. Ruckle, A. , Gurevych , I. : Representation learning for answer selection with lstmbased importance weighting . In: IWCS 2017 | 12th International Conference on Computational Semantics | Short papers ( 2017 )

15. Xu , A. , Liu , Z. , Guo , Y. , Sinha , V. , Akkiraju , R.: A new chatbot for customer service on social media . In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems . pp. 3506 { 3510 . ACM ( 2017 )

16. Evans , W.P. , Davidson , L. , Sicafuse , L. : Someone to listen: Increasing youth helpseeking behavior through a text-based crisis line for youth . Journal of Community Psychology 41 ( 4 ), 471 { 487 ( 2013 )

17. Winata , G.I. , Kampman , O. , Yang , Y. , Dey , A. , Fung , P. : Nora the empathetic psychologist . In: Proc. Interspeech . pp. 3437 { 3438 ( 2017 )

18. Klein , G. , Kim , Y. , Deng , Y. , Senellart , J. , Rush , A.M. : Opennmt: Open-source toolkit for neural machine translation . CoRR abs/1701 .02810 ( 2017 )

19. Lison , P. , Tiedemann , J.: Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles . In: Chair), N.C.C. , Choukri , K. , Declerck , T. , Grobelnik , M. , Maegaard , B. , Mariani , J. , Moreno , A. , Odijk , J. , Piperidis , S. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016 ). European Language Resources Association (ELRA) , Paris, France (may 2016 )

20. Losada , D.E. , Crestani , F. : A Test Collection for Research on Depression and Language Use CLEF 2016, Evora (Portugal) . Experimental IR Meets Multilinguality , Multimodality, and Interaction pp. 28 { 29 ( 2016 )

21. Pennebaker , J.W. , Booth , R.J. , Francis , M.E. : Linguistic inquiry and word count: LIWC [Computer software] . Erlbaum Publishers, Mahwah,NJ ( 2007 )

22. De Choudhury , M. , Counts , S. , Horvitz , E.: Social media as a measurement tool of depression in populations . Proceedings of the 5th Annual ACM Web Science Conference on - WebSci ' 13 pp. 47 { 56 ( 2013 )

23. Warriner , A.B. , Kuperman , V. , Brysbaert , M. : Norms of valence, arousal, and dominance for 13,915 English lemmas . Behavior Research Methods 45 ( 4 ), 1191 { 1207 ( 2013 )

24. Mohammad , S.M. : Word A ect Intensities . arXiv preprint ( 2017 )

25. Cambria , E. , Poria , S. , Hazarika , D. , Kwok , K. : SenticNet 5: Discovering Conceptual Primitives for Sentiment Analysis by Means of Context Embeddings . Aaai pp. 1795 { 1802 ( 2018 )

26. Le , Q. , Mikolov , T. : Distributed representations of sentences and documents . In: International Conference on Machine Learning . pp. 1188 { 1196 . Beijing China ( 2014 )

27. Joulin , A. , Grave , E. , Bojanowski , P. , Mikolov , T. : Bag of tricks for e cient text classi cation . arXiv preprint ( 2016 )

28. Pestian , J. , Nasrallah , H. , Matykiewicz , P. , Bennett , A. , Leenaars , A. : Suicide Note Classi cation Using Natural Language Processing: A Content Analysis . Biomedical informatics insights 2010(3) , 19 { 28 ( 2010 )

29. Pestian , J. , Pestian , J. , Pawel

Matykiewicz

, Brett South, Ozlem Uzuner, John Hurdle: Sentiment Analysis of Suicide Notes: A Shared Task . Biomedical Informatics Insights 5 , 3 ( 2012 )

30. De Choudhury , M. , Kiciman , E. , Dredze , M. , Coppersmith , G. , Kumar , M. : Discovering shifts to suicidal ideation from mental health content in social media . In: Proceedings of the 2016 CHI conference on human factors in computing systems . pp. 2098 { 2110 . ACM, San Jose, CA, USA ( 2016 )

31. Michael , J.: Pushshift.io, https://pushshift.io/

32. Chawla , N.V. , Bowyer , K.W. , Hall , L.O. , Kegelmeyer , W.P. : SMOTE: Synthetic minority over-sampling technique . Journal of Arti cial Intelligence Research 16 , 321 { 357 ( 2002 )

33. King , G. , Zeng , L. : Logistic Regression in Rare Events Data . Political Analysis 9 ( 02 ), 137 { 163 ( 2001 )

34. Beck , A.T. , Steer , R.A. : Internal consistencies of the original and revised beck depression inventory . Journal of clinical psychology 40(6) , 1365 { 1367 ( 1984 )

35. Dozois , D.J. , Dobson , K.S. , Ahnberg , J.L.: A psychometric evaluation of the beck depression inventory{ii . Psychological assessment 10(2) , 83 ( 1998 )