Quick and (maybe not so) Easy Detection of Anorexia in Social Media Posts Elham Mohammadi, Hessam Amini, and Leila Kosseim Computational Linguistics at Concordia (CLaC) Laboratory Department of Computer Science and Software Engineering Concordia University, Montréal, QC H3G 2W1, Canada {elham.mohammadi,hessam.amini,leila.kosseim}@concordia.ca Abstract. This paper presents an ensemble approach for the early de- tection of anorexia in social media posts. The approach utilizes several attention-based neural sub-models to extract features and predict class probabilities, which are later used as input features to a Support Vector Machine (SVM) making the final classification. The model was evalu- ated on the first task of eRisk 2019, whose aim was the early detection of anorexia in Reddit posts. Our submission, named CLaC achieved F1 and latency-weighted F1 scores of 0.7073 and 0.6908 respectively, allow- ing it to rank first in terms of these metrics, and achieved competitive results based on other evaluation metrics. Keywords: Anorexia · Early detection · Social media · Ensemble clas- sifier · Neural networks · Support vector machine 1 Introduction In the last decade, the use of social media to express personal thoughts, emotions, and ideas has become more and more prevalent. The analysis of online data can be useful for many purposes, such as business and marketing, political planning, prediction of stock market [10], as well as enhancing awareness of emergencies [34]. Another noteworthy line of research has focused on the detection of toxicity, hate speech, aggression and cyber bullying on online platforms, an effort that could facilitate timely interventions in violent situations [8,31]. In healthcare applications, online posts have been used for detecting disease outbreaks [23], finding smoking patterns [30], and the identification of adverse drug reactions [33]. Another useful application is the automatic detection of mental health issues, a relatively recent field which has attracted the attention of many researchers in Natural Language Processing (NLP). Corpora from Twitter, Facebook, blogs and online forums, and Reddit are used as resources to detect various mental health problems, such as anxiety, depression, suicide ideation, and eating disorders [3]. Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem- ber 2019, Lugano, Switzerland. Although automatically monitoring online forums to detect cases of mental health issues is beneficial, the elapsed time between the first signs of a mental is- sue and the actual detection of a potential victim can play a crucial role. Earlier detection of a harmful behavior can help moderators better handle the situa- tion. However, to the best of our knowledge, not much research has specifically addressed the task of early detection of mental health issues. The eRisk shared task [17] was created with the goal of addressing issues re- lated to early risk detection of mental health problems. According to [17], early detection can be useful in many applications from the identification of poten- tial sexual offenders to the detection of victims of suicidal tendencies, making intervention possible before it is too late. [17] argues that while current risk as- sessment approaches often aim at detecting harmful behavior after the fact, it is very important to consider the timing of risk detection and to minimize the time between the observation of the first evidence of destructive behavior and triggering an alarm. To that end, the organizers of the eRisk shared task have encouraged the development of approaches which model the process rather than the outcome, as well as developing reliable evaluation metrics and test collections tailored to early risk detection. The aim of this work is to propose of a model for the early detection of anorexia and to evaluate it using the eRisk 2019 data and evaluation metrics [19]. The rest of the paper is organized as follows: Section 2 provides an overview of the related literature. Section 3 consists of a brief summary of the task and the data set used. Section 4 presents the general model architecture that has been developed. Section 5 is dedicated to a more detailed description of model variants that were employed for the experiments. Section 6 includes a summary and discussion of the results. Section 7 concludes the paper and presents some interesting future directions. 2 Related Work Many researchers have used corpora from Twitter, Facebook, Reddit, blogs and online forums as resources to experiment with classification tasks pertaining to mental health issues [3]. Pestian et al. [27] experimented with different machine learning methods for suicide note classification. The features used in the study included words, part of speech tags, readability scores, and emotions. The best accuracy of 74% was achieved by a logistic regression model. DeVault et al. [9] studied the symptoms of psychological distress in dialogues with a virtual agent. The use of a Naı̈ve Bayes classifier for the detection of post-traumatic stress disorder (PTSD) and distress yielded a 20% improvement over the baseline accuracy of 53.5%, and showed that the automatic assessment of psychological distress is indeed possible. More recently, Jackson et al. [13] used clinical texts obtained through the Clinical Record Interactive Search1 to extract symptoms of severe mental ill- ness. The authors made use of TextHunter [1] (a natural language processing information extraction tool) and an SVM classifier, and were able to classify 38 symptoms with an F1-score of 85%. Shen and Rudzicz [28] used different feature sets including word2vec embed- ding, latent Dirichlet allocation topic modelling, lexico-syntactic features, and n-grams (unigrams and bigrams) to detect anxiety in Reddit posts. Initially, the authors compared the results achieved by an SVM and a 2-layer neural network. Though both classifiers performed well, the SVM yielded marginally better results. However, they achieved their best result of 98% accuracy using the neural network with n-gram probabilities and word embeddings combined with Linguistic Inquiry and Word Count (LIWC) features. Coppersmith et al. [6] explored the automatic detection of post-traumatic stress disorder (PTSD), depression, bipolar disorder, and seasonal affective dis- order (SAD) in Twitter data, using LIWC features and character and word n-grams, and found the latter resulting in superior performance. Benton et al. [2] used multi-task learning to predict suicide risk and a variety of mental health conditions from Twitter data, including anxiety, depression, PTSD, and schizophrenia. It was found that a multi-task framework can be effectively used in cases with limited data. Apart from individual efforts, shared tasks (e.g. [7,21,20,35]) have also been organized to encourage the development of common benchmarks (datasets and metrics) and the comparison of approaches for the detection of distress in online textual data. All of the previous work described above used a classic classification approach that does not measure how early the detection is performed. On the other hand, the eRisk shared tasks [17,18,19] focus on the early detection of mental health issues. In the first edition of eRisk [17], the data set used was a collection of social media posts and comments from depressed and non-depressed authors, recorded chronologically. As evaluation metric, Early Risk Detection Error (ERDE) was used, an error measure which assigns a penalty to late decisions and rewards early ones [16]. As it was the first edition of this shared task, many teams focused on making accurate rather than early decisions, with the highest F1-score being 64% and the lowest ERDE50 score2 being 9.68% [17]. The second eRisk shared task [18] included two tasks: Early risk detection of depression and early risk detection of anorexia. Like the year before, the ERDE evaluation metric was used as the main metric alongside F1, precision, and recall [18]. The best performing systems, in both tasks, were designed by Trotzek et al. [32]. Their team experimented with different variations of bag of words features and a Convolutional Neural Network (CNN) [15] as well as ensemble models. In the depression task, their system achieved an F1-score of 64% and an ERDE50 of 1 https://crisnetwork.co 2 A detailed description of ERDEo , where o is either 5 or 50, can be found in [18]. 6.44%. In the anorexia task, they achieved an F1-score of 85% and an ERDE50 of 5.96%. In this work, we present an ensemble approach that can be used for the detec- tion of different types of distress in textual data. We investigate the effectiveness of the model by presenting and analyzing our results in the first task of eRisk 2019 [19]. 3 Task and Dataset Following the success of the eRisk 2018 task 2 [18], the eRisk 2019 task 1 [19] focuses on the early detection of anorexia in online posts. The data used for the task is a collection of Reddit users labelled as anorexic or non-anorexic [16], along with a collection of their Reddit posts, recorded chronologically. For the training phase, the data from the previous year (eRisk 2018 task 2), including both training and test sets, was made available. For the testing phase, posts were released on an item-by-item basis in chronological order for a new collection of Reddit users. The goal was to detect users suffering from anorexia, having observed as few posts from them as possible. As a result, in addition to precision, recall, and F1-score, two other metrics were used: Early Detection Error (ERDE) measure which penalizes late decisions, and latency-weighted F1, a modified version of F1 score that takes into account the delay of the decision3 . Table 1 shows some statistics of the datasets. As shown in the table, the datasets are highly imbalanced, with about 90% of the users not suffering from anorexia. Table 1. Distribution of user labels in the datasets. The 2018 datasets refer to the eRisk 2018 task 2 data. Dataset Source Positive Negative All Training Train 2018 20 (13%) 132 (87%) 152 Validation Test 2018 41 (13%) 279 (87%) 320 Testing – 73 (9%) 742 (91%) 815 4 System Overview Fig. 1 shows the architecture of the model that we used for the eRisk 2019 shared task. The full model includes 8 different neural sub-models, followed by a fusion component, which concatenates the neural features and predicted class probabilities from different sub-models, and forwards them to a final SVM classifier. This section will provide a more detailed explanation of the different compo- nents of the model. 3 The details of the evaluation metrics for eRisk 2019 task 1 is explained in [19]. Fig. 1. Architecture of the model. The number of arrows between components corre- sponds to the number of sub-models that move in that flow. The rounded-corner boxes represent the components that work at the post level, while the sharp-corner boxes are user-level components. The solid lines represent neural connections; while the dotted lines show the flow of data without the existence of a neural connection. The bold arrow between the Fusion and SVM corresponds to the flow of data that exists only in the final model. 4.1 Sub-models As shown in Fig. 1, each sub-model includes an input layer that receives as input the posts by a user, and vectorizes its tokens using an embedding layer. The output of the input layer is then fed to a hidden layer, which is followed by a post-level attention/pooling layer that creates a representation of the post from its constituent tokens. The user-level attention layer is responsible to calculate the vector representation of the user, using her/his online posts. Finally, the output (classification) layer predicts the probability distribution of the positive and negative classes (i.e. anorexic versus non-anorexic). Our main focus during the development of the sub-models was to include diversity of information sources, so that the final ensemble model can incorporate different points of views when performing the final classification. Input Layer. The inputs to the model are the online posts of each user. Each post is first tokenized, and the tokens are sent to the word embedder, in order to be converted into dense vectors. As shown in Figure 1, these token vectors are then fed to the hidden layer. Two different pretrained word embeddings were experimented with. The first word embedder was the 300d version of GloVe [26] that was pretrained on 840B tokens of web data from Common Crawl. The second word embedder was the original 1024d version of ELMo, which was pretrained on the 1 Billion Word Language Model Benchmark [4]. These two word embeddings were used in order to provide our ensemble model with sub-models that utilize both contextual (ELMo) and non-contextual (GloVe) word embedders in their input layer. Hidden Layer. The hidden layer is responsible for processing the token vectors, generated by the input layer. As shown in Fig. 1, we have experimented with four hidden architectures in our sub-models: a CNN [15] that processes token n-grams separately, and a Bidirectional Vanilla RNN (BiRNN), a Bidirectional Long Short-Term Memory (BiLSTM) [12] and a Bidirectional Gated Recurrent Unit (BiGRU) [5], all of which process token vectors sequentially, from first to last and vice-versa, by taking into account the preceding and following tokens, respectively. Post-level Attention/Pooling Layer. Following [32], for the sub-models that use CNN in the hidden layer, a max pooling is applied to the outputs of the hidden layer after being passed through a Concatenated Rectified Linear Unit (CReLU, i.e. ReLU applied on the concatenation of each vector and its negative). In the models that use BiRNN, BiLSTM, or BiGRU in their hidden layer, an attention mechanism is responsible for computing the representation of a post (P ) by weighted-averaging over the outputs of the hidden layer for each token in the post, where the weights assigned to each token is calculated automatically. The function used by the attention mechanism can be shown in Equation 1: n X P = yt ωt (1) t=1 where yt represents the output of the recurrent hidden layer at time-step t, and ωt is the weight assigned to the output in that time-step. In our model, the attention mechanism uses an N -to-1 feed-forward layer (with the weights w, where N is equal to the size of the output vectors of the recurrent hidden layer) to map the output of the hidden layer at each time-step (e.g. yt ) to a scalar (e.g. νt ): νt = yt × w (2) These scalars are then concatenated, and softmax is applied to the resulting vector. The resulting vector from the softmax will include the weights that are used by the attention mechanism: ω = Sof tmax([ν1 , ν2 , ν3 , . . . , νn ]) (3) User-level Attention Mechanism. Knowing that the posts by a user do not contribute equally to detect her/his mental state [32], a user-level atten- tion mechanism is used to make the system learn to automatically detect the contribution of each post to the final classification of the user. The mechanism of the user-level attention is similar to the post-level at- tention mechanism, but computes a vector representation of a user from the representation of her/his posts (resulted from the post-level attention/pooling). Output (Classification) Layer. The final layer in the sub-models is a feed- forward fully-connected layer that maps the output of the user-level attention to a vector with size 2 (corresponding to the negative and positive classes). At the end of this layer, a softmax activation function gives as the output, the predicted probability distribution over the classes negative and positive. 4.2 Ensemble Model As shown in Fig. 1, the ensemble model is composed of several neural sub-models, a fusion component, and a final SVM classifier. The fusion component concate- nates the outputs of the user-level attention units (which will subsequently be referred to as neural features), and the predicted probability distributions of the two classes, resulting from the softmax activation functions from all its con- stituent sub-models. The output of the fusion component is taken as the final representation of a user. This representation is finally fed to an SVM classifier to perform the ensemble classification. 5 Experimental Setup This section describes our experiments with the above model for our participa- tion to the eRisk 2019 shared task [19]. 5.1 Sub-models Implementation PyTorch [24] was used to implement and train the sub-models. The Adam op- timizer [14] was used, and the learning rate was set to 5 × 10−4 . Cross-entropy was used as the loss function, in order to handle the imbalanced distribution of the positive and negative classes in the training set (see Table 1), weights proportional to the inverse of the number of samples of each class were assigned to that class. Due to lack of computational resources, mini-batches with a max- imum size of 128 were used at the post level for each user and only the first 100 tokens of the posts were used4 . In order to minimize the amount of padding in the batches, posts with similar number of tokens were assigned to the same batch. In order to fine-tune the other hyperparameters of the sub-models (including the number and size of convolutional filters, number of recurrent units, and num- ber of training epochs), each sub-model was individually trained with training set and optimized on the validation set (see Table 1), based on F1 score. The specifics of the 8 different sub-models are shown in Table 2. Since each sub-model is composed of a unique pair of hidden layer and word embedding type, they will later be referred to as - (see the second column of Table 2). Table 2. Hyperparameter values used in the 8 sub-models # Name Hyperparameters 1 CNN-GloVe 100 bigram convolution filters, trained for 10 epochs 2 CNN-ELMo 200 unigram filters and 50 bigram convolution filters, trained for 6 epochs 3 BiRNN-GloVe one layer of 64 vanilla RNN units, trained for 14 epochs 4 BiRNN-ELMo one layer of 50 vanilla RNN units, trained for 13 epochs 5 BiLSTM-GloVe one layer of 32 bidirectional LSTM units, trained for 31 epochs 6 BiLSTM-ELMo one layer of 64 bidirectional LSTM units, trained for 14 epochs 7 BiGRU-GloVe one layer of 64 bidirectional GRUs, trained for 14 epochs 8 BiGRU-ELMo one layer of 64 bidirectional GRUs, trained for 8 epochs 5.2 Ensemble Classifiers Scikit-learn [25] was used to develop the SVM classifier used in the ensemble model. Three different versions of ensemble classifiers were developed: 1. Ens-Feat is the version of the ensemble model that only utilizes the neural features. The SVM classifier in this version uses a sigmoid kernel. The γ and C parameters in the SVM were set to auto (i.e. 1/) and 4, respectively. 2. Ens-Prob uses only the predicted class probabilities from the softmax acti- vation function at the end of the neural sub-models. It utilizes a polynomial kernel with the degree of 1. The γ and C parameters in the SVM were set to scale (i.e. 1/[×]) and 1, respectively. 3. Ens-All utilizes both neural features and predicted class probabilities in its SVM classifier, that uses a sigmoid kernel, and has its values of γ and C set to auto and 2, respectively. 4 This limit only truncated a small number of posts, as the average length was ∼37.47 tokens in the eRisk 2018 task 2 data. 5.3 Submitted Runs Based on the results with the validation set, 5 runs were submitted to the shared task server. For the 1st and 2nd runs, CNN-GloVe and CNN-ELMo were used, respectively, as stand-alone models5 , and Ens-Feat, Ens-Prob, and Ens-All com- prised the 3rd, 4th and 5th runs. 6 Results and Discussion Table 3 shows the official results of our submissions, as well as selected runs from other teams (as reported in [19]) that achieved the best result with one of the official evaluation metrics, or achieved competitive results. For the results of our team (CLaC ), we indicate in Table 3 the specific name of the models used in the five submitted runs. Table 3. Official results on the first task of the eRisk 2019 shared task. #writings: max- imum number of writings (Reddit posts) that were processed for a user, P : Precision, R: Recall, l-w F1 : Latency-Weighted F1 score. team model run #writings P R F1 ERDE5 ERDE50 l-w F1 CLaC CNN-GloVe 0 109 0.4463 0.7400 0.5567 0.0672 0.0393 0.5437 CLaC CNN-ELMo 1 109 0.6061 0.8219 0.6977 0.0573 0.0312 0.6895 CLaC Ens-Feat 2 109 0.6020 0.8082 0.6900 0.0602 0.0313 0.6766 CLaC Ens-Prob 3 109 0.6292 0.7671 0.6914 0.0627 0.0355 0.6752 CLaC Ens-All 4 109 0.6374 0.7945 0.7073 0.0625 0.0343 0.6908 lirmm 0 2024 0.74 0.63 0.68 0.09 0.05 0.63 lirmm 1 2024 0.77 0.60 0.68 0.09 0.06 0.62 Fazl 2 2001 0.09 1.00 0.16 0.17 0.11 0.14 UNSL 0 2000 0.42 0.78 0.55 0.06 0.04 0.55 UNSL 4 2000 0.31 0.92 0.47 0.06 0.03 0.46 INAOE-CIMAT 3 2000 0.67 0.68 0.68 0.09 0.05 0.63 As shown in Table 3, the model Ens-All achieved the highest F1 (0.7073) and latency-weighted F1 (0.6908) scores of all participants’ runs. This was in line with our intuition that using an ensemble model that makes use of both neural features and predicted class probabilities from the 8 sub-models has a higher capability of detecting the correct class after observing a small number of writings. The results also show that the CNN-ELMo model can achieve F1 and latency-weighted F1 scores that are competitive to Ens-All, and outperforms Ens-Feat and Ens- Prob in these two metrics. The CNN-ELMo model also resulted in the best recall, ERDE5 and ERDE50 , showing the potential of this model to be used independently for the task of early risk detection of anorexia. Table 3, also shows that all our models, except CNN-GloVe (run 0) achieved significantly superior performances in terms of F1 score and latency-weighted 5 These two sub-models achieved the most promising results among all the sub-models, during the training phase. F1 (teams lirmm and INAOE-CIMAT achieved the next best F1 and latency- weighted F1 scores). Run 1 of team lirmm achieved the highest precision. The best recall was achieved by run 2 of the team Fazl. Runs 0 and 4 of the team UNSL achieved the highest ERDE5 and ERDE50 , respectively, where we could achieve competitive results using CNN-ELMo. The number of writings processed by the models submitted by each team shows that our models used a significantly lower number of writings in compar- ison to the other teams6 . This shows that our systems have a great potential of making early and correct decisions. This is supported by an even larger gap between the latency-weighted F1 scores of our team and the runs submitted by other teams, in comparison to the gap in F1 score. Although our systems achieved the best or competitive results according to different evaluation metrics, we suffered from lack of computational resources when running the models that use the ELMo embedder for around 2000 it- erations. The models had to be run for approximately 2000 times due to the item-by-item release of the test data which was chosen for the eRisk 2019 shared task (in the previous eRisk shared tasks, the test data was released in 10 chunks, making the number of iterations equal to 10). Despite this technical drawback, the advantages of using ELMo to extract context-sensitive embeddings greatly outweigh its disadvantages. This can also be observed by comparing the results achieved by CNN-GloVe and CNN-ELMo. 7 Conclusion and Future Work This paper presents an ensemble approach which can be used to detect distress in the social media posts of a user. The ensemble model utilizes neural features alongside predicted class probabilities which are output by 8 different neural sub-models. Using this model and under the team name CLaC, we participated to the first task of eRisk 2019 [19], which was aimed at the early detection of anorexia in online posts, and ranked first in terms of F1 and latency-weighted F1 scores. Using a similar architecture, we also participated to the CLPsych 2019 shared task [22], whose aim was to assess suicide risk based on online posts. Considering that our ensemble model ranked first in tasks A and C of this shared task, the same model architecture seems applicable to other similar tasks, where the goal is to detect different types of mental health issues using social media posts. We believe that the user-level attention mechanism has played an important role in the good results achieved on these shared tasks. It would be interesting to qualitatively analyze the results of the attention mechanism, to see how they correlate with human perception, i.e. whether the posts to which the attention mechanism assigns more weights are actually the same posts that seem more informative to a health specialist for detecting anorexia. Also, during the development phase, it was found that removing each of the 8 sub-models (evens the sub-models with low individual performances) negatively 6 The average number of writings processed by the participating teams was 1273. affected the result of the final ensemble classifier. It would be interesting to measure quantitatively the contribution of each of the 8 neural sub-models in the result of the final classifier. This could then be leveraged to improve the performance of the system. An additional research direction is the use of linguistic features and metadata. The current model does not explicitly use such features, however Trotzek et al. [32] showed that they can significantly improve early detection of anorexia. Lastly, it would be interesting to experiment with more diverse architectures in the neural sub-models (e.g. by using other hidden layer architectures, such as recursive neural networks [11,29]) as a way of improving the performance of the current ensemble classifier. Acknowledgment We would like to thank the reviewers for their comments on an earlier version of this paper. This work was financially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC). References 1. Ball, M., Patel, R., Hayes, R.D., Dobson, R.J., Stewart, R.: TextHunter – a user friendly tool for extracting generic concepts from free text in clinical research. In: AMIA Annual Symposium Proceedings. vol. 2014, p. 729. American Medical Informatics Association (2014) 2. Benton, A., Mitchell, M., Hovy, D.: Multitask learning for mental health conditions with limited social media data. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2017). pp. 152–162. Association for Computational Linguistics, Valencia, Spain (April 2017) 3. Calvo, R.A., Milne, D.N., Hussain, M.S., Christensen, H.: Natural language pro- cessing in mental health applications using non-clinical texts. Natural Language Engineering 23(5), 649–685 (2017) 4. Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., Robinson, T.: One billion word benchmark for measuring progress in statistical language modeling. In: 15th Annual Conference of the International Speech Communication Association (INTERSPEECH 2014). Singapore (September 2014) 5. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Em- pirical Methods in Natural Language Processing (EMNLP 2014). pp. 1724–1734. Doha, Qatar (October 2014) 6. Coppersmith, G., Dredze, M., Harman, C.: Quantifying mental health signals in twitter. In: Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality (CLPsych 2014). pp. 51–60. Baltimore, Maryland, USA (June 2014) 7. Coppersmith, G., Dredze, M., Harman, C., Hollingshead, K., Mitchell, M.: CLPsych 2015 shared task: Depression and PTSD on twitter. In: Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality (CLPsych 2015). pp. 31–39. Association for Computational Linguistics, Denver, Colorado (2015) 8. Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detec- tion and the problem of offensive language. In: Proceedings of the Eleventh In- ternational Conference on Web and Social Media. pp. 512–515. Montréal, Canada (May 2017) 9. DeVault, D., Georgila, K., Artstein, R., Morbini, F., Traum, D., Scherer, S., Morency, L.P., et al.: Verbal indicators of psychological distress in interactive di- alogue with a virtual human. In: Proceedings of the Special Interest Group on Discourse and Dialogue Conference (SIGDIAL 2013). pp. 193–202 (2013) 10. Giatsoglou, M., Vozalis, M.G., Diamantaras, K., Vakali, A., Sarigiannidis, G., Chatzisavvas, K.C.: Sentiment analysis leveraging emotions and word embeddings. Expert Systems with Applications 69, 214–224 (2017) 11. Goller, C., Kuchler, A.: Learning task-dependent distributed representations by backpropagation through structure. Neural Networks 1, 347–352 (1996) 12. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) 13. Jackson, R.G., Patel, R., Jayatilleke, N., Kolliakou, A., Ball, M., Gorrell, G., Roberts, A., Dobson, R.J., Stewart, R.: Natural language processing to extract symptoms of severe mental illness from clinical text: the clinical record interac- tive search comprehensive data extraction (CRIS-CODE) project. British Medical Journal (BMJ open) 7(1) (2017) 14. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: The 3rd International Conference for Learning Representations (ICLR 2015). San Diego, California, USA (May 2015) 15. LeCun, Y., Haffner, P., Bottou, L., Bengio, Y.: Object recognition with gradient- based learning. In: Shape, Contour and Grouping in Computer Vision, pp. 319–345 (1999) 16. Losada, D.E., Crestani, F.: A test collection for research on depression and language use. In: International Conference of the Cross-Language Evaluation Forum for European Languages. pp. 28–39. Evora, Portugal (September 2016) 17. Losada, D.E., Crestani, F., Parapar, J.: eRISK 2017: CLEF lab on early risk predic- tion on the internet: Experimental foundations. In: Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages. pp. 346–360. Dublin, Ireland (September 2017) 18. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk: Early Risk Prediction on the Internet. In: CLEF 2018: Experimental IR Meets Multilinguality, Multi- modality, and Interaction. pp. 343–361. Avignon, France (September 2018) 19. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk 2019: Early Risk Pre- diction on the Internet. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. 10th International Conference of the CLEF Association, CLEF 2019. Lugano, Switzerland (September 2019) 20. Lynn, V., Goodman, A., Niederhoffer, K., Loveys, K., Resnik, P., Schwartz, H.A.: CLPsych 2018 shared task: Predicting current and future psychological health from childhood essays. In: Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic (CLPsych 2018). pp. 37–46. Association for Computational Linguistics, New Orleans, LA (2018) 21. Milne, D.N., Pink, G., Hachey, B., Calvo, R.A.: CLPsych 2016 shared task: Triaging content in online peer-support forums. In: Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2016). pp. 118–127. Association for Computational Linguistics, San Diego, CA, USA (June 2016) 22. Mohammadi, E., Amini, H., Kosseim, L.: CLaC at CLPsych 2019: Fusion of neu- ral features and predicted class probabilities for suicide risk assessment based on online posts. In: Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic (CLPsych 2019). Minneapolis, Minnesota, USA (June 2019) 23. Ofoghi, B., Mann, M., Verspoor, K.: Towards early discovery of salient health threats: A social media emotion classification technique. In: Biocomputing 2016: Proceedings of the Pacific Symposium. pp. 504–515. Kohala Coast, Hawaii (Jan- uary 2016) 24. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In: NIPS 2017 Autodiff Workshop. Long Beach, California, USA (January 2017) 25. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12(Oct), 2825–2830 (2011) 26. Pennington, J., Socher, R., Manning, C.: GloVe: Global vectors for word represen- tation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014). pp. 1532–1543. Doha, Qatar (October 2014) 27. Pestian, J., Nasrallah, H., Matykiewicz, P., Bennett, A., Leenaars, A.: Suicide note classification using natural language processing: A content analysis. Biomedical informatics insights 3, BII–S4706 (2010) 28. Shen, J.H., Rudzicz, F.: Detecting anxiety through reddit. In: Proceedings of the Fourth Workshop on Computational Linguistics and Clinical Psychology – From Linguistic Signal to Clinical Reality (CLPsych 2017). pp. 58–65 (2017) 29. Socher, R., Lin, C.C.Y., Ng, A., Manning, C.: Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011). pp. 129–136. Bellevue, Washington, USA (June 2011) 30. Struik, L.L., Baskerville, N.B.: The role of facebook in crush the crave, a mobile-and social media-based smoking cessation intervention: qualitative framework analysis of posts. Journal of medical Internet Research 16(7) (2014) 31. Thompson, J.J., Leung, B.H., Blair, M.R., Taboada, M.: Sentiment analysis of player chat messaging in the video game StarCraft 2: Extending a lexicon-based model. Knowledge-Based Systems 137, 149–162 (December 2017) 32. Trotzek, M., Koitka, S., Friedrich, C.M.: Word embeddings and linguistic metadata at the CLEF 2018 tasks for early detection of depression and anorexia. In: Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum. Avignon, France (September 2018) 33. Yang, C.C., Jiang, L., Yang, H., Tang, X.: Detecting signals of adverse drug reac- tions from health consumer contributed content in social media. In: Proceedings of ACM SIGKDD Workshop on Health Informatics (HI-KDD 2012). Beijing, China (August 2012) 34. Yin, J., Karimi, S., Lampert, A., Cameron, M., Robinson, B., Power, R.: Using social media to enhance emergency situation awareness. In: Twenty-Fourth Inter- national Joint Conference on Artificial Intelligence (IJCAI 2015). Buenos Aires, Argentina (July 2015) 35. Zirikly, A., Resnik, P., Uzuner, Ö., Hollingshead, K.: CLPsych 2019 shared task: Predicting the degree of suicide risk in Reddit posts. In: Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic (CLPsych 2019). Minneapolis, Minnesota, USA (June 2019)