=Paper=
{{Paper
|id=Vol-2696/paper_53
|storemode=property
|title=Early Mental Health Risk Assessment through Writing Styles, Topics and Neural Models
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_53.pdf
|volume=Vol-2696
|authors=Diego Maupome,Maxime D. Armstrong,Raouf Moncef Belbahar,Josselin Alezot,Rhon Balassiano,Marc Queudot,Sébastien Mosser,Marie-Jean Meurs
|dblpUrl=https://dblp.org/rec/conf/clef/MaupomeABABQ0M20
}}
==Early Mental Health Risk Assessment through Writing Styles, Topics and Neural Models==
Early Mental Health Risk Assessment through Writing Styles, Topics and Neural Models Diego Maupomé, Maxime D. Armstrong, Raouf Belbahar, Josselin Alezot, Rhon Balassiano, Marc Queudot, Sébastien Mosser[0000−0001−9769−216X] , and Marie-Jean Meurs[0000−0001−8196−2153] University of Quebec in Montreal UQAM meurs.marie-jean@uqam.ca Abstract. This paper describes the participation of the RELAI team in the eRisk 2020 tasks. The 2020 edition of eRisk proposed two tasks: (T1) Early assessment of risk of self-harm and (T2) Signs of depression in social media users. The second task focused on automatically filling a depression questionnaire given user writing history. The RELAI team participated in both tasks, and addressed them using topic modeling al- gorithms (LDA and Anchor), neural models with three different architec- tures (Deep Averaging Networks (DANs), Contextualizers, and Recur- rent Neural Networks (RNNs)), and an approach based on writing styles. For the second task related to early detection of depression, the system based on LDA performed well according to all the evaluation metrics, and achieved the best performance among participants according to the Average Difference between Overall Depression Levels (ADODL) with a score of 83.15%. Overall, the submitted systems achieved promising results, and suggest that evidence extracted from social media could be useful for early mental health risk assessment. Keywords: Early Risk Detection · Topic Modeling · Neural Networks · Mental Health Risk Assessment. 1 Introduction The global goal of the eRisk challenges is the early detection of at-risk people from their textual production on social media, using Natural Language Process- ing (NLP) techniques. In 2020, two different tasks were put forth: early detection of signs of self-harm (T1), and measuring the severity of the signs of depres- sion (T2) using textual data from related Reddit subreddits1 . These tasks are follow-ups of tasks 2 and 3 from 2019, respectively. T1 consists in sequentially Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessaloniki, Greece. 1 https://www.reddit.com. Table 1. Summary statistics of the training and test sets for T1 (Self-Harm). Distri- butions in training and test sets are reported with % in parentheses. Training Test Self-Harm Control Total Self-Harm Control Total Nb. users 41 (12%) 299 (88%) 340 104 (25%) 319 (75%) 423 processing writings from a set of social media users, and detecting signs of self- harm [12], classifying users as at-risk or not. The goal is not only to perform this classification but also to do it as early as possible, i.e., based on as few writings per user as possible. T2 consists in automatically filling the Beck’s Depression Inventory (BDI) [3] for a set of users, based on a history of their postings on social media. In this work, we describe the participation of the RELAI team from University of Quebec in Montreal (UQAM) at the Conference and Labs of the Evaluation Forum (CLEF) 2020 eRisk tasks for early detection of signs of self-harm and depression [12]. The article is organized as follows. Sections 2 and 3 describe the proposed approaches and the research background they rely on, present the applied methodologies, experimental setup and results obtained on T1 and T2 respectively. Each Section concludes with a discussion about the results, and suggests possible future improvements. 2 Early Signs of Self-Harm (T1) Self-harm is thought to affect about 12% of adolescents [9]. In 2014-15, hospi- talizations due to self-inflicted injuries in Canada were thrice as numerous as suicides [18]. While it is co-morbid with other mental health disorders [5], its peculiar characteristics and cyclical nature have caused non-suicidal self-injury to be included as an independent disorder in the DSM-V [22]. Further, only a small fraction of young people will seek professional help either before or after engaging in self-harm (9-12%) [9]. This highlights the potential of the use of au- tomatic means of detection on social media [20]. Such is the aim of the current task. We give hereafter a brief description of the corpus, the metrics, as well as our participation. 2.1 Task and Data As previously mentioned, this task was first introduced in the previous iteration of eRisk (2019). In 2020, the dataset (training and test) consists of users exhibit- ing signs of self-harm and control users. For information regarding the labeling process, we direct the reader to [11]. Table 1 presents some statistics about the 2020 dataset. The test set is markedly different from the training set both in class proportions and user verbosity. Indeed, the ratio of positive subjects in the test is roughly double that of the training set. In addition, both positive and negative test users have fewer and shorter documents compared to their train- ing counterparts. One of the chief concerns of the task is the early detection of positive subjects. Therefore, during the test stage, a REST server2 was set up by the task organizers to iteratively release user writings item-by-item during a limited period of time. Thereby, the participants had to send a GET request to retrieve the writing of each user. After each request, the processing/prediction pipeline runs and gives back to the server, via a POST request, the predictions about each individual. After each release of writings, a decision had to be emit- ted. Classifying a user as suffering from self-harm (decision: 1) was considered as final, while predicting the user not at risk (decision: 0) was open to updates in the following rounds. In order to evaluate the performance of the systems, and to explore ranking-based measures, the task organizers also asked participants to provide an estimated score of the level of self-harm with the decision. 2.2 Evaluation Metrics Several metrics allow to evaluate the systems in T1: standard classification met- rics like precision, recall and F1 score as well as specific, time-aware classification metrics such as ERDE, latencyT P , speed and Flatency (i.e. latency-weighted F1 ). ERDE - Early Risk Detection Error - is a metric designed for eRisk tasks, tak- ing into account the correctness of predictions and the delay taken by the system to make these predictions. The delay in decision for a given user is defined by k, the number of posts processed by the system before making a decision. latencyT P takes into account the latency for true positive predictions only be- cause they represent users needing early intervention, as opposite to true negative predictions. This measure is based on the median number of posts the system has to analyze to detect true positives. The last two metrics are speed and Flatency [17]. Computing the speed requires a penalty factor, which takes into account the number of a user’s writings needed by the system to make a decision. Flatency is a latency-weighted F1 score, which combines the effectiveness of the system with a delay, by multiplying the F1 score by the speed metric measuring the delay of the system. Since the 2019 edition, the organizers have added a ranking-based evalua- tion process, which is not based on the binary predictions made but rather on their associated score. Two metrics are proposed for evaluating this ranking: the precision at k (P @k - percentage of true positive users among the k users predicted by the system as presenting the highest risk); and the Normalized Dis- counted Cumulative Gain (N DCG - evaluates a system based on the relevance of the rankings built from its results). These metrics are computed after seeing k writings; this year, they were reported with four different k values: 1, 100, 500, 1000. 2.3 Related Work In 2019, the best precision, F1 and ERDE [7] were obtained by a system based on supervised learning for text classification called Sequential S3 - SS3 for Smooth- ness, Significance, and Sanction - [6] submitted by the UNSL team. As for the 2 https://early.irlab.org/server.html Table 2. Summary of best results obtained on eRisk 2019 T2 (Self-harm) System Run precision recall F1 ERDE5 ERDE50 latencyT P speed Flatency SS3 [7] 0 .710 .410 .520 .090 .073 2 1 .520 SS3 [7] 1 .310 .880 .460 .082 .049 3 .990 .450 LTL-INAOE [16] 0 .120 1 .220 .125 .106 1 1 .220 other evaluation metrics such as recall, latencyT P and speed, the system sub- mitted by LTL-INAOE achieved the best results with an approach based on the similarity between a given piece of text and a set of phrases potentially related to self-harm. Table 2 reports the best results obtained by the participating teams of eRisk 2019 T2. In the next Sections, some details are given about how we approached the problem of self-harm detection. 2.4 Topic Models Topics discussed by users of social media could provide insight into their men- tal status [13]. We chose to explore this hypothesis by using Latent Dirichlet Allocation (LDA) [4], the most widely known topic modeling algorithm, as well as the Anchor variant [2]. For both topic modeling approaches, the best results were obtained when training the model using the entire textual production of each user as a single document, concatenating the posts together. Latent Dirichlet Allocation. Two different LDA models were tested, one based on word stems and one based on word bigrams. Both models operate on documents with stop-words and short words (3 characters or fewer) removed. The first model further stems the remaining words. The second LDA model is trained instead on word bigrams. Once the LDA model is trained, users are mapped to a vector of topics. Finally, a logistic classifier is trained on these vectors. We use different training-validation splits to tune the hyper-parameters, namely the number of topics extracted. The best results are obtained by the stemmed unigram model, using a total of 14 topics. Anchor Variant. This method tries to find a set of anchor words for each topic discovered. The anchor words will be assigned high probability in only one topic. The implementation of the Anchor-based system is similar to the LDA models previously discussed, using stop-word removal and stemming. Further, tokens used by over 60% of users are disregarded. As with the standard LDA approach, we find our best results in validation using 14 topics. 2.5 Neural Encoders One of the principal challenges of the task is to combine the analyses of a user’s writings in order to arrive at a single prediction for said user. In this respect, the flexibility afforded by the back-propagation framework allowed us to explore several manners in which to structure prediction models. Broadly speaking, we Table 3. Results on the test set of the RELAI systems in T1: Self-harm detection System precision recall F1 ERDE5 ERDE50 latencyT P speed Flatency Anchor (run 0) .341 .865 .489 .188 .136 2 .996 .487 LDA (run 1) .350 .885 .501 .190 .130 2 .996 .499 Contextualizer (run 2) .438 .740 .550 .245 .132 8 .973 .535 IDA LSTM (run 3) .291 .894 .439 .306 .168 7 .977 .428 DAN (run 4) .381 .846 .525 .260 .141 7 .977 .513 distinguish two modes of aggregation encoding the documents making up a user into a single user encoding. The first mode, nested aggregation, uses two en- coders. The first encoder encodes documents independently of each other. The second encoder aggregates these encoded documents together. The second mode of aggregation, flat aggregation, uses a single encoder combining the words from all documents simultaneously. We explored such aggregations with three different architectures as encoders: Deep Averaging Networks (DANs), Contextualizers, and Recurrent Neural Networks (RNNs). Contrary to the other two models, DANs [10] cannot account for the position of items so we only used nested ag- gregation with these models, having one DAN encode each of a user’s document independently, and a separate DAN aggregate these encoded documents. In the case of Contextualizers, a flat aggregation is more interesting, as even small parts of documents can be put into the context of other passages so we opted for a positional encoding consisting of a concatenation of three vectors of sinu- soids [19] for each word: one of them corresponding to the position of the word in the document and the two others corresponding to the position of the document. Rather than simply providing the position of a document in the user’s history, we provide a position with two components: one counting the units of time elapsed since the writing of the post and a second one enumerates documents happening within the same unit of time (a day in our case). As for RNNs, we borrow the inter-document attention RNN approach described by [14] as the conditions are very similar. 2.6 Results and Discussion Results of the RELAI systems on the test set are presented in Table 3 for the decision-based evaluation and Table 4 for the ranking-based evaluation. Overall, the proposed models appear to have erred on the side of caution, achieving high recall but relatively low precision. In terms of ranking-based evaluation, precision increases from the measure- ment at 100 writings and remains high throughout for all models, suggesting issues with the policies mapping scores to decisions. The number of positive subjects being over 100, P @10 is less indicative of classification viability. While all models achieve high N DCG@10, neural models especially, N DCG@100 re- mains modest throughout. Contrasting this with the high recall and low precision achieved further illustrates the need for policy adjustments. Table 4. Ranking-based evaluation on the test set of the RELAI systems in T1 1 writing 100 writings Model P @10 N DCG@10 N DCG@100 P @10 N DCG@10 N DCG@100 Anchor .7 .80 .52 .8 .87 .52 LDA .3 .28 .43 .6 .69 .47 Contextualizer .2 .20 .27 .7 .81 .63 IDA LSTM .2 .20 .27 .9 .94 .51 DAN .2 .20 .27 .7 .68 .59 500 writings 1000 writings Model P @10 N DCG@10 N DCG@100 P @10 N DCG@10 N DCG@100 Anchor .8 .87 .52 .8 .87 .50 LDA .6 .69 .47 .7 .75 .47 Contextualizer .8 .87 .70 .8 .87 .72 IDA LSTM 1 1 .59 1 1 .60 DAN 1 1 .71 .9 .81 .66 3 Early Signs of Depression (T2) Given a user’s history writing and based on evidence found in it, T2 participants had to fill a standard depression questionnaire defined from Beck’s Depression Inventory (BDI) [3]. The questionnaire is composed of 21 questions with 4 possi- ble answers (from 0 to 3), except for questions 16 and 18, where there are seven possible answers (0, 1a, 1b, 2a, 2b, 3a, 3b). The answers to each question rep- resent an ordinal scale, each one associated to an integer value. The sum total of a subject’s answers is considered their score. Additionally, these scores are associated with the following categories: minimal depression (depression levels 0-9), mild depression (10-18), moderate depression (19-29) and severe depres- sion (30-63). In T2, the proposed systems had to estimate each user’s response to each individual question. The predictions are therefore much more complex than those expected in T1. We give a brief description of the corpus, the metrics as well as our participation. 3.1 Task and Data The second task of the eRisk 2020 lab was introduced in 2019, with the goal of exploring much finer-grained prediction of the severity of depression symp- toms [12]. For this purpose, each subject was asked to fill the BDI questionnaire. The systems submitted by participants then had to estimate every user’s answer to each question given writing history of users. In order to assess the correctness of the responses provided by the participants, a number of metrics are used. Some are concerned with obtaining the exact answers whereas others are con- cerned with proximity in individual answers or overall BDI score. These metrics are described in the following Section. While participants in the previous itera- tion did not have training data at their disposal, 20 users were made available for eRisk 2020 participants, with 70 more used for evaluation. Their distributions according to the standard categorization used on BDI scores are shown in Table 5. As in T1, the training and test set differ in this respect. Table 5. Summary statistics of the training and test sets for T2 (Depression) Training Test Total nb. users 20 70 Nb. minimal depression users 4 (20%) 10 (14%) Nb. mild depression users 4 (20%) 23 (33%) Nb. moderate depression users 4 (20%) 18 (26%) Nb. severe depression users 8 (40%) 19 (27%) 3.2 Evaluation Metrics Four metrics evaluate the systems trying to address T2. The first one is the Average Hit Rate (AHR). For a given user, the hit rate is simply the number of matches between the system automatic answers of the questionnaire and the user answers, i.e., the rate of correct guesses over the total. The AHR is then the mean hit rate across all users. The second metric is the Average Closeness Rate (ACR). The closeness rate is a finer-grained measure of the disparity between the prediction and the ground truth for each answer, as defined by the ordinal scale on which the answers are placed. To calculate this, for each question, one takes the system and user’s answers and computes the absolute differences (ad) between them. The closeness rate for a user is the mean closeness rate for each question, and the ACR is the mean closeness rate across users. The third metric is the Average Difference between Overall Depression Levels (ADODL), which is the mean over all users of the Difference between Overall Depression Levels (DODL), i.e. the absolute difference between the ground truth and the system predictions. The last metric is the Depression Category Hit Rate (DCHR), which is the fraction of the cases where the system score and the user’s score fall in the same category. 3.3 Related Work In eRisk 2019, the highest AHR, was achieved by the SS3 system trained using the dataset for the eRisk 2018 depression detection task [7]. Since the model was designed as a ”yes or no” classifier, the authors had to make some modifi- cations to return a depression level between 0 and 63 to be able to fill a BDI questionnaire. Additionally, a question-centered variant was built, achieving the aforementioned AHR. The best ACR (distance-based variant) was also achieved by a variant of the previous system using a probability distribution depending on the value of the expected answer. In terms of ADODL and DCHR, the best performances were reached with an unsupervised approach [1] using the distance between the answers and all the sentences of a user’s writing history. Table 6 reports the best results obtained by the participating teams of eRisk 2019 Table 6. Summary of best results obtained on eRisk 2019 T3 (Self-harm) Run AHR ACR ADODL DCHR CAMH GPT nearest unsupervised [1] 23.81% 57.06% 81.03% 45.00% UNSLC [7] 41.43% 69.13% 78.02% 40.00% UNSLE [7] 40.71% 71.27% 80.48% 35.00% According to [11] these results show that it is possible to automatically ex- tract some depressions signals from social media activity. Although the perfor- mance is still modest and far from a really effective depression screening tool. 3.4 Approaching the Task as One of Authorship Attribution The BDI was filled by only 20 users. Treating each of these users as one observa- tion to be mapped to the answers they gave to the questionnaire would lead to a very limited number of examples. We hence approached the problem as one of authorship attribution by two different methods, which rely on decision models taking two documents and outputting the probability that both documents were written by the same user. The proposed systems exploit the decision models in different ways: one attempting to relate users to each other, and the other at- tempting to relate a user to the text contained in the BDI questionnaire itself. These methods are refered to as user-based and answer-based respectively. One key advantage of this authorship framework is that the training of de- cision models does not require the annotation provided for the training set; the models can be trained on unannotated data from the same domain. The dataset from the eRisk 2018 depression risk detection task could hence be used for the training of the authorship attribution models, using the 2020 training data for validation. These models include LDA and a Contextualizer as well as a stylometry-based approach. As for validation, in the user-based approach, some users for whom the BDI is known were used as a knowledge base. The set of users was therefore split in half, using one half as a knowledge base and the other half for validation On the contrary, the answer-based approach allows to validate on all 20 users. 3.5 Topic Models One of the representation models used for this task was an LDA model. Using topic modeling, the strategy is to create topic vectors for users and then measure the distance between these in the user-based approach, or between these and the topic representation of answers for the answer-based approach As previously mentioned, the LDA model is trained on the eRisk 2018 depression risk detection dataset. As in the Self-Harm task, each user’s posts are grouped together into larger documents. While the number of such groups will be the same for all users, the choice of it is set by observing its effect on validation results. The pre-processing then involves of stop-words, short words (3 letters and fewer) and stemming. Using the pre-processed documents, a dictionary and a bag-of- words are created to train the LDA model. A filter is applied when creating the dictionary, removing words appearing in fewer than 20 documents or in over half of the documents. We find better results when requiring the model to find 30 topics. The trained LDA model is then used to create vectors for the documents from eRisk 2020 task 2 dataset. Each document is one of the reddit post included in the dataset. Finally, the distance between every pair of document vectors is measured using cosine similarity, which naturally falls in the unit interval, as the topic vectors are strictly positive. For both approaches, we aimed to maximize the ADODL metric. For the answer-based approach, the different experiments show that the best ADODL is reached when combining each user’s documents into 19 groups, with an LDA model trained for 30 topics. The ADODL attained by the user-based approach is approximately even when concatenating users’ posts into 10 to 19 groups. We opt to use 19 groups at test time as we posit this will allow for finer-grained predictions. 3.6 Contextualizer Contextualizer encoders were also used for this task. This time, the aggregation considerations of the first task were no longer relevant. We tested two different approaches for the authorship decision task: encoding each document separately (parallel) or together (simultaneous). Both encoders were trained for this author- ship task, ultimately using the depression questionnaire task as final validation, in both the user- and answer-based form. For the parallel encoder, the angular similarity between the document vectors is used. The simultaneous encoder, on the other hand, outputs the probability of the author being the same by design. In order to prevent overfitting, we cease training of the authorship models by monitoring their accuracy on unseen pairs of documents, including unseen pairs of familiar documents, unseen documents by familiar users, as well as unseen users. After extensive testing, we select the parallel encoder for the user-based approach, and the simultaneous one for answer-based prediction. 3.7 Stylometry This approach focuses on the writing style of a document in order to characterize its author. To this end, several linguistic features served as document represen- tations, such as length of words and sentences, word and character frequencies and word and sentence lengths. These features were largely inspired by stylomet- ric approaches to authorship attribution in instant messaging [15, 8] as well as legal proceedings and film reviews [21]. They are presented in Table 7. As with the LDA system, users’ documents have been concatenated together, in order to have the same number of documents per user while still accounting for all their production. Features are normalized with respect to the length of these groups, whether this length pertains to words, characters or sentences. These features result in document representations of size 585. These vector representations are Table 7. Linguistic features and their associated number of dimensions used for the stylometry-based authorship model. The frequencies of the most frequent tokens are computed and compared irrespective of what these tokens are or whether they are the same for any two users. Type Feature Syntactic Frequency of select Parts of Speech (46) Frequency of most frequent word unigrams (100) Frequency of most frequent word bigrams (100) Frequency of most frequent character unigrams (100) Frequency of most frequent character bigrams (100) Frequency of most frequent character trigrams (100) Lexical Number of unique words (1) Number of alphanumeric characters (1) Number of digits (1) Number of non-ASCII characters (1) Punctuation ratio (1) Average length of words (1) Number of long words (1) Number of short words (1) Morphological Number of uppercase words (1) Number of uppercase characters (26) Average length of sentences (1) Pragmatical Number of hyperlinks (1) Table 8. Results on the test set for T2 (Depression) Model AHR ACR ADODL DCHR LDA (answer-based) 28.50% 60.79% 79.07% 30.00% LDA (user-based) 36.39% 68.32% 83.15% 34.29% Contextualizer (answer-based, simultaneous) 21.16% 55.40% 73.76% 27.14% Contextualizer (user-based, parallel) 36.80% 68.37% 80.84% 22.86% Stylometry (user-based) 37.28% 68.37% 80.70% 20.00% then compared using cosine similarity. As previously mentioned, validation was performed with the subjects for whom the BDI was available. 3.8 Results and Discussion The results achieved on the test set are shown in Table 8. The more severe met- rics, the hit rates, were fairly low for all five models. The user-based approach produced superior results across metrics and authorship models. This is unsur- prising for LDA, where considerable parts of users’ activity will likely differ in subject matter from the BDI questionnaire. Given that the Contextualizer en- coder matches documents individually to answers, there might be gains in perfor- mance to be obtained by considering the highest scoring document, rather than the average, for each answer. Nevertheless, although the answer-based approach was outperformed by the user-based one, it has the very appealing advantage of not requiring annotated data, i.e. users with known BDIs. 4 Conclusion This paper has described the experiments performed by the RELAI team from UQAM in the context of the eRisk 2020. Five models were submitted for each of the two tasks. For the first task related to early detection of self-harm, two topic modeling systems were proposed, one using the standard LDA algorithm, and one relying on its Anchor variant. The three remaining systems were based on neural net- work, using three different architectures as encoders: Deep Averaging Networks (DANs), Contextualizers, and Recurrent Neural Networks (RNNs). All models are recall-oriented, which is arguably a safer decision policy. As evidenced by the ranking-based evaluations, however, tweaking this policy could result in greater precision. Globally, we achieved moderate results, the precision and recall ob- tained leads to a F1 -score between 0.439 and 0.550 which is decent comparing to others systems. The Anchor model distinguished from our submitted models by its rapidity to provide fast predictions with little content. This could be ex- plained by the presence of discriminative anchor words in provided user writings which allow to predict rapidly if a user is at risk or not. For the second task related to early detection of depression, we approached the problem as one of authorship attribution by two different methods: user- based and answer-based. This approach affords the freedom to build decision models in a variety of ways. We relied again on LDA and the Contextualizer as well as a stylometry-based approach, achieving the best result among partici- pants for ADODL (83.15%) with the LDA model with the user-based approach. This metric is arguably the most relevant when it comes to overall assessment of depression. Nonetheless, the ACR could be more interesting moving forward as it pertains to informing a clinician on the exact symptoms a patient is ex- periencing. Also, the LDA model shows a better balance between the different metrics. Almost all of the other approaches submitted achieved higher results than the average results for each metric. For example, the stylometric model user-based approach performed the second best AHR. We could also note some pertinent aspects. First, our systems are completely independent of the domain; they make decisions only on extracted features from the provided texts without requiring heavy processes of feature engineering or domain specific hand-crafted features. Also, the stark difference in the proportions in terms of number of users as well as the number of writings between the training set and the test set for the two tasks could impact the performance of submitted models. Overall, the test results show the promise of each approach. In future works, we will analyze in more detail the results obtained for each task. We plan to incorporate more carefully selected features to our decisions models which could grant a better ability to identify users at risk. Finally, given the unique nature of T2, we will explore different variations to improve predictions at a finer-grained level. Reproducibility. The source code of the presented systems is available under GNU GPL v3 licence to ensure reproducibility. It can be found in the following repositories: https://gitlab.ikb.info.uqam.ca/ikb-lab/nlp/eRisk2020 References 1. Abed-Esfahani, P., Howard, D., Maslej, M., Patel, S., Mann, V., Goegan, S., French, L.: Transfer Learning for Depression: Early Detection and Severity Predic- tion from Social Media Postings. In: CLEF (Working Notes) (2019) 2. Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., Zhu, M.: A practical algorithm for topic modeling with provable guarantees. In: Inter- national Conference on Machine Learning. pp. 280–288 (2013) 3. Beck, A.T., Ward, C.H., Mendelson, M., Mock, J., Erbaugh, J.: An In- ventory for Measuring Depression. Archives of General Psychiatry 4(6), 561–571 (06 1961). https://doi.org/10.1001/archpsyc.1961.01710120031004, https://doi.org/10.1001/archpsyc.1961.01710120031004 4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of machine Learning research 3(Jan), 993–1022 (2003) 5. Brown, R.C., Plener, P.L.: Non-suicidal self-injury in adolescence. Current psychi- atry reports 19(3), 20 (2017) 6. Burdisso, S.G., Errecalde, M., y Gómez, M.M.: A text classification framework for simple and effective early depression detection over so- cial media streams. Expert Systems with Applications 133, 182 – 197 (2019). https://doi.org/https://doi.org/10.1016/j.eswa.2019.05.023, http://www.sciencedirect.com/science/article/pii/S0957417419303525 7. Burdisso, S.G., Errecalde, M., Montes y Gómez, M.: UNSL at eRisk 2019: a uni- fied approach for anorexia, self-harm and depression detection in social media. In: Working Notes of the Conference and Labs of the Evaluation Forum-CEUR Workshop Proceedings. vol. 2380 (2019) 8. Cristani, M., Roffo, G., Segalin, C., Bazzani, L., Vinciarelli, A., Murino, V.: Conversationally-inspired stylometric features for authorship attribution in instant messaging. In: Proceedings of the 20th ACM international conference on Multime- dia. pp. 1121–1124 (2012) 9. Doyle, L., Treacy, M.P., Sheridan, A.: Self-harm in young people: Prevalence, asso- ciated factors, and help-seeking in school-going adolescents. International journal of mental health nursing 24(6), 485–494 (2015) 10. Iyyer, M., Manjunatha, V., Boyd-Graber, J., Daumé III, H.: Deep unordered com- position rivals syntactic methods for text classification. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th In- ternational Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 1681–1691 (2015) 11. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk 2019 Early Risk Predic- tion on the Internet. In: International Conference of the Cross-Language Evaluation Forum for European Languages. pp. 340–357. Springer (2019) 12. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk 2020: Early Risk Pre- diction on the Internet. In: A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, H. Joho, C. Lioma, C. Eickhoff, A. Névéol, L. Cappellato, N. Ferro (eds) (ed.) Ex- perimental IR Meets Multilinguality, Multimodality, and Interaction Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020). Springer International Publishing (2020) 13. Maupomé, D., Meurs, M.J.: Using Topic Extraction on Social Media Content for the Early Detection of Depression. CLEF (Working Notes) 2125 (2018) 14. Maupomé, D., Queudot, M., Meurs, M.J.: Inter and intra document attention for depression risk assessment. In: Canadian Conference on Artificial Intelligence. pp. 333–341. Springer (2019) 15. Mudit Bhargava, P.M., Asawa, K.: Stylometric Analysis for Authorship Attribution on Twitter. In: Big Data Analyctics: Second International Conference. pp. 37–47 (2013) 16. Ortega-Mendoza, R.M., Farı́as, D.I.H., Montes-y Gómez, M.: LTL-INAOE’s Partic- ipation at eRisk 2019: Detecting Anorexia in Social Media through Shared Personal Information. In: CLEF (Working Notes) (2019) 17. Sadeque, F., Xu, D., Bethard, S.: Measuring the latency of depression detection in social media. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. pp. 495–503 (2018) 18. Skinner, R., McFaull, S., Draca, J., Frechette, M., Kaur, J., Pearson, C., Thompson, W.: Suicide and self-inflicted injury hospitalizations in Canada (1979 to 2014/15). Health promotion and chronic disease prevention in Canada: research, policy and practice 36(11), 243 (2016) 19. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017) 20. Xian, L., Vickers, S.D., Giordano, A.L., Lee, J., Kim, I.K., Ramaswamy, L.: # selfharm on Instagram: Quantitative Analysis and Classification of Non-Suicidal Self-Injury. In: 2019 IEEE First International Conference on Cognitive Machine Intelligence (CogMI). pp. 61–70. IEEE (2019) 21. Yunita Sari, Mark Stevenson, A.V.: Topic or Style? Exploring the Most Useful Features for Authorship Attribution. In: Proceedings of the 27th International Conference on Computational Linguistics. pp. 343–353 (2018) 22. Zetterqvist, M.: The DSM-5 diagnosis of nonsuicidal self-injury disorder: a review of the empirical literature. Child and adolescent psychiatry and mental health 9(1), 1–13 (2015)