=Paper=
{{Paper
|id=Vol-2696/paper_41
|storemode=property
|title=NLP-UNED at eRisk 2020: Self-harm Early Risk Detection with Sentiment Analysis and Linguistic Features
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_41.pdf
|volume=Vol-2696
|authors=Elena Campillo Ageitos,Juan Martinez-Romo,Lourdes Araujo
|dblpUrl=https://dblp.org/rec/conf/clef/AgeitosMA20
}}
==NLP-UNED at eRisk 2020: Self-harm Early Risk Detection with Sentiment Analysis and Linguistic Features==
NLP-UNED at eRisk 2020: self-harm early risk detection with sentiment analysis and linguistic features Elena Campillo Ageitos1[0000−0003−0255−0834] , Juan Martinez-Romo1,2[0000−0002−6905−7051] , and Lourdes Araujo1,2[0000−0002−7657−4794] 1 NLP & IR Group, Dpto. Lenguajes y Sistemas Informáticos Universidad Nacional de Educación a Distancia (UNED) 2 Instituto Mixto de Investigación - Escuela Nacional de Sanidad (IMIENS) ecampillo@lsi.uned.es, juaner@lsi.uned.es, lurdes@lsi.uned.es Abstract. Mental health problems such as depression are conditions that, going undetected, can have serious consequences. A less-known mental health problem that has been linked to depression is self-harm. There is evidence suggesting that people’s writings can reflect these prob- lems, and research has been done to detect these individuals through their content on social media. Early detection is crucial for mental health problems, and for this purpose a shared task named eRisk was proposed. This paper describes NLP-UNED’s participation on the 2020 T1 subtask. Participants were asked to create systems that detected early self-harm signs on Reddit users. Our team shows a data analysis of the 2019 T2 sub- task and proposes a simple feature-driven classifier with features based on first-person pronoun use, sentiment analysis and self-harm terminol- ogy. Keywords: Early Risk Detection · Self-Harm detection · Sentiment Analysis · Natural Language Processing 1 Introduction Mental health problems, such as depression, are conditions that affect more people every day. These conditions may go undetected for many years, causing the people who suffer them to not receive adequate medical assistance. Untreated mental health issues can lead to serious consequences, such as addictions or even suicide. Self-harm, also known as Non-Suicidal Self-Injury (NSSI from now on) is a lesser known type of mental health problem that affects primarily young people [7]. Self-harms refer to the act of causing bodily harm to oneself with no suicidal intent, such as cutting, burning, hair pulling, and they have been linked Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. to underlying mental health problems such as depression and anxiety [8]. Is is a maladaptive form of coping [12] that causes pain and distress to the self-harmer, and could lead to unintentional suicide. It is important to dedicate efforts to better detect mental health problems in the society so they can better receive the help they need. It has been proven that people who suffer from mental health problems show differences in the way they communicate with other people, and the way they write [4] [24]. Natural Language Processing can be used to analyze these peo- ple’s writings and detect underlying mental health problems. Social media use has been on the rise in the past decades, and the sheer volume of information available in these platforms can be used for these purposes. Recent research has applied NLP techniques to develop systems that automatically detect users with potential mental health issues. Early detection is key in the treatment of mental health problems, since a fast intervention improves the probabilities of a good prognosis. The longer a mental health goes undetected, the more likely serious consequences are to derive from it. Most of the efforts done in the literature focus on detection, but not on early detection. Early detection would allow a faster diagnostic, which would help mental health specialist to do a faster intervention. In the light of this problem, the shared task eRisk was created. This task focuses on early detection of several mental health problems, such as depression, anorexia, and self-harm on temporal data extracted from Reddit. The 2020 eRisk task [14] proposed two different subtasks: Task 1 focused on early detection of signs of self-harm, while Task 2 focused on measuring the severity of the signs of depression. Our team participated in Task 1: detecting self-harm. The dataset for this subtask is a collection of chronological written posts made by different users on Reddit. Each user is tagged as positive or negative, where positive users show signs of self-harm, and negative users do not. The objective of this task was to evaluate the writings sequentially and give a prediction of whether a user showed signs of self-harm or not as fast as possible. The task was divided in two stages: (i) training stage: during this phase, a training set was given to prepare and tune each team’s systems. The training data was composed of 2019’s task 2 (T2) training and testing data, and each user was labelled as either positive (self-harm) or negative (no self-harm). (ii) test stage: participants connected to a server to obtain the testing data and send the predictions. For each request to the server, an array of users with one writing each was obtained, and a prediction for each user had to be sent before being able to make a new request for new writings. Thus, participants had to create a system that interacted with the server and made predictions for every user, one writing at a time. The objective of the task was to detect positive users as early as possible. After the test stage, each team’s participation was evaluated based on precision, recall, F1, and new metrics developed for the sake of this competition that penalize late decisions: Early Detection Error (ERDE) and latency-weighted F1. More information on these metrics can be found at [13]. This paper presents our participation in the self-harm subtask. We present an exploratory analysis of the 2019 T2 dataset. The rest of the paper is organized as follows: Section 2 shows a review of the related literature; section 3 details our proposed model for the task; section 4 presents an exploratory analysis of the 2019 T2 dataset we performed previous to developing the system; section 6 summarizes our official results for the task, plus some corrections; finally, section 7 presents our conclusions and ideas for future research. 2 Related Work Social media has been previously studied in relation to health [22] [20]. Mental health, and depression in general, is a common focus on works attempting to detect individuals who suffer from that illness [3] [10] [19] [26]. Some work focuses on early prediction of mental illness symptoms [4] [17], but there are very few of them [9]. Studies performed on self-harm are also scarce. Most work has been done on studying the personalities and behavioral patterns of people who self-harm [1] [18], showing common patterns about high negative affectivity, and how it’s a maladaptive coping strategy. Some effort has been done on studying self-harm behavior in social media in particular [2] [6] [16] [21], but they focus on studying posting patterns, behaviours, consequences, etc. Their findings show how people who self-harm have different posting patterns than mentally healthy users. Some researchers focused on identifying self-harm content on social media [27] [25]. They show a mixture of NLP methods, both supervised and unsupervised, and using traditional and deep learning methods. Wang et al. [25] uses a mixture of CNN-generated features and features obtained from their findings on posting patterns: language has different structures, and more negative sentiment, they are more likely to have more interactions with other users but less online friends and posting hours are different, and self-harm content is usually done late at night. Research done on predicting future self-harm behavior or finding at-risk in- dividuals is rare. While some efforts have been done using methods such as using apps and data from wearable devices [11] [15], there is little research done on predicting this behavior on social media. The eRisk shared task first introduced the early risk detection on 2019 as a subtask, but no training data was given to develop the solutions. Most participants focused on developing their own training data instead of opting for unsupervised methods. 3 Proposed model We propose a machine learning approach that uses text features to predict whether a message belongs to a positive or negative user. These features are fed to a SVM classifier. A decision module takes the classified messages and decides whether an user is positive or negative. The most challenging part of the eRisk task is the temporal complexity of the problem. The features are calculated taking this into account, and decisions are also made with that in mind. The model can be divided in three distinct stages: 1) Pre-processing and feature calculation; 2) Message classification, where the supervised part of the model takes place; and 3) User decision, where each user is categorized as positive (1) or negative (0). 3.1 Features window module The Window One of the biggest challenges of the dataset for this task is that the golden truth is given for users, but each user has an arbitrary number of posts. It is naı̈ve to assume that any and all messages will give us the same amount of relevant information about whether an user self-harms or not. For once, the user status is known because the user has self-reported (in the case of positive users) in a post. While it is unlikely a person will falsely self-report self-harm, there is no information about when they started doing it, and when or if they stopped. Besides, some users that are classified in the golden truth as negative might do self-harm but have never reported it. Furthermore, this is a fundamentally temporal task. Each message is not created in isolation: there is a context to them. We are limited in the context information we have about each message, but we do know the date of each post, and therefore the order in which they were created. Finally, not all messages are equal in “information quality”. These messages are posted in a social network, where writing conventions are loose. Some of them may be very short while others are very long in comparison, some of them might only be a media link, some of them might be a copied text not written by the user and so on. To take all those challenges into consideration and create a hopefully better system, each new message is not observed in a vacuum. Their context, that is, their surrounding messages are also taken into account. Since future messages are unknown, only the previous messages can be used. For this, we implemented a sliding window. For every new received message, our system calculated the features of the text combined from the current message and the previous w messages, where w is a configurable parameter. Depending on the size of this parameter, a longer or shorter user history would be taken into account in each step: a size of 1 only uses the current message, while a size of ”all” would use the whole user history. Features For each window of messages, a set of text features was calculated. These features were a mixture of textual and grammatical features (text length, number of words, etc.) and ”special” features. Table 1 shows the list of features. For these special features, previous work was done in analyzing the 2019 dataset to check if we could find differences between the positive and negative Table 1. Text features generated for the classification system Feature Description Title length (combined) Length of the title and comment combined Number of words in title Number of distinct words in the title Title length Length of the title Number of words in text Number of words in the comment Text length Length of the comment Punctuation Number of punctuation marks (’,’, ’.’) Questions Number of questions marks Exclamations Number of exclamation marks Happy faces Number of happy emoticons ( ’:)’, ’:))’, ’:D’, etc.) Sad faces Number of sad emoticons ( ’:(’, ’:((’, ’D:’, etc.) Special features Sentiment analysis Emotional score of the title and comment combined Pronouns Number of first-person pronouns (I, me, mine, my, myself) NSSI words Number of words from the NSSI corpus [8] users. It was observed that, in general, positive users did have significant differ- ences from negative users, although the difference between single messages was big. Section 4 shows details of this analysis. First-person pronouns: There is evidence suggesting that people who use more first-person pronouns on average are more depressed than people who use the third person [5] [8] [23]. There is also evidence linking depression and non- suicidal self-harm [8], so tracking this information would prove beneficial for our task. Besides, two sentences talking about self-harm are different depending on who the person is talking about: “I cut myself today” VS “She is thinking about cutting herself”. In the first case, the user shows clear signs of doing self-harm. In the second case, however, the user is seeking advice about a person they know, but they show no evidence about themselves. We can track this difference by counting first-person pronouns. Sentiment analysis: As mentioned previously, it is supposed that people who do self-harm show more negative emotions [8]. Tracking sentiment to keep track of the users’ moods makes sense in this context. We focused only on positive or negative sentiment. This feature shows the sentiment of the window as a numeric score normalized by the length of the texts. A negative score demonstrates a negative sentiment, while a positive score demonstrates a positive emotion. NSSI words: Finally, some people who self-harm will surely talk about it. There is a sub-reddit dedicated to self-harm, where users talk about their dis- order and support each other. We can suppose that at least some of the users in our dataset will use this sub-reddit, or they will talk about their problem somewhere else. It proves useful to track the usage of the most common words related to self-harm. This feature is linked to the first-person pronouns one. By tracking not only self-harm words, but also who is the subject of those sentences, we know if the user is more likely to be doing self-harm, or they are talking about somebody else. A list of words related to self-harm (NSSI words from now on- wards) was obtained from [8]. This feature shows the number of words from this list that appear in the window, normalized by the length of the texts. 3.2 Message classification module The features calculated from the window messages are fed to a previously trained SVM classifier. This classifier predicts whether these features belong to a message generated by a positive or negative user. 3.3 User decision module In the final step, the outputs from the previous module are fed to the decision module. For every new message we receive, we have to classify each user as “positive” or “negative”. A positive decision is final, but a negative one may be revised later. Besides, the task rewards quick decisions, so the earlier we make a positive decision, the better. Following the same reasoning as with the features window module, however, one positive decision should not be enough to classify one user as positive. We must implement a decision policy. The decision policy was created as such: for every new message, after receiv- ing the output (positive or negative) of the window, the previous n outputs for that user would be observed, where n is a configurable parameter. If they were all positive, this user would be classified as positive in this iteration. If not, they would be classified as negative. 4 Data Analysis Before starting the development of the model, we did an exploratory analysis of the 2019 dataset used in the eRisk task the previous year. The results of our findings are presented in this section. During the model development, the data was divided in train and test data, and the analysis was only performed on the train data. However, we recalculated the analysis with all the 2019 data after the 2020 task was over for the purposes of these working notes. The categories of the analysis follow the same division as the features ex- plained in section 3. Table 2 shows how many positive and negative users there exist in the dataset. As was stated before, the data is highly skewed towards negative users. All analytical results have to be taken with this information into account, since there are five times more negative than positive users. Table 3 shows how the amount of positive and negative users affects to the number of posts that can be found in the dataset. Since there are more negative users, it is no surprising that there are more diversity in the posts from this kind of users. There is a difference of 989 posts between the minimum and maximum Table 2. Amount of positive and negative users. Users Total Positive 41 Negative 299 Total 340 for positive users, while the difference is 1982 for negative users. Information about the total and average length of posts is also given in Table 3. Although the total length is greater for negative users, the mean shows that posts made by positive users are longer on average, with the median value also being higher. The longest post belongs to a negative user, however. In addition, Table 3 shows total and average number of words used per post. In this table we can see that positive users use, on average, more words per post than negative users. Table 3. Number of posts, post length and number of words per post for positive and negative users. Users Total Mean Deviation Min Max Median Number of posts Positive 6927 168.951 260.282 8 997 50 Negative 163506 546.843 544.145 10 1992 340 Post length Positive 1290174 186.253 334.899 1 5880 81 Negative 23605461 144.371 394.172 1 37555 59 Number of words per post Positive 1543682 111.425 252.631 0 5880 36 Negative 27929951 85.410 289.123 0 37555 26 Following the data analysis of this section, we decided to explore the use of first, second and third-person pronouns and how they differed between positive and negative users. Table 4 shows our findings. These values are normalized by post length. It can be seen that, on average, positive users use more pronouns per post, and the greater difference can be seen in first-person pronouns. The same analysis was performed for the use of NSSI words. Table 5 shows the statistics in the use of NSSI words. These values are also normalized by post length. A notable difference is observed once again between positive and negative users, with positive users using more NSSI words on average. Figures 1 and 2 show the frequency distribution of the NSSI words for positive and negative users, respectively. Table 6 shows the same statistics with NSSI words divided in their different categories. Finally, Table 7 shows the differences found when applying the sentiment analysis between positive and negative users. The values are normalized by post length, and a greater value equals a more positive sentiment. Unfortunately, there are no observable differences between them. Table 4. Use of first, second and third-person pronouns per post normalized by post length. Users Mean Deviation Median First-person Positive 1.036E-02 1.490E-02 4.329E-03 Negative 6.486E-03 2.299E-02 0 Second-person Positive 4.211E-03 1.074E-02 0 Negative 3.169E-03 9.594E-03 0 Third-person Positive 3.208E-03 7.679E-03 0 Negative 2.295E-03 6.908E-03 0 Table 5. Use of NSSI words according to positive and negative users. Users Total Mean Deviation Median Min Max Positive 2223 1.423E-03 5.774E-03 0 0 .143 Negative 22450 9.979E-04 6.367E-03 0 0 .333 800 600 Counts 400 200 0 bad feel help understand cut pick pain break tear bang rub control hit pull burn bore stick tattoo relax bleed rip reaction bite punish scratch pierc numb pinch bleach relief embed carv scrape razor insert knife bruis stab fingernail self-harm poison Samples Fig. 1. Positive users. Frequency distribution of the NSSI words. 4000 3000 Counts 2000 1000 0 bad feel help tear understand relax rub bang bleed stab razor carv pierc scrape bleach control hit pick break cut pain pull burn stick reaction bore rip punish tattoo bite scratch knife relief bruis poison insert numb pinch embed ingest fingernail self-harm Samples Fig. 2. Negative users. Frequency distribution of the NSSI words. Table 6. Use of NSSI words divided by categories per post normalized by post length. Users Mean Deviation Methods of NSSI Positive 3.981E-04 2.880E-03 Negative 3.991E-04 4.564E-03 Cutting-specific terms Positive 6.347E-05 1.185E-03 Negative 4.093E-05 1.122E-03 NSSI Terms Positive 1.822E-07 1.101E-05 Negative 3.615E-08 1.029E-05 Instruments used Positive 2.275E-05 1.186E-03 Negative 1.267E-05 6.094E-04 Reasons for NSSI Positive 1.002E-03 4.888E-03 Negative 5.861E-04 4.391E-03 Table 7. Sentiment analysis score per post normalized by post length. Users Total Mean Deviation Min Max Median Positive 920.074 2.693E-03 1.525E-02 -.148 .229 0 Negative 18269.027 2.338E-03 1.531E-02 -.345 .293 0 5 Experimental Setup This section presents the experiments conducted for the official eRisk 2020 task using the model proposed in section 3. 5.1 Model implementation The SVM classification model was implemented using a combination of NLTK 1 and Scikit-learn 2 . More specifically, Scikit-learn’s LinearSVC implementa- tion of C-Support Vector Classification model was used. The amount of positive and negative users available for training was highly unbalanced in favour of the negative users, so the “class weight balanced” was used during training. Other parameters were used as default. NLTK was used for data cleanup and text pre-processing (tokenizing and stemming). Sentiment analysis was also performed with NLTK’s Sentiment In- tensity Analyzer. Training and testing The SVM classifier was trained with data from the eRisk 2019 task. During the model evaluation, this data was divided in training and testing data, and for the current task evaluation, a new classifier was trained with the whole 2019 data collection. 5.2 Submitted runs Our team participated with five different runs. We were interested in observing the differences in performance by combining three factors: 1) The window size during training, 2) The window size during testing and 3) the decision window size during testing (the amount of consecutive positive messages before declaring an user as positive). Every run used a different combination of these factors. Table 8 shows the configuration of each run. Table 8. Configuration for the runs. *All denotes that the size of the window is the total of posts for each user. Run id Training window Testing message window Decision window 0 1 10 5 1 1 10 3 2 1 20 3 3 All* 10 3 4 All* 20 3 1 https://www.nltk.org/ 2 https://scikit-learn.org/ 6 Results and Discussion This section shows the official results for the task, plus some additional tests performed independently by our team. The overview for the official results of all teams can be found at [14]. During the evaluation stage of the task, teams were to iteratively request data from a server and send their predictions, one message per user at a time. After implementing our model, a program was implemented that automatically connected to this server and performed the model calculations. This program was launched on May 30th, and let run for 24 hours. Some problems were encountered during the evaluation stage of the task. The program halted for 12 hours and had to be relaunched, which caused the number of processed messages to be lesser than expected. Furthermore, a bug in the code caused an issue with the differentiation of the five distinct runs. After the official results were given and our implementation error was fixed, we rerun the predictions again in order to show more realistic results in these working notes. Tables 9, 10 and 12 show the official results for our team received by the task organizers. Results from other teams were added for comparison purposes. Table 9 shows the time span and number of messages processed. We include information from the fastest and slowest teams, plus the one that achieved the best results in the official metrics. Our team, which took 1 day to process 554 messages, is amongst the faster teams, especially considering 12 hours were lost. Table 9. User writings processed and time lapsed. team runs user writings processed lapse of time NLP-UNED 5 554 1 day SSN NLP 5 222 3 hs hildesheim 5 522 72 days + 22 hs iLab 5 954 20 hs Table 10 shows the official evaluation metrics for the binary decision task, plus our own calculations for the results of our fixed system. The results of the runs that achieved the best results for each metric are also added, and it is interesting to note that all belong to the same team. Table 11 also shows additional information about the number of users that were classified as positive or negative by our fixed system. Participating teams were also required to send, for each iteration, scores that represented the estimated risk of each user. Table 12 shows the official result for our team, and the best results. Standard IR metrics were calculated after processing 1 message, 100 messages, 500 messages and 1000 messages. Our team only processed 554 messages, so the 1000 messages metrics are not given. The testing window appears to have little effect on the result metrics. This could be due to the difference between the window sizes being too small (10 Table 10. Official eRisk 2020 T1 results compared with the recalculated results and the best of each metric for comparison. The best results for each metric and our best results are bolded. All team’s results are available at [14] team name run id P R F 1 ERDE5 ERDE50 latencytp speed latency − weightedF 1 T1 official results NLP-UNED 0 .237 .913 .376 .423 .199 11 .961 .362 NLP-UNED 1 .246 1 .395 .210 .185 1 1 .395 NLP-UNED 2 .246 1 .395 .210 .185 1 1 .395 NLP-UNED 3 .246 1 .395 .210 .185 1 1 .395 NLP-UNED 4 .246 1 .395 .210 .185 1 1 .395 T1 fixed results NLP-UNED 0 .234 .875 .369 .332 .204 5 .984 .363 NLP-UNED 1 .237 .942 .379 .255 .197 3 .992 .376 NLP-UNED 2 .238 .942 .380 .255 .197 3 .992 .377 NLP-UNED 3 .246 1 .395 .213 .185 3 .992 .392 NLP-UNED 4 .246 1 .395 .213 .185 3 .992 .392 T1 best results iLab 0 .833 .577 .682 .252 .111 10 .965 .658 iLab 1 .913 .404 .560 .248 .149 10 .965 .540 iLab 2 .544 .654 .594 .134 .118 2 .996 .592 iLab 3 .564 .885 .689 .287 .071 45 .830 .572 iLab 4 .828 .692 .754 .255 .255 100 .632 .476 Table 11. Fixed results information about number of positives, negatives, and confu- sion matrix. run id # positives # negatives # true positives # true negatives # false positives # false negatives 0 389 34 91 21 298 13 1 413 10 98 4 315 6 2 412 11 98 5 314 6 3 423 0 104 0 319 0 4 423 0 104 0 319 0 Table 12. Ranking official results next to the best results for comparison. 1 writing 100 writings 500 writings team run P@10 NDCG@10 NDCG@100 P@10 NDCG@10 NDCG@100 P@10 NDCG@10 NDCG@100 NLP-UNED 0 .7 .69 .49 .6 .73 .26 .6 .73 .24 NLP-UNED 1 .6 .62 .27 .2 .27 .18 .2 .27 .16 NLP-UNED 2 .6 .62 .27 .2 .27 .18 .2 .27 .16 NLP-UNED 3 .6 .62 .27 .2 .27 .18 .2 .27 .16 NLP-UNED 4 .6 .62 .27 .2 .27 .18 .2 .27 .16 iLab 3 .9 .94 .66 1 1 .83 1 1 .84 and 20). The decision window size affected the latency, which can be seen more clearly in the fixed results: The run with window size 5 had a latency of 5, while the runs with window size 3 had a latency of 3. The biggest difference was found for the training window size. Runs 3 and 4, trained with window size “All”, obtained better results for the evaluation metrics, but they also classified every user as positive. Runs 0, 1 and 2, which were trained with window size 1, classified more than 10 users as negative. While our system was a simple approach, it achieved modest results. Latency- weighted F1 is an interpretable metric that estimates the “goodness” of the solution, and our team scored, on average, more than half that the winning team achieved. This shows that even a simple, feature-driven approach can tackle what looks like a very complex problem with promising results. Furthermore, in this kind of problem, recall is a more important metric than precision. This is because, ideally, this system would be used as a tool to raise alarm about users, but an expert would review each case in a separate basis. For this reason, it is very important to detect each and every one of the true positive cases. Table 10 shows that, while our precision score is low, our recall score is very high. Table 11 shows that this is partially because some runs categorize all users as positive, but we believe some tuning in the decision window size would somewhat fix this problem. Speed and latency are important metrics in early risk detection, and our system achieved high scores for both of them. It is also important to note that our decisions are fast and not very heavy on computing resources. Past messages are not iterated more than x, being x the size of the window, so the model can continue forever with no extra cost. 7 Conclusions and Future Work In this paper we present the NLP-UNED participation on the eRisk 2020 T1 task. We perform a data analysis of the 2019 T2 self-harm data and use our findings to construct features for a system to perform early predictions of signs of self- harm on users extracted from Reddit data. Our analysis shows that subjects who self-harm, on average, write longer posts, use more first-person pronouns, and mention more words related to NSSI. The official eRisk results show that our system, while simple, manages to achieve modest but fast results, but more work is needed to obtain state-of-art results. We are interested in finding if fine-tuning the window sizes using in our system could significantly improve results. Implementing the same window policy during the training phase as the testing phase could yield better results as well. Finally, there are evidences that self-harm subjects have different posting patterns than non self-harmers so we are interested in exploring the temporal differences in the dataset and creating more features. Acknowledgments This work has been partially supported by the Spanish Ministry of Science and Innovation within the projects PROSA-MED (TIN2016-77820-C3-2-R), DOTT- HEALTH (PID2019-106942RB-C32), and EXTRAE II (IMIENS 2019). References 1. Baetens, I., Claes, L., Muehlenkamp, J., Grietens, H., Onghena, P.: Non-Suicidal and Suicidal Self-Injurious Behavior among Flemish Adoles- cents: A Web-Survey. Archives of Suicide Research 15(1), 56–67 (2011), https://doi.org/10.1080/13811118.2011.540467 2. Cavazos-Rehg, P.A., Krauss, M.J., Sowles, S.J., Connolly, S., Rosas, C., Bharadwaj, M., Grucza, R., Bierut, L.J.: An analysis of depression, self- harm, and suicidal ideation content on Tumblr. Crisis 38(1), 44–52 (2017), https://psycnet.apa.org/record/2016-36501-001 3. Conway, M., O’Connor, D.: Social media, big data, and mental health: Current advances and ethical implications. Current Opinion in Psychology 9, 77 – 82 (2016), http://www.sciencedirect.com/science/article/pii/S2352250X16000063 4. De Choudhury, M., Counts, S., Horvitz, E.: Social Media as a Measurement Tool of Depression in Populations. In: Proceedings of the 5th Annual ACM Web Science Conference. pp. 47–56. WebSci ’13, Association for Computing Machinery, New York, NY, USA (2013) 5. Edwards, T., Holtzman, N.S.: A meta-analysis of correlations between depression and first person singular pronoun use. Journal of Research in Personality 68, 63–68 (2017), http://dx.doi.org/10.1016/j.jrp.2017.02.005 6. Emma Hilton, C.: Unveiling self-harm behaviour: what can so- cial media site twitter tell us about self-harm? a qualitative ex- ploration. Journal of Clinical Nursing 26(11-12), 1690–1704 (2017), https://onlinelibrary.wiley.com/doi/abs/10.1111/jocn.13575 7. Gluck, S.: Self-injurers and their common personality traits, https://www.healthyplace.com/abuse/self-injury/self-injurers-and-their-common- personality-traits 8. Greaves, M.M.: A Corpus Linguistic Analysis of Public Reddit and Tumblr Blog Posts on Non-Suicidal Self-Injury. Ph.D. thesis, Oregon State University (2018), https://ir.library.oregonstate.edu/concern/graduate thesis or dissertations/mp48sk29z 9. Guntuku, S.C., Yaden, D.B., Kern, M.L., Ungar, L.H., Eichstaedt, J.C.: Detecting depression and mental illness on social media: an integra- tive review. Current Opinion in Behavioral Sciences 18, 43–49 (2017), http://dx.doi.org/10.1016/j.cobeha.2017.07.005 10. Karmen, C., Hsiung, R.C., Wetter, T.: Screening internet forum participants for depression symptoms by assembling and enhancing multiple NLP meth- ods. Computer Methods and Programs in Biomedicine 120(1), 27–36 (2015), http://dx.doi.org/10.1016/j.cmpb.2015.03.008 11. Lederer, N., Grechenig, T., Baranyi, R.: UnCUT: Bridging the gap from paper diary cards towards mobile electronic monitoring solutions in borderline and self- injury. In: SeGAH 2014 - IEEE 3rd International Conference on Serious Games and Applications for Health, Books of Proceedings. Institute of Electrical and Electronics Engineers Inc. (2014) 12. Lewis, S., Santor, D.: Self-harm reasons, goal achievement, and prediction of future self-harm intent. The Journal of nervous and mental disease 198, 362–9 (05 2010), http://journals.lww.com/00005053-201005000-00008 13. Losada, D.E., Crestani, F., Parapar, J.: Early detection of risks on the internet: an exploratory campaign. In: 41st European Conference on Information Retrieval. pp. 259–266. Springer (2019), http://dx.doi.org/10.1007/978-3-030-15719-7 35 14. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk 2020: Early Risk Pre- diction on the Internet. In: Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020). Springer International Publishing (2020) 15. Malott, L., Bharti, P., Hilbert, N., Gopalakrishna, G., Chellappan, S.: Detecting self-harming activities with wearable devices. In: 2015 IEEE International Con- ference on Pervasive Computing and Communication Workshops (PerCom Work- shops). pp. 597–602 (2015) 16. Moreno, M.A., Ton, A., Selkie, E., Evans, Y.: Secret Society 123: Understanding the Language of Self-Harm on Instagram. Journal of Adolescent Health 58(1), 78–84 (2016) 17. Nadeem, M.: Identifying depression on twitter (2016), http://arxiv.org/abs/1607.07384 18. Nicolai, K.A., Wielgus, M.D., Mezulis, A.: Identifying Risk for Self-Harm: Ru- mination and Negative Affectivity in the Prospective Prediction of Nonsuicidal Self-Injury. Suicide and Life-Threatening Behavior 46(2), 223–233 (2016) 19. O’Dea, B., Wan, S., Batterham, P.J., Calear, A.L., Paris, C., Christensen, H.: Detecting suicidality on twitter. Internet Interventions 2(2), 183–188 (2015), http://dx.doi.org/10.1016/j.invent.2015.03.005 20. Park, M., McDonald, D.W., Cha, M.: Perception differences between the depressed and non-depressed users in Twitter. Proceedings of the 7th International Confer- ence on Weblogs and Social Media, ICWSM 2013 pp. 476–485 (2013) 21. Patchin, J.W., Hinduja, S.: Digital Self-Harm Among Adolescents. Journal of Ado- lescent Health 61(6), 761–766 (2017) 22. Paul, M.J., Dredze, M.: You Are What You Tweet: Analyzing Twitter for Public Health. International AAAI Conference on Weblogs and Social Media (ICWSM) (2011) 23. Pennebaker, J.W.: The Secret Life of Pronouns What Our Words Say About Us. Bloomsbury Press (2011) 24. Pennebaker, J.W., Mehl, M.R., Niederhoffer, K.G.: Psychological Aspects of Nat- ural Language Use: Our Words, Our Selves. Annual Review of Psychology 54(1), 547–577 (2003) 25. Wang, Y., Tang, J., Li, J., Li, B., Wan, Y., Mellina, C., O’Hare, N., Chang, Y.: Understanding and discovering deliberate self-harm content in social media. 26th International World Wide Web Conference, WWW 2017 pp. 93–102 (2017) 26. Yang, W., Mu, L.: GIS analysis of depression among Twitter users. Applied Ge- ography 60, 217–223 (2015), http://dx.doi.org/10.1016/j.apgeog.2014.10.016 27. Yates, A., Cohan, A., Goharian, N.: Depression and self-harm risk assessment in online forums. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pp. 2968–2978. Association for Computational Linguistics (2017), https://www.aclweb.org/anthology/D17-1322