-

Lewis, S., Santor, D.: Self-harm reasons, goal achievement, and prediction of future self-harm intent. The Journal of nervous and mental disease

10.1007/978-3-030-15719-7

NLP-UNED at eRisk 2020: self-harm early risk detection with sentiment analysis and linguistic features

IR Group

P@10 1

Dpto. Lenguajes y Sistemas Informaticos

1 0 Instituto Mixto de Investigacion - Escuela Nacional de Sanidad , IMIENS 1 Universidad Nacional de Educacion a Distancia , UNED

2015

198 362

Mental health problems such as depression are conditions that, going undetected, can have serious consequences. A less-known mental health problem that has been linked to depression is self-harm. There is evidence suggesting that people's writings can re ect these problems, and research has been done to detect these individuals through their content on social media. Early detection is crucial for mental health problems, and for this purpose a shared task named eRisk was proposed. This paper describes NLP-UNED's participation on the 2020 T1 subtask. Participants were asked to create systems that detected early self-harm signs on Reddit users. Our team shows a data analysis of the 2019 T2 subtask and proposes a simple feature-driven classi er with features based on rst-person pronoun use, sentiment analysis and self-harm terminology.

Early Risk Detection Self-Harm detection Analysis Natural Language Processing Sentiment

Mental health problems, such as depression, are conditions that a ect more people every day. These conditions may go undetected for many years, causing the people who su er them to not receive adequate medical assistance. Untreated mental health issues can lead to serious consequences, such as addictions or even suicide. Self-harm, also known as Non-Suicidal Self-Injury (NSSI from now on) is a lesser known type of mental health problem that a ects primarily young people [ 7 ]. Self-harms refer to the act of causing bodily harm to oneself with no suicidal intent, such as cutting, burning, hair pulling, and they have been linked to underlying mental health problems such as depression and anxiety [ 8 ]. Is is a maladaptive form of coping [12] that causes pain and distress to the self-harmer, and could lead to unintentional suicide. It is important to dedicate e orts to better detect mental health problems in the society so they can better receive the help they need.

It has been proven that people who su er from mental health problems show di erences in the way they communicate with other people, and the way they write [ 4 ] [24]. Natural Language Processing can be used to analyze these people's writings and detect underlying mental health problems. Social media use has been on the rise in the past decades, and the sheer volume of information available in these platforms can be used for these purposes. Recent research has applied NLP techniques to develop systems that automatically detect users with potential mental health issues.

Early detection is key in the treatment of mental health problems, since a fast intervention improves the probabilities of a good prognosis. The longer a mental health goes undetected, the more likely serious consequences are to derive from it. Most of the e orts done in the literature focus on detection, but not on early detection. Early detection would allow a faster diagnostic, which would help mental health specialist to do a faster intervention.

In the light of this problem, the shared task eRisk was created. This task focuses on early detection of several mental health problems, such as depression, anorexia, and self-harm on temporal data extracted from Reddit. The 2020 eRisk task [14] proposed two di erent subtasks: Task 1 focused on early detection of signs of self-harm, while Task 2 focused on measuring the severity of the signs of depression. Our team participated in Task 1: detecting self-harm. The dataset for this subtask is a collection of chronological written posts made by di erent users on Reddit. Each user is tagged as positive or negative, where positive users show signs of self-harm, and negative users do not. The objective of this task was to evaluate the writings sequentially and give a prediction of whether a user showed signs of self-harm or not as fast as possible.

The task was divided in two stages: (i) training stage: during this phase, a training set was given to prepare and tune each team's systems. The training data was composed of 2019's task 2 (T2) training and testing data, and each user was labelled as either positive (self-harm) or negative (no self-harm). (ii) test stage: participants connected to a server to obtain the testing data and send the predictions. For each request to the server, an array of users with one writing each was obtained, and a prediction for each user had to be sent before being able to make a new request for new writings. Thus, participants had to create a system that interacted with the server and made predictions for every user, one writing at a time. The objective of the task was to detect positive users as early as possible. After the test stage, each team's participation was evaluated based on precision, recall, F1, and new metrics developed for the sake of this competition that penalize late decisions: Early Detection Error (ERDE) and latency-weighted F1. More information on these metrics can be found at [13].

This paper presents our participation in the self-harm subtask. We present an exploratory analysis of the 2019 T2 dataset. The rest of the paper is organized as follows: Section 2 shows a review of the related literature; section 3 details our proposed model for the task; section 4 presents an exploratory analysis of the 2019 T2 dataset we performed previous to developing the system; section 6 summarizes our o cial results for the task, plus some corrections; nally, section 7 presents our conclusions and ideas for future research. 2

Related Work

Social media has been previously studied in relation to health [22] [20]. Mental health, and depression in general, is a common focus on works attempting to detect individuals who su er from that illness [ 3 ] [ 10 ] [19] [26]. Some work focuses on early prediction of mental illness symptoms [ 4 ] [17], but there are very few of them [ 9 ].

Studies performed on self-harm are also scarce. Most work has been done on studying the personalities and behavioral patterns of people who self-harm [ 1 ] [18], showing common patterns about high negative a ectivity, and how it's a maladaptive coping strategy. Some e ort has been done on studying self-harm behavior in social media in particular [ 2 ] [ 6 ] [16] [21], but they focus on studying posting patterns, behaviours, consequences, etc. Their ndings show how people who self-harm have di erent posting patterns than mentally healthy users.

Some researchers focused on identifying self-harm content on social media [27] [25]. They show a mixture of NLP methods, both supervised and unsupervised, and using traditional and deep learning methods. Wang et al. [25] uses a mixture of CNN-generated features and features obtained from their ndings on posting patterns: language has di erent structures, and more negative sentiment, they are more likely to have more interactions with other users but less online friends and posting hours are di erent, and self-harm content is usually done late at night.

Research done on predicting future self-harm behavior or nding at-risk individuals is rare. While some e orts have been done using methods such as using apps and data from wearable devices [ 11 ] [15], there is little research done on predicting this behavior on social media. The eRisk shared task rst introduced the early risk detection on 2019 as a subtask, but no training data was given to develop the solutions. Most participants focused on developing their own training data instead of opting for unsupervised methods. 3

Proposed model

We propose a machine learning approach that uses text features to predict whether a message belongs to a positive or negative user. These features are fed to a SVM classi er. A decision module takes the classi ed messages and decides whether an user is positive or negative.

The most challenging part of the eRisk task is the temporal complexity of the problem. The features are calculated taking this into account, and decisions are also made with that in mind.

The model can be divided in three distinct stages: 1) Pre-processing and feature calculation; 2) Message classi cation, where the supervised part of the model takes place; and 3) User decision, where each user is categorized as positive (1) or negative (0). 3.1

Features window module

The Window One of the biggest challenges of the dataset for this task is that the golden truth is given for users, but each user has an arbitrary number of posts. It is nave to assume that any and all messages will give us the same amount of relevant information about whether an user self-harms or not. For once, the user status is known because the user has self-reported (in the case of positive users) in a post. While it is unlikely a person will falsely self-report self-harm, there is no information about when they started doing it, and when or if they stopped. Besides, some users that are classi ed in the golden truth as negative might do self-harm but have never reported it.

Furthermore, this is a fundamentally temporal task. Each message is not created in isolation: there is a context to them. We are limited in the context information we have about each message, but we do know the date of each post, and therefore the order in which they were created.

Finally, not all messages are equal in \information quality". These messages are posted in a social network, where writing conventions are loose. Some of them may be very short while others are very long in comparison, some of them might only be a media link, some of them might be a copied text not written by the user and so on.

To take all those challenges into consideration and create a hopefully better system, each new message is not observed in a vacuum. Their context, that is, their surrounding messages are also taken into account. Since future messages are unknown, only the previous messages can be used.

For this, we implemented a sliding window. For every new received message, our system calculated the features of the text combined from the current message and the previous w messages, where w is a con gurable parameter. Depending on the size of this parameter, a longer or shorter user history would be taken into account in each step: a size of 1 only uses the current message, while a size of "all" would use the whole user history.

Features For each window of messages, a set of text features was calculated. These features were a mixture of textual and grammatical features (text length, number of words, etc.) and "special" features. Table 1 shows the list of features.

For these special features, previous work was done in analyzing the 2019 dataset to check if we could nd di erences between the positive and negative

Emotional score of the title and comment combined Number of rst-person pronouns (I, me, mine, my, myself)

Number of words from the NSSI corpus [ 8 ] users. It was observed that, in general, positive users did have signi cant di erences from negative users, although the di erence between single messages was big. Section 4 shows details of this analysis.

First-person pronouns : There is evidence suggesting that people who use more rst-person pronouns on average are more depressed than people who use the third person [ 5 ] [ 8 ] [23]. There is also evidence linking depression and nonsuicidal self-harm [ 8 ], so tracking this information would prove bene cial for our task. Besides, two sentences talking about self-harm are di erent depending on who the person is talking about: \I cut myself today" VS \She is thinking about cutting herself". In the rst case, the user shows clear signs of doing self-harm. In the second case, however, the user is seeking advice about a person they know, but they show no evidence about themselves. We can track this di erence by counting rst-person pronouns.

Sentiment analysis : As mentioned previously, it is supposed that people who do self-harm show more negative emotions [ 8 ]. Tracking sentiment to keep track of the users' moods makes sense in this context. We focused only on positive or negative sentiment. This feature shows the sentiment of the window as a numeric score normalized by the length of the texts. A negative score demonstrates a negative sentiment, while a positive score demonstrates a positive emotion.

NSSI words : Finally, some people who self-harm will surely talk about it. There is a sub-reddit dedicated to self-harm, where users talk about their disorder and support each other. We can suppose that at least some of the users in our dataset will use this sub-reddit, or they will talk about their problem somewhere else. It proves useful to track the usage of the most common words related to self-harm. This feature is linked to the rst-person pronouns one. By tracking not only self-harm words, but also who is the subject of those sentences, we know if the user is more likely to be doing self-harm, or they are talking about somebody else. A list of words related to self-harm (NSSI words from now onwards) was obtained from [ 8 ]. This feature shows the number of words from this list that appear in the window, normalized by the length of the texts. 3.2

Message classi cation module

The features calculated from the window messages are fed to a previously trained SVM classi er. This classi er predicts whether these features belong to a message generated by a positive or negative user. 3.3

User decision module

In the nal step, the outputs from the previous module are fed to the decision module.

For every new message we receive, we have to classify each user as \positive" or \negative". A positive decision is nal, but a negative one may be revised later. Besides, the task rewards quick decisions, so the earlier we make a positive decision, the better.

Following the same reasoning as with the features window module, however, one positive decision should not be enough to classify one user as positive. We must implement a decision policy.

The decision policy was created as such: for every new message, after receiving the output (positive or negative) of the window, the previous n outputs for that user would be observed, where n is a con gurable parameter. If they were all positive, this user would be classi ed as positive in this iteration. If not, they would be classi ed as negative. 4

Data Analysis

Before starting the development of the model, we did an exploratory analysis of the 2019 dataset used in the eRisk task the previous year. The results of our ndings are presented in this section.

During the model development, the data was divided in train and test data, and the analysis was only performed on the train data. However, we recalculated the analysis with all the 2019 data after the 2020 task was over for the purposes of these working notes.

The categories of the analysis follow the same division as the features explained in section 3.

Table 2 shows how many positive and negative users there exist in the dataset. As was stated before, the data is highly skewed towards negative users. All analytical results have to be taken with this information into account, since there are ve times more negative than positive users.

Table 3 shows how the amount of positive and negative users a ects to the number of posts that can be found in the dataset. Since there are more negative users, it is no surprising that there are more diversity in the posts from this kind of users. There is a di erence of 989 posts between the minimum and maximum for positive users, while the di erence is 1982 for negative users. Information about the total and average length of posts is also given in Table 3. Although the total length is greater for negative users, the mean shows that posts made by positive users are longer on average, with the median value also being higher. The longest post belongs to a negative user, however. In addition, Table 3 shows total and average number of words used per post. In this table we can see that positive users use, on average, more words per post than negative users.

Following the data analysis of this section, we decided to explore the use of rst, second and third-person pronouns and how they di ered between positive and negative users. Table 4 shows our ndings. These values are normalized by post length. It can be seen that, on average, positive users use more pronouns per post, and the greater di erence can be seen in rst-person pronouns.

The same analysis was performed for the use of NSSI words. Table 5 shows the statistics in the use of NSSI words. These values are also normalized by post length. A notable di erence is observed once again between positive and negative users, with positive users using more NSSI words on average. Figures 1 and 2 show the frequency distribution of the NSSI words for positive and negative users, respectively. Table 6 shows the same statistics with NSSI words divided in their di erent categories.

Finally, Table 7 shows the di erences found when applying the sentiment analysis between positive and negative users. The values are normalized by post length, and a greater value equals a more positive sentiment. Unfortunately, there are no observable di erences between them.

Fig. 1. Positive users. Frequency distribution of the NSSI words. lfee lehpabdtandltronithickprkae tcuianplluprnubitsck itconroebirpishntrae ttoo itebtrcahlrxae ifkenrubangbilfree isoonleebdtsabrrzoa itrsennubmrcva irceprscaeplcaehbicnhpeebdmirsubitsengliraenglf-rahm b up ta sc p ifnse

Samples

Users Total Mean Deviation Min Max Median Positive 920.074 2.693E-03 1.525E-02 -.148 .229 0

Negative 18269.027 2.338E-03 1.531E-02 -.345 .293 0

Experimental Setup

This section presents the experiments conducted for the o cial eRisk 2020 task using the model proposed in section 3. The SVM classi cation model was implemented using a combination of NLTK 1 and Scikit-learn 2. More speci cally, Scikit-learn's LinearSVC implementation of C-Support Vector Classi cation model was used. The amount of positive and negative users available for training was highly unbalanced in favour of the negative users, so the \class weight balanced" was used during training. Other parameters were used as default.

NLTK was used for data cleanup and text pre-processing (tokenizing and stemming). Sentiment analysis was also performed with NLTK's Sentiment Intensity Analyzer.

Training and testing The SVM classi er was trained with data from the eRisk 2019 task. During the model evaluation, this data was divided in training and testing data, and for the current task evaluation, a new classi er was trained with the whole 2019 data collection. 5.2

Submitted runs

Our team participated with ve di erent runs. We were interested in observing the di erences in performance by combining three factors: 1) The window size during training, 2) The window size during testing and 3) the decision window size during testing (the amount of consecutive positive messages before declaring an user as positive). Every run used a di erent combination of these factors. Table 8 shows the con guration of each run. 1 https://www.nltk.org/ 2 https://scikit-learn.org/

Results and Discussion

This section shows the o cial results for the task, plus some additional tests performed independently by our team. The overview for the o cial results of all teams can be found at [14].

During the evaluation stage of the task, teams were to iteratively request data from a server and send their predictions, one message per user at a time. After implementing our model, a program was implemented that automatically connected to this server and performed the model calculations. This program was launched on May 30th, and let run for 24 hours.

Some problems were encountered during the evaluation stage of the task. The program halted for 12 hours and had to be relaunched, which caused the number of processed messages to be lesser than expected. Furthermore, a bug in the code caused an issue with the di erentiation of the ve distinct runs. After the o cial results were given and our implementation error was xed, we rerun the predictions again in order to show more realistic results in these working notes.

Tables 9, 10 and 12 show the o cial results for our team received by the task organizers. Results from other teams were added for comparison purposes.

Table 9 shows the time span and number of messages processed. We include information from the fastest and slowest teams, plus the one that achieved the best results in the o cial metrics. Our team, which took 1 day to process 554 messages, is amongst the faster teams, especially considering 12 hours were lost.

Table 10 shows the o cial evaluation metrics for the binary decision task, plus our own calculations for the results of our xed system. The results of the runs that achieved the best results for each metric are also added, and it is interesting to note that all belong to the same team. Table 11 also shows additional information about the number of users that were classi ed as positive or negative by our xed system.

Participating teams were also required to send, for each iteration, scores that represented the estimated risk of each user. Table 12 shows the o cial result for our team, and the best results. Standard IR metrics were calculated after processing 1 message, 100 messages, 500 messages and 1000 messages. Our team only processed 554 messages, so the 1000 messages metrics are not given.

The testing window appears to have little e ect on the result metrics. This could be due to the di erence between the window sizes being too small (10 and 20). The decision window size a ected the latency, which can be seen more clearly in the xed results: The run with window size 5 had a latency of 5, while the runs with window size 3 had a latency of 3. The biggest di erence was found for the training window size. Runs 3 and 4, trained with window size \All", obtained better results for the evaluation metrics, but they also classi ed every user as positive. Runs 0, 1 and 2, which were trained with window size 1, classi ed more than 10 users as negative.

While our system was a simple approach, it achieved modest results. Latencyweighted F1 is an interpretable metric that estimates the \goodness" of the solution, and our team scored, on average, more than half that the winning team achieved. This shows that even a simple, feature-driven approach can tackle what looks like a very complex problem with promising results.

Furthermore, in this kind of problem, recall is a more important metric than precision. This is because, ideally, this system would be used as a tool to raise alarm about users, but an expert would review each case in a separate basis. For this reason, it is very important to detect each and every one of the true positive cases. Table 10 shows that, while our precision score is low, our recall score is very high. Table 11 shows that this is partially because some runs categorize all users as positive, but we believe some tuning in the decision window size would somewhat x this problem.

Speed and latency are important metrics in early risk detection, and our system achieved high scores for both of them. It is also important to note that our decisions are fast and not very heavy on computing resources. Past messages are not iterated more than x, being x the size of the window, so the model can continue forever with no extra cost. 7

Conclusions and Future Work

In this paper we present the NLP-UNED participation on the eRisk 2020 T1 task. We perform a data analysis of the 2019 T2 self-harm data and use our ndings to construct features for a system to perform early predictions of signs of selfharm on users extracted from Reddit data. Our analysis shows that subjects who self-harm, on average, write longer posts, use more rst-person pronouns, and mention more words related to NSSI. The o cial eRisk results show that our system, while simple, manages to achieve modest but fast results, but more work is needed to obtain state-of-art results.

We are interested in nding if ne-tuning the window sizes using in our system could signi cantly improve results. Implementing the same window policy during the training phase as the testing phase could yield better results as well. Finally, there are evidences that self-harm subjects have di erent posting patterns than non self-harmers so we are interested in exploring the temporal di erences in the dataset and creating more features.

Acknowledgments

This work has been partially supported by the Spanish Ministry of Science and Innovation within the projects PROSA-MED (TIN2016-77820-C3-2-R), DOTTHEALTH (PID2019-106942RB-C32), and EXTRAE II (IMIENS 2019).

1. Baetens , I. , Claes , L. , Muehlenkamp , J. , Grietens , H. , Onghena , P. : Non-Suicidal and Suicidal Self-Injurious Behavior among Flemish Adolescents: A Web-Survey . Archives of Suicide Research 15 ( 1 ), 56 { 67 ( 2011 ), https://doi.org/10.1080/13811118. 2011 .540467

2. Cavazos-Rehg , P.A. , Krauss , M.J. , Sowles , S.J. , Connolly , S. , Rosas , C. , Bharadwaj , M. , Grucza , R. , Bierut , L.J.: An analysis of depression, selfharm, and suicidal ideation content on Tumblr . Crisis 38 ( 1 ), 44 { 52 ( 2017 ), https://psycnet.apa.org/record/2016-36501-001

3. Conway , M. ,

'Connor , D. : Social media, big data, and mental health: Current advances and ethical implications . Current Opinion in Psychology 9 , 77 { 82 ( 2016 ), http://www.sciencedirect.com/science/article/pii/S2352250X16000063

4. De Choudhury , M. , Counts , S. , Horvitz , E.: Social Media as a Measurement Tool of Depression in Populations . In: Proceedings of the 5th Annual ACM Web Science Conference . pp. 47 { 56 . WebSci '13, Association for Computing Machinery, New York, NY, USA ( 2013 )

5. Edwards , T. , Holtzman , N.S.: A meta-analysis of correlations between depression and rst person singular pronoun use . Journal of Research in Personality 68 , 63 { 68 ( 2017 ), http://dx.doi.org/10.1016/j.jrp. 2017 . 02 .005

Emma

Hilton , C. : Unveiling self-harm behaviour: what can social media site twitter tell us about self-harm? a qualitative exploration . Journal of Clinical Nursing 26 ( 11 - 12 ), 1690 { 1704 ( 2017 ), https://onlinelibrary.wiley.com/doi/abs/10.1111/jocn.13575

7. Gluck , S. : Self-injurers and their common personality traits , https://www.healthyplace.com/abuse/self-injury/ self-injurers-and-their-commonpersonality-traits

8. Greaves , M.M.: A Corpus Linguistic Analysis of Public Reddit and Tumblr Blog Posts on Non-Suicidal Self-Injury . Ph.D. thesis , Oregon State University ( 2018 ), https://ir.library.oregonstate.edu/concern/graduate thesis or dissertations/mp48sk29z

9. Guntuku , S.C. , Yaden , D.B. , Kern , M.L. , Ungar , L.H. , Eichstaedt , J.C. : Detecting depression and mental illness on social media: an integrative review . Current Opinion in Behavioral Sciences 18 , 43 { 49 ( 2017 ), http://dx.doi.org/10.1016/j.cobeha. 2017 . 07 .005

10. Karmen , C. , Hsiung , R.C. , Wetter , T. : Screening internet forum participants for depression symptoms by assembling and enhancing multiple NLP methods . Computer Methods and Programs in Biomedicine 120 ( 1 ), 27 { 36 ( 2015 ), http://dx.doi.org/10.1016/j.cmpb. 2015 . 03 .008

11. Lederer , N. , Grechenig , T. , Baranyi , R.: UnCUT: Bridging the gap from paper diary cards towards mobile electronic monitoring solutions in borderline and selfinjury . In: SeGAH 2014 - IEEE 3rd International Conference on Serious Games and Applications for Health, Books of Proceedings. Institute of Electrical and Electronics Engineers Inc . ( 2014 )