Detection of early sign of self-harm on Reddit using multi-level machine Hojjat Bagherzadeh1 , Ehsan Fazl-Ersi2 , and Abedin Vahedian2 1 Dept. of Computer Engineering, Ferdowsi University of Mashhad, Iran bagherzadehhosseinabad.hojjat@mail.um.ac.ir 2 Dept. of Computer Engineering, Ferdowsi University of Mashhad, Iran {fazlersi,vahedian}@um.ac.ir Abstract. This paper describes the participation of the EFE research team in task1 of CLEF eRisk 2020 competitions. This challenge basically focuses on the early detection of symptoms of self-harm from users’ posts on social media. Identifying mental illnesses especially in the early stages can help people and avoid risky behaviors. Personal notes on social media are often indicative of one’s psychological state, therefore using natural language processing techniques on users’ posts one can develop an early risk detection system. The proposed method is basically consisting of Word2Vec representation, an ensemble of SVM and deep neural network and also attention layers. The obtained results are very competitive and show the strength of the system provided in the early diagnosis of self- harm. Keywords: Early Risk Detection, Self-Harm, Natural Language Processing, SVM, Attention, Word2Vec. 1 Introduction Self-harm, also known as self-injury, is defined as intentional bodily harm and can affect all people, regardless of age, gender and, race. It can be considered as a common mental health issue, which can lead to several mental illnesses including depression, anxiety and, emotional distress [9] . Previous findings suggest that people’s narratives or writing patterns can reflect their mental state [29, 30]. So with help of sentiment analysis, researchers try to identify mentally ill individuals based on people’s writings on the internet and this is the main objective behind the CLEF eRisk challenges [17, 18, 20]. This year’s challenge has two tasks. Task 1 deals with early detection of self-harm and task 2 tries to measure the severity of the signs of depression. The EFE team participated in Task 1 of the competition and that is given a sequence of writings for each user, the system attempts to detect signs of self-harm in users as early as possible. User’s writings are Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessaloniki, Greece. processed in the same order in which they were sent and it lets chronologically monitor user’s activity. This paper presents the participation of the EFE team in self-harm early detection challenges of CLEF 2020 [22]. The method is an ensemble of deep neural networks and Support Vector Machine (SVM) as the classifier of the approach which takes its features from vector representation of the text of people posts. These vectors are Word2Vec representation of cleaned text tweaked by attention layers at different steps. Evaluation results of proposed method runs and all other competition runs of the task 1 is discussed. The rest of the paper is organized as follows. Section 2 covers related work, while section 3 gives a brief description of Task 1 of early risk detection and the used datasets. And part 4 introduces the proposed method. Analysis of the results of experiments are presented in section 5 and finally, the conclusion and suggested directions for future works are presented in section 6. 2 Related Work Early risk detection based on sentiment analysis is a trending research field having many growing applications. In recent years, given the popularity of social media networks as a source for news and information such health-related datasets have been made available which attracted substantial attention and led to the introduction of online competitions such as CLEF [1, 17] , CLPsych Shared Task [8, 23, 36]. Research suggests that individuals with mental issues can be identified by what they publicly share on online social media platforms because of the language patterns they use in their written texts. Thus, with advances in Natural Language Processing (NLP) researchers can now provide tools that have the capability of detecting mental illness in early stages. NLP modules basically have two steps. The first step is a vector representa- tion of text such as term frequency-inverse document frequency (TF-IDF) [2], pre-defined patterns, or Part-Of-Speech tagging which needs expert views over the context. Also, there are more generic text representations namely Word2Vec [25, 26], Doc2Vec [15], and LDA (Latent Dirichlet Allocation) [3, 24] that are based on counting all words of context. All of these try to represent the text as a high-dimensional vector that is appropriate for machine-learning engines. The second step is the learning process. SVM, neural networks and inference models are some of many learning models which are being wildly used in the NLP process. Dealing with the detection of mental illness, especially self-harm is a challeng- ing task as it is commonly relied on self-report. Most people who have self-harm also suffer from other mental illnesses, such as depression and anxiety, and thus make it difficult to be distinguished [13, 10]. Wang et al. [34] were detecting self- harm content on Flicker using word embedding and deep neural networks. UNSL team [5], one of the participants in eRisk 2019, designed a special dictionary- based text classifier. Bouarara and his team [4] analyzed users tweets to detect suicidal or self-harm behaviors to prevent any risk attempt using a sentiment classification model. Research findings state that users’ writing patterns can ex- press their mental state [7, 6, 32]. Furthermore, based on researches in this field, EFE team participated in in CLEF eRisk 2020 competition using a combination of 2 deep neural networks and SVM models trained by Word2vec text represen- tation which promising results were obtained. 3 Dataset and Competition The training dataset of early self-harm detection as Task 1 of CLEF 2020 was provided by the competition organizer and was the Task 2 of CLEF 2019 com- petition [20, 19]. The dataset consists of dated textual data of users’ online posts labeled as self-harm and non-self-harm. The labeling only determines the status of each user and doesn’t suggest any label for each writing. Table 3 shows a brief statistic and summary of Task 1 train data [21]. Table 1. Task 1 self-harm train dataset statistics Train Dataset Self-harm Control No. Users 41 299 Avg. No. writings per user 169 546.8 Avg. No. words per writing 24.8 18.8 The training dataset includes whole writings of each user and a label indi- cating the self-harm status of each one. The test stage though has an iterative strategy and a new round of writing for each user is being released only after the current run results are sent by the competitor. The evaluation measures being used in this challenge other than precision, recall, and F1, is ERDE. A detailed description of the tasks and evaluation metrics can be found in the corresponding task description paper [21]. 4 Proposed Method The proposed method for eRisk 2020 Task 1 is a multi-level approach which is a combination of deep neural networks and SVM machine along with attention mechanism, as shown in Fig. 1. At each round, first, at word level, the text cleaning steps including lowercasing, tokenizing, removing additional phrases and stemming is done to obtain a cleaned version of submitted texts and then vector representation of post using Word2Vec model [15, 25] is computed. Then, at user level these representation of posts is fed to the first level machines and scores of indicating the level of self-harness for each post is achieved. Next, at user level, these scores are aggregated to create user level features. Using an attention mechanism and Chi-Squared feature selection technique [25, 33] the most appropriate features are being selected as input to the user level learning machines. Finally, values of user level SVM machines are making the final deci- sion based on a scoring fusion function. Therefore, at each round based on the user’s writings from the beginning until this round a decision about considering the user as a self-harm case is being made. Further details for each level are as follows. Fig. 1. Architecture of the proposed model 4.1 Word level Word level or input layer is where create a numerical representation of words and posts used by learning machines at the next level. First, the text of people’s post is tokenized and cleaned using NLP tools which mostly involve converting plurals, removing web address and hashtags, lowercasing, stemming and lem- matizing [27]. Then these clean words are fed to the word embedder which in this experiment is a customized Word2Vec model with 100-dimension vector space that is trained on Twitter and Reddit posts. Word2Vec is two-layer neu- ral network that is trained to reconstruct or predict surrounding contexts of words in a sentence [14] and the inner layer weights of trained network are used as numeric representation of words. Therefore, each post P is represented by [ W 2V1 , W 2V2 , ..., W 2Vm ]p and each W 2Vip is a 100-dimension vector which is Word2Vec representation of i-th word of p-th post. 4.2 Post level In post level, the Convolutional Neural Network (CNN) [27, 14, 16] , Long Short- Term Memory (LSTM) [35, 12] and SVM [31, 28] are the main learning machines. The one dimensional CNN and LSTM networks process words of each post based on their chronological order and gives a score for each post which determines the probability of belonging to the self-harm class. Each W 2Vip Word2Vec represen- tation of each words of post is fed directly to the CNN and LSTM networks. Additionally, SVM is not being able to sequentially process words’ posts and thus the aggregated version of words representation which is a weighted average of each words’ post as shown in equation 1 is fed to the SVM machine. Then the SVM like the other two neural machines gives a score to each post. np X Rp = (Wword )p,i × W 2Vip (1) i=1 Where Rp represents the numerical representation of p-th post and (Wword )p,i is weight of i-th word of the p-th post which is computed in the training process based on the importance of every word in the degree of positivity of each post. And W 2Vip is the Word2Vec representation of i-th word of p-th post. Therefore, the output of this level for each post of a user is a three-value vector in a way that each learning machine produce one value. 4.3 User level At user level, posts’ score of each user form a 3×n matrix which n is the number of post sent by user up to the current round. In the training stage, by using common statistical measures such as average, standard deviation and variance [11] 37 features are generated for each user and among these features, with the help of Chi-squared feature selection method [25, 33] , the 7 most descriptive ones are being selected and used as inputs in the SVM machine. In the test stage, at each round, these 7 descriptive features in forms of 3 × 7 matrix are created based on user writings from the beginning. Before applying statistical measures on scores of post level, scores are changed by the attention mechanisms. Considering the fact that not all the posts sent by a user is related to one’s mental state, posts scores are weighted by their correlation to self-harm category. Therefore, the score’s posts are being weighted and then used to generate the selected statistical features which are fed into the final SVM. At each round, the three-digit value output of the final SVM is used by the scoring fusion function which calculates a value indicating the level of self-harm of the user based on one’s writing from the beginning. These values are sent to the final decision system to alerts a ‘1’ for users with self-harm mental status. The final decision system works in a way that it stores scores of scoring fusion function at each round and then decide to alert ‘1’ for users whenever the conditions depicted below are met. • The average scores of user’s up to this round. • The number of ascending cases of user’s scores. • The number of values above the maximum threshold level. • The average higher scores of user’s up to this round. 5 Experimental setup and results The main parts of proposed system are shown in Fig. 1. EFE team have partic- ipated in task 1 with three runs, each of which has small differences. • The first run model is exactly as depicted in Fig. 1 and used eRisk 2018 task 1 & 2 as the training dataset for optimizing the post level and used eRisk 2019 task 1 & 2 as the training dataset for optimizing the user level and configuring the attention mechanism. • The second run configuration is as same as the first run, and the only dif- ference is for the training datasets. eRisk 2018 depression task was used to train the first post level of the machine and the eRisk 2019 self-harm was used to train the user level of the model. • The third run model only has the SVM machines and the neural network models of the system are omitted. The datasets for post and user levels are the same as run 2 training sets. Evaluation metrics for this challenge were two groups that are fully explained in [1]. The first group measures are precision, recall, and F1 metrics which con- sider accuracy of the model on unbalanced datasets. And the second group which is early risk detection error (ERDE), latencyT P and latency − weighted F 1 con- sider accuracy of the model in presence of time. Table 5 shows the official results of the 3 runs of the proposed system. As can be seen in the table, the second run has the best performance among the others and that is because of choosing the self-harm dataset for training the user level. However, the first run has shown comparable results, which indicates the connection between depression and self-harm and other mental illnesses. Table 2. Official result of the proposed method runs in self-harm task (T1) Latency-weighted Run P R F1 ERDE5 ERDE50 LatencyT P Speed F1 1 0.73 0.519 0.607 0.257 0.142 11 0.961 0.583 2 0.625 0.625 0.625 0.268 0.117 11 0.961 0.601 3 0.496 0.615 0.549 0.283 0.14 11 0.961 0.528 There are 56 runs of 12 teams participating in Task 1 of 2020 eRisk CLEF challenge. Table 3 shows the statistic of participant results compare to the pro- posed method. As shown in Table 5 the method has gained comparable results Table 3. Statistics of results of 56 runs and rank of proposed method for task 1 (*: in these measures lower value means better performance.) lapse time min (per writing) ERDE50* Latency 2 ERDE5* Latency Speed Run P R F1 Min 0.237 0.01 0.02 0.423 0.269 133 1 0.019 0.301 Max 0.913 1 0.754 0.134 0.071 1 0.526 0.658 200 Average 0.437 0.639 0.441 0.247 0.170 15.89 0.943 0.448 18 Std. 0.235 0.322 0.175 0.056 0.042 29.31 0.107 0.126 57.39 EFE Team 8 30 4 33 11 11 16 3 2 in F1 and latency-weighted F1 measure and achieved 4th rank in F1 and 3rd rand in latency-weighted F1 measure in almost the shortest time needed for pro- cessing and completing the challenge. Because of the imbalance in the dataset, the key to great performance is to maintain the balance between P and R, which is achieved in run 2 of the proposed model. Given the fact that this was the first attempt participating in such competition, we paid a lot of attention to giving early answers and as a result, the best outcomes of the model were not obtained. This also explains why the latency-weighted F1 rank is the third, while the F1 rank is the forth 6 Conclusion and future work In this article using the presented model, EFE team participated in task 1 of eRisk2020 [21]. The task was to detect the sign of self-harm in users based on their writings as early as possible. By engaging in this challenge, the capability of social media’s content as a potential source for applications related to health and safety issues has been demonstrated. The proposed system is an ensemble multi- level method based on SVM, CNN, and LSTM network, which are fine-tuned by the attention layers. The test results show the positive effects of using attention mechanisms in the post layer and the user layer on the system, especially since not all posts sent by a person reflect his or her mental state. Another main difficulty is that there is always a trade-off between early decision making and more precise decision making. In this way, on the one hand, there is the need to detect the sign of mental illness in the user as early as possible and on the other hand, the more writings the system processes about the user, the more accurate the answer will be. Finally, considering the fact that this is the first attempt of EFE team at such challenges, it’s been found that a lot of work can be done to improve the system for real situations. Future research direction in improving the model is by working on better encoding text into numerical representation and also creating better attention mechanisms at different levels of the system. Also, another research interest is to find an optimum, under which both accuracy and giving the fastest answer are maintained. References 1. CLEF eRisk: Early risk prediction on the Internet, https://erisk.irlab.org/ 2. Beel, J., Langer, S., Gipp, B.: TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Personal Document Collections (2017) 3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14, pp. 601–608. MIT Press (2002) 4. Bouarara, H.A.: Detection and Prevention of Twitter Users with Suicidal Self- Harm Behavior. International Journal of Knowledge-Based Organizations 10(1), 49–61 (nov 2019) 5. Burdisso, S.G., Errecalde, M., Montes-Y-Gómez, M.: UNSL at eRisk 2019: a Uni- fied Approach for Anorexia, Self-harm and Depression Detection in Social Media. Tech. rep. (2019) 6. Choudhury, M.D., De, S.: Mental Health Discourse on reddit: Self-Disclosure, So- cial Support, and Anonymity. undefined (2014) 7. Coppersmith, G., Dredze, M., Harman, C., Hollingshead, K.: From ADHD to SAD: Analyzing the Language of Mental Health on Twitter through Self-Reported Di- agnoses pp. 1–10 (2015) 8. Coppersmith, G., Dredze, M., Harman, C., Hollingshead, K., Mitchell, M.: CLPsych 2015 Shared Task: Depression and PTSD on Twitter pp. 31–39 (2015) 9. Edmondson, A.J., Brennan, C.A., House”, A.O.: ”non-suicidal reasons for self- harm: A systematic review of self-reported accounts”. ”Journal of Affective Disor- ders” ”191”, ”109 – 117” (”2016”) 10. Gratz, K.L.: Risk factors for deliberate self-harm among female college students: The role and interaction of childhood maltreatment, emotional inexpressivity, and affect intensity/reactivity. American Journal of Orthopsychiatry 76(2), 238–250 (apr 2006) 11. Holosko, M., Thyer, B.: Commonly Used Statistical Terms. In: Pocket Glossary for Commonly Used Research Terms, pp. 145–156. SAGE Publications, Inc. (jan 2014) 12. Jianqiang, Z., Xiaolin, G., Xuejun, Z.: Deep Convolution Neural Networks for Twitter Sentiment Analysis. IEEE Access 6, 23253–23260 (jan 2018) 13. Kairam, S., Kaye, J., Guerra-Gomez, J.A., Shamma, D.A.: Snap decisions? How users, content, and aesthetics interact to shape photo sharing behaviors. In: Con- ference on Human Factors in Computing Systems - Proceedings. pp. 113–124. Association for Computing Machinery (may 2016) 14. Kshirsagar, R., Morris, R., Bowman, S.: Detecting and Explaining Crisis (2017) 15. Le, Q.V., Mikolov, T.: Distributed Representations of Sentences and Documents. 31st International Conference on Machine Learning, ICML 2014 4, 2931–2939 (may 2014) 16. Liao, S., Wang, J., Yu, R., Sato, K., Cheng, Z.: CNN for situations understand- ing based on sentiment analysis of twitter data. In: Procedia Computer Science. vol. 111, pp. 376–381. Elsevier B.V. (jan 2017) 17. Losada, D.E., Crestani, F., Parapar, J.: eRISK 2017: CLEF lab on early risk pre- diction on the internet: Experimental foundations. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 10456 LNCS, pp. 346–360. Springer Verlag (2017) 18. Losada, D.E., Crestani, F., Parapar, J.: eRISK 2017: CLEF Lab on Early Risk Prediction on the Internet: Experimental Foundations. In: Jones, G.J.F., Lawless, S., Gonzalo, J., Kelly, L., Goeuriot, L., Mandl, T., Cappellato, L., Ferro, N. (eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction. pp. 346– 360. Springer International Publishing, Cham (2017) 19. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk 2019 Early Risk Predic- tion on the Internet. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 11696 LNCS, pp. 340–357. Springer Verlag (sep 2019) 20. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk at CLEF 2019: Early Risk Prediction on the Internet. In: Linda Cappellato Nicola Ferro, D.E.L.H.M. (ed.) Conference and Labs of the Evaluation Forum. CEUR-WS.org (2019) 21. Losada, D.E., Crestani, F., Parapar, J.: eRisk 2020: Self-harm and depression chal- lenges. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 12036 LNCS, pp. 557–563. Springer (apr 2020) 22. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk 2020: Early Risk Prediction on the Internet. In: A. Arampatzis E. Kanoulas, T.T.S.V.H.J.C.L.C.E.A.N.L.C.N.F.e. (ed.) Experimental Meets Multilingual- ity, Multimodality, and Interaction Proceedings of the Eleventh International Conference of the CLEF Association. Springer International Publishing (2020) 23. Lynn, V., Goodman, A., Niederhoffer, K., Loveys, K., Resnik, P., Schwartz, H.: CLPsych 2018 Shared Task: Predicting Current and Future Psychological Health from Childhood Essays. pp. 37–46 (2018) 24. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning Word Vectors for Sentiment Analysis. Tech. rep. (2011) 25. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed Repre- sentations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems (oct 2013) 26. Mikolov, T., Yih, W.T., Zweig, G.: Linguistic Regularities in Continuous Space Word Representations. Tech. rep. (2013) 27. Mohan, V.: Preprocessing Techniques for Text Mining - An Overview (feb 2015) 28. Monika, R., Deivalakshmi, S., Janet, B.: Sentiment Analysis of US Airlines Tweets Using LSTM/RNN. In: Proceedings of the 2019 IEEE 9th International Confer- ence on Advanced Computing, IACC 2019. pp. 92–95. Institute of Electrical and Electronics Engineers Inc. (dec 2019) 29. Moulahi, B., Azé, J., Bringay, S.: Dare to care: A context-aware framework to track suicidal ideation on social media. pp. 346–353 (10 2017) 30. Paul, M., Dredze, M.: You Are What Your Tweet: Analyzing Twitter for Public Health. Artificial Intelligence 38, 265–272 (01 2011) 31. Ragheb, W., Azé, J., Bringay, S., Servajean, M.: Attentive Multi-stage Learning for Early Risk Detection of Signs of Anorexia and Self-harm on Social Media. Tech. rep. (2019) 32. Schwartz, H., Eichstaedt, J., Kern, M., Park, G., Sap, M., Stillwell, D., Kosin- ski, M., Ungar, L.: Towards Assessing Changes in Degree of Depression through Facebook (jan 2014) 33. Sun, J., Zhang, X., Liao, D., Chang, V.: Efficient method for feature selection in text classification. In: Proceedings of 2017 International Conference on Engineering and Technology, ICET 2017. vol. 2018-Janua, pp. 1–6. Institute of Electrical and Electronics Engineers Inc. (mar 2018) 34. Wang, Y., Tang, J., Li, J., Li, B., Wan, Y., Mellina, C., O’Hare, N., Chang, Y.: Understanding and Discovering Deliberate Self-Harm Content in Social Media. In: Proceedings of the 26th International Conference on World Wide Web. pp. 93– 102. WWW ’17, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE (2017) 35. Wang, Y., Zhang, C., Zhao, B., Xi, X., Geng, L., Cui, C.: Sentiment Analysis of Twitter Data Based on CNN. Shuju Caiji Yu Chuli/Journal of Data Acquisition and Processing 33(5), 921–927 (sep 2018) 36. Zirikly, A., Resnik, P., Uzuner, .O., Hollingshead, K.: CLPsych 2019 Shared Task: Predicting the Degree of Suicide Risk in Reddit Posts. Tech. rep. (2019)