=Paper=
{{Paper
|id=Vol-2936/paper-80
|storemode=property
|title=CeDRI at eRisk 2021: A Naive Approach to Early Detection of Psychological Disorders
in Social Media
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-80.pdf
|volume=Vol-2936
|authors=Rui Pedro Lopes
|dblpUrl=https://dblp.org/rec/conf/clef/Lopes21
}}
==CeDRI at eRisk 2021: A Naive Approach to Early Detection of Psychological Disorders
in Social Media==
CeDRI at eRisk 2021: A Naive Approach to Early Detection of Psychological Disorders in Social Media Rui Pedro Lopes1 1 Research Center for Digitalization and Intelligent Robotics (CeDRI), Instituto Politécnico de Bragança, Portugal Abstract This paper describes the participation of the CeDRI team in eRisk 2021 tasks, particularly, the Task 1: Early Detection of Signs of Pathological Gambling and Task 2: Early Detection of Signs of Self-Harm. The main difference between these two is that the first is a “test only” challenge, where no training data is supplied. The second task has labeled data available, which can be used for training. Both tasks were addressed using the same algorithms, using a custom training set for Task 1 and the provided data in the second. The algorithms were TfIdf vectorizer with a Logistic Regression layer, Word2Vec vectorizer with LSTM and Word2Vec vectorizer with CNN. All vectorizers and Neural Networks were trained solely with the training data. As expected, the algorithms did not state-of-the-art, but the experience allowed to reflect in several aspects related to the importance of proper dataset preparation and processing. Keywords Early Risk Detection, Tf-Idf, Word2Vec, Recursive Neural Networks, Dataset Heuristics, DL4J. 1. Introduction The term social network refers to a person’s connections to other people. In fact, creating and maintaining social networks provide opportunities to connect with others who have similar interests. Although initially applied in the context of “real-world” or physical, the concept expanded to also include platforms that support online communication, such as Instagram, Twitter or Reddit. Digital platforms further enhance these opportunities, allowing forming rela- tionships with people never met in person. Geographical barriers are attenuated or eliminated, allowing to actively engage with people around the world. They can explore their curiosity, pick up hobbies, or just spend time online. The possibility to write, participate or communicate without restrictions also provides a means to unburden or receive emotional support. Some people resort to social networks to talk about their state of mind, their feelings, distresses and other problems. In opposition to verbal and direct communication, the content available in the social networks is persistent, allowing asynchronous access data and providing a good means for psychological and health related studies and analysis [1, 2, 3]. According to several findings, people’s mental state can be inferred from their social networks narratives [4, 5]. Based in this, the CLEF eRisk challenges harness this opportunity to explore issues of evaluation methodologies, performance CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " rlopes@ipb.pt (R. P. Lopes) 0000-0002-9170-5078 (R. P. Lopes) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) metrics and other aspects related to building test collections and defining challenges for early risk detection [6, 7, 8, 9]. This year’s challenge has three tasks. Task 1, on early risk detection of pathological gambling, and Task 2, on early risk detection of self-harm, consist of sequentially processing pieces of evidence and detect early traces of pathological gambling and self-harm , respectively, as soon as possible. Task 3, measuring the severity of the signs of depression, consists of estimating the level of depression from a thread of user submissions. The CeDRI team participated in Task 1 and Task 2, where users’ posts are processed in the same order in which they are sent, to chronologically monitor the users’ activity. This paper presents the participation of the CeDRI team in the pathological gambling and in the self-harm early detection challenges of CLEF 2021. In task 1, two runs where executed, using a Long-short Term Memory (LSTM) and Convolutional Neural Network (CNN) deep neural networks, both with Word2Vec embeddings. Task 2 used three runs, with LSTM, CNN with Word2Vec embeddings, like the previous task, and a logistic regression layer with Tf-Idf vectorizer. Although the results were very close within the runs, the best results in Task 1 was latency-weighted F1=0.141 (with the LSTM) and in Task 2 latency-weighted F1=0.206 (with the CNN). The rest of the paper is organized as follows. Section 2 covers the considerations regarding the datasets, while section 3 introduces the proposed method. Analysis of the results of experiments are presented in section 4 and finally, the conclusion and suggested directions for future works are presented in section 5. 2. Dataset The machine learning area is characterized by three main approaches of learning [10]: • supervised - maps an input to an output based on example input-output pairs; • unsupervised - patterns are learned without any explicit feedback; • reinforcement - learns from a series of reinforcements, such as rewards and punishments. These are applied in several areas and with several purposes, such as classification, prediction, estimation, affinity grouping, clustering and profiling. The eRisk challenge Task 1 and 2 is mainly a classification problem, widely approached with supervised learning methods. In these problems, a learning agent is shown what to do through an annotated set of training examples, and it is expect an automated learning algorithm to generalize from these examples. For this, it is fundamental to understand and make sure that the training data is adequate and it is well labeled. 2.1. Text pre-processing Social networks’ posts often include tokens that do not represent words, such as URLs, HTML entities, users’ handles, or others. Some of these do not bring relevant information to infer the psychological condition of the user and may affect the performance of classification. The pre-processing applied in both tasks included the following operations: • unescape html entities (ex: < or <) • remove handles (@abcd @pqrs) • remove URLs (https://erisk.irlab.org) • normalize lengthening (111111 -> 11; kkkkkkkkkkk -> kk) • remove numbers • convert to lowercase (Tomorrow -> tomorrow) • strip punctuation • tokenize • perform stemming The vocabulary is substantially reduced, as well as the word variations (Table 1). The same pre-processing approach was applied in both tasks (sections 2.2 and 2.3). Table 1 Pre-processing sample. Original text Pre-processed text We will be having our next meeting this evening at [next, meet, even, pm, pm, 5:00pm EST (9:00pm GMT). Meetings are 1 hour. Partici- gmt, meet, hour, particip, pants must use Skype audio and video. If you’d like to must, skype, audio, video, → join, [DM me](http://www.reddit.com/message/compose/?to= you’d, like, join, me, gambl, JeffW55&subject=ProblemGamblingSupportGroup) with support, group, skype, name, your Skype name so you can be added to the call. Thanks. Jeff ad, call, thank, jeff] 2.2. Task 1: pathological gambling dataset The challenge consists of sequentially processing pieces of evidence and detect early traces of pathological gambling signs in texts written in Social Media. This was an “only test” task, so no training data was provided. The test collection format is a collection of writings (posts or comments) from a set of Social Media users, labeling two categories of users, pathological gamblers and non-pathological gamblers, and, for each user, the collection contains a sequence of writings (in chronological order) [11]. Since the challenge did not provide labeled data, a custom dataset, based on Reddit, was built. For that, the Python Pushshift.io API Wrapper (PSAW - https://github.com/dmarx/ psaw) was used to retrieve posts from the Pushshift initiative (https://pushshift.io), in Comma Separated Values (CSV) format. This allowed to remove the limit of 1000 posts that could be downloaded from Reddit directly. The dataset was built based on the r/GamblingAddiction and r/problemgambling communities. In addition, a random set of posts was also downloaded to complement the dataset with non-gambling related content (Table 2). There is a considerable number of posts available after downloading, in a total of 73064 referring gambling issues and 47103 posts of random subjects. However, extracting data from the CSV files failed in many posts, having only 7079 posts and 2306, respectively. This was due to incompatibility issues between the post text and the CSV encoding, related to the appearance of commas (‘,’) in the text and unterminated ‘"’, which made the issue of extracting the columns Table 2 Summary of the training data set for eRisk 2021 Pathological gambling task Reddit Community Number of Posts Usable Posts Dataset Label r/GamblingAddiction 16528 1467 1467 True r/problemgambling 56536 5612 839 True random 47103 2306 2306 False very difficult and error sensitive. Because of balancing issues, the dataset was build with 2306 posts labeled with False and 2306 posts with True. Each post was stored in a single file, prefixed with pos or neg followed by a number (e.g. pos_1762.txt, neg_2032.txt). It was decided not to associate or track the users, so each post is individual and not related to any other. After building the dataset, the most frequent tokens in the gambling related posts (1a) and in the non-gambling related posts (1b) were calculated (Figure 1). As expected, tokens like gambi, monei, or stop appear in the vocabulary for gambling posts. For random, like, know and would are very frequent. (a) gambling related posts. (b) Non-gambling related posts. Figure 1: The ten most frequent words. Next, the same operation was performed for bi-grams, to better understand the context of the words (Figure 2). In these, feel like is transversal to both types of posts, although credit card and gambli addict, for example, are clearly indicating the type of posts. 2.3. Task 2: self-harm dataset The training data provided XML files for 340 subjects, 41 of which belonging to the self-harm group, 299 to the control group (Table 3). The total number of writings in the self-harm group is 7,192 posts in contrast to 163,506 in the control group. The difference between the groups is also very significant in the average number of writings per subject: 175.4 in the self-harm group and 546.8 in the control group. The average length of the users’ (subjects’) writings is (a) Gambling related posts. (b) Non-gambling related posts. Figure 2: The ten most frequent bi-grams. 179.1 and 129.2 respectively, and the number of tokens is 15.12 and 10.6. Although the control subjects write more posts, they are, in average, shorter. The dataset is also provided with the test writing, in the same format. They are also present in table 3, for completeness. Table 3 Summary of the data set for eRisk 2021 Self-harm task Train Test Full Self-harm Control Self-harm Control Self-harm Control Subjects 41 299 104 319 145 618 Min Posts 8 10 9 9 8 0 Max Posts 997 1992 942 1990 997 1992 Total Posts 7192 163506 11691 92146 18883 255652 Avg Posts 175,4 546,8 112,4 288,9 130,2 413,68 Min Length 0 0 0 0 0 0 Max Length 5880 54796 6627 56651 6627 56651 Total Length 1288542 21129774 1823906 7339145 3112448 28468919 Avg Length 179,1 129,2 156 79,6 164,8 111,4 Min Tokens 0 0 0 0 0 0 Max Tokens 546 3342 559 1334 559 3342 Total Tokens 108752 1730021 148204 568180 256956 2298201 Avg Tokens 15,12 10,6 12,7 6,2 13,6 9 In addition, not all posts are of the same language. Using OpenNLP’s language detection model, a total of 81 different languages were counted. Table 4 show the 15 more frequent languages within the writings. The dataset uses binary labels on the subjects, as having (positive) and not-having (negative) self-harm (ground truth). As seen in table 3, each subject has an arbitrary number of posts, and it is not expected that all of them will be strictly related to whether an user self-harms or not. The main approach in this work, is to use a machine learning approach that uses text to Table 4 Different languages in the training set Code Language Count eng English 107123 tur Turkish 44288 cmn Chamic languages 2353 war Waray 1625 lat Latin 1423 min Minangkabau 1385 plt Pali 1123 afr Afrikaans 1081 vol Volapük 983 mri Mossi 973 por Portuguese 781 epo Esperanto 596 nob Norwegian Bokmål 499 ron Romany 391 ceb Cebuano 364 predict whether a message belongs to a positive or negative user, so the classifier should not be trained with just the ground truth. Some selection on the posts have to be made, so that only the self-harm related writings are kept as positive samples in the training set. Based on Non-Suicidal Self-Injury (NSSI) words [12], a selection was made on the posts to extract individual writings to be used as positive examples. The examples were written in two directories (pos/ for positive and neg/ for negative) with the following name schema: subject280_2.txt, where the first number is the subject number and the second is this subject’s post number. After selecting writings based on NSSI words, and excluding all languages except English, a total of 391 positive labeled writings remained. For balance, the same number of negative labeled writings were selected. The tokens frequency were also extracted from both the positive and the negative writings. In this case, the bi-grams (Figure 3) and tri-grams (Figure 4) are presented. 3. Proposed methods This section presents the models and experiments conducted for the eRisk 2021 task. First, the classification methods require that text be converted to vectors. 3.1. Vectorizers All the methods rely on the vectorization of the subjects’ writings. Two vectorizers were trained, based on TfIdf and Word2Vec, both with the same text pre-processing techniques (section 2.1). • TfIdf: – minimum word frequency = 2; (a) Self-harm related posts. (b) Non-self-harm related posts. Figure 3: The ten most frequent bi-grams. (a) Self-harm related posts. (b) Non-self-harm related posts. Figure 4: The ten most frequent tri-grams. • Word2Vec: – minimum word frequency = 5; – number of iterations = 1; – number of epochs = 5; – layer size = 128; – window size = 5; 3.2. Classifiers Three classification models were build for the tasks. The first is a simple Logistic Regression layer using the TfIdf vectorizer, used in both Tasks 1 and 2: • output dimension = 2; • weight initialization algorithm = XAVIER; • activation = SOFTMAX; • optimization algorithm = STOCHASTIC_GRADIENT_DESCENT; • updater = Nesterovs(0.1, 0.9) • batch size = 32; Another classifier was built using a CNN with Word2Vec vectors as input, used only in Task 2: • weight initialization algorithm = RELU; • activation = LEAKYRELU; • updater = Adam(0.01); • convolution mode = SAME; • l2 = 0.0001; • convolution layer 1 = [128, 100], kernel size = [3, 128] • convolution layer 2 = [128, 100], kernel size = [4, 128] • convolution layer 3 = [128, 100], kernel size = [5, 128] • merge(cl1, cl2 and cl3) • global pooling with dropout = 0.5 • loss function = MCXENT • dense layer = [100, 2], activation = SOFTMAX Finally, a classifier based on LSTM with Word2Vec vectors as input, used in both Tasks 1 and 2: • updater = Adam(5e-3) • l2 = 1e-5; • weight initialization algorithm = XAVIER; • lstm layer = [128, 256], activation = TANH); • lstm output layer = [256, 2], activation = SOFTMAX; loss function = MCXENT 4. Analysis of the results In task 1, according to the eRisk 2021 evaluation report, the maximum number of all users writings was 2000. Of these, only 271 were processed, in 1 day 5 hours, 44 minutes and 10 seconds, until the servers were shutdown. The unavailability, at the time, of an additional GPU made the processing time much slower and, as such, 1 day was not enough to process the whole set. Two runs were executed, based on LSTM and TfIdf (Table 5). The final results are far from the best in all metrics. Nevertheless, the LSTM performed better, although marginally, than TfIdf, a much simpler classifier. In task 2, the maximum number of all users writings were 1999. Of these, only 369 were processed, taking 1 day 9 hours, 51 minutes and 27 seconds. As before, and although a GPU was available in this task, the system was not able to process the totality of test users until the server was shutdown. Three runs were executed, based on LSTM, CNN and TfIdf (Table 6). Table 5 Task 1 runs Run Method 𝑃 𝑅 𝐹1 𝐸𝑅𝐷𝐸5 𝐸𝑅𝐷𝐸50 𝑙𝑎𝑡𝑒𝑛𝑐𝑦𝑇 𝑃 𝑠𝑝𝑒𝑒𝑑 𝑙𝑎𝑡𝑒𝑛𝑐𝑦 − 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝐹 1 0 LSTM .076 1 .142 .079 .060 2 .996 .141 1 TfIdf .070 1 .131 .066 .065 1 1 .131 Table 6 Task 2 runs Run Method 𝑃 𝑅 𝐹1 𝐸𝑅𝐷𝐸5 𝐸𝑅𝐷𝐸50 𝑙𝑎𝑡𝑒𝑛𝑐𝑦𝑇 𝑃 𝑠𝑝𝑒𝑒𝑑 𝑙𝑎𝑡𝑒𝑛𝑐𝑦 − 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝐹 1 0 LSTM .11 .993 .199 .109 .09 2 .996 .198 1 CNN .116 1.0 .207 .113 .085 2 .996 .206 2 TfIdf .105 1 .19 .096 .094 1 1.0 .19 It seemed that the CNN performed better in some metrics, although marginally, compared with LSTM, with TfIdf getting very low scores. Moreover, the algorithms seems to be highly inclined to emit positive decisions, with perfect recall but extremely low precision. Although it is not clear, this may be due to the fact that the posts are processed individually, without any consideration of the previous writings. Some window or accumulator approach could be used to understand if this is the issue. Overall, the three methods can be improved. They were rather close, which gives the indica- tion that the main issue is with the selection of the training dataset. A deeper understanding is necessary regarding the dataset and, after that, new methods can be devised and tested. 5. Conclusions This paper describes the CeDRI submission to the CLEF eRisk 2021 task 1 and 2 on detecting early signs of pathological gambling and self-harm in social media posts. Three methods were presented that seek to classify each writing independently of the others using only information about the text. The first task is a “test only”, so it was necessary to build a training set based on posts collected from Reddit. Task 2 required the processing and filtering of the writings in order to isolate the posts that refer to self-harm from the others, and use these for training the classifiers. Due to the simple classifiers used, state-of-the-art results were not expected. The main purpose was to try to understand the effectiveness of building training sets based on simple heuristics filters. For future work, the inclusion of more features, such as Part of Speech (PoS) frequency, post date and time, and others should be studied. Acknowledgments This work has been supported by FCT – Fundação para a Ciência e Tecnologia within the Project Scope: UIDB/05757/2020. References [1] D. Marengo, C. Montag, C. Sindermann, J. D. Elhai, M. Settanni, Examining the links between active Facebook use, received likes, self-esteem and happiness: A study using objective social media data, Telematics and Informatics 58 (2021) 101523. URL: https: //linkinghub.elsevier.com/retrieve/pii/S0736585320301829. doi:10.1016/j.tele.2020. 101523. [2] L. Faelens, K. Hoorelbeke, B. Soenens, K. Van Gaeveren, L. De Marez, R. De Raedt, E. H. Koster, Social media use and well-being: A prospective experience-sampling study, Com- puters in Human Behavior 114 (2021) 106510. URL: https://linkinghub.elsevier.com/retrieve/ pii/S0747563220302624. doi:10.1016/j.chb.2020.106510. [3] X. Chen, Z. Pan, A review on assessment, early warning and auxiliary diagnosis of depression based on different modal data, in: Z. Pan, X. Hei (Eds.), Twelfth International Conference on Graphics and Image Processing (ICGIP 2020), SPIE, Xi’an, China, 2021, p. 75. URL: https://www.spiedigitallibrary.org/conference-proceedings-of-spie/11720/ 2589413/A-review-on-assessment-early-warning-and-auxiliary-diagnosis-of/10.1117/12. 2589413.full. doi:10.1117/12.2589413. [4] B. Moulahi, J. Azé, S. Bringay, DARE to Care: A Context-Aware Framework to Track Suicidal Ideation on Social Media, in: A. Bouguettaya, Y. Gao, A. Klimenko, L. Chen, X. Zhang, F. Dzerzhinskiy, W. Jia, S. V. Klimenko, Q. Li (Eds.), Web Information Systems Engineering – WISE 2017, volume 10570, Springer International Publishing, Cham, 2017, pp. 346–353. URL: http://link.springer.com/10.1007/978-3-319-68786-5_28. doi:10.1007/ 978-3-319-68786-5_28, series Title: Lecture Notes in Computer Science. [5] Z. Zhang, G. Bors, “Less is more”: Mining useful features from Twitter user profiles for Twitter user classification in the public health domain, Online Information Review 44 (2019) 213–237. URL: https://www.emerald.com/insight/content/doi/10.1108/OIR-05-2019-0143/ full/html. doi:10.1108/OIR-05-2019-0143. [6] D. E. Losada, F. Crestani, J. Parapar, eRISK 2017: CLEF Lab on Early Risk Prediction on the Internet: Experimental Foundations, in: G. J. Jones, S. Lawless, J. Gonzalo, L. Kelly, L. Goeuriot, T. Mandl, L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction, volume 10456, Springer International Publishing, Cham, 2017, pp. 346–360. URL: http://link.springer.com/10.1007/978-3-319-65813-1_30. doi:10. 1007/978-3-319-65813-1_30, series Title: Lecture Notes in Computer Science. [7] D. E. Losada, F. Crestani, J. Parapar, Overview of eRisk 2018: Early Risk Prediction on the Internet (extended lab overview) (2018) 20. [8] D. E. Losada, F. Crestani, J. Parapar, Overview of eRisk 2019 Early Risk Prediction on the Internet, in: F. Crestani, M. Braschler, J. Savoy, A. Rauber, H. Müller, D. E. Losada, G. Heinatz Bürki, L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2019, pp. 340–357. doi:10.1007/978-3-030-28577-7_27. [9] D. E. Losada, F. Crestani, J. Parapar, Overview of eRisk 2020: Early Risk Prediction on the Internet, in: A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, H. Joho, C. Lioma, C. Eickhoff, A. Névéol, L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction, volume 12260, Springer International Publishing, Cham, 2020, pp. 272–287. URL: https://link.springer.com/10.1007/978-3-030-58219-7_20. doi:10. 1007/978-3-030-58219-7_20, series Title: Lecture Notes in Computer Science. [10] S. J. Russell, P. Norvig, Artificial intelligence: a modern approach, Pearson series in artificial intelligence, fourth edition ed., Pearson, Hoboken, 2021. [11] D. Losada, F. Crestani, A Test Collection for Research on Depression and Language Use, in: Proc. of Experimental IR Meets Multilinguality, Multimodality, and Interaction, 7th International Conference of the CLEF Association, CLEF 2016, Evora, Portugal, 2016, pp. 28–39. [12] M. M. Greaves, C. Dykeman, A Corpus Linguistic Analysis of Public Reddit Blog Posts on Non-Suicidal Self-Injury, arXiv:1902.06689 [cs] (2019). URL: http://arxiv.org/abs/1902.06689, arXiv: 1902.06689.