-

Early detection of anorexia using RNN-LSTM and SVM classi ers

Akshaya Ranganathan

Haritha A

haritha16038g@cse.ssn.edu.in 0

Thenmozhi D

Chandrabose Aravindan

aravindancg@ssn.edu.in 0 0 Department of CSE, SSN College of Engineering , Chennai

2018

Social Media text analysis has engendered a variety of applications in the medical domain, a major example being the detection and cure of deleterious mental disorders. Anorexia is a deadly, psychiatric eating disorder with typical characteristics of alarmingly low body weight conditions and distorted body image, with an unreasonable sense of being overweight. With developments in the eld of Natural Language Processing, such highly lethal disorders can be identi ed and mitigated in their rudimentary stages, saving the victim a lot of mental and physical abuse. The Task 1 of CLEF 2019's eRisk lab focuses mainly on the early prediction of anorexia, analysed by posts which are sourced from social media platforms. Our team, SSN-NLP has used variations of two major models for sentiment classi cation, a deep learning RNN-LSTM, and a traditional SGDC Classi er. User-speci c data from consequent posts that were extracted from Reddit was released by CLEF eRisk, which was used in its entirety for our training, testing, evaluation and scoring process. With the help of RAKE (Automated keyword extraction), numeric scores were obtained to identify the level of anorexia/self-harm.SSN-NLP submitted 5 variant models to the server that repeatedly accepted submissions and gave user writings to the participating teams. According to the ERDE-50 and F1 scores, our 2-layer LSTM with normed-bahdanau attention, performed the best having scores of 0.07 and 0.33 respectively.

Anorexia early detection deep learning machine Learning LSTM natural language processing SVM

Anorexia Nervosa is a potentially life-threatening psychiatric disorder characterized by very extreme unhappiness over one's body image and intense desire to lose weight even if it's lower than what's considered normal. In the age of Instagram celebrities showing o their perfectly toned bodies, internet culture has created harsh rules that people, especially teenagers are expected to adhere to. According to a study by National Eating Disorders Association1, irrespective of the time, 0.3-0.4% of women and 0.1% men test positive for anorexia nervosa. DSM-5 (Diagnostic and Statistical Manual of Mental Disorders) gives de nitions and diagnostic material for mental disorders. According to DSM-5, Anorexia Nervosa is characterized by the following criteria: 2 1. Restriction of energy intake relative to requirements leading to signi cantly low body weight in the context of age, sex, developmental trajectory, and physical health. 2. Intense fear of gaining weight or becoming fat, even though underweight 3. Disturbance in the way in which ones body weight or shape is experienced, undue in uence of body weight or shape on self-evaluation, or denial of the seriousness of the current low body weight.

However, another serious type of anorexia is called Atypical Anorexia where a person maintains a healthy weight despite consistent loss in weight.Types of anorexia include: 1. Binge/purge type: A person tries to purge by over-exercising or even vomiting after eating in an attempt to compensate for the weight gained by eating. 2. Restrictive type:: A person levies harsh restrictions on the quantity of food consumed, which in most cases is barely su cient for survival. eRisk 2019 primarily focuses on early detection of risk on the internet. The primary goal is to use text mining solutions for early detection in various areas like detection of people with suicidal tendencies, tendency to fall prey to criminal organizations, etc [ 4 ]. The aim of Task 1 under Erisk 2019 is to detect symptoms of anorexia as early as possible. Early detection technologies using text processing can be employed in di erent areas, particularly those related to health and safety. A few applications of early detection include the areas of sexual predators, mental disorders and cyber-bullying. Prediction is broadly classi ed into two stages: - the training stage and the test stage. In the training stage eRisk released chunks of training data as well as test data of eRisk 2018. The chunks consisted of user writings posted on Reddit as well as classi cation results. Users are classi ed as Anorexic and Non-Anorexic. During the testing stage, an automatic server repeatedly accepted our submissions and released test data batch by batch. The task evaluates the earliness of predictions in addition to their correctness. The task aims to obtain a scoring system based on the level of alert. 2

Related Works

Extant methods to detect Anorexia can be categorized into two types. One method is the analysis of change in behavioral patterns by general physicians 1 https://www.nationaleatingdisorders.org/statistics-research-eating-disorders 2 https://www.nationaleatingdisorders.org/learn/by-eating-disorder/anorexia as well as friends and family of the patient through structured mental analysis. According to a study that weighs the importance of a primary physician in detecting eating disorders, a series of questions are used to detect the presence of anorexia which is done by examining the answers to each of these questions [16]. A few examples include What did you eat yesterday?, Do you ever binge eat (eat more than you want) or use laxatives, diuretics, or diet pills?, Do you think you are thin (too thin), etc. The second method [ 9 ] involves the use of Sentiment Analysis on Social Media posts. For example, a research work showcased that students with signs of depression use more personal pronouns like 'I' and negative valence possessing words (eg: gloomy, sad). Erisk aims at early detection of anorexic tendencies by analyzing posts of users on Reddit. One such approach involves the Bag of Words (BoW) model [ 13 ] that uses a vocabulary comprising of all the unique words in the text and performs vectorization assigning a speci c weight to each word. The term weighting for the BoW model has been split into 3 components: a term frequency component, a document frequency component, and a normalization component. Yet another approach involves UMLS based MetaMap [ 9 ] assistance for keyword detection. Further, Traditional Learning algorithms were applied to the information collected by the methods mentioned above (eg. SVM, logistic regression, RF). Yet another approach involved the use of TF-IDF similar to the works mentioned before. However, this research adopted a deep learning approach using CNN-LSTM [ 3 ]. Our work involves the usage of Recurrent Neural Networks with Long short term memory (LSTM) to analyze patterns and make predictions on sequences of texts. Rapid automated keyword extraction (RAKE) was implemented to identify the most frequently occurring keywords relating to anorexia in the training data. The results were combined to devise a prediction and risk-based scoring system. 3

Dataset Analysis

3.1

Dataset analysis - Task 1

This year's Task 1 was an extension of CLEF eRisk 2018's Task 2, the training data [14] of this years task was a combination of both the test and training data of the previous year. Reddit, much like twitter o ers a python supporting API that can be used to scrape required data e ectively. Twitter sentiment analysis [ 2 ] has proven to be a powerful indicator of mental illnesses like depression and PTSD. While the training data was categorized into negative and positive examples, the labels of test data had to be extracted from the le riskgolden-truth-test.txt and mapped on to the actual writings of the users. Each document had an XML tree structure comprising of the tags : INDIVIDUAL ,ID ,WRITING ,TITLE ,DATE and TEXT. For the training-examples, a total of 152 user writings were given in comparison to 320 users for the test-examples all out of which only the TEXT and TITLE attributes were separated to be fed as training data. Table 1 gives a summary of all heading levels. Attributes Number of users Positives/Negatives Number of documents Avg documents per user 558.26 Avg words per document 184.54 The data [14] given was a consolidation of the test and training data of CLEF 2018. Data were represented as positive-examples and negative-examples chunks, each containing XML les of writings done by a certain subject. Using XML ElementTree library of Python, the given TEXT elements of each le were consolidated as follows: (see Fig. 1) To atten out the discrepancies in the data set, all special characters, erroneous blank spaces and empty strings (NULL) were removed using Regular Expressions. The cleanup of data was done in accordance with the input expected by the Neural Machine Translation model. Cleaned text and respective labels were stored in the form of comma-separated values using FileWriter of python. A vocabulary le comprising of all unique words in the training set was built to be fed into the Deep learning model. 3.3

Data augmentation

Due to the sparse characteristics of positive examples in the training set, Data Augmentation had to be done using the mentioned mechanism: Synonym generation using POS Tagging: Using the POSTagger module 3, various parts 3 https://github.com/nltk/nltk of speech were identi ed from each positive example of the text. Post identi cation, the NLTK WordNet 4 module identi ed the synonyms for adjectives(JJ) and adverbs(RB) and populated the dataset with replaced text which led to a signi cant increase of tuples in our dataset. As shown in the gure, (Fig. 3) the POS Tagger splits each sentence into relevant parts of speech, and the wordnet (Fig. 2) generates synonyms for each word. Multiple sentences of anorexia positive users were augmented to the dataset by replacing each adverb and adjective in a sentence with their respective list of most relevant synonyms. Take an example sentence: My body is so heavy that I actively need to exercise every moment of the day. The POS Tagger identi es heavy and actively as adjective and adverb respectively. Synset identi es synonyms for heavy as weighty, hefty, big, massive and synonyms for actively as e ectively, usefully, productively. Now, sentences with combinations of these synonyms are generated. Nearly 45,000 sentences were added to our dataset through the mentioned methodology.

Proposed methodologies and Implementation Deep learning approach -Neural Machine Translation

Task 1's primary goal was to classify the user as anorexia-positive or anorexianegative. We have used a Deep Learning based approach for our implementation using Neural Machine Translation to solve the classi cation problem. Basic Architecture of Neural Machine Translation is a Sequence to Sequence model (Seq2Seq). NMT is built based on the concept of an Encoder- Decoder [15]. The encoder converts the input sequence to a thought vector while the decoder maps it to a target language. In our case, the decoder maps the input sequences to two classes- positive and negative indication of anorexia. The TensorFlow code based on tutorial code released by Neural Machine Translation5 [ 7 ] that was developed based on Seq2Seq models [ 12, 1, 6 ] was used to implement our 4 https://github.com/wordnet/wordnet 5 https://github.com/tensor ow/nmt deep learning approach for sentiment classi cation. Neural Machine Translation (NMT) was implemented with LSTM. LSTM is expanded as Long Short Term memory which is used to remember only the important parts of each input sentence and is trained to forget the rest. Thus, the output is a combination of the current input sentences predictions as well as the memory of previous important parts of sentences. LSTM captures Long Term Dependencies using 3 gates { Forget Gate: Decides what part of previous cell state must be forgotten. { Input Gate: Responsible for the addition of information to the cell state. { Output Gate: Responsible for selecting useful information to output at current cell state

it = (wx(i)x + wh(i)ht 1 + b(i)) ft = (wx(f)x + wh(f)ht 1 + b(f) + 1)

ot = (wx(o)x + wh(o)ht 1 + b(o)) ct = tanh(wx(c)x + wh(c)ht 1 + b(c)) e ct = ft ect 1 + it ect hb=f = ot tanh(ct) where ws are the weight matrices, ht 1 is the hidden layer state at time t 1, it, ft, ot are the input, forget, output gates respectively at time t, and hb=f is the hidden state of backward, forward LSTM cells. Four di erent NMT variations have been implemented for runs 1-4 of our submissions.

{ Model 1: 2 layer bidirectional LSTM with Scaled Luong attention { Model 2: 4 layer bidirectional LSTM with Scaled Luong attention { Model 3: 2 layer bidirectional LSTM with Normed Bahdanau attention { Model 4: 4 layer bidirectional LSTM with Normed Bahdanau attention 4.2

Traditional Learning Approach

TF-IDF is used to assign weights to words to nd out important words. TF stands for term frequency. It is a measure of the number of times a word occurs in a given document [ 10 ]. It is calculated by dividing the number of occurrences of a given word by the total number of words in a document. However, words like a, the occur a lot of times and are not very signi cant. So, we calculate the Inverse Document Frequency.

W eights = T F

IDF Stochastic Gradient Descent[ 1 ] is essentially Gradient Descent with a batch size of 1 and works e ectively when redundant data is present. SGD Classi er of sklearn performs Stochastic Gradient Descent Optimization on SVM Classi cation Model. Stochastic Gradient Descent is proven to be useful especially for large datasets and has found increased usage in several text mining applications [ 10 ]. After data augmentation, the dataset was cleaned and fed to the model. The accuracy of the model while training was found to be 90%. { Model 0: SVM Classi er with SGD optimization using TF-IDF (1) (2) (3) (4) (5) (6) (7) The motive behind Task 1 of eRisk 2019 was to facilitate the early prediction of anorexia. This year, they added another feature to the submissions called a score of positivity or negativity. Score is a numeric estimation of the level of anorexia/self-harm. Using this score, CLEF 2019 adapts ranking based measures for the evaluation of participants. The module Rapid Automated Keyword Extraction (RAKE) [ 11 ] was used to identify the most frequently occurring keywords in our training set, and to calculate the score based on these keywords. The input parameters for RAKE comprise a list of stop words (or stoplist) usually provided by NLTK for the English language, a set of phrase delimiters, and a set of word delimiters. RAKE uses stop words and phrase delimiters to segment the chunk of text into candidate keywords. The number of times each word occurs in the document gives the frequency score, and the number of times each keyword occurs with each other keyword is found as the co-occurrence score. f inalscore = co occurrencescore=f requencyscore (8) RAKE eliminates words that occur very frequently in the document but are of trivial relevance. Using co-occurring keywords, we successfully mined out pairs like body-mass, anorexia-nervosa,purge-eating, binge-eating. The fundamental di erence between RAKE and TF-IDF scores is that RAKE nds word phrases in a single document and assigns relevance scores, while TF-IDF uses multiple documents to assign a single word score. Since our work required a single but voluminous training document, RAKE outperformed its TF-IDF counterpart. To achieve stable prediction scores, we used a function that checks the following : { If a user classi ed as anorexia positive has stopped posting altogether, the score was signi cantly increased, causing a high level of alert. { If a user was classi ed positive both in the current and previous runs, the score was boosted so as to con rm the decision of positive anorexia, as early as possible. { If a user was classi ed positive in the previous run, but the current run is negative, the score was balanced out, waiting for further writings to make the ultimate decision. 5 5.1

Results and Evaluation Decision based evaluation

According to the task, several methods of evaluation were considered [ 5 ]. Evaluation of results was initially based on Early Risk Detection Error (ERDE). ERDE gives a measure of correctness of decision as well as delay taken to arrive at a decision.

P =

T P T P + F P

T P R = (10)

T P + F N 2 P R F = (11)

P + R However, ERDE has certain drawbacks. For example, a system that detects all the true positive writings still does not get an error of zero. Alternatively, a modi cation ERDEo%was suggested. This method considers the percentage of writings of the users seen before making a decision as opposed to the number of user writings. However, this method has a major aw as in real life the total number of user writings may not be known.Another method based on Flatency was proposed. For a user u 2 U , ku writings are seen before making a decision du. gu stands for the ground truth of decisions. Delay in nding true positives are considered as

latencyT P = median fku : u U; du = gu = 1g speed = (1

median fpenalty(ku : u U; du = gu = 1)g) Based on the speed and F1 score, latency weighted F1 score is calculated.

Flatency = F speed The maximum precision attained by our system is 0.48, whereas the overall [ 5 ] maximum among all systems is 0.71. Maximum recall of our system is 0.26, as opposed to overall maximum of all systems which is 1. Maximum F1 score is 0.34, whereas maximum of all systems id 0.71. ERDA5 is relatively low with a value of 0.08 as opposed to least value of 0.06 amongst all systems. Least ERDA50 is 0.07 for our system, while overall least is 0.03. Speed of a system is 1 if it detects true positive in the rst writing of a user. Systems speed is 1. Standard measures of precision, recall, F-measure are calculated as follows: P = ju U : du = gu = 1j

ju U : du = 1j R = ju U : du = gu = 1j ju U : gu = 1j Yet another factor for evaluating performance is the speed of a system. A speed of 1 indicates that the system predicted true positives in the rst writing as opposed to 0 if the system predicts only after a few hundred writings. (12) (13) (14) (16) (17) (18) Along with the decision, a score which is an estimate of the level of risk, was also calculated for each user. The evaluation algorithm assigns ranks to users based on decreasing level of risk. The ranks are re-calculated after each set of writings. The rankings are evaluated with P@10 and NDCG metrics. The relatively long duration between submissions of various runs can be attributed to the o ine processes used by our system(6 days,22 hs ) From the released evaluation results, it can be inferred that our models performed extremely well with respect to early prediction (speed ), as the true positives were correctly classi ed within the rst few sets of user writings. Our Flatency however, was not up to standards, in comparison with a few of the best functioning systems, such as CLAC, which achieved a weighted F1 score of 0.69. Model 1 : 2 Layer BLSTM with scaled luong attention and Model 4: 4 Layer BLSTM with normed bahdanau attention have shown the best performance and this could be explained by taking the concept behind these attention mechanisms. As mentioned in [ 6 ], the luong mechanism simply uses hidden states at the top LSTM layers in both the encoder and decoder, thus explaining why for a lesser number of layers (2 layers) scaled - luong attention worked better. The reason why bahdanau attention worked for a deeper number of layers (4 layers) can be justi ed, as a hidden state in Bahdanau goes through a deep-output and a max-out layer before making predictions [ 1 ].

Conclusions and Future work

In this paper, we have presented the participation of our team, SSN-NLP at the eRisk 2019 task of early detection of signs of anorexia. Early risk prediction on the internet is vital to the development in the eld of mental health and safety. We have treated this as a classi cation problem and presented 4 variations of Deep learning approaches, and one Traditional learning model using Neural Machine Translation (NMT) and SVM with SGD optimizer. The future scope for our model includes complete automation, devoid of any kind of online processing and research on other algorithms that could improve our model accuracy. 14. Trotzek, Marcel, Sven Koitka, and Christoph M. Friedrich. "Utilizing neural networks and linguistic metadata for early detection of depression indications in text sequences." IEEE Transactions on Knowledge and Data Engineering (2018). 15. Verma, A. A., and Bhattacharyya, P. Literature Survey: Neural Machine Translation. 16. Walsh, J M et al. Detection, evaluation, and treatment of eating disorders the role of the primary care physician. Journal of general internal medicine vol. 15,8 (2000): 577-90.

1. Bahdanau

, Cho

, Bengio

Neural machine translation by jointly learning to align and translate . arXiv preprint arXiv:1409.0473. 2014 Sep 1 .

2. Coppersmith , Glen, Mark Dredze, Craig Harman, Kristy Hollingshead, and Margaret Mitchell. "CLPsych 2015 shared task: Depression and PTSD on Twitter." In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality , pp. 31 - 39 . 2015 .

3. Liu, Ning, Zheng Zhou, Xin Kang, and Fuji Ren. "TUA1 at eRisk 2018 . " ( 2018 )

4. Losada , David E. , Fabio

Crestani

, and

Javier

Parapar . "Overview of eRisk: Early Risk Prediction on the Internet." In International Conference of the Cross-Language Evaluation Forum for European Languages , pp. 343 - 361 . Springer, Cham, 2018 .

5. Losada , David E. and Crestani , Fabio and Parapar, Javier.Overview of eRisk 2019: Early Risk Prediction on the Internet . Experimental IR Meets Multilinguality, Multimodality, and Interaction. 10th International Conference of the CLEF Association , CLEF 2019 . ( 2019 )

6. Luong , Minh-Thang, Hieu

Pham , and Christopher D.

Manning . "E ective approaches to attention-based neural machine translation . " arXiv preprint arXiv:1508.04025 ( 2015 ).

7. Luong , T. , Brevdo , E. and Zhao , R. , Neural machine translation (seq2seq) tutorial . 2017 . URL: https://www. tensor ow. org/tutorials/seq2seq (17.02 . 2018 ).

8. Robbins

, Monro S . A stochastic approximation method . The annals of mathematical statistics. 1951 Sep 1 : 400 - 7 .

9. Paul, Sayanta, Jandhyala Sree Kalyani, and

Tanmay

Basu . "Early Detection of Signs of Anorexia and Depression Over Social Media using E ective Machine Learning Frameworks . " ( 2018 )

10. Qaiser , Shahzad & Ali, Ramsha. ( 2018 ). Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents . International Journal of Computer Applications . 181 . 10.5120/ijca2018917395.

11. Rose , Stuart, Dave Engel, Nick Cramer, and Wendy Cowley . "Automatic keyword extraction from individual documents." Text mining: applications and theory ( 2010 ): 1 - 20 .

12. Sutskever , I. , Vinyals , O. and Le , Q.V. , 2014 . Sequence to sequence learning with neural networks . In Advances in neural information processing systems (pp. 3104 - 3112 ).

13. Trotzek , M. , Koitka , S. and Friedrich , C.M. ,

Word

Embeddings and Linguistic Metadata at the CLEF 2018 Tasks for Early Detection of Depression and Anorexia .( 2018 )