Introduction

Early Risk Prediction by means of DeepLearning?

Pablo Raez Garcia Retamero

praez@pa.uc3m.es

Isabel Segura Bedmar

isegura@inf.uc3m.es

Risk Pre-

0 Universidad Carlos III de Madrid

This work presents our ve approaches to early risk detection of anorexia on social media in CLEF eRisk 2019. Our models make use of di erent kinds of deep neural networks to classify the users in a danger situation. We show the e ectiveness of our models by using the validation and test datasets. The best model obtains a F1 score of 0.57 over the objective class in the validation and a 0.20 over the test.

Deep Learning diction

Introduction

Anorexia is an eating disorder which presents symptoms such as fear of gaining weight or a distorted and delirious perception of the own body. This disease is often associated with severe psychological alterations that cause changes in the emotional behaviour. These psychologycal alterations are discernible in the behaviour of the a ected and are usually re ected in social media as posts and comments. Currently several anorexia detection methods exist [ 11, 2, 19, 14, 18, 13 ], which are mainly based in behavioural analysis. Anorexia symptoms are usually very diverse and probably hidden by the subjects of study, which makes it harder to make a decision, delaying the diagnoses.

Much research has been carried out to early detect these symptoms in social media in an automatic way. Even being a well known problem, anorexia is still hard to diagnose, due to it having wide variety of symptoms as well as the long periods needed for them to show up, as in the amenorrhea case [ 5 ]. Because getting to diagnose the patients is an arduous task, patients will receive treatment in later stages of anorexia. This, in turn, will make the therapy longer and more expensive than if the problem was promptly diagnosed. The automatic detection, with the highest possible accuracy, of anorexia in its early stages would mean great time savings as well as considerable patient health improvements who had been treated quickly.

Five di erent approaches were carried out in order to address this problem. These approaches are explained in further details in section 4. Both, the results obtained by the validation and testing dataset are included.

The paper is structured as follows. Section 2 gathers the state of the art of Natural Language Processing techniques applied to the risk prediction domain. Next, in section 3 the dataset and tools used are named. It is followed by section 4 where the methods as well as the neural architectures proposed are described. In section 5 the results obtained are shown. Finally in section 6 the conclusions and the future work are gathered. 2

State of the Art

This section gathers the main works related to early risk prediction on the internet. The usage of machine learning techniques in mental illness detection such as anorexia is quite recent. Even so, there is considerable bibliography on the matter [ 11, 2, 19, 14, 18, 13 ].

In [ 19 ], Deep Learning techniques have been applied to the problem of anorexia and depression detection for the CLEF eRisk 2018 tasks [ 9 ]. The authors approach the problem by turning it into a sentence classi cation one, where the sentences are classi ed as positive if they have been written by an ill user and negative otherwise. They make use of the TF-IDF algorithm to get the most representative words for each one of the classes. Then, the sentences are encoded by means of a Convolutional Neural Network (CNN). They managed to obtain F1 scores of 0.64 and 0.85 as well as ERDE5 of 8.78 and 11.40 in the depression and anorexia tasks, respectively.

Our rst approach is quite similar to the one previously described, but we also make use of word or char embeddings in every model, as well as a fully connected layer after the CNN ones, which have been shown to improve the results of the classi er.

In [ 14 ] approach to the CLEF eRisk 2018 tasks, di erent machine learning techniques are presented, such as Linear Regression [ 12 ], Super Vector Machines [ 16 ], Ada Boost [ 15 ], Random Forests [ 1 ], and Recursive Neural Network (RNN) [ 17 ]. Texts are represented using di erent features such as Bag Of Words (BOW) and Uni ed Medical Language System (UMLS). Experiments show that the best results are obtained by BOW and using the classi ers Ada Boost and the Random Forests. They managed to obtain F1 scores of 0.58 and 0.67 as well as ERDE5 of 9.81 and 12.17 in the depression and anorexia tasks, respectively.

In [ 18 ] approach to the CLEF eRisk 2017 task [ 8 ], several combinations of user-level linguistic metadata, BoW [ 21 ], neural word embeddings [ 3 ], and CNN [ 7 ] are used. Obtaining an F1 value of 0.48 and an ERDE5 of 12.73 on the depression task.

There have been some interesting approaches not so heavily focused into machine and deep learning techniques such as the one described in [ 13 ], which focuses into Author Pro ling (AP). It consists in analysing texts to predict general or demographic attributes of authors such as: gender, age, personality, native language, and political orientation, among others. 3

Materials

This section gathers the materials used. 3.1

Dataset

The dataset for this task has the same format as the one described in [ 10 ]. The collection provided, for training and validation, is composed by 152 subjects, of them 20 are anorexic and 132 are not. The texts from these subjects are formed by a total of 253,341 posts and comments, of which 24,874 come from ill subjects and 228,467 are from healthy people. As it can be seen, the training set is very unbalanced, which in turn makes the whole task harder to perform.

For every di erent subject, we get all their writings with several information elds, being them the title of the post (sometimes blank), as well as the date and time. It also contains info about the platform where the post was made, may it be reddit or other, and the posted text itself.

The test dataset is hosted as a server that iteratively yields user writings to the participating teams. These iterations go across time to get the writtings of each user in a more real-world-like scenario. It will only give back the writings when all runs of a timestep for a team are sent. This dataset counts with 2000 timestep for over 800 users. Being them "id", "nick", "redditor", "title", "content", and "date". The "nick" is used as the subject id, and the "title", "content" and "date" ones are used as their homonyms in the training dataset. "Redditor" and "id" do not relate with any of the training dataset and nally number indicates the iteration on the test dataset, which is used for validation purposes. 3.2

Tools

Google Colab was used to run the experiments. It consists of a machine with an Intel(R) Xeon(R) CPU @ 2.20GHz as a CPU and its equipped with 12Gb of RAM. The most interesting part of it for us is the GPU they provide, being it a Tesla K80 GPU with 12Gb of memory as well.

The experiments were developed using python, and its libraries Keras and Tensor ow for DL models. Some other libraries were used such as Pandas or NumPy for the processing of the data. 4

Method

In this section the method followed for the development of the approaches is explained. This method includes all the pre and post processing of the data.

TFIDF text

TFIDF text 3000 2500 se2000 cn e r ru1500 c O 1000 500 0 0 200 400 600Length 800 1000 1200 1400 1000 2000 Length 3000 4000 5000 (a) Hist. of post length used in A and C in (b) Hist. of total subject posts length used words. in B in words.

CHAR text 160000 140000 s120000 ce100000 n e rru 80000 c O 60000 40000 20000

0 0 200 400 600 Leng8t0h0 1000 1200 1400 (c) Hist. of post length used in D and E in characters. Di erent types of neural networks such as RNN and CNN have been used to generate deep learning models, which are further explained below.

As a preprocessing step, all texts are cleaned by removing stop words, numbers, punctuation and words with less than three characters. Then a TF-IDF algorithm is used in order to lower the volume of words while retaining the most representative ones.

For the models A, B and C, the posts are tokenized and cropped or padded to a xed length of 50 words per post in models A and C. This padding and cropping takes place because the input for the neural networks must have a xed shape. The reason why longer texts are cropped is because too much padding will add too much noise to the networks. This is because than most texts have less than 50 words, as shown in gure 1a. The selected length of the B model is 350. Contrary to the one selected in the previous models, this length is chosen due to the prohibitive size of the network past it. The ideal value would have been 1000, as can be seen in gure 1b.

For the models E and D instead of tokenizing the texts and xing them to a certain length, another preprocessing step is added, based in splitting the words into characters. This operation is needed in order to make use of char embedThreshold Selection

Threshold Selection 0.8 0.6 0.4 0.2 0.0 0.8 0.6 0.4 0.2 0.0

Objetive Class F1 Macro avg Weighted avg 0.2 0.4 Threshold Value 0.6

Objetive Class F1 Macro avg

Weighted avg 0.4 Threshold Value

0.6 0.8 1.0 0.2 0.8 1.0 (a) F1 scores obtained by A model depend- (b) F1 scores obtained by C model depending on threshold value ing on threshold value dings, which have shown themselves useful in NLP tasks [ 20 ]. Char embeddings has its advantages against word embeddings: they do not face problems when processing unseen words as every word is formed with characters. Another characteristic advantage is the robustness against misspelled words. Furthermore, char embeddings are usually low dimensional ones, which in turn improves the speed of the models. Then each text is xed to a length of 400 characters. The length was picked by hand and it was done regarding the g 1c in the same way as for models A, B and C. Finally the characters were fed into the di erent neural networks.

Finally, and after the processing performed by the di erent models, the output of the networks is compared with a threshold to determine if it was a risk situation or not. This threshold was obtained empirically for each model by subduing the results to several tests in which the threshold value iterated in the range of 0.1 and 0.9. Then, the threshold with the highest F1 of the active class was selected. The thresholds are shown in table 1. An illustrative example of this process can be seen in gure 2 where the evaluation of A and C thresholds is shown.

Several experiments were performed to nd out the best hyper-parameter con guration for each one of the models, which can be found in table 3. The tuned hyperparameters regarding the model can be found in detail in table 2

Some of the hyper-parameters checked were regarding the model themselves, such as load emb, emb size, trainable emb, cnn size, rnn size, dropout, dnn size, and batch size. Some others were speci c of the type of networks used; in the CNN was cnn lter, which determines the size of the kernel used, and in the RNN we can nd cell type, determining the type of the cell used, being it GRU or LSTM, bidirectional that indicates if the layers were bidirectional ones, and attention which, as its own name depicts, determines if an attention mechanism was used or not. This model is a simple rst approach to classify the di erent records independently. The posts are taken as if they were independent, and they are labelled to 0 or 1 taking into account if the user who wrote them was control class or positive class patient.

This model gets as input the di erent texts, which then will undergo a Word Embedding layer, whose output is fed to a one-dimensional CNN. Finally, the output of the former layer is fed into a fully connected layer just before the output one (see Img 4a). This model similar approach to the the previous one. But in this case, instead of taking the texts as independent bits of information, all of the texts of the same user are processed together. This way, the input to the net is all the tokenized text a user has ever posted and the objective value is if the subject is in risk of su ering anorexia or not.

This model gets the text input which, in the same way as in the previous model, undergo a Word Embedding layer, whose output is in the same way fed 140 120 100 s ce 80 n e r r cu 60 O 40 20 0 Writtings per User

Train Test 0 250 500 750 1000 1250 Number of writtings 1500 1750 2000 to a RNN layer. The result is then fed to a fully connected layer which is placed just before the output one (see Img 4b). 4.3

C model

This model is a more sophisticated one in the sense that it uses previous A models in order to generate what we call "writing embeddings" by means of transfer learning [ 6 ]. Then they are fed to a RNN layer, which allows us to process varying number of texts. This is crucial due to the dataset having very variable number of texts per user as can be seen in g 3.

It is composed of the whole best A model without the two last layers. Those outputs are used as "writing embeddings" which represent the di erent texts in just a 32 dimension vector. Then, the "writing embeddings" are fed into the RNN layers, whose output is then passed trough a fully connected layer before the output layer (see Img 5a). 4.4

D model

This model follows the same idea as the A, which is to classify the di erent texts independently. But it di ers from the previous one in the fact that it does not use word embeddings, but char embeddings instead.

This model gets as input the di erent chars from every post, which then will undergo a Char Embedding layer, which mainly di ers from the word embedding (a) A model (c) D model (b) B model one in the dimensions of the vector which is way shorter, as well as in the vocabulary which is, as well, way smaller. The output of this layer is then fed into a one-dimensional CNN in the same way as the A model. Finally, the output of the CNN is fed into a fully connected layer, and then the output one (see Img 4c). This model follows the same idea as the C, which is to use a pre-trained model to generate "writing embeddings". But it di ers from the previous one in the fact that instead of using a pre-trained model on word embeddings, it uses a char embeddings pre-trained model. This model also takes advantage on the RNN layers that allow it to process the users no matter the number of texts they individually have, which, as aforementioned, is really disperse (see Fig 3).

This model makes use of the best D model weights, but without the two last layers. The outputs resulting of the processing with the cropped D model, which are given in the form of a 64 dimension vector, are fed into the RNN layers. Finally, likewise the previous models, the output of the former layer is fed into a fully connected network, and then it goes under the output one (see Img 5b). 5

Results

In this section, the results obtained by the ve di erent approaches are shown. We divide this section in the validation results and the test results. The evaluation has taken place by means of the test server presented in section 3. We also include the best results obtained in the challenge.

The common measure of performance in terms of precision and recall is the F1-score [ 4 ]. This metric is the harmonic mean of the precision and recall. As we are mostly concerned about the performance over the positive class, only the F1 of that class is shown in the validation results. We also add the Macro F1 due to it being a good measure of the performance with unbalanced classes, where the most important is the least represented one. Finally we add the weighted F1 as a comparison.

The best results of each model can be seen in the table 4. The best results are in bold. These metrics are very limited in comparison with the ones provided by the challenge organisers. Still the validation metrics provided are promising, specially the ones obtained by the C approach. Still further work must be performed in order to improve the overall results.

The results obtained in the o cial evaluation are shown in table 5. The best results obtained for each metric are also shown in the aforementioned table. 6

Conclusions and Future Work

Five di erent approaches to the CLEF eRisk 2019 task 1 have been described. All the approaches make use of some kind of neural networks and two of them bene t from concepts such as transfer learning. Several hyper-parameters of those models were nely-tuned in order to achieve the better performance possible. Although our o cial results are very low, we can conclude that our models provide promising results for the early detection of anorexia in social media, obtaining an F1 score up to a 0.57 in the positive class. Not so good results were obtained in the test experimentation, F1-wise, even so, for ERDE5 and ERDE50, results close to the best ones were obtained.

Still, further work is needed. We would like to feed the di erent approaches with more kinds of embeddings such as concept embeddings, as well as to put to test the usage of word embeddings and char embeddings in the same model.

Acknowledgments

This work was supported by the Research Program of the Ministry of Economy and Competitiveness - Government of Spain, (DeepEMR project TIN201787548-C2-1-R). (a) C model (b) E model

Fig. 5: Structure of models C and E.

1. Breiman , L. : Random forests . Machine learning 45(1) , 5 { 32 ( 2001 )

2. Coppersmith , G. , Dredze , M. , Harman , C. : Quantifying mental health signals in twitter . In: Proceedings of the workshop on computational linguistics and clinical psychology: From linguistic signal to clinical reality . pp. 51 { 60 ( 2014 )

3. Goldberg , Y. , Levy , O.: word2vec explained: deriving mikolov et al.'s negativesampling word-embedding method . arXiv preprint arXiv:1402.3722 ( 2014 )

4. Goutte , C. , Gaussier , E.: A probabilistic interpretation of precision, recall and fscore, with implication for evaluation . In: European Conference on Information Retrieval . pp. 345 { 359 . Springer ( 2005 )

5. Gutierrez-Barqu n , I.E.: Alteraciones menstruales y anorexia nerviosa . Trastornos de la conducta alimentaria (3) , 277 { 284 ( 2006 )

6. Karpathy , A. , Toderici , G. , Shetty , S. , Leung , T. , Sukthankar , R. , Fei-Fei , L. : Largescale video classi cation with convolutional neural networks . In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ( June 2014 )

7. Krizhevsky , A. , Sutskever , I. , Hinton , G.E.: Imagenet classi cation with deep convolutional neural networks . In: Advances in neural information processing systems . pp. 1097 { 1105 ( 2012 )

8. Losada , D.E. , Crestani , F. , Parapar , J.: erisk 2017: Clef lab on early risk prediction on the internet: experimental foundations . In: International Conference of the Cross-Language Evaluation Forum for European Languages . pp. 346 { 360 . Springer ( 2017 )

9. Losada , D.E. , Crestani , F. , Parapar , J.: Overview of erisk: Early risk prediction on the internet . In: International Conference of the Cross-Language Evaluation Forum for European Languages . pp. 343 { 361 . Springer ( 2018 )

10. Losada , D.E. , Crestani , F. , Parapar , J.: Overview of eRisk 2019 : Early Risk Prediction on the Internet . In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. 10th International Conference of the CLEF Association , CLEF 2019 . Springer International Publishing, Lugano, Switzerland ( 2019 )

11. Mohr , D.C. , Zhang , M. , Schueller , S.M.: Personal sensing: Understanding mental health using ubiquitous sensors and machine learning . Annual review of clinical psychology 13 , 23 { 47 ( 2017 )

12. Montgomery , D.C. , Peck , E.A. , Vining , G.G. : Introduction to linear regression analysis , vol. 821 . John Wiley & Sons ( 2012 )

13. Ortega-Mendoza , R.M. , Lopez-Monroy , A.P. , Franco-Arcega , A. , Montes-y Gomez , M. : Peimex at erisk2018: Emphasizing personal information for depression and anorexia detection

14. Paul , S. , Kalyani , J.S. , Basu , T. : Early detection of signs of anorexia and depression over social media using e ective machine learning frameworks

15. Schapire , R.E.: Explaining adaboost . In: Empirical inference, pp. 37 { 52 . Springer ( 2013 )

16. Scholkopf , B. , Smola , A.J. : Learning with kernels: support vector machines, regularization, optimization, and beyond . MIT press ( 2001 )

17. Socher , R. , Huang , E.H. , Pennin , J. , Manning , C.D. , Ng , A.Y. : Dynamic pooling and unfolding recursive autoencoders for paraphrase detection . In: Advances in neural information processing systems . pp. 801 { 809 ( 2011 )

18. Trotzek , M. , Koitka , S. , Friedrich , C.M.:

Word embeddings and linguistic metadata at the clef 2018 tasks for early detection of depression and anorexia

19. Wang , Y.T. , Huang , H.H. , Chen12, H.H.:

A neural network approach to early risk detection of depression and anorexia on social media text

20. Zhang , X. , LeCun , Y.: Text understanding from scratch . arXiv preprint arXiv:1502.01710 ( 2015 )

21. Zhang , Y. , Jin , R. , Zhou , Z.H. : Understanding bag-of-words model: a statistical framework . International Journal of Machine Learning and Cybernetics 1 ( 1-4 ), 43 { 52 ( 2010 )