136 Analysis of Global Word Representations for Depression Detection Niveditha Sekara, S Chandrakalaa* and G Prakashb* a Intelligent Systems Lab, School of Computing, SASTRA Deemed to be University, Thanjavur, India b Department of CSE, Amrita School of Engineering, Bengaluru, Amrita Vishwa Vidyapeetham, India Abstract Social media such as Twitter, Facebook, Google plus, Reddit, Tumblr have been a widely used platform for people to communicate, share views and feelings with others freely. The information obtained from this short text messages helps in predicting their emotions, views, sentiment, opinion and it is applied in different fields like marketing, election, product review, sentiment analysis, emotion detection etc. Behavioral analysis from text data is another widely popular field. This paper gives an analysis of global word representations and overview of the work done on depression detection related tasks. Major steps such as pre- processing of data, feature extraction, representation and classification methods are summarized. Keywords Social media, depression detection, behavioral analysis, emotion detection, GloVe representation, deep learning 1. Introduction1 few people are not aware of abnormalities in their mental health to consult a Psychiatrist. In order to address this, the Depression can be Behavioral analysis is the study of human detected from the social media data of the behavior. It involves observing the behavior, users itself [1]. Since most of the people identifying the mental state, analyzing and around the world are using social media like understanding the change in human behavior. Facebook, Twitter, Instagram etc. Depression Behavioral analysis is also called as can be detected from their text messages, emotional/sentimental analysis. Among status updates, posts they are sharing, self- several emotions, the crucial ones are with reported surveys and the communities or pages negative emotions. Some negative emotions they are following [2-4]. are stress, depression, frustration, hate, envy, This analysis can be done from text anger, anxiety, boredom and panic. These data, speech/audio data and visual data [5], [6]. emotions may affect the mental health as well The data for this analysis can be collected as physical health of a person. In which, from any social media. Since most of the user depression is a persistent mood disorder and in prefers to share short text messages on the the worst case, it can be a life-threatening one. events happening around them or information So it is essential to identify the people at the about them, it is more informative to analyze risk of depression. Face-to-face interviews and the social media text data. This sentimental a set of questionnaire are used by Psychiatrist, analysis is very popular since it is needed in to understand the behavioral health of the wide application areas of marketing, artificial person. It provides a more accurate result, but intelligence, political science, human- computer interaction, psychology, stock ISIC'21: International Semantic Intelligence Conference, 1 market prediction etc. Figure 1 shows the flow February 25-27, 2021, New Delhi, India EMAIL: nivedithasekarit@gmail.com (Niveditha Sekar); diagram of depression detection system. chandrakala@cse.sastra.edu (S Chandrakala); gprakas_74@rediffmail.com (G Prakash) * Corresponding Author 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 137 information in a consistent manner. The opposite of this is also true i.e. users who are under stress or depression are not much interested to have communication on social media [5], [10]. This low activeness in social media results in lesser tweets and thereby it is difficult to identify the emotion of the user with accuracy. The main task in the emotion analysis is to understand the semantic nature of the short text messages. Most of the features identified from the short text or tweet are sparse features. It is really challenging to detect the emotion from those sparse features since they contribute very less value in the detection of emotion [11]. In a word-level representation, most of the words identified are ambiguous, and they also contain stop words. Hence, it is difficult to identify their emotion class label by a classifier [11]. It is Figure 1: Flow diagram of Depression also difficult to identify the original meaning detection system of the sentence, when it has a sarcastic tone. Since the sentences may sound joyful, but they 2. Challenges in short text data actually express sadness. It leads to false positive in the result [8], [12]. analysis Text data which is collected from any social media does not have a structure. Each user 3. Short text datasets for expresses his/her view in different ways and depression detection their text includes new words, short form of words, errors in the spelling of the words etc. The short text dataset can be collected through [7], [8]. It is difficult to detect depression from the Twitter public API [5], [13], [14] or a single tweet of a user. Thus, we need to through the short text datasets, which are observe a history of tweets of a particular user already available [15-17]. Twitter public API [9]. Also, there is a word limit for twitter provides a means to access the twitter software tweets. Within 140 characters it is hard to platform. Several software libraries are express one’s feelings and also it is hard for available for each programming language the analyst to interpret their feelings [10]. In namely tweepy for Python and rtweet for R. order to identify their emotion, they have to Twitter API is of two types, they are Twitter analyze the comment and retweets about that REST API and Twitter Streaming API. Twitter particular tweet. This is a long chain process, Streaming API [18], [19] will provide live to detect the emotion of a particular user. A tweets until you stop, whereas REST API will large collection of tweets, from the history of provide historical data. Table 1 lists few short that particular user have to be taken into text datasets, which are used in depression account and also the comments, retweets for detection literature. each tweet by the user have to be considered for this emotion detection. A typical social media user used to share information about them in any of this form text messages, photos or videos. They share 138 Table 1 Summary of few Short Text Datasets for Depression Detection Ref. Dataset name Description [7], [9], CLPsych dataset 1,746 twitter users examples, in which 246 are [15] PTSD users and 327 are depressed users. [9] BellLetsTalk campaign All tweets with #BellLetsTalk hashtag are collected, dataset in which 95 people disclosed that they are depressed. [16] CLEF/eRisk 2017 dataset 887 Reddit users examples, in which 135 are depressed. [10] Sina weibo dataset 23,304 users tweets are crawled, in which 11,074 users are stressed. [17] LiveJournal dataset This dataset consists of 2,132 posts. In which, 758 are depressed posts. [11] SemEval 2007 dataset This dataset consists of 1,250 news headlines. They are labelled into 6 emotions. [11] ISEAR dataset This dataset contains 7,666 sentences. They are labelled into 7 emotions. 4. Pre-processing the short text  Also negative references are replaced by their full words i.e. “can’t” is replaced by data “cannot”.  Emoticons and emojis are replaced with Before feature selection, the short text data is their words. pre-processed to refine the unstructured and noisy data. Pre-processing phase is an important phase, as it helps in improving the 5. Feature extraction and performance. representation  In the pre-processing phase, all the non- ASCII, non-English characters, URLs and 5.1. Feature extraction @username are removed. Since they are not contributing any valued information to From the pre-processed data, the features are the depression detection system. extracted, represented and are given as input to the classification methods. There are several  All the acronyms are expanded to its full features or attributes involved in the process of form like “idk” as “I don’t know”. depression detection. Some of the features  This phase performs tokenizing, stemming used for this depression detection are user- and removing stop words [17], [20], [21]. level feature, tweet-level feature, temporal Tokenizing process will split the texts into feature, non-temporal feature, social sequence of tokens. Stemming process interaction feature, content feature, posting will reduce the length of a word, by behavior feature, term frequency feature, Bag- reducing the word to its word stem like Of-Words (BOW) feature, hashtags, negation, “rained”, ”raining” as “rain”. Stop words LIWC feature, word N-gram feature, Part-of- are removed, some of them are “a”, “the”, speech (POS) feature, topic, tweet frequency, “and” etc. RT [24] etc. Several feature extraction  In each word, if a letter is appearing techniques are available as built-in commands continuously more than twice then it is in R language, SciPy, Numpy etc. replaced with its appropriate word [22], The tweet-level attributes will give [23] like “Noooooo” as “No”. information from the tweet, image, retweets, 139 comments and likes. The user level attributes model which will map words with similar will provide more information on the emotion context into a feature vector [28]. The GloVe of the user; it includes the behavior of the user representation model proves to be effective from their social interaction and from their and is showing improved performance when posts. The social interaction attributes have combined with Deep Convolutional Neural information about the content and the structure Network, than the state-of-art approaches [28], in which the user communicates with his [40]. friends [5], [10]. Tweets are classified in time series for temporal feature, whereas history of tweets is used in non-temporal feature. Term 6. Depression detection methods frequency feature gives the frequency count of individual word or n-gram of words. POS The extracted features and derived feature finds the adjectives since they provide representations are fed as input for further more information. Negation feature gives the modeling. Depression can be detected from the actual opinion orientation like “not happy” is short text data with the help of various equivalent to “sad” [25]. Bag-Of-Words will modeling methods, such as Discriminative provide the occurrence of each word in a model based methods, Ensemble model based document. Word N-grams feature is similar to methods, Probabilistic model based methods, Bag-Of-Words. N-gram includes phonemes, ANN based methods, Deep learning based syllables, letters, words [16]. To reduce methods and Unsupervised learning based dimension or attributes Principal Component methods. Analysis (PCA) is used [26]. 6.1. Discriminative model based 5.2. Representation methods There are several feature representation SVM is a discriminative classifier. SVM is models are available. Some of the most suited for text data, because of the sparse representation models are Word2Vec nature of the text. Text data can be categorized representation, FastText, Global vector for into two categories. They are user-level word representation (GloVe) model, word N- attributes and tweet-level attributes. In tweet- gram feature representation, twitter specific level category, first the features are extracted, feature representation, word sentiment polarity next the features are segregated into different score representation, word representation classes like depressed words, non-depressed features, temporal feature vector, non-temporal words, polarity words, stop words etc. In user- feature vector etc. level category, the user tweet history is Word2Vec representation uses considered. All the tweets of the user are continuous skip gram and BOW features. considered like a single tweet and then tweet- Based on non-temporal feature, overall level detection is performed. It uses (BOW) to emotion score is calculated. For temporal get the vocabulary. Then it is trained using feature, if a user did not tweet anything for a SVM in original dataset, dataset balanced by day its score is taken as zero. In such a way, under-sampling and dataset balanced by over- emotion score vector is calculated [27]. In sampling. It is observed that user-level word embedding, all the words are mapped classification gives high performance with into a multi-dimensional vector, where respect to recall measure in comparison with semantically related words are neighbors. The the tweet-level classification even for the word sentiment polarity score representation, limited number of feature. Also it is difficult to finds either the word has a strong relationship detect whether the user is depressed or not with positive sentiment or non-positive from a single tweet/post, hence user-level sentiment. To identify this, it uses the lexicon category is used [9]. It is also observed that based sentiment feature and Senti-wordnet. when Linear SVM is applied on BOW feature, FastText is similar to skip-gram it provides good performance in terms of representation, where each ngram has its Recall measure [15]. SVM gives good vector. Vector representation helps to improve accuracy when compared with Naïve Bayes the performance, as it provides the hidden and Logistic regression methods [29]. Table 2 details [36]. The GloVe model is a regression 140 gives the summary of few discriminative the best fit for the depression case. Here Naïve modeling. Bayes theorem shows better performance with respect to accuracy when compared with other 6.2. Ensemble model based classifier algorithms. When evaluating with respect to Precision and F1-score Logistic methods Regression gives good performance [15]. Also, Naïve Bayes is the best classification Random Forest (RF) classifier is an ensemble approach when compared with BP neural classifier. It is a multitude of decision tree, for network and Decision tree. Also, Naïve Bayes more accurate results. To detect depression gives high precision and recall value [26]. from the text data, temporal feature and non- Table 4 gives the summary of few temporal feature are used. Feature vector from probabilistic modeling. non-temporal feature is referred as EMO. EMO, LIWC and combination of EMO+LIWC feature sets are given as input to Random 6.4. Artificial Neural Network Forest classifier. It is observed that RF gives (ANN) based methods high precision and recall than SVM [30]; also it provides more information with temporal Artificial Neural Network (ANN) is combined features [27]. RF classifier is also used to with several unsupervised learning model to classify the online post and communities into detect depression from social media text data. depressive and non-depressive. On top of the Some of the unsupervised learning models are extracted LIWC feature, RF is applied to Biterm Topic model, Word2vec, Replicated classify them. Hierarchical HMM is used for Softmax Machine. BTM identifies words that determining the degree of depression in the appear together. It will identify two words that social communities. RF, Logistic Regression, appear together, if the size of window is given and Gaussian NB are applied with different as two. BTM uses topic to represent the hidden representation methods such as Word2Vec, aspects of the document. Word2vec is a word FastText with Skip-gram, and GloVe. RF embedding process, identifies both semantic provides better performance than the other and syntactic regularities in the sentence. And models when combined with FastText [36]. it will group them in clusters, if the vectors Table 3 gives the summary of few ensemble have similar semantic meanings i.e. it modeling. computes the association with words and groups them together. RSM is similar to term 6.3. Probabilistic model based frequency counter, it will count the occurrences of a particular word in the methods vocabulary collected. RSM also identifies the hidden topical structure. On top of this Naïve Bayes is a probability based classifier. unsupervised learning model, Stochastic Naïve Bayes algorithm has assumes that each Gradient Descent (SGD) model is applied. feature is independent. Bag-Of-Words (BOW) SGD acts as transfer learning approach, as this approach will provide the words with its will transfer the high-level semantic features to occurrence frequency. BOW feature is given ANN. In order to filter the noisy feature and to as input to different classification algorithms maintain the stability of this model, Sparse like DT, NB, Linear SVM and Logistic Encoding method is applied. The transfer Regression. Each tweet is treated as a learning approach used in this Hybrid Neural document. Here Bag-Of-Words finds the Network is called as Latent Semantic Machine occurrence frequency of words related to (LSM). It accepts the raw features from the depression. Decision tree will provide results unsupervised learning models and derives for most of the cases, but it may be unstable them into high level semantic feature mixture, when there is a change in data. Linear SVM is which will be fed into the Neural network. It is also used for this purpose, where a straight line observed that HNN+BTM with one LSM and is used to differentiate classes. It uses a HNN+BTM with two LSM performed better maximum-margin hyperplane to perform this in terms of F1 measure, than HNN with other identification of classes. Logistic Regression unsupervised learning models. It is also uses the probability of words belonging to a observed that HNN+RSM and particular class and curve is drawn to identify 141 HNN+Word2vec with sparse encoding give MultiChannelPoolingCNN is same as better performance than HNN+RSM and MultiChannelCNN but with two different HNN+Word2vec without sparse encoding. max-pooling sizes 2 and 5. MutiChannel CNN The selection of the unsupervised learning and bi-directional GRU combined to give models for extracting the source features more accuracy than CNN [38]. These CNN added more value to this HNN model [11]. variants are compared with the RNN model Feed Forward (FF) is a type of ANN. and it is observed that CNN with global max The Reddit dataset is pre-processed and fed to pooling layer gives high performance than FF neural network method. This FF modeling RNN based model by providing the highest is used for multiclass classification, which precision and recall [7]. involves “selfharm”, “suicidewatch”, “anxiety”, “depression”, etc. It is observed that CNN with Factor Graph Model (FGM). FF classifier gives more accurate results when CNN is combined with the Factor Graph compared with SVM and linear regression Model (FGM) to extract more tweet level and [31]. Table 5 gives the summary of few ANN user level information. In this approach, CNN based modeling. method is applied on the dataset along with the Cross Auto Encoders (CAE). CNN will provide the user-level attributes, which is 6.5. Deep learning based methods obtained from tweet-level. Then this will be given as input to the next phase FGM. FGM 6.5.1. Convolutional Neural considers three factors and three aspects of this Networks (CNN) attributes to map this into states. The three factors are attribute factor, dynamic factor and CNN with global max pooling layer. The pre- social factor. To depict the correlation of the processing of twitter data provides a stress state and time with attribute, attribute vocabulary for further phases. The words are factor is used. Dynamic factor is used to give encoded into a sequence of fixed length and correlation of the stress state and dynamic occurrence of the word is limited to two times time. Social factor is used to depict the in that sequence. Then unsupervised training correlation between the stress state and time models are used to transform the encoded with polarity comments. The three main words into a low dimensional vector. For this aspects, which FGM is taking into account, are many models are available like Skip-gram and the following user-level attributes: posting CBOW. Skip-gram concentrates on the behavior, content, and social interaction. contextual words and is able to detect rare Based on these factors and aspects the user- words whereas the CBOW concentrates on the level attributes are mapped with the respective current words and is much of a continuous stress state level. This CNN+FGM give better Skip-gram. Skip-gram and CBOW are the two performance, by providing the highest layers of Word2Vec model. This unsupervised precision and recall, when compared with the training model is performed with different traditional methods like SVM, RF, LR [10]. sense and it involves two tasks. They are predicting word and sense from the input. For DCNN with Global vector for word this, it first identifies the word that occurs representation (GloVe) model. DCNN method together, example “happy” it can come with helps to identify whether the tweets express words like “journey”, “morning”, “birthday”. positive or non-positive emotion. Before Then Rectified Liner Unit (ReLU) is used, this applying Deep Convolutional Neural network, will identify the label for the missing data and the tweets are preprocessed, features are sense of the sentence, thereby produce the extracted and represented into feature vector label output. using GloVe model. The GloVe model is a On top of these embeddings, variants of regression model which combines the CNN are applied. CNNWithMAX means following two methods local context window Convolution with 250 layers is applied and and global matrix factorization. Deep then the global max pooling layer is applied to Convolutional Neural Networks (DCNN) is extract the global information. In applied on the vector, generated by GloVe MultiChannelCNN, three times CNN is model. The twitter specific feature vector, applied, with the filter of length 3, 4, and 5. unigram and bigram feature vector, word 142 sentiment polarity score feature vector are LSTM and Gate Recurrent Unit (GRU). combined into a single feature vector. In the LSTM and GRU are best suited for predicting first Convolutional layer, on top of the long-term data involving delay. Combining combined feature vector, Convolutional filter GRU with LSTM helps in handling the is applied to get new vector. The vector is difficulties in LSTM, which is the training mapped to a fixed length vector. Again speed. GloVe representation is used to utilize convolutional layer is applied to get new both local and global details of the data. vector. This GloVe+DCNN model uses three Among RNN, LSTM, GRU and LSTM-GRU, k-max pooling layer and three convolutional LSTM-GRU provides better performance [37]. layers to give the probability of positive or Table 6 gives the summary of few deep negative sentiment in the tweet. It is observed learning based modeling. that GloVe+DCNN provide high precision and recall when compared with BoW/GloVe with SVM or LR [28]. 6.6. Unsupervised learning based methods 6.5.2. Recurrent Neural Networks K-means is an unsupervised learning method. (RNN) Before applying k-means to the observations, the collected data is pre-processed. Then the RNN is widely used in NLP. Word2Vec model data is analyzed, by calculating the word is used to represent the vocabulary. It also frequency. Words in the vocabulary are helps to determine or predict the word and represented into vector, using one-hot sense from the input. Rectified Liner Unit encoding or word embedding process. (ReLU) is also used. ReLU helps in Word2Vec model can also be used to generate identifying the missing label for the data and vectors. Then k-means clustering is applied, also identifies the sense of the sentence. This words with similar meaning are grouped embedding is given as input to RNN model. together in clusters. Based on cosine similarity RNN is applied with Bidirectional LSTM and it is easy to accumulate semantically similar context-aware attention. LSTM prevents error words in the clusters [32]. For Latent Dirichlet from exploding and vanishing gradient Allocation (LDA) method, the extracted N- problems. Bidirectional RNN connects the gram features are fed as input. LDA is applied output from two hidden layers of opposite on term-document matrix and gives output as direction to the same output. Bidirectional topic-document matrix, which is fed into LSTM helps to concatenate both forward and Multilayer Perceptron (MLP). MLP works backward representation. Context-aware with 30 topics as input and two hidden layer of attention provides the weighted sum of all 60 and 30 units. It gives comparatively words in a sequence and also it helps to focus moderate performance with respect to on the more important words. It is observed precision and recall, which is due to the that optimized embedding performed better, unsupervised nature of the topic extraction than the trainable random embedding for [33]. Table 7 gives the summary of few RNN. Also, when compared with CNN based unsupervised learning based modeling. models, RNN shows low performance with respect to precision and recall [7]. Table 2 Summary of few Discriminative Model Based Methods Ref. Dataset Feature Classifier Performance [9] CLPsych2015, Bag-Of-Words, User-level, SVM Precision-0.58, BellLetsTalk 2015 tweet-level feature Recall-0.77 [15] CLPsych2015 Bag-Of-Words SVM Precision-0.83, Recall-0.83 143 Ref. Dataset Feature Classifier Performance [29] Twitter, 20newsgroups N-gram , negation, Part- SVM Precision-0.83, Of-Speech (POS) Recall-0.79 Table 3 Summary of few Ensemble Model Based Methods Ref. Dataset Feature Classifier Performance [27] Twitter Streaming API Non-Temporal(EMO), RF Precision-0.90, Temporal(EMO-TS), LIWC Recall-0.86 feature [30] LiveJournal LIWC feature RF Precision-0.89, Recall-0.90 [36] Twitter dataset Skip-gram RF Precision-0.82, Recall-0.81 Table 4 Summary of few Probabilistic Model Based Methods Ref. Dataset Feature Classifier Performance [15] CLPsych2015 Bag-Of-Words Naïve Bayes Precision-0.82, Recall – 0.82 [26] Questionnaire data Words, pronoun and Naïve Bayes Precision-0.75, punctuation feature Recall-0.62 Table 5 Summary of few Artificial Neural Network (ANN) Based Methods Ref. Dataset Feature Classifier Performance [31] Reddit dataset Linguistic features Feed Forward Precision-0.92, (FF) Recall-0.88 [11] SemEval and ISEAR words feature ANN with F1-0.60 dataset, SinaNews unsupervised dataset learning model 144 Table 6 Summary of Few Deep Learning Based Methods Ref. Dataset Feature Classifier Performance [7] CLPsych2015, Continuous Bag- Of- CNN with Precision-0.87, BellLetsTalk2015 Words, Skip gram global max Recall-0.87 pooling layer [10] Sina Weibo’s REST APIs, User-level feature, Tweet- CNN with FGM Precision-0.90, Tencent Weibo level feature Recall-0.96 [28] Stanford Twitter Word N-gram feature, DCNN with Precision-0.88, Sentiment Test dataset, emoticons, hashtags, GloVe model Recall-0.87 SemEval 2014 Task9 negation, Part- Of-Speech dataset, Stanford Twitter (POS) and capitalized Sentiment Gold dataset, words. Sentiment Evaluation Dataset, Sentiment Strength Twitter dataset [7] CLPsych2015, Continuous Bag- Of- RNN with Precision-0.63, BellLetsTalk2015 Words, Skip gram bidirectional Recall-0.65 LSTM [37] IMDB dataset words feature LSTM and GRU F1-0.86 Table 7 Summary of few Unsupervised Learning Based Methods Ref. Dataset Feature Classifier Performance [32] Twitter data Word frequency K-means Cosine similarity helps in easy clustering [33] eRisk 2018 pilot N-gram features Latent Dirichlet Precision-0.32, task dataset Allocation (LDA) Recall-0.62 7. Analysis of Global word interdisciplinary research in the field of depression related problems. The people under representations depression can be alerted when early signs of depression are found. eRisk 2017 dataset 7.1. Dataset overview focussed on the early risk prediction with multiple actors (Ex: Children sexual abuse) The dataset used in the following experimental and with single actors (Ex: Depression, bipolar analyses is “CLEF/eRisk 2018 dataset”. The disorder, teenage distress) from online text aim of the CLEF eRisk is to identify the data. eRisk 2018 dataset is formed with the people, liable to depression from the data 2017 dataset, it involved in the early prediction available on the Internet. It paved a way of of Depression and Anorexia among the social 145 media users. Both eRisk 2017 and eRisk 2018 two hidden layers of varying units. The output uses the same source of data, i.e. it collects the layer of the RNN helps in labeling the users as social media texts from a particular collection a depressed person or non-depressed person. of users. The data is arranged in chronological The performance of GloVe and RNN model is order of 10 chunks from oldest to newest of depicted in Table 8. each user. It provides data for both training and testing. The training data is divided into Analysis with GloVe and CNN. The GloVe depressed and control groups i.e., non- model is observed to be effective for sentiment depressed. The eRisk 2017 dataset is a analysis from text data mining [28], the GloVe collection of writings from 887 social media representation model is combined with CNN users, where 135 are depressed. The eRisk to analyze the result. The dataset taken for this 2018 dataset is an extended collection of 2017 analysis consists of a few empty writings, dataset, which consists of writings from 1,707 which are ignored. Then the dataset is pre- users, where 214 users are depressed. processed while preserving the emoticons and symbols since they provide valuable 7.2. Methodologies used information. Each user’s writing in each chunk is analyzed and formed a matrix of words with Analysis with TF-IDF representation and a pre-trained set of word embeddings. This LDA. The eRisk dataset is pre-processed as an pre-processed tokenized input is given to a initial step. The TF-IDF vectorizer is well single Convolutional layer of 100 filters with suited for text dataset. As this will provide the CReLU activation. A Single max pooling layer unique list of words used in the dataset, along is applied to classify each user as depressed or with their frequency of occurrence. It helps in not. The GloVe model is combined with classifying the words under a particular set of different layers of CNN and LSTM network topics. The TF-IDF vectorizer of Scikit-learn and the performance is observed high for converts the writings of social media users into GloVe with the multiple layers of CNN and a matrix of TF-IDF features. The terms bi-LSTM [39]. The performance of GloVe extracted using the TF-IDF vectorizer is and CNN model is depicted in Table 8. formed as a matrix and given as input to Latent Dirichlet Allocation (LDA). The output 7.3. Performance analysis of LDA is the topic matrix. As each document is composed of different topics or attributes. The Classification report and Confusion And each topic is composed of different matrix is used to analyze the performance of words. This topic matrix is given as input to the above methodologies. The following the MLP model, it consists of two intermediate Table 8 shows Precision, Recall and F1 of layers of 50 & 20 units. By this approach, each these three methods. user is labeled as depressed or not. The The TF-IDF representation focuses performance of the TF-IDF and LDA model is mainly on the frequency of word occurrence in depicted in Table 8. documents. Then maps the word into an appropriate topic, thereby it classifies them. In Analysis with GloVe and RNN. GloVe this case, whenever some word related to combines the advantage of methods local depression comes, it classifies them as context window and global matrix depressed, which is not the ideal way. The factorization, to provide meaningful word Global vector for word representation insights. The GloVe model is providing considers the frequency of word occurrence promising results for text classification. It is and the frequency of co-occurrence of words combined with the RNN model, as RNN thereby provides more valuable information in model widely used for text classification and classifying them. The GloVe representation Natural Language Processing (NLP). The founds to be significantly better than the most eRisk dataset is pre-processed and tokenized commonly used word representation TF-IDF. and then given as input to the GloVe The RNN and CNN classifier works well with representation model. To provide meaningful text representation and its performances are statistics, GloVe forms the word-to-word co- analyzed with GloVe representation. From the occurrence matrix. The Resultant of GloVe is table, it is found that GloVe representation is given as input to the RNN model. It involves better than TF-IDF representation. Also, 146 GloVe representation performs well with CNN that GloVe representation with CNN classifier than RNN. This is because the RNN model provides comparatively better results. gives better result with word embeddings of higher length. From the analysis, it is found Table 8 Summary of Performance Analysis Methodology Precision Recall F1 TF-IDF representation 0.62 0.28 0.49 with LDA GloVe with RNN [34] 0.80 0.20 0.31 GloVe with CNN [35] 0.42 0.66 0.51 GloVe with multiple CNN and 0.60 0.54 0.55 bi-LSTM [39] 8. Future directions References The performance of depression detection [1] Juyoung Song, Tae Min Song, Dong-Chul system can be improved or made more Seo, and Jae Hyun Jin.: “Data Mining of meaningful with the following directions for Web-Based Documents on Social future research. Networking Sites that Included Suicide-  The depression detection task can also be Related Words among Korean done by extracting the emotions from the Adolescents”. Journal of Adolescent speech data. Health, 59(6):668–673 2016.  It can also be extended by grouping users, [2] Budhaditya Saha, Thin Nguyen, Dinh based on gender, age, locations and other Phung, and Svetha Venkatesh.: “A demographic attributes. Framework for Classifying Online Mental  The Spatiotemporal features from video Health-Related Communities with an data can also be included, as they Interest in Depression”. IEEE Journal of contribute more information. Biomedical and Health Informatics,  Daily variation of a user’s depression can 20(4):1008–1015 2016. also be monitored. [3] Sharath Chandra Guntuku, David B Yaden, Margaret L Kern, Lyle H Ungar,  It can be extended by including the and Johannes C Eichstaedt.: “Detecting medical context, so the clinical depression Depression and Mental Illness on Social can be detected from social media data. Media: An Integrative Review”. Current Opinion in Behavioral Sciences, 18:43–49 9. Conclusion 2017. [4] Elizabeth M Seabrook, Margaret L Kern, This paper provided an overview of the Ben D Fulcher, and Nikki S Rickard.: depression detection system and the analysis “Predicting Depression from Language- of global word representations from the short Based Emotion Dynamics: Longitudinal text data. Datasets and Machine learning Analysis of Facebook and Twitter Status methods, used in recent years for the Updates”. Journal of Medical Internet depression detection are summarized. The research, 20(5) 2018. global word representations model proved to [5] Guangyao Shen, Jia Jia, Liqiang Nie, Fuli be effective is analyzed with different Feng, Cunjun Zhang, Tianrui Hu, classifiers. Various challenges and future TatSeng Chua, and Wenwu Zhu.: directions are summarized for future research. “Depression Detection via Harvesting Social Media: A Multimodal Dictionary Learning Solution”. In Proceedings of the 147 Twenty-Sixth International Joint [15] Moin Nadeem.: “Identifying Depression Conference on Artificial Intelligence on Twitter”. arXiv preprint (IJCAI-17), pages 3838–3844, 2017. arXiv:1607.07384 2016. [6] Andrew G Reece and Christopher M [16] Maxim Stankevich, Vadim Isakov, Danforth.: “Instagram Photos Reveal Dmitry Devyatkin, and Ivan Smirnov.: Predictive Markers of Depression”. EPJ “Feature Engineering for Depression Data Science, 6(1):15 2017. Detection in Social Media” 2018. [7] Ahmed Husseini Orabi, Prasadith [17] Maryam Mohammed Aldarwish and Buddhitha, Mahmoud Husseini Orabi, Hafiz Farooq Ahmad.: “Predicting and Diana Inkpen.: “Deep Learning for Depression Levels Using Social Media Depression Detection of Twitter Users”. Posts”. In Autonomous Decentralized In Proceedings of the Fifth Workshop on System (ISADS), 2017 IEEE 13th Computational Linguistics and Clinical International Symposium on, pages 277– Psychology: From Keyboard to Clinic, 280 2017. pages 88–97 2018. [18] Felix Ming Fai Wong, Chee Wei Tan, [8] Harshali P Patil and Mohammad Atique.: Soumya Sen, and Mung Chiang.: “Sentiment Analysis for Social Media: A “Quantifying Political Leaning from Survey”. In Information Science and Tweets, Retweets, and Retweeters”. IEEE Security (ICISS), 2015 2nd International Transactions on Knowledge and Data Conference on, pages 1–4 2015. Engineering, 28(8):2158–2172 2016. [9] Zunaira Jamil.: “Monitoring Tweets for [19] Zhenhua Zhang, Qing He, Jing Gao, and Depression to Detect At-risk Users”, Ming Ni.: “A Deep Learning Approach 2017. for Detecting Traffic Accidents from [10] Huijie Lin, Jia Jia, Jiezhong Qiu, Social Media Data”. Transportation Yongfeng Zhang, Guangyao Shen, Research Part C: Emerging Technologies, Lexing Xie, Jie Tang, Ling Feng, and Tat- 86:580–596 2018. Seng Chua.: “Detecting Stress Based on [20] C Sindhu, Dyawanapally Veda Vyas, and Social Interactions in Social Networks”. Kommareddy Pradyoth.: “Sentiment IEEE Transactions on Knowledge and Analysis Based Product Rating Using Data Engineering, 29(9):1820–1833 2017. Textual Reviews”. In Electronics, [11] Xiangsheng Li, Yanghui Rao, Haoran Communication and Aerospace Xie, Raymond Yiu Keung Lau, Jian Yin, Technology (ICECA), 2017 International and Fu Lee Wang.: “Bootstrapping Social conference of, volume 2, pages 727–731. Emotion Classification with Semantically [21] Anukarsh G Prasad, S Sanjana, Skanda M Rich Hybrid Neural Networks”. IEEE Bhat, and B S Harish.: “Sentiment Transactions on Affective Computing, Analysis for Sarcasm Detection on 8(4):428–442 2017. Streaming Short Text Data”. In [12] Monireh Ebrahimi, Amir Hossein Knowledge Engineering and Applications Yazdavar, and Amit Sheth.: “Challenges (ICKEA), 2017 2nd International of Sentiment Analysis for Dynamic Conference on, pages 1–5 2017. Events”. IEEE Intelligent Systems, [22] Sonia Xylina Mashal and Kavita Asnani.: 32(5):70–75 2017. “Emotion Intensity Detection for Social [13] Zhaoxia Wang, Chee Seng Chong, Landy Media Data”. In Computing Lan, Yinping Yang, Seng Beng Ho, and Methodologies and Communication Joo Chuan Tong.: “Fine-Grained (ICCMC), 2017 International Conference Sentiment Analysis of Social Media with on, pages 155–158 2017. Emotion Sensing”. In Future [23] Ahmed Al-Saffar, Suryanti Awang, Hai Technologies Conference (FTC), pages Tao, Nazlia Omar, Wafaa Al-Saiagh, and 1361–1364 2016. Mohammed Al-bared.: “Malay Sentiment [14] Sara Rosenthal, Noura Farra, and Preslav Analysis Based on Combined Nakov.: “SemEval-2017 task 4: Classification Approaches and Senti- Sentiment Analysis in Twitter”. In Lexicon Algorithm”. PloS One, Proceedings of the 11th International 13(4):e0194852 2018. Workshop on Semantic Evaluation [24] Sho Tsugawa, Yusuke Kikuchi, Fumio (SemEval-2017), pages 502–518 2017. Kishino, Kosuke Nakajima, Yuichi Itoh, 148 and Hiroyuki Ohsaki.: “Recognizing Informed Deep Learning”. Scientific Depression from Twitter Activity”. In reports, 7:45141 2017. Proceedings of the 33rd Annual ACM [32] Long Ma, Zhibo Wang, and Yanqing Conference on Human Factors in Zhang.: “Extracting Depression Computing Systems, pages 3187–3196 Symptoms from Social Networks and 2015. Web Blogs via Text Mining”. In [25] Amit G Shirbhate and Sachin N International Symposium on Deshmukh.: “Feature Extraction for Bioinformatics Research and Sentiment Classification on Twitter Applications, pages 325–330 2017. Data”. International Journal of Science [33] Maupome´.: “Using Topic Extraction on and Research (IJSR) ISSN (Online), Social Media Content for the Early pages 2319–7064 2016. Detection of Depression” 2018. [26] Changye Zhu, Baobin Li, Ang Li, and [34] Paul, S., Kalyani, J.S. and Basu, T.,: Tingshao Zhu.: “Predicting Depression “Early Detection of Signs of Anorexia from Internet Behaviors by Time- and Depression Over Social Media using Frequency Features”. In Web Intelligence Effective Machine Learning (WI), 2016 IEEE/WIC/ACM Frameworks”. 2018. International Conference on, pages 383– [35] Trotzek, M., Koitka, S. and Friedrich, 390 2016. C.M.,: “Word Embeddings and Linguistic [27] Xuetong Chen, Martin D Sykora, Thomas Metadata at the CLEF 2018 Tasks for W Jackson, and Suzanne Elayan.: “What Early Detection of Depression and About Mood Swings: Identifying Anorexia” 2018. Depression on Twitter with Temporal [36] Kaibi, Ibrahim, and Hassan Satori.: “A Measures of Emotions”. In Companion of comparative evaluation of word the The Web Conference 2018 on The embeddings techniques for twitter Web Conference 2018, pages 1653–1660 sentiment analysis”. In 2019 2018. International Conference on Wireless [28] Zhao Jianqiang, Gui Xiaolin, and Zhang Technologies, Embedded and Intelligent Xuejun.: “Deep Convolution Neural Systems (WITS), pp. 1-4. IEEE 2019. Networks for Twitter Sentiment [37] Ni, Ru, and Huan Cao.: “Sentiment Analysis”. IEEE Access, 6:23253–23260 Analysis based on GloVe and LSTM- 2018. GRU”. In 2020 39th Chinese Control [29] Anees Ul Hassan, Jamil Hussain, Conference (CCC), pp. 7492-7497. IEEE Musarrat Hussain, Muhammad Sadiq, and 2020. Sungyoung Lee.: “Sentiment Analysis of [38] Cheng, Yan, Leibo Yao, Guoxiong Xiang, Social Networking Sites (SNS) Data Guanghe Zhang, Tianwei Tang, and Using Machine Learning Approach for Linhui Zhong.: “Text sentiment the Measurement of Depression”. In orientation analysis based on multi- Information and Communication channel CNN and bidirectional GRU with Technology Convergence (ICTC), 2017 attention mechanism”. IEEE Access 8: International Conference on, pages 138– 134964-134975 2020. 140 2017. [39] Goularas, Dionysis, and Sani Kamis.: [30] Iram Fatima, Hamid Mukhtar, Hafiz “Evaluation of deep learning techniques Farooq Ahmad, and Kashif Rajpoot.: in sentiment analysis from Twitter data”. “Analysis of User-Generated Content In 2019 International Conference on Deep from Online Social Communities to Learning and Machine Learning in Characterise and Predict Depression Emerging Applications (Deep-ML), pp. Degree”. Journal of Information Science, 12-17. IEEE 2019. page 0165551517740835 2017. [40] Abid, Fazeel, Muhammad Alam, and [31] George Gkotsis, Anika Oellrich, Sumithra Adnan Abid.: “Representation of Words Velupillai, Maria Liakata, Tim JP Over Vectors in Recurrent Convolutional Hubbard, Richard JB Dobson, and Rina Attention Architecture for Sentiment Dutta.: “Characterisation of Mental Analysis”. In 2019 International Health Conditions in Social Media using Conference on Innovative Computing (ICIC), pp. 1-8. IEEE 2019.