136


Analysis of Global Word Representations for Depression
Detection
Niveditha Sekara, S Chandrakalaa* and G Prakashb*
a
    Intelligent Systems Lab, School of Computing, SASTRA Deemed to be University, Thanjavur, India
b
    Department of CSE, Amrita School of Engineering, Bengaluru, Amrita Vishwa Vidyapeetham, India


                   Abstract
                    Social media such as Twitter, Facebook, Google plus, Reddit, Tumblr have been a widely
                   used platform for people to communicate, share views and feelings with others freely. The
                   information obtained from this short text messages helps in predicting their emotions, views,
                   sentiment, opinion and it is applied in different fields like marketing, election, product
                   review, sentiment analysis, emotion detection etc. Behavioral analysis from text data is
                   another widely popular field. This paper gives an analysis of global word representations and
                   overview of the work done on depression detection related tasks. Major steps such as pre-
                   processing of data, feature extraction, representation and classification methods are
                   summarized.

                   Keywords
                   Social media, depression detection, behavioral analysis, emotion detection, GloVe
                   representation, deep learning

1.      Introduction1                                                                        few people are not aware of abnormalities in
                                                                                             their mental health to consult a Psychiatrist. In
                                                                                             order to address this, the Depression can be
Behavioral analysis is the study of human                                                    detected from the social media data of the
behavior. It involves observing the behavior,                                                users itself [1]. Since most of the people
identifying the mental state, analyzing and                                                  around the world are using social media like
understanding the change in human behavior.                                                  Facebook, Twitter, Instagram etc. Depression
Behavioral analysis is also called as                                                        can be detected from their text messages,
emotional/sentimental       analysis.     Among                                              status updates, posts they are sharing, self-
several emotions, the crucial ones are with                                                  reported surveys and the communities or pages
negative emotions. Some negative emotions                                                    they are following [2-4].
are stress, depression, frustration, hate, envy,                                                     This analysis can be done from text
anger, anxiety, boredom and panic. These                                                     data, speech/audio data and visual data [5], [6].
emotions may affect the mental health as well                                                The data for this analysis can be collected
as physical health of a person. In which,                                                    from any social media. Since most of the user
depression is a persistent mood disorder and in                                              prefers to share short text messages on the
the worst case, it can be a life-threatening one.                                            events happening around them or information
So it is essential to identify the people at the                                             about them, it is more informative to analyze
risk of depression. Face-to-face interviews and                                              the social media text data. This sentimental
a set of questionnaire are used by Psychiatrist,                                             analysis is very popular since it is needed in
to understand the behavioral health of the                                                   wide application areas of marketing, artificial
person. It provides a more accurate result, but                                              intelligence, political science, human-
                                                                                             computer interaction, psychology, stock
ISIC'21: International Semantic Intelligence Conference,
1


                                                                                             market prediction etc. Figure 1 shows the flow
February 25-27, 2021, New Delhi, India
EMAIL: nivedithasekarit@gmail.com (Niveditha Sekar);                                         diagram of depression detection system.
chandrakala@cse.sastra.edu (S Chandrakala);
gprakas_74@rediffmail.com (G Prakash)
* Corresponding Author
                2021 Copyright for this paper by its authors. Use permitted under Creative
             Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)
                                                                                                  137


                                                      information in a consistent manner. The
                                                      opposite of this is also true i.e. users who are
                                                      under stress or depression are not much
                                                      interested to have communication on social
                                                      media [5], [10]. This low activeness in social
                                                      media results in lesser tweets and thereby it is
                                                      difficult to identify the emotion of the user
                                                      with accuracy. The main task in the emotion
                                                      analysis is to understand the semantic nature
                                                      of the short text messages. Most of the features
                                                      identified from the short text or tweet are
                                                      sparse features. It is really challenging to
                                                      detect the emotion from those sparse features
                                                      since they contribute very less value in the
                                                      detection of emotion [11]. In a word-level
                                                      representation, most of the words identified
                                                      are ambiguous, and they also contain stop
                                                      words. Hence, it is difficult to identify their
                                                      emotion class label by a classifier [11]. It is
Figure 1: Flow diagram of Depression                  also difficult to identify the original meaning
detection system                                      of the sentence, when it has a sarcastic tone.
                                                      Since the sentences may sound joyful, but they
2. Challenges in short text data                      actually express sadness. It leads to false
                                                      positive in the result [8], [12].
   analysis
Text data which is collected from any social
media does not have a structure. Each user
                                                      3. Short text datasets                      for
expresses his/her view in different ways and          depression detection
their text includes new words, short form of
words, errors in the spelling of the words etc.       The short text dataset can be collected through
[7], [8]. It is difficult to detect depression from   the Twitter public API [5], [13], [14] or
a single tweet of a user. Thus, we need to            through the short text datasets, which are
observe a history of tweets of a particular user      already available [15-17]. Twitter public API
[9]. Also, there is a word limit for twitter          provides a means to access the twitter software
tweets. Within 140 characters it is hard to           platform. Several software libraries are
express one’s feelings and also it is hard for        available for each programming language
the analyst to interpret their feelings [10]. In      namely tweepy for Python and rtweet for R.
order to identify their emotion, they have to         Twitter API is of two types, they are Twitter
analyze the comment and retweets about that           REST API and Twitter Streaming API. Twitter
particular tweet. This is a long chain process,       Streaming API [18], [19] will provide live
to detect the emotion of a particular user. A         tweets until you stop, whereas REST API will
large collection of tweets, from the history of       provide historical data. Table 1 lists few short
that particular user have to be taken into            text datasets, which are used in depression
account and also the comments, retweets for           detection literature.
each tweet by the user have to be considered
for this emotion detection.
      A typical social media user used to share
information about them in any of this form text
messages, photos or videos. They share
                                                                                                       138


Table 1
Summary of few Short Text Datasets for Depression Detection
    Ref.         Dataset name                  Description
    [7], [9],    CLPsych dataset               1,746 twitter users examples, in which 246 are
    [15]                                       PTSD users and 327 are depressed users.
    [9]          BellLetsTalk campaign         All tweets with #BellLetsTalk hashtag are collected,
                 dataset                       in which 95 people disclosed that they are
                                               depressed.
    [16]         CLEF/eRisk 2017 dataset       887 Reddit users examples, in which 135 are
                                               depressed.
    [10]         Sina weibo dataset            23,304 users tweets are crawled, in which
                                               11,074 users are stressed.
    [17]         LiveJournal dataset           This dataset consists of 2,132 posts. In
                                               which, 758 are depressed posts.
    [11]         SemEval 2007 dataset          This dataset consists of 1,250 news
                                               headlines. They are labelled into 6 emotions.
    [11]         ISEAR dataset                 This dataset contains 7,666 sentences. They are
                                               labelled into 7 emotions.


4. Pre-processing the short text                            Also negative references are replaced by
                                                             their full words i.e. “can’t” is replaced by
data                                                         “cannot”.
                                                            Emoticons and emojis are replaced with
Before feature selection, the short text data is             their words.
pre-processed to refine the unstructured and
noisy data. Pre-processing phase is an
important phase, as it helps in improving the            5. Feature     extraction                    and
performance.                                             representation
    In the pre-processing phase, all the non-
     ASCII, non-English characters, URLs and             5.1.    Feature extraction
     @username are removed. Since they are
     not contributing any valued information to          From the pre-processed data, the features are
     the depression detection system.                    extracted, represented and are given as input to
                                                         the classification methods. There are several
    All the acronyms are expanded to its full
                                                         features or attributes involved in the process of
     form like “idk” as “I don’t know”.
                                                         depression detection. Some of the features
    This phase performs tokenizing, stemming
                                                         used for this depression detection are user-
     and removing stop words [17], [20], [21].
                                                         level feature, tweet-level feature, temporal
     Tokenizing process will split the texts into
                                                         feature,     non-temporal      feature,    social
     sequence of tokens. Stemming process
                                                         interaction feature, content feature, posting
     will reduce the length of a word, by
                                                         behavior feature, term frequency feature, Bag-
     reducing the word to its word stem like
                                                         Of-Words (BOW) feature, hashtags, negation,
     “rained”, ”raining” as “rain”. Stop words
                                                         LIWC feature, word N-gram feature, Part-of-
     are removed, some of them are “a”, “the”,
                                                         speech (POS) feature, topic, tweet frequency,
     “and” etc.
                                                         RT [24] etc. Several feature extraction
    In each word, if a letter is appearing              techniques are available as built-in commands
     continuously more than twice then it is             in R language, SciPy, Numpy etc.
     replaced with its appropriate word [22],                  The tweet-level attributes will give
     [23] like “Noooooo” as “No”.                        information from the tweet, image, retweets,
                                                                                                  139


comments and likes. The user level attributes       model which will map words with similar
will provide more information on the emotion        context into a feature vector [28]. The GloVe
of the user; it includes the behavior of the user   representation model proves to be effective
from their social interaction and from their        and is showing improved performance when
posts. The social interaction attributes have       combined with Deep Convolutional Neural
information about the content and the structure     Network, than the state-of-art approaches [28],
in which the user communicates with his             [40].
friends [5], [10]. Tweets are classified in time
series for temporal feature, whereas history of
tweets is used in non-temporal feature. Term
                                                    6.    Depression detection methods
frequency feature gives the frequency count of
individual word or n-gram of words. POS             The    extracted    features  and     derived
feature finds the adjectives since they provide     representations are fed as input for further
more information. Negation feature gives the        modeling. Depression can be detected from the
actual opinion orientation like “not happy” is      short text data with the help of various
equivalent to “sad” [25]. Bag-Of-Words will         modeling methods, such as Discriminative
provide the occurrence of each word in a            model based methods, Ensemble model based
document. Word N-grams feature is similar to        methods, Probabilistic model based methods,
Bag-Of-Words. N-gram includes phonemes,             ANN based methods, Deep learning based
syllables, letters, words [16]. To reduce           methods and Unsupervised learning based
dimension or attributes Principal Component         methods.
Analysis (PCA) is used [26].
                                                    6.1. Discriminative model based
5.2.    Representation                                   methods
There are several feature representation            SVM is a discriminative classifier. SVM is
models are available. Some of the                   most suited for text data, because of the sparse
representation      models     are    Word2Vec      nature of the text. Text data can be categorized
representation, FastText, Global vector for         into two categories. They are user-level
word representation (GloVe) model, word N-          attributes and tweet-level attributes. In tweet-
gram feature representation, twitter specific       level category, first the features are extracted,
feature representation, word sentiment polarity     next the features are segregated into different
score representation, word representation           classes like depressed words, non-depressed
features, temporal feature vector, non-temporal     words, polarity words, stop words etc. In user-
feature vector etc.                                 level category, the user tweet history is
       Word2Vec          representation      uses   considered. All the tweets of the user are
continuous skip gram and BOW features.              considered like a single tweet and then tweet-
Based on non-temporal feature, overall              level detection is performed. It uses (BOW) to
emotion score is calculated. For temporal           get the vocabulary. Then it is trained using
feature, if a user did not tweet anything for a     SVM in original dataset, dataset balanced by
day its score is taken as zero. In such a way,      under-sampling and dataset balanced by over-
emotion score vector is calculated [27]. In         sampling. It is observed that user-level
word embedding, all the words are mapped            classification gives high performance with
into a multi-dimensional vector, where              respect to recall measure in comparison with
semantically related words are neighbors. The       the tweet-level classification even for the
word sentiment polarity score representation,       limited number of feature. Also it is difficult to
finds either the word has a strong relationship     detect whether the user is depressed or not
with positive sentiment or non-positive             from a single tweet/post, hence user-level
sentiment. To identify this, it uses the lexicon    category is used [9]. It is also observed that
based sentiment feature and Senti-wordnet.          when Linear SVM is applied on BOW feature,
FastText      is     similar    to     skip-gram    it provides good performance in terms of
representation, where each ngram has its            Recall measure [15]. SVM gives good
vector. Vector representation helps to improve      accuracy when compared with Naïve Bayes
the performance, as it provides the hidden          and Logistic regression methods [29]. Table 2
details [36]. The GloVe model is a regression
                                                                                                   140


gives the summary of few discriminative               the best fit for the depression case. Here Naïve
modeling.                                             Bayes theorem shows better performance with
                                                      respect to accuracy when compared with other
6.2. Ensemble             model           based       classifier algorithms. When evaluating with
                                                      respect to Precision and F1-score Logistic
     methods                                          Regression gives good performance [15].
                                                      Also, Naïve Bayes is the best classification
Random Forest (RF) classifier is an ensemble
                                                      approach when compared with BP neural
classifier. It is a multitude of decision tree, for
                                                      network and Decision tree. Also, Naïve Bayes
more accurate results. To detect depression
                                                      gives high precision and recall value [26].
from the text data, temporal feature and non-
                                                      Table 4 gives the summary of few
temporal feature are used. Feature vector from
                                                      probabilistic modeling.
non-temporal feature is referred as EMO.
EMO, LIWC and combination of EMO+LIWC
feature sets are given as input to Random             6.4. Artificial Neural Network
Forest classifier. It is observed that RF gives            (ANN) based methods
high precision and recall than SVM [30]; also
it provides more information with temporal            Artificial Neural Network (ANN) is combined
features [27]. RF classifier is also used to          with several unsupervised learning model to
classify the online post and communities into         detect depression from social media text data.
depressive and non-depressive. On top of the          Some of the unsupervised learning models are
extracted LIWC feature, RF is applied to              Biterm Topic model, Word2vec, Replicated
classify them. Hierarchical HMM is used for           Softmax Machine. BTM identifies words that
determining the degree of depression in the           appear together. It will identify two words that
social communities. RF, Logistic Regression,          appear together, if the size of window is given
and Gaussian NB are applied with different            as two. BTM uses topic to represent the hidden
representation methods such as Word2Vec,              aspects of the document. Word2vec is a word
FastText with Skip-gram, and GloVe. RF                embedding process, identifies both semantic
provides better performance than the other            and syntactic regularities in the sentence. And
models when combined with FastText [36].              it will group them in clusters, if the vectors
Table 3 gives the summary of few ensemble             have similar semantic meanings i.e. it
modeling.                                             computes the association with words and
                                                      groups them together. RSM is similar to term
6.3. Probabilistic          model         based       frequency counter, it will count the
                                                      occurrences of a particular word in the
     methods                                          vocabulary collected. RSM also identifies the
                                                      hidden topical structure. On top of this
Naïve Bayes is a probability based classifier.
                                                      unsupervised learning model, Stochastic
Naïve Bayes algorithm has assumes that each
                                                      Gradient Descent (SGD) model is applied.
feature is independent. Bag-Of-Words (BOW)
                                                      SGD acts as transfer learning approach, as this
approach will provide the words with its
                                                      will transfer the high-level semantic features to
occurrence frequency. BOW feature is given
                                                      ANN. In order to filter the noisy feature and to
as input to different classification algorithms
                                                      maintain the stability of this model, Sparse
like DT, NB, Linear SVM and Logistic
                                                      Encoding method is applied. The transfer
Regression. Each tweet is treated as a
                                                      learning approach used in this Hybrid Neural
document. Here Bag-Of-Words finds the
                                                      Network is called as Latent Semantic Machine
occurrence frequency of words related to
                                                      (LSM). It accepts the raw features from the
depression. Decision tree will provide results
                                                      unsupervised learning models and derives
for most of the cases, but it may be unstable
                                                      them into high level semantic feature mixture,
when there is a change in data. Linear SVM is
                                                      which will be fed into the Neural network. It is
also used for this purpose, where a straight line
                                                      observed that HNN+BTM with one LSM and
is used to differentiate classes. It uses a
                                                      HNN+BTM with two LSM performed better
maximum-margin hyperplane to perform this
                                                      in terms of F1 measure, than HNN with other
identification of classes. Logistic Regression
                                                      unsupervised learning models. It is also
uses the probability of words belonging to a
                                                      observed        that       HNN+RSM           and
particular class and curve is drawn to identify
                                                                                                 141


HNN+Word2vec with sparse encoding give              MultiChannelPoolingCNN is same as
better performance than HNN+RSM and                 MultiChannelCNN but with two different
HNN+Word2vec without sparse encoding.               max-pooling sizes 2 and 5. MutiChannel CNN
The selection of the unsupervised learning          and bi-directional GRU combined to give
models for extracting the source features           more accuracy than CNN [38]. These CNN
added more value to this HNN model [11].            variants are compared with the RNN model
      Feed Forward (FF) is a type of ANN.           and it is observed that CNN with global max
The Reddit dataset is pre-processed and fed to      pooling layer gives high performance than
FF neural network method. This FF modeling          RNN based model by providing the highest
is used for multiclass classification, which        precision and recall [7].
involves       “selfharm”,       “suicidewatch”,
“anxiety”, “depression”, etc. It is observed that          CNN with Factor Graph Model (FGM).
FF classifier gives more accurate results when      CNN is combined with the Factor Graph
compared with SVM and linear regression             Model (FGM) to extract more tweet level and
[31]. Table 5 gives the summary of few ANN          user level information. In this approach, CNN
based modeling.                                     method is applied on the dataset along with the
                                                    Cross Auto Encoders (CAE). CNN will
                                                    provide the user-level attributes, which is
6.5. Deep learning based methods                    obtained from tweet-level. Then this will be
                                                    given as input to the next phase FGM. FGM
6.5.1. Convolutional Neural                         considers three factors and three aspects of this
     Networks (CNN)                                 attributes to map this into states. The three
                                                    factors are attribute factor, dynamic factor and
 CNN with global max pooling layer. The pre-        social factor. To depict the correlation of the
processing of twitter data provides a               stress state and time with attribute, attribute
vocabulary for further phases. The words are        factor is used. Dynamic factor is used to give
encoded into a sequence of fixed length and         correlation of the stress state and dynamic
occurrence of the word is limited to two times      time. Social factor is used to depict the
in that sequence. Then unsupervised training        correlation between the stress state and time
models are used to transform the encoded            with polarity comments. The three main
words into a low dimensional vector. For this       aspects, which FGM is taking into account, are
many models are available like Skip-gram and        the following user-level attributes: posting
CBOW. Skip-gram concentrates on the                 behavior, content, and social interaction.
contextual words and is able to detect rare         Based on these factors and aspects the user-
words whereas the CBOW concentrates on the          level attributes are mapped with the respective
current words and is much of a continuous           stress state level. This CNN+FGM give better
Skip-gram. Skip-gram and CBOW are the two           performance, by providing the highest
layers of Word2Vec model. This unsupervised         precision and recall, when compared with the
training model is performed with different          traditional methods like SVM, RF, LR [10].
sense and it involves two tasks. They are
predicting word and sense from the input. For             DCNN with Global vector for word
this, it first identifies the word that occurs      representation (GloVe) model. DCNN method
together, example “happy” it can come with          helps to identify whether the tweets express
words like “journey”, “morning”, “birthday”.        positive or non-positive emotion. Before
Then Rectified Liner Unit (ReLU) is used, this      applying Deep Convolutional Neural network,
will identify the label for the missing data and    the tweets are preprocessed, features are
sense of the sentence, thereby produce the          extracted and represented into feature vector
label output.                                       using GloVe model. The GloVe model is a
       On top of these embeddings, variants of      regression model which combines the
CNN are applied. CNNWithMAX means                   following two methods local context window
Convolution with 250 layers is applied and          and global matrix factorization. Deep
then the global max pooling layer is applied to     Convolutional Neural Networks (DCNN) is
extract     the    global     information.    In    applied on the vector, generated by GloVe
MultiChannelCNN, three times CNN is                 model. The twitter specific feature vector,
applied, with the filter of length 3, 4, and 5.     unigram and bigram feature vector, word
                                                                                                142


sentiment polarity score feature vector are                LSTM and Gate Recurrent Unit (GRU).
combined into a single feature vector. In the       LSTM and GRU are best suited for predicting
first Convolutional layer, on top of the            long-term data involving delay. Combining
combined feature vector, Convolutional filter       GRU with LSTM helps in handling the
is applied to get new vector. The vector is         difficulties in LSTM, which is the training
mapped to a fixed length vector. Again              speed. GloVe representation is used to utilize
convolutional layer is applied to get new           both local and global details of the data.
vector. This GloVe+DCNN model uses three            Among RNN, LSTM, GRU and LSTM-GRU,
k-max pooling layer and three convolutional         LSTM-GRU provides better performance [37].
layers to give the probability of positive or       Table 6 gives the summary of few deep
negative sentiment in the tweet. It is observed     learning based modeling.
that GloVe+DCNN provide high precision and
recall when compared with BoW/GloVe with
SVM or LR [28].                                     6.6. Unsupervised learning based
                                                         methods
6.5.2. Recurrent Neural Networks
                                                    K-means is an unsupervised learning method.
     (RNN)                                          Before applying k-means to the observations,
                                                    the collected data is pre-processed. Then the
RNN is widely used in NLP. Word2Vec model
                                                    data is analyzed, by calculating the word
is used to represent the vocabulary. It also
                                                    frequency. Words in the vocabulary are
helps to determine or predict the word and
                                                    represented into vector, using one-hot
sense from the input. Rectified Liner Unit
                                                    encoding or word embedding process.
(ReLU) is also used. ReLU helps in
                                                    Word2Vec model can also be used to generate
identifying the missing label for the data and
                                                    vectors. Then k-means clustering is applied,
also identifies the sense of the sentence. This
                                                    words with similar meaning are grouped
embedding is given as input to RNN model.
                                                    together in clusters. Based on cosine similarity
RNN is applied with Bidirectional LSTM and
                                                    it is easy to accumulate semantically similar
context-aware attention. LSTM prevents error
                                                    words in the clusters [32]. For Latent Dirichlet
from exploding and vanishing gradient
                                                    Allocation (LDA) method, the extracted N-
problems. Bidirectional RNN connects the
                                                    gram features are fed as input. LDA is applied
output from two hidden layers of opposite
                                                    on term-document matrix and gives output as
direction to the same output. Bidirectional
                                                    topic-document matrix, which is fed into
LSTM helps to concatenate both forward and
                                                    Multilayer Perceptron (MLP). MLP works
backward      representation.     Context-aware
                                                    with 30 topics as input and two hidden layer of
attention provides the weighted sum of all
                                                    60 and 30 units. It gives comparatively
words in a sequence and also it helps to focus
                                                    moderate performance with respect to
on the more important words. It is observed
                                                    precision and recall, which is due to the
that optimized embedding performed better,
                                                    unsupervised nature of the topic extraction
than the trainable random embedding for
                                                    [33]. Table 7 gives the summary of few
RNN. Also, when compared with CNN based
                                                    unsupervised learning based modeling.
models, RNN shows low performance with
respect to precision and recall [7].

Table 2
Summary of few Discriminative Model Based Methods

 Ref.   Dataset                  Feature                  Classifier     Performance
 [9]    CLPsych2015,             Bag-Of-Words, User-level, SVM           Precision-0.58,
        BellLetsTalk 2015        tweet-level feature                     Recall-0.77

 [15]   CLPsych2015              Bag-Of-Words             SVM            Precision-0.83,
                                                                         Recall-0.83
                                                                                         143


 Ref.   Dataset                 Feature                  Classifier    Performance
 [29]   Twitter, 20newsgroups N-gram , negation, Part-   SVM           Precision-0.83,
                              Of-Speech (POS)                          Recall-0.79


Table 3
Summary of few Ensemble Model Based Methods

 Ref.   Dataset                 Feature                  Classifier    Performance
 [27]   Twitter Streaming API   Non-Temporal(EMO),     RF              Precision-0.90,
                                Temporal(EMO-TS), LIWC                 Recall-0.86
                                feature
 [30]   LiveJournal             LIWC feature             RF            Precision-0.89,
                                                                       Recall-0.90


 [36]   Twitter dataset         Skip-gram                RF            Precision-0.82,
                                                                       Recall-0.81


Table 4
Summary of few Probabilistic Model Based Methods

 Ref.   Dataset                 Feature                  Classifier    Performance
 [15]   CLPsych2015             Bag-Of-Words             Naïve Bayes   Precision-0.82,
                                                                       Recall – 0.82

 [26]   Questionnaire data      Words, pronoun and       Naïve Bayes   Precision-0.75,
                                punctuation feature                    Recall-0.62


Table 5
Summary of few Artificial Neural Network (ANN) Based Methods

 Ref.   Dataset                 Feature                  Classifier    Performance
 [31]   Reddit dataset          Linguistic features      Feed Forward Precision-0.92,
                                                         (FF)         Recall-0.88

 [11]   SemEval and ISEAR       words feature            ANN with     F1-0.60
        dataset, SinaNews                                unsupervised
        dataset                                          learning
                                                         model
                                                                                                   144


Table 6
Summary of Few Deep Learning Based Methods

 Ref.   Dataset                 Feature                     Classifier       Performance
 [7]    CLPsych2015,            Continuous Bag- Of-         CNN with         Precision-0.87,
        BellLetsTalk2015        Words, Skip gram            global max       Recall-0.87
                                                            pooling layer
 [10]   Sina Weibo’s REST APIs, User-level feature, Tweet- CNN with FGM Precision-0.90,
        Tencent Weibo           level feature                           Recall-0.96


 [28]   Stanford Twitter          Word N-gram feature,      DCNN with        Precision-0.88,
        Sentiment Test dataset, emoticons, hashtags,        GloVe model      Recall-0.87
        SemEval 2014 Task9        negation, Part- Of-Speech
        dataset, Stanford Twitter (POS) and capitalized
        Sentiment Gold dataset, words.
        Sentiment Evaluation
        Dataset, Sentiment
        Strength Twitter dataset
 [7]    CLPsych2015,            Continuous Bag- Of-         RNN with         Precision-0.63,
        BellLetsTalk2015        Words, Skip gram            bidirectional    Recall-0.65
                                                            LSTM
 [37]   IMDB dataset            words feature               LSTM and GRU F1-0.86


Table 7
Summary of few Unsupervised Learning Based Methods

 Ref.   Dataset                 Feature                     Classifier         Performance
 [32]   Twitter data            Word frequency              K-means            Cosine similarity
                                                                               helps in easy
                                                                               clustering
 [33]   eRisk 2018 pilot        N-gram features             Latent Dirichlet Precision-0.32,
        task dataset                                        Allocation (LDA) Recall-0.62


7. Analysis of             Global      word           interdisciplinary research in the field of
                                                      depression related problems. The people under
representations                                       depression can be alerted when early signs of
                                                      depression are found. eRisk 2017 dataset
7.1. Dataset overview                                 focussed on the early risk prediction with
                                                      multiple actors (Ex: Children sexual abuse)
The dataset used in the following experimental        and with single actors (Ex: Depression, bipolar
analyses is “CLEF/eRisk 2018 dataset”. The            disorder, teenage distress) from online text
aim of the CLEF eRisk is to identify the              data. eRisk 2018 dataset is formed with the
people, liable to depression from the data            2017 dataset, it involved in the early prediction
available on the Internet. It paved a way of          of Depression and Anorexia among the social
                                                                                                  145


media users. Both eRisk 2017 and eRisk 2018          two hidden layers of varying units. The output
uses the same source of data, i.e. it collects the   layer of the RNN helps in labeling the users as
social media texts from a particular collection      a depressed person or non-depressed person.
of users. The data is arranged in chronological      The performance of GloVe and RNN model is
order of 10 chunks from oldest to newest of          depicted in Table 8.
each user. It provides data for both training
and testing. The training data is divided into       Analysis with GloVe and CNN. The GloVe
depressed and control groups i.e., non-              model is observed to be effective for sentiment
depressed. The eRisk 2017 dataset is a               analysis from text data mining [28], the GloVe
collection of writings from 887 social media         representation model is combined with CNN
users, where 135 are depressed. The eRisk            to analyze the result. The dataset taken for this
2018 dataset is an extended collection of 2017       analysis consists of a few empty writings,
dataset, which consists of writings from 1,707       which are ignored. Then the dataset is pre-
users, where 214 users are depressed.                processed while preserving the emoticons and
                                                     symbols since they provide valuable
7.2. Methodologies used                              information. Each user’s writing in each chunk
                                                     is analyzed and formed a matrix of words with
Analysis with TF-IDF representation and              a pre-trained set of word embeddings. This
LDA. The eRisk dataset is pre-processed as an        pre-processed tokenized input is given to a
initial step. The TF-IDF vectorizer is well          single Convolutional layer of 100 filters with
suited for text dataset. As this will provide the    CReLU activation. A Single max pooling layer
unique list of words used in the dataset, along      is applied to classify each user as depressed or
with their frequency of occurrence. It helps in      not. The GloVe model is combined with
classifying the words under a particular set of      different layers of CNN and LSTM network
topics. The TF-IDF vectorizer of Scikit-learn        and the performance is observed high for
converts the writings of social media users into     GloVe with the multiple layers of CNN and
a matrix of TF-IDF features. The terms               bi-LSTM [39]. The performance of GloVe
extracted using the TF-IDF vectorizer is             and CNN model is depicted in Table 8.
formed as a matrix and given as input to
Latent Dirichlet Allocation (LDA). The output        7.3. Performance analysis
of LDA is the topic matrix. As each document
is composed of different topics or attributes.       The Classification report and Confusion
And each topic is composed of different              matrix is used to analyze the performance of
words. This topic matrix is given as input to        the above methodologies. The following
the MLP model, it consists of two intermediate       Table 8 shows Precision, Recall and F1 of
layers of 50 & 20 units. By this approach, each      these three methods.
user is labeled as depressed or not. The                      The TF-IDF representation focuses
performance of the TF-IDF and LDA model is           mainly on the frequency of word occurrence in
depicted in Table 8.                                 documents. Then maps the word into an
                                                     appropriate topic, thereby it classifies them. In
Analysis with GloVe and RNN. GloVe                   this case, whenever some word related to
combines the advantage of methods local              depression comes, it classifies them as
context      window   and     global     matrix      depressed, which is not the ideal way. The
factorization, to provide meaningful word            Global vector for word representation
insights. The GloVe model is providing               considers the frequency of word occurrence
promising results for text classification. It is     and the frequency of co-occurrence of words
combined with the RNN model, as RNN                  thereby provides more valuable information in
model widely used for text classification and        classifying them. The GloVe representation
Natural Language Processing (NLP). The               founds to be significantly better than the most
eRisk dataset is pre-processed and tokenized         commonly used word representation TF-IDF.
and then given as input to the GloVe                 The RNN and CNN classifier works well with
representation model. To provide meaningful          text representation and its performances are
statistics, GloVe forms the word-to-word co-         analyzed with GloVe representation. From the
occurrence matrix. The Resultant of GloVe is         table, it is found that GloVe representation is
given as input to the RNN model. It involves         better than TF-IDF representation. Also,
                                                                                             146


GloVe representation performs well with CNN        that GloVe representation with CNN classifier
than RNN. This is because the RNN model            provides comparatively better results.
gives better result with word embeddings of
higher length. From the analysis, it is found
Table 8
Summary of Performance Analysis

Methodology                            Precision         Recall                 F1
TF-IDF representation
                                         0.62             0.28                 0.49
with LDA
GloVe with RNN [34]                      0.80             0.20                 0.31

GloVe with CNN [35]                      0.42             0.66                 0.51
GloVe with multiple CNN and
                                         0.60             0.54                 0.55
bi-LSTM [39]

8.    Future directions                            References
The performance of depression detection            [1] Juyoung Song, Tae Min Song, Dong-Chul
system can be improved or made more                    Seo, and Jae Hyun Jin.: “Data Mining of
meaningful with the following directions for           Web-Based Documents on Social
future research.                                       Networking Sites that Included Suicide-
 The depression detection task can also be            Related     Words      among      Korean
    done by extracting the emotions from the           Adolescents”. Journal of Adolescent
    speech data.                                       Health, 59(6):668–673 2016.
 It can also be extended by grouping users,       [2] Budhaditya Saha, Thin Nguyen, Dinh
    based on gender, age, locations and other          Phung, and Svetha Venkatesh.: “A
    demographic attributes.                            Framework for Classifying Online Mental
 The Spatiotemporal features from video               Health-Related Communities with an
    data can also be included, as they                 Interest in Depression”. IEEE Journal of
    contribute more information.                       Biomedical and Health Informatics,
 Daily variation of a user’s depression can           20(4):1008–1015 2016.
    also be monitored.                             [3] Sharath Chandra Guntuku, David B
                                                       Yaden, Margaret L Kern, Lyle H Ungar,
 It can be extended by including the
                                                       and Johannes C Eichstaedt.: “Detecting
    medical context, so the clinical depression
                                                       Depression and Mental Illness on Social
    can be detected from social media data.
                                                       Media: An Integrative Review”. Current
                                                       Opinion in Behavioral Sciences, 18:43–49
9.    Conclusion                                       2017.
                                                   [4] Elizabeth M Seabrook, Margaret L Kern,
This paper provided an overview of the                 Ben D Fulcher, and Nikki S Rickard.:
depression detection system and the analysis           “Predicting Depression from Language-
of global word representations from the short          Based Emotion Dynamics: Longitudinal
text data. Datasets and Machine learning               Analysis of Facebook and Twitter Status
methods, used in recent years for the                  Updates”. Journal of Medical Internet
depression detection are summarized. The               research, 20(5) 2018.
global word representations model proved to        [5] Guangyao Shen, Jia Jia, Liqiang Nie, Fuli
be effective is analyzed with different                Feng, Cunjun Zhang, Tianrui Hu,
classifiers. Various challenges and future             TatSeng Chua, and Wenwu Zhu.:
directions are summarized for future research.         “Depression Detection via Harvesting
                                                       Social Media: A Multimodal Dictionary
                                                       Learning Solution”. In Proceedings of the
                                                                                              147


     Twenty-Sixth       International      Joint   [15] Moin Nadeem.: “Identifying Depression
     Conference on Artificial Intelligence              on       Twitter”.     arXiv      preprint
     (IJCAI-17), pages 3838–3844, 2017.                 arXiv:1607.07384 2016.
[6] Andrew G Reece and Christopher M               [16] Maxim Stankevich, Vadim Isakov,
     Danforth.: “Instagram Photos Reveal                Dmitry Devyatkin, and Ivan Smirnov.:
     Predictive Markers of Depression”. EPJ             “Feature Engineering for Depression
     Data Science, 6(1):15 2017.                        Detection in Social Media” 2018.
[7] Ahmed Husseini Orabi, Prasadith                [17] Maryam Mohammed Aldarwish and
     Buddhitha, Mahmoud Husseini Orabi,                 Hafiz Farooq Ahmad.: “Predicting
     and Diana Inkpen.: “Deep Learning for              Depression Levels Using Social Media
     Depression Detection of Twitter Users”.            Posts”. In Autonomous Decentralized
     In Proceedings of the Fifth Workshop on            System (ISADS), 2017 IEEE 13th
     Computational Linguistics and Clinical             International Symposium on, pages 277–
     Psychology: From Keyboard to Clinic,               280 2017.
     pages 88–97 2018.                             [18] Felix Ming Fai Wong, Chee Wei Tan,
[8] Harshali P Patil and Mohammad Atique.:              Soumya Sen, and Mung Chiang.:
     “Sentiment Analysis for Social Media: A            “Quantifying Political Leaning from
     Survey”. In Information Science and                Tweets, Retweets, and Retweeters”. IEEE
     Security (ICISS), 2015 2nd International           Transactions on Knowledge and Data
     Conference on, pages 1–4 2015.                     Engineering, 28(8):2158–2172 2016.
[9] Zunaira Jamil.: “Monitoring Tweets for         [19] Zhenhua Zhang, Qing He, Jing Gao, and
     Depression to Detect At-risk Users”,               Ming Ni.: “A Deep Learning Approach
     2017.                                              for Detecting Traffic Accidents from
[10] Huijie Lin, Jia Jia, Jiezhong Qiu,                 Social Media Data”. Transportation
     Yongfeng Zhang, Guangyao Shen,                     Research Part C: Emerging Technologies,
     Lexing Xie, Jie Tang, Ling Feng, and Tat-          86:580–596 2018.
     Seng Chua.: “Detecting Stress Based on        [20] C Sindhu, Dyawanapally Veda Vyas, and
     Social Interactions in Social Networks”.           Kommareddy Pradyoth.: “Sentiment
     IEEE Transactions on Knowledge and                 Analysis Based Product Rating Using
     Data Engineering, 29(9):1820–1833 2017.            Textual Reviews”. In Electronics,
[11] Xiangsheng Li, Yanghui Rao, Haoran                 Communication         and       Aerospace
     Xie, Raymond Yiu Keung Lau, Jian Yin,              Technology (ICECA), 2017 International
     and Fu Lee Wang.: “Bootstrapping Social            conference of, volume 2, pages 727–731.
     Emotion Classification with Semantically      [21] Anukarsh G Prasad, S Sanjana, Skanda M
     Rich Hybrid Neural Networks”. IEEE                 Bhat, and B S Harish.: “Sentiment
     Transactions on Affective Computing,               Analysis for Sarcasm Detection on
     8(4):428–442 2017.                                 Streaming Short Text Data”. In
[12] Monireh Ebrahimi, Amir Hossein                     Knowledge Engineering and Applications
     Yazdavar, and Amit Sheth.: “Challenges             (ICKEA), 2017 2nd International
     of Sentiment Analysis for Dynamic                  Conference on, pages 1–5 2017.
     Events”. IEEE Intelligent Systems,            [22] Sonia Xylina Mashal and Kavita Asnani.:
     32(5):70–75 2017.                                  “Emotion Intensity Detection for Social
[13] Zhaoxia Wang, Chee Seng Chong, Landy               Media       Data”.      In     Computing
     Lan, Yinping Yang, Seng Beng Ho, and               Methodologies      and     Communication
     Joo    Chuan     Tong.:     “Fine-Grained          (ICCMC), 2017 International Conference
     Sentiment Analysis of Social Media with            on, pages 155–158 2017.
     Emotion      Sensing”.       In      Future   [23] Ahmed Al-Saffar, Suryanti Awang, Hai
     Technologies Conference (FTC), pages               Tao, Nazlia Omar, Wafaa Al-Saiagh, and
     1361–1364 2016.                                    Mohammed Al-bared.: “Malay Sentiment
[14] Sara Rosenthal, Noura Farra, and Preslav           Analysis      Based      on     Combined
     Nakov.:     “SemEval-2017        task    4:        Classification Approaches and Senti-
     Sentiment Analysis in Twitter”. In                 Lexicon      Algorithm”.     PloS    One,
     Proceedings of the 11th International              13(4):e0194852 2018.
     Workshop on Semantic Evaluation               [24] Sho Tsugawa, Yusuke Kikuchi, Fumio
     (SemEval-2017), pages 502–518 2017.                Kishino, Kosuke Nakajima, Yuichi Itoh,
                                                                                           148


     and Hiroyuki Ohsaki.: “Recognizing              Informed Deep Learning”. Scientific
     Depression from Twitter Activity”. In           reports, 7:45141 2017.
     Proceedings of the 33rd Annual ACM         [32] Long Ma, Zhibo Wang, and Yanqing
     Conference on Human Factors in                  Zhang.:       “Extracting      Depression
     Computing Systems, pages 3187–3196              Symptoms from Social Networks and
     2015.                                           Web Blogs via Text Mining”. In
[25] Amit G Shirbhate and Sachin N                   International        Symposium          on
     Deshmukh.: “Feature Extraction for              Bioinformatics         Research        and
     Sentiment Classification on Twitter             Applications, pages 325–330 2017.
     Data”. International Journal of Science    [33] Maupome´.: “Using Topic Extraction on
     and Research (IJSR) ISSN (Online),              Social Media Content for the Early
     pages 2319–7064 2016.                           Detection of Depression” 2018.
[26] Changye Zhu, Baobin Li, Ang Li, and        [34] Paul, S., Kalyani, J.S. and Basu, T.,:
     Tingshao Zhu.: “Predicting Depression           “Early Detection of Signs of Anorexia
     from Internet Behaviors by Time-                and Depression Over Social Media using
     Frequency Features”. In Web Intelligence        Effective         Machine         Learning
     (WI),        2016       IEEE/WIC/ACM            Frameworks”. 2018.
     International Conference on, pages 383–    [35] Trotzek, M., Koitka, S. and Friedrich,
     390 2016.                                       C.M.,: “Word Embeddings and Linguistic
[27] Xuetong Chen, Martin D Sykora, Thomas           Metadata at the CLEF 2018 Tasks for
     W Jackson, and Suzanne Elayan.: “What           Early Detection of Depression and
     About Mood Swings: Identifying                  Anorexia” 2018.
     Depression on Twitter with Temporal        [36] Kaibi, Ibrahim, and Hassan Satori.: “A
     Measures of Emotions”. In Companion of          comparative      evaluation    of     word
     the The Web Conference 2018 on The              embeddings techniques for twitter
     Web Conference 2018, pages 1653–1660            sentiment      analysis”.          In 2019
     2018.                                           International Conference on Wireless
[28] Zhao Jianqiang, Gui Xiaolin, and Zhang          Technologies, Embedded and Intelligent
     Xuejun.: “Deep Convolution Neural               Systems (WITS), pp. 1-4. IEEE 2019.
     Networks      for    Twitter  Sentiment    [37] Ni, Ru, and Huan Cao.: “Sentiment
     Analysis”. IEEE Access, 6:23253–23260           Analysis based on GloVe and LSTM-
     2018.                                           GRU”. In 2020 39th Chinese Control
[29] Anees Ul Hassan, Jamil Hussain,                 Conference (CCC), pp. 7492-7497. IEEE
     Musarrat Hussain, Muhammad Sadiq, and           2020.
     Sungyoung Lee.: “Sentiment Analysis of     [38] Cheng, Yan, Leibo Yao, Guoxiong Xiang,
     Social Networking Sites (SNS) Data              Guanghe Zhang, Tianwei Tang, and
     Using Machine Learning Approach for             Linhui      Zhong.:    “Text     sentiment
     the Measurement of Depression”. In              orientation analysis based on multi-
     Information       and     Communication         channel CNN and bidirectional GRU with
     Technology Convergence (ICTC), 2017             attention mechanism”. IEEE Access 8:
     International Conference on, pages 138–         134964-134975 2020.
     140 2017.                                  [39] Goularas, Dionysis, and Sani Kamis.:
[30] Iram Fatima, Hamid Mukhtar, Hafiz               “Evaluation of deep learning techniques
     Farooq Ahmad, and Kashif Rajpoot.:              in sentiment analysis from Twitter data”.
     “Analysis of User-Generated Content             In 2019 International Conference on Deep
     from Online Social Communities to               Learning and Machine Learning in
     Characterise and Predict Depression             Emerging Applications (Deep-ML), pp.
     Degree”. Journal of Information Science,        12-17. IEEE 2019.
     page 0165551517740835 2017.                [40] Abid, Fazeel, Muhammad Alam, and
[31] George Gkotsis, Anika Oellrich, Sumithra        Adnan Abid.: “Representation of Words
     Velupillai, Maria Liakata, Tim JP               Over Vectors in Recurrent Convolutional
     Hubbard, Richard JB Dobson, and Rina            Attention Architecture for Sentiment
     Dutta.: “Characterisation of Mental             Analysis”.       In 2019     International
     Health Conditions in Social Media using         Conference on Innovative Computing
                                                     (ICIC), pp. 1-8. IEEE 2019.