=Paper=
{{Paper
|id=Vol-2786/Paper19
|storemode=property
|title=Semantic Analysis of Sentiments through Web-Mined Twitter Corpus
|pdfUrl=https://ceur-ws.org/Vol-2786/Paper19.pdf
|volume=Vol-2786
|authors=Satish Chandra,Mahendra Kumar Gourisaria,Harshvardhan GM,Siddharth Swarup Rautaray,Manjusha Pandey,Sachi Nandan Mohanty
|dblpUrl=https://dblp.org/rec/conf/isic2/ChandraGGRPM21
}}
==Semantic Analysis of Sentiments through Web-Mined Twitter Corpus==
<pdf width="1500px">https://ceur-ws.org/Vol-2786/Paper19.pdf</pdf>
<pre>
                                                                                                                                          122


Semantic Analysis of Sentiments through Web-Mined Twitter
Corpus
Satish Chandraa, Mahendra Kumar Gourisariaa, Harshvardhan GMa, Siddharth Swarup
Rautaraya, Manjusha Pandeya and Sachi Nandan Mohantyb
a
     School of Computer Engineering, KIIT Deemed to be University, Bhubaneswar-751024, Odisha, India
b
     Dept of Computer Science & Engineering, ICFAITech, ICFAI Foundation for Higher Education,
     Hyderabad-500082, India


                 Abstract
                 A huge amount of textual data is generated due to the boom of microblogging. Microblogging
                 sites such as Facebook, Twitter and Google+ are used by millions of people to express their
                 views and emotions on different subjects. In this paper, we discuss sentiment analysis on a
                 Twitter dataset having various tweets from different users. Sentiment analysis is useful for
                 gaining the opinion of people using large volumes of text data where texts are highly
                 unstructured and heterogeneous. In this paper, different classification techniques like Support
                 Vector Machine, Logistic Regression, Logistic Regression with Stochastic Gradient Descent
                 optimizer, Decision Tree Classification, Naive Bayes, Bidirectional LSTM and Random Forest
                 Classification have been applied to analyze the sentiment of people, i.e., whether their tweets
                 are positive or negative. The corpus has been analyzed by plotting descriptive insights such as
                 the word cloud and frequency of positive and negative tweets. The best classifier was selected
                 by comparing the results of accuracy, recall, precision, F1 score, AUC score and ROC curve.

                 Keywords
                 Sentiment Analysis, Twitter, Natural Language Processing, Word2Vec, Support Vector
                 Machine, Logistic Regression, Random Forest.


1. Introduction                                                                              what they feel and think about their products
                                                                                             [2]. As a result, sentiment analysis on Twitter is
                                                                                             an effective way of reckoning public opinion.
With the universality of microblogging and
                                                                                             Sentiment analysis provides the potential of
social networking sites, Twitter, with 319
                                                                                             observing numerous social networking sites in
million monthly users has now become a
                                                                                             real-time.
valuable resource for several individuals and
organizations for posting blogs and expressing                                                   Twitter has a limitation of 140 characters [3]
their views and opinions on different subjects                                               in each tweet, which causes individuals to use
like politics, sports, movies, etc. [1]. Stimulated                                          phrases in their tweets. Sentiment Analysis
by the growth of social media, many companies                                                automatically detects whether a text section
and media organizations are trying to mine                                                   contains emotions or opinioned content. It also
Twitter to observe people’s views to understand                                              determines the polarity of the text. Generally,
                                                                                             the dataset consists of a group of tweets where
ISIC’21: International Semantic Intelligence Conference,                                     each tweet is interpreted with a sentiment label.
February 25–27, 2021, Delhi, India
EMAIL:       schandra1.sc@gmail.com       (S.   Chandra);                                    Commonly sentiments are labeled positive,
mkgourisaria2010@gmail.com         (M.   K.    Gourisaria);                                  negative or neutral. However, some datasets
harrshvardhan@gmail.com (H. GM); siddharthfcs@kiit.ac.in
(S.S. Rautaray); manjushafcs@kiit.ac.in (M. Pandey);
                                                                                             have mixed or irrelevant tags too, which ranges
sachinandan09@gmail.com (S.N. Mohanty)                                                       from -5 to 5 and depicting negative to positive
ORCID: 0000-0002-6881-2668 (S. Chandra); 0000-0002-1785-                                     polarity [4]. Twitter sentiment analysis is
8586 (M. K. Gourisaria);0000-0003-3592-2931(H. GM.); 0000-
0002-3864-2127 (S.S. Rautaray); 0000-0002-6077-5794 (M.                                      helpful to understand public temperament about
Pandey); 0000-0002-4939-0797 (S.N. Mohanty)                                                  different social or cultural events and
             ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative
             Commons License Attribution 4.0 International (CC BY 4.0).                      forecasting the inconsistency within the stock
             CEUR Workshop Proceedings (CEUR-WS.org)                                         exchange [5].
                                                                                               123


   Sentiment analysis on Twitter is a sort of      ensemble by combining various classification
challenge due to its short length. The             techniques and feature sets. They used two
unstructured and heterogeneous data compelled      types of feature sets: word-relations and part-
us to apply the preprocessing step before          of-speech information and three types of
feature    extraction    [6].   The    various     classifiers like maximum entropy, Support
preprocessing steps include URLs removal,          Vector Machines and Naïve Bayes to form the
replacing negation, stopwords removal,             ensemble framework. Weighted combination,
removing numbers and expanding acronyms.           fixed     combination and meta-classifier
The preprocessing has been done with the help      ensemble techniques were used for sentiment
of the Natural Language Tool Kit (NLTK).           analysis and better accuracy was attained [9].
Then feature extraction is of two phases. First,   People on social networking sites give their
the normal text was formed by eliminating the      opinion about anything and everything. It was a
Twitter-specific features and then feature         challenge to recognize all types of data for
extraction was accomplished to extract more        training. Therefore, [2] proposed a model to
features [1].                                      study the sentiment from the hash tagged
                                                   (HASH) data set, iSieve data set and the
    This research paper is organized into
                                                   emoticon (EMOT) dataset. The authors trained
different segments as follows. Section 2 briefs
                                                   their model on a variety of feature extraction
about related works of sentiment analysis. In
                                                   techniques like lexicon features, part-of-speech
Section 3 we talk about the methodology and
                                                   (POS) features, n-gram features and
materials which explains the data exploration,
                                                   microblogging features. They concluded that in
data preprocessing and feature extraction. We
                                                   the microblogging domain, the POS feature
have also described the different classification
                                                   may not be useful and the benefits of the
algorithms used in the implementation namely
                                                   Emoticon dataset are also lessened when
Support Vector Machine, Logistic Regression,
                                                   microblogging features are included [2].
Logistic Regression - Stochastic Gradient
Descent, Decision Tree, Naïve Bayes,                   The authors from the paper [10] discussed
Bidirectional Long Short-Term Memory               about social network analysis and Twitter being
(BiLSTM) and Random Forest. In Section 4 we        a rich source for sentiment analysis and
show the results, analyses and comparison of       proposed a model to implement Twitter
models. Section 5 comprises the conclusion and     sentiment analysis by fetching the data from
future work.                                       Twitter APIs. Their analysis is based on
                                                   different queries of job opportunities. The
2. Related works                                   dataset has positive, negative, and neutral
                                                   labels. They noted that the neutral sentiments
                                                   are high in comparison to positive or negative
   With the advancement of Natural Language
                                                   which shows that there is a need to improve
Processing (NLP), research on Sentiment
                                                   Twitter sentiment analysis [10]. Twitter has
Analysis      ranges   from     document-level     become increasingly popular in the field of
classification [7] to words and phrase-level       politics. A real-time sentiment analyzer
classification [8]. The method to retrieve
                                                   towards the incumbent of Ex. president Barack
semantic information from a large corpus was
                                                   Obama and the nine other challengers have
presented by Hatzivassiloglou and McKeown.         been designed by [11]. They used IBM’s
This method separates domain-dependent             InfoSphere Streams platform (IBM, 2012) for
details and conforms to a novel domain when
                                                   speed and accuracy and pipelining real-time
the corpus is substituted. Their model focuses
                                                   data. Using the Twitter “firehouse” they
on adjectives, intending to identify near-
                                                   constructed logical keyword combinations to
synonyms and antonyms from their model.            recover relatable tweets about candidates and
    For increasing the efficiency and accuracy     events. They achieved an accuracy of 59% [11].
of the model [9] used the ensemble framework
                                                       Some researchers have tried to determine
for sentiment analysis. They utilized movies
                                                   the public point of view on different subjects
reviews and multi domain datasets extracted        like politics, movies, news, etc. from the
from Amazon product reviews which includes         Twitter posts [12]. The authors of the paper [13]
reviews of Books, Electronics, DVD and
                                                   used IMDB, a popular Internet database
Kitchen. They succeeded in framing the
                                                   containing movie information and Blippr, a
                                                                                                   124


social networking site where reviews are in the        management. In this regard, [14] classified the
form of ‘Iblips’. Their analysis gave the F-score      public reviews of a hotel into positive and
as high as 0.9 using SVM and demonstrated              negative. They collected 800 reviews from
domain adaptation as a useful technique for            TripAdvisor and performed the preprocessing
sentiment analysis. They introduced a new              step by NLTK in Python. They used various
feature    reduction      technique,     Relative      classifiers like Logistic Regression, Random
Information Index (RII), which combines with           Forest, Stochastic Gradient Descent Classifier,
another popular technique ‘thresholding’ to            Naïve Bayes and Support Vector Machine.
form a good feature reduction technique that           Their analysis was that Naïve Bays classifier
not only reduces the features but also improves        was best among them but Stochastic Gradient
the F-score [13]. The importance of sentiment          classifier also worked well. The analysis was
analysis has increased so much that it has been        based on the results of accuracy, recall,
in use in various industries, such as hotel            precision and F1-score [14].

Table 1
Tabular presentation of the related work
Authors Year          Dataset used          Models implemented               Observation/ Results
 [9]      2011        Movies review,        They used two types of           They observed that
                  Multi        domain feature sets: word-relations        ensemble technique was
                  dataset         from and              part-of-speech.   very much efficient in
                  extracted       from Maximum entropy, Support           obtaining the accurate
                  Amazon         which Vector Machines and Naïve          results.
                  includes reviews of Bayes.                 Weighted
                  Books,         DVD, combination,                fixed
                  Electronics      and combination        and     meta-
                  Kitchen.               classifier           ensemble
                                         techniques were also used.
 [2]      2011        Hash      tagged      The model was trained on         The best result was
                  (HASH) data set, a variety of feature extraction        obtained from n-gram
                  iSieve data set and techniques        like    lexicon   features along with lexicon
                  the        Emoticon features, n-gram features,          features. POS features may
                  (EMOT) dataset         part-of-speech (POS) features    not     be     useful     in
                                         and microblogging features.      microblogging domain
 [10]     2019        The data was          They used NLTK for find          The concluded that the
                  obtained        from the different categories of the    neutral      tweets      are
                  Twitter API for tweets like positive, weakly            significantly high in most
                  different         job positive, strongly positive,      of the queries. Thereby
                  opportunities          neutral, negative, strongly      showing the improvement
                  queries.               negative, weakly negative.       in Sentiment Analysis.
 [11]     2012        The data was          Designed a real-time             They      achieved     an
                  obtained        from sentiment analyzer towards         accuracy of 59%.
                  Twitter API during the incumbent of Ex.
                  the US presidential president Barack Obama and
                  election in 2012.      the nine other challengers.
                                         They used IBM’s InfoSphere
                                         Streams platform (IBM,
                                         2012) for speed and accuracy
                                         and pipelining real-time data.
                                         Using the Twitter “firehouse”
                                         they constructed logical
                                         keyword combinations to
                                         recover relatable tweets about
                                         candidates and events.
                                                                                                         125


 [13]      2011        IMDB, a popular           They used SVM and                Their analysis gave the
                    Internet database        introduced a new feature          F-score as high as 0.9 using
                    containing movie         reduction technique, Relative     SVM and demonstrated
                    information   and        Information Index (RII),          domain adaptation as a
                    Blippr, a social         which combines with another       useful     technique     for
                    networking    site       popular              technique    sentiment analysis.
                    where reviews are        ‘thresholding’ to form a good
                    in the form of           feature reduction technique
                    ‘Iblips’.                that not only reduces the
                                             features but also improves the
                                             F-score
 [14]      2018        They classified           They      used      various      Their analysed that
                    the public reviews       classifiers    like    Logistic   Naïve Bays classifier was
                    of a hotel into          Regression, Random Forest,        best among them but
                    positive        and      Stochastic Gradient Descent       Stochastic         Gradient
                    negative         by      Classifier, Naïve Bayes and       classifier also worked well.
                    collecting     800       Support Vector Machine.           The analysis was based on
                    reviews       from                                         the results of accuracy,
                    TripAdvisor                                                recall, precision and F1-
                                                                               score.


3. Materials and methods

    The study of computer algorithms that
improves automatically by learning from itself
is known as machine learning. The data and
output are fed into the machine learning model
and the machine creates its programming logic
to predict the result. The dataset is split into two
halves i.e., training part, which contains input
feature vectors and their labels, and the testing
part. A classification model with the help of a
specific algorithm is developed using the
training part to observe a pattern. The testing            Figure 1: Workflow for Twitter sentiment
part is used to obtain the accuracy of the model,          analysis
which tells whether a model is a good fit,
underfit or overfit.                                           Exploring the data has a key role in machine
                                                           learning as it helps us to visualize the types and
                                                           statistics of data [16]. Here, the dataset consists
3.1.     Data exploration                                  of 0.8M positive and 0.8M negative tweets
                                                           shown in Fig. 2 (a). As it is text data, the word
    The dataset used in this work was taken                cloud can also be visualized, as shown in Fig. 2
from UCI/Kaggle [15] in csv (comma separated               (b).
values) which contains 1.6 million tweets.
Preprocessing the data was done which
includes tokenization, stemming, stopword
removal to clean the text. A feature vector was
created using relevant features. Data mining
classification algorithms such as Decision Tree,
Logistic Regression, Random Forest, SVM,
Naive Bayes and LR-SGD classifiers were used
to gather the accuracy by classifying the tweets
into positive or negative tweets. Fig. 1 shows
the algorithm adopted for sentiment analysis.
                                                                                                   126


                                                    In feature extraction, the vector space model is
                                                    used for document representation. A vector is
                                                    created whose dimension is equal to the size of
                                                    English vocabulary and each element is initially
                                                    initialized to 0. If a text data features that vocab
                                                    word, one ‘1’ will be put in that dimension, as
                                                    shown in Eqn. 1, Eqn. 2 and Eqn. 3. Every time,
                                                    a text that features the vocab word is
                                                    encountered, the count will be increased,
                                                    leaving 0’s everywhere for the words which
                                                    were not found even once. The 2 nd Edition of
                                                    the Oxford dictionary contains 171,476 words
                                                    [17] in current use. So, if a vector is made with
                                                    all these words the model will be of high
                                                    variance and here feature selection comes into
                                                    account. For proper weighting and feature
                                                    extraction, the count vectorizer method was
                                                    used which keeps track of the frequent terms as
                                                    well as rare words. The vector space model
Figure 2: (a) Statistics of positive and negative   improves the accuracy. The feature extraction
dataset (b) Word cloud of the dataset               method is used for dimensionality reduction by
                                                    removing the non-informative words and rare
                                                    words. Bag of Words model is created which
3.2.    Data preprocessing                          contains the most frequent words from the
                                                    feature vector to improve the accuracy [1].
As the Twitter datasets are composed of
unstructured, heterogeneous, ill-formed words,         𝐿𝑜𝑣𝑒 = [0,0,0,1,0,0,0 … … … … . .0]         (1)
irregular grammar and non-dictionary terms,
the tweets were cleaned by various NLTK              𝐺𝑜𝑜𝑑 = [0,0,0,0,2,0,0 … … … … … . .0]         (2)
methods before feature extraction [1].
Preprocessing steps are [12] -                                                                     (3)
                                                      𝐷𝑎𝑦 = [0,0,0,0,0,5,0 … … … … … … 0]
• Eliminating all non-English characters and
  non-ASCII from the text.                              Other than Bag-of words model
• Removal of all URL links as they do not           Tokenization was also used for Bidirectional
  provide any information about the sentiment.      Long Short-Term Memory, in which raw texts
• Numbers are removed as they are not useful        are broken up into unique texts i.e., tokens.
  in finding sentiment.                             Each of the tokens has its unique token id’s. In
• Stop words are the most frequent words in a       tokenisation, a vector is created with a size
  language, such as "as", "an", "about”, “any"      equivalent to the number of unique words in the
  etc. There are many stopwords in English          corpora. A sequence of tokens is created and
  literature. These stopwords do not play any       they are represented as a vector as shown in
  role in finding the sentiments so they are        Eqn. 4 and Eqn. 5. As each of the tweets has a
  removed from the dataset.                         different length so its token represented
• Stopwords also contain “not”, but are not         sequence has also a different length which
  removed from the tweets as they are crucial       makes it difficult to feed into the Deep Learning
  in analyzing negative reviews.                    algorithms as it requires sequences of the same
• Stemming is the process to bring back the         length [18] . To counter this problem, padding
  words into their original form such as            and truncating steps come into account where
  “loved” becomes “love”, “worst” becomes           the length of the padded sequence is defined. If
  “bad” and so on.                                  the length of the tokenised sequence is larger
                                                    than the padded sequence then the tokens of the
                                                    sequence after the length of the tokenised
3.3.    Feature extraction                          sequence would be truncated, i.e., they are
                                                    removed. If the length of the tokenised
                                                                                                     127


sequence is smaller than the padded sequence         is chosen to be 6 then Eqn. 4 will be truncated
then the tokens of the sequence after the length     as shown in Eqn. 6 and Eqn. 5 will be padded
of the tokenized sequence would be padded            as shown in Eqn. 7.
with “0”. If the length of the padded sequence
              What consumes your mind controls your life = [32,13,21,122,781,45,23]           (4)

                         𝑝𝑟𝑎𝑐𝑡𝑖𝑐𝑒 𝑚𝑎𝑘𝑒𝑠 𝑎 𝑚𝑎𝑛 𝑝𝑒𝑟𝑓𝑒𝑐𝑡 = [53,321,32,48,44]                     (5)

               What consumes your mind controls your life = [32,13,21,122,781,45]             (6)

                        𝑝𝑟𝑎𝑐𝑡𝑖𝑐𝑒 𝑚𝑎𝑘𝑒𝑠 𝑎 𝑚𝑎𝑛 𝑝𝑒𝑟𝑓𝑒𝑐𝑡 = [53,321,32,48,44,0]                    (7)


3.4.    Classification Algorithms
    Classification algorithms are the most
important part of supervised learning in
machine learning. The classification algorithm
is used to indicate the class of the data. In this
paper, classification algorithms play a crucial
role in labeling the tweets positive or negative.

3.4.1. Bidirectional Long Short-Term                 Figure 3: Memory cell of LSTM

Memory (BiLSTM)
                                                       𝑖𝑡 = 𝜎(𝑊𝑦𝑖 𝑦𝑡 + 𝑊𝑘𝑖 𝑘𝑡−1 + 𝑊𝑐𝑖 𝑐𝑡−1          (8)
    A traditional neural network can’t remember                       + 𝑏𝑖 )
the previous inputs, for predicting the next
word previous information is a must. Recurrent        𝑜𝑡 = 𝜎(𝑊𝑦𝑜 𝑦𝑡 + 𝑊𝑘𝑜 𝑘𝑡−1 + 𝑊𝑐𝑜 𝑐𝑡−1 (9)
Neural Network (RNN) has the potential of                            + 𝑏𝑜 )
remembering everything from the past as they
have the loop and hidden layer in them. The            𝑓𝑡 = 𝜎(𝑊𝑦𝑓 𝑦𝑡 + 𝑊𝑘𝑓 𝑘𝑡−1 + 𝑊𝑐𝑓 𝑐𝑡−1 (10)
loops in RNN allows the network to persist                            + 𝑏𝑓 )
information. Recurrent neural network
translates the independent activations to               𝑐𝑡 = 𝑓𝑡 𝑐𝑡−1 + 𝑖𝑡 𝑡𝑎𝑛ℎ(𝑊𝑦𝑐 𝑦𝑡               (11)
dependent activations by furnishing equal                                 + 𝑊𝑘𝑐 𝑘𝑡−1 + 𝑏𝑐 )
biases and weights to complete layers, thus the
complexity of increasing the parameters is                       𝑘𝑡 = 𝑜𝑡 tanh (𝑐𝑡 )                 (12)
reduced and the result of one layer is the input
                                                        Where 𝜎 represents a logistic sigmoid
to the following hidden layers [19]. Long Short-
                                                     function, c, o, i and f represent cell vectors,
Term Memory (LSTM) is a special form of
                                                     output, input and forget gate. These have the
Recurrent Neural Network (RNN) which has
                                                     same dimension as the hidden vector k [19].
the potential to learn long-term dependencies.
LSTMs are accomplished to abstain from the              Bidirectional Long Short-Term Memory
long dependencies problem. In LSTM the               (BiLSTM) is an extension of LSTM, which can
hidden layer of RNN is restored by the Long          be designed by putting two independent LSTM.
Short-Term Memory cell. The LSTM memory              The structure permits the neural network to
cell can be achieved by the Eqn. 8-12.               have both forward and backward information at
                                                     every time step. This will run the data in two
                                                     ways, one from future to past and one from past
                                                     to future so by this method the model will be
                                                     able to preserve information from both the
                                                                                               128


future and past. Fig. 4 shows the Bidirectional
LSTM [20].


Figure 4: A Bidirectional LSTM Network

3.4.2. Logistic regression
                                                   Figure 6: Logistic Regression curve
Logistic regression is an example of a linear
classifier that is used to classify the class of   3.4.3. Logistic Regression-Stochastic
data. Logistic regression determines the link
between the independent and dependent              Gradient Descent Classifier
variables by estimating probabilities [16]. It
returns the probability by transforming the            Logistic Regression-Stochastic Gradient
output with the help of the logistic sigmoid       Descent (LR-SGD) is a type of linear model,
function. Fig. 5 shows the linear regression       known as Incremental Gradient Descent [14].
graph and its equation is given by Eqn. 13 as,     Logistic     Regression-Stochastic      Gradient
                                                   Descent (LR-SGD) classifier is an effective
               𝑌 = 𝐵0 + 𝐵1 𝑋                (13)   way to selective learning of linear classifiers
The equation of sigmoid function [22] is,          under different loss functions and penalties
                                                   such as Logistic Regression and Support Vector
                     1                   (14)
                 𝑃=                                Machines. The ‘log’ loss function is used to
                  1+𝑒   −𝑦
                                                   optimize Logistic Regression while the ‘hinge’
Now, applying Eqn. 14 to Eqn. 13 and solving       loss function is used for optimizing the Support
for 𝑦 to get Eqn. 15 i.e., logistic regression     Vector Machine. LR-SGD Classifier has
equation                                           recently gained much significance in the field
                                                   of large-scale learning although it has been
                𝑃                           (15)
        ln (       ) = 𝐵0 + 𝐵1 𝑋                   around in the machine learning association for
               1−𝑃                                 a long time [21]. The sparse and large-scale
The graph is now converted into a logistic         machine learning problems, which can be
regression graph shown in Fig. 6.                  encountered in sentiment analysis, often make
                                                   use of the LR-SGD classifier and this fact
                                                   motivated us to use the LR-SGD classifier in
                                                   our problem with 1.6M tweets [22]. One of the
                                                   strengths of the LR-SGD classifier is the
                                                   hyperparameter tuning which can be used to
                                                   solve error functions also called the cost
                                                   function.

                                                   3.4.4. Support Vector Machine

                                                       The Support Vector Machine can be
                                                   regarded as a linear model for regression and
                                                   classification tasks [23]. The Support Vector
Figure 5: Linear regression graph
                                                   Machine finds the optimal separable
                                                   hyperplane to separate the tweets into two parts
                                                   [24]. It is applied to noisy data. The hyperplane
                                                   line separates the tweets in a very efficient way
                                                   shown in Fig. 7. Support Vectors are the
                                                                                                  129


locations which are quite close to the line from      model in classifier the equation is [16], [29]
both the classes. The distance between them is        given by Eqn. 18 as,
often called a margin [25]. The Support Vector                                 𝑛
Machine is easier to implement and scales well                              𝑁𝑖
                                                       𝑀 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑀 𝑃(𝑀) ∏ 𝑃( )                 (18)
for high dimensional data. It is implemented                                𝑀
                                                                              𝑖=1
with kernels that transform non-separable
problems into separable problems by adding
more dimensions to it. The most commonly              3.4.6. Decision Tree Classifier
used kernel is the Radial Basis Function (RBF)
kernel. Mathematically, it can be defined by             A feasible approach to the multistage
Eqn. 16,                                              decision is to use the Decision Tree classifier
                                                      [30]. In the multistage approach complex
                                     2
   𝑃(𝑦, 𝑦𝑖 ) = 𝑒 (−𝑔𝑎𝑚𝑚𝑎∗𝑠𝑢𝑚(𝑦−𝑦𝑖 ) )       (16)      decisions are broken up into several simple
                                                      decisions to obtain the desired solution. A
                                                      complete multistage recognition has been
                                                      reviewed by [31]. It is used where data is
                                                      regularly split. Decision Tree can be applied for
                                                      both - regression models to predict the
                                                      continuous value and classification models for
                                                      predicting probability. As our model is a binary
                                                      classifier having positive and negative labels,
                                                      the Decision Tree classifier has been
                                                      implemented [32]. It is robust, easy and simple
                                                      to implement and not sensitive to irrelevant
                                                      features [33]. Fig. 8 (a) shows how the dataset
Figure 7: SVM classifier graph showing                was split into different categories using the
hyperplane                                            Decision Tree classifier and (b) demonstrates a
                                                      general Decision Tree.
3.4.5. Naïve Bayes Classifier
    Naïve Bayes [26] is the most common
supervised machine learning technique for
classification. It is also known as the
probabilistic classification technique as it is
based on probability [27]. It is completely
dependent on the famous probability theorem
i.e., Bayes’ theorem. Bayes’ theorem is
correlated to conditional probability. It finds the
probability of an occurring event when the
probability of another occurred event is already
given [27]. Mathematically, it can be stated by
Eqn. 17,
           𝑀    𝑃(𝑀)𝑃(𝑁⁄𝑀)
         𝑃( ) =                              (17)
           𝑁       𝑃(𝑁)
                 𝑀
   Where, 𝑃 ( ) refers to posterior i.e.
                 𝑁
probability of M when N is given, 𝑃(𝑁⁄𝑀)
represents likelihood i.e., probability of N when
M true, P(M) is the prior i.e., probability of M      Figure 8: (a) Portioning of a two-dimensional
and P(N) represents marginalization i.e.              feature space (b) Overview of a Decision Tree
probability of N [28]. After implementing the
                                                      3.4.7. Random Forest Classifier
                                                                                                   130


    The Random Forest classifier is a supervised     been evaluated to validate and verify the quality
ML technique and a very popular classifier. Just     of the results [35], [36]. The confusion matrices
like the Decision Tree, it can also be               for various classifiers have been shown in Fig.
implemented on both classification and               10, Fig. 11, Fig. 12 and Fig. 13. Table 2
regression models. It is an ensemble learning        compares the different classification models
method of classification that builds a set of        based on these evaluating metrics. Fig. 14
multiple decision trees from the training data       graphically depicts the performance of the
and outputs mode of class [34]. It is used in        different classifiers concerning the accuracy,
applications like search engines, image              recall, precision, F1-score and AUC score.
classification, etc. It constructs a decision tree
from each sample and gives the output. The best
solution is selected by voting. It is easier to
implement, fast and scalable but it easily
overfits the data [34]. Fig. 9 shows the complete
sketch of the Random Forest classifier.


                                                     Figure 10: Confusion matrix of (a) Logistic
                                                     Regression (b) Support Vector Machine


Figure 9: Overview of Random Forest classifier       Figure 11: Confusion matrix of (a) Naïve Bayes
                                                     (b) LR-SGD classifier
4. Implementation and result
    The dataset was collected from Kaggle.
Implementation was done on Python and
NLTK was used for cleaning and training the
model. The various classifiers used are Logistic
Regression, Naïve Bayes, Support Vector
Machine, Random Forest, LR-SGD classifier,
Bidirectional Long Short-Term Memory and             Figure 12: Confusion matrix of (a) Random
Decision Tree. The dataset consists of 1.6           Forest (b) Decision Tree classifier
million out of which 1,280,000 were used for
training and 320,000 for testing [15].
    Evaluating the models is very important for
observing the performance and correctness of
the different models on the test data and finding
the best among them. The performance of a
classifier can be described by the confusion
matrix on a set of data for which true values are
known. With the help of the confusion matrix,        Figure 13: Confusion matrix of Bidirectional
different evaluating metrics such as accuracy,       Long Short-Term Memory
recall, precision, F1-score and AUC score have

Table 2
Performance measure of various classifiers
                                                                                                        131


                               Accuracy          Recall     Precision        F-1 Score         AUC Score
 Bidirectional LSTM            0.7890            0.7889     0.7891           0.7889            0.78904
 Logistic Regression           0.7249            0.7249     0.7272           0.7242            0.72489
 Naïve Bayes                   0.7124            0.7124     0.7133           0.7121            0.71239
 LR-SGDC                       0.7209            0.7208     0.7274           0.7189            0.72082
 SVM                           0.7245            0.7244     0.7274           0.7236            0.72445
 Decision Tree                 0.6849            0.6849     0.685            0.6849            0.6849
 Random Forest                 0.7129            0.7129     0.7131           0.7128            0.71286

   Accuracy: It is the percentage of tweets that
have been classified correctly by the model.
The accuracy of the model can be calculated
using Eqn. 19.
                          𝑇𝑃 + 𝑇𝑁              (19)
  𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
                  𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Precision: It is the ratio of actual positive tweets
to predicted positive tweets. The precision of
the model can be calculated using Eqn. 20.
                                                          Figure 14: Performance graph of different
                           𝑇𝑃               (20)
         𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =                                      classifiers
                         𝑇𝑃 + 𝐹𝑃
   Recall: It is the ratio of predicted positive
tweets to total positive tweets. The recall of the
model can be calculated using Eqn. 21.
                       𝑇𝑃               (21)
           𝑅𝑒𝑐𝑎𝑙𝑙 =
                   𝑇𝑃 + 𝐹𝑁
   F1-score: F1-score can be defined as the
harmonic mean of recall and precision. The F-
measure of the model can be calculated using
Eqn. 22.
                         𝑃∗𝑅              (22)
          𝐹1 𝑠𝑐𝑜𝑟𝑒 = 2 ∗
                         𝑃+𝑅
                                                          Figure 15: ROC Curve of various Classifiers
   Where, TP is the True Positive, TN refers to
True Negative, FP is the False Positive, FN                   The Receiver Operating Characteristic
means False Negative, P refers to Precision and           curve (ROC) curve is a tool which predicts the
       R is the Recall.                                   probabilistic value of binary outcome [37]. The
                                                          relationship between the sensitivity which is the
   AUC score: AUC score can be calculated by
                                                          true positive rate and the specificity which is the
finding the area under the ROC curve [11]. The
                                                          false positive rate is represented graphically by
AUC score of the model can be calculated using
                                                          the ROC curve. It is a significant metric as it
Eqn. 23.
                                                          covers the whole spectrum between zero and
              𝑆𝑃 − 𝑃𝐸 (𝑁𝑂 + 1)⁄2                          one. The true positive rate is exactly equal to
                                       (23)
      𝐴𝑈𝐶 =                                               the false positive rate at 0.5, and this represents
                    𝑃𝐸 ∗ 𝑁𝑂                               a random or no skilled classifier [38]. The AUC
   Where, SP is the Sum of positive                       score can be calculated by finding the area
observations, PE refers to Positive Examples              under the ROC curve. The ROC curves for
and NO is the Negative Observations.                      different classifiers have been plotted in Fig.
                                                          15.
                                                             With the help of the confusion matrix of the
                                                          various classifiers showing the values of a true
                                                          negative, true positive, false negative and false
                                                                                                 132


positive, we have calculated precision,                  There are various methods of machine
accuracy, F1-score, recall and roc-auc score as      learning, symbolic and deep learning for the
shown in Table 2. In this paper, we have             analysis of the tweets or reviews. But machine
compared various classifiers like Random             learning techniques are most common, efficient
Forest, Logistic regression, Support Vector          and simpler than others. In this paper, machine
Machine, Decision Tree, LR-SGDC and Naïve            learning techniques were used for the analysis
Bayes with the state-of-the-art approach Bi-         of tweets on a Twitter dataset. The tweets were
LSTM. On observing the results of Table 2 it         cleaned in the preprocessing step by removing
had been found that the Bidirectional LSTM           the stopwords, URL, numbers and various
was the best classifier with an accuracy of          Twitter-specific features with the help of
78.90%, and decision tree came out as runner         NLTK. To deal with the miss-spelling and non-
up with an accuracy of 72.49%, followed by           informative words, feature extraction was done
Support Vector Machine and LR-SGDC                   and a Bag of Words model was created with the
Classifier with an accuracy of 72.45% and            most frequent words. The tweets were, then,
72.09% respectively. Random Forest and Naïve         classified into positive and negative by various
Bayes also predicted well with an accuracy of        classifiers like LR-SGD Classifier, Naïve
71.29% and 71.24%. it was also observed that         Bayes, Random Forest, Logistic Regression,
decision tree classifier didn’t came up to the       SVM, Bidirectional LSTM and Decision Tree.
expectation with just an accuracy of 68.49%.         By observing the ROC curve and accuracy
On examining carefully, it can be observed that      score, it was clear that Bidirectional LSTM is
prediction of true positive class with respect to    the best classifier with an accuracy of 78.90%.
predicted positive class i.e., precision score of    Hence, it was found that Bidirectional LSTM is
Bi-LSTM was also highest among all with a            very useful in finding sentiment analysis.
precision score of 78.91%. LR-SGDC and
                                                        The model can be implemented in a website
SVM classifier were the runner ups with a
                                                     or Android applications for classifying the
precision score of 72.74% for each, followed by
                                                     sentiments of people on different subjects. As
Logistic Regression with 72.72% precision
                                                     the microblogging sites are blooming,
score. Naïve Bayes and Random Forest
                                                     sentiment analysis is very important for many
classifier also predicted the positive class well
                                                     organizations in implicating social intelligence
with a precision score of 71.33% and 71.31%
                                                     and social media analytics.
respectively. The precision score of decision
tree classifier was least with a score of 68.5%.        The future of this research paper is to
Prediction of true positive class with respect to    explore the data on a wider genre of different
actual positive class i.e., recall score of Bi-      social networking sites and e-commencing sites
LSTM was best with score of 78.89%, with             where people do online shopping for many
Logistic Regression as the runner up with sore       things like books, games, etc. Accuracy rates of
of 72.49% followed by SVM, LR-SGDC,                  these products can be found by sentiment
Random Forest and Naïve Bayes with score of          analysis. It can also be implemented to build the
72.44%, 72.08%, 71.29% and 71.24%                    human confidence model.
respectively. Even here, Decision Tree was not
as good with precision score of 68.49%. The          6. Conflict of interest
F1-score and AUC score of Bi-LSTM was best
of among all the classifiers. All these results of
                                                     There is no conflict of interest.
various classifiers can be visualized graphically
as shown in the Fig. 14. Fig. 15 depicts the ROC
curve of all the classifiers implemented in our      7. Acknowledgement
experiments, which also shows that Bi-LSTM
is the best classifier. The model can also be very   I would like to express my heartiest gratitude to
useful for analyzing the tweets related to           all the co-authors and special thanks to Prof.
medical data [39], [40],[41], [42], [43], [44].      Mahendra Kumar Gourisaria and Mr.
                                                     Harshvardhan GM who have been a constant
5. Conclusion                                        source of knowledge, inspiration and support. I
                                                     would equally thank my parents and friends
                                                     who inspired me to remain focused and helped
                                                     me to complete this research paper.
                                                                                                  133


8. References                                                Technol., vol. 6, no. 11, pp. 2344–2350,
                                                             2019.
                                                      [11]   H. Wang, D. Can, A. Kazemzadeh, F.
[1]    Z. Jianqiang, G. Xiaolin, and Z. Xuejun,
                                                             Bar, and S. Narayanan, “A System for
       “Deep Convolution Neural Networks
                                                             Real-time Twitter Sentiment Analysis
       for Twitter Sentiment Analysis,” IEEE
                                                             of 2012 U.S. Presidential Election
       Access, vol. 6, pp. 23253–23260, 2018,
                                                             Cycle,” Proc. 50th Annu. Meet. Assoc.
       doi: 10.1109/ACCESS.2017.2776930.
                                                             Comput. Linguist., no. July, pp. 115–
[2]    E. Kouloumpis, T. Wilson, and J.
                                                             120,              2012,              doi:
       Moore, “Twitter sentiment analysis:
                                                             10.1145/1935826.1935854.
       The good the bad and the omg!. In Fifth
                                                      [12]   M. S. Neethu and R. Rajasree,
       International AAAI conference on
                                                             “Sentiment analysis in twitter using
       weblogs and social media,” in
                                                             machine learning techniques,” 2013 4th
       Proceedings of the Fifth International
                                                             Int. Conf. Comput. Commun. Netw.
       AAAI Cinference on Weblogs and Social
                                                             Technol. ICCCNT 2013, 2013, doi:
       Media, 2011, pp. 538–541.
                                                             10.1109/ICCCNT.2013.6726818.
[3]    A. Hassan, A. Abbasi, and D. Zeng,
                                                      [13]   V. M. K. Peddinti and P. Chintalapoodi,
       “Twitter Sentiment Analysis: A
                                                             “Domain adaptation in sentiment
       Bootstrap Ensemble Framework,” in
                                                             analysis of twitter,” AAAI Work. - Tech.
       2013 International Conference on
                                                             Rep., vol. WS-11-05, pp. 44–49, 2011.
       Social Computing, Sep. 2013, pp. 357–
                                                      [14]   N. Lokeswari and K. Amaravathi,
       364, doi: 10.1109/SocialCom.2013.56.
                                                             “Comparative Study of Classification
[4]    M. Thelwall, K. Buckley, and G.
                                                             Algorithms in Sentiment Analysis,” Int.
       Paltoglou,        “Sentiment       strength
                                                             Res. J. Sci. Eng. Technol., vol. 4, no. 8,
       detection for the social web,” J. Am.
                                                             pp. 31–39, 2018.
       Soc. Inf. Sci. Technol., vol. 63, no. 1, pp.
                                                      [15]   Kaggle.com, “Sentiment140 dataset
       163–173, 2012, doi: 10.1002/asi.21662.
                                                             with 1.6 million tweets,” 2015.
[5]    A. Mittal and A. Goel, “Stock prediction
                                                             [Online].                      Available:
       using twitter sentiment analysis,” 2012.
                                                             https://www.kaggle.com/kazanova/sent
[6]    Z. Jianqiang and G. Xiaolin,
                                                             iment140.
       “Comparison research on text pre-
                                                      [16]   S. Das, R. Sharma, M. K. Gourisaria, S.
       processing methods on twitter sentiment
                                                             S. Rautaray, and M. Pandey, “Heart
       analysis,” IEEE Access, vol. 5, no. c, pp.
                                                             disease detection using core machine
       2870–2879,              2017,           doi:
                                                             learning and deep learning techniques:
       10.1109/ACCESS.2017.2672677.
                                                             A comparative study,” Int. J. Emerg.
[7]    B. Pang and L. Lee, “Opinion mining
                                                             Technol., vol. 11, no. 3, pp. 531–538,
       and sentiment analysis,” Found. Trends
                                                             2020.
       Inf. Retr., vol. 2, no. 1–2, pp. 1–135,
                                                      [17]   Wil, “How many words are in the
       2008, doi: 10.1561/1500000011.
                                                             English language?,” English Live, 2018.
[8]    V. Hatzivassiloglou and K. R.
                                                             [Online].                      Available:
       McKeown, “Predicting the semantic
                                                             https://wordcounter.io/blog/how-many-
       orientation       of     adjectives,”     in
                                                             words-are-in-the-english-language/.
       Proceedings of the eighth conference on
                                                      [18]   Z. Jiang, L. Li, D. Huang, and L. Jin,
       European chapter of the Association for
                                                             “Training word embeddings for deep
       Computational Linguistics -, 1997, pp.
                                                             learning in biomedical text mining
       174–181, doi: 10.3115/979617.979640.
                                                             tasks,” Proc. - 2015 IEEE Int. Conf.
[9]    R. Xia, C. Zong, and S. Li, “Ensemble
                                                             Bioinforma. Biomed. BIBM 2015, pp.
       of feature sets and classification
                                                             625–628,             2015,           doi:
       algorithms for sentiment classification,”
                                                             10.1109/BIBM.2015.7359756.
       Inf. Sci. (Ny)., vol. 181, no. 6, pp. 1138–
                                                      [19]   Z. Huang, W. Xu, and K. Yu,
       1152,                2011,              doi:
                                                             “Bidirectional LSTM-CRF Models for
       10.1016/j.ins.2010.11.023.
                                                             Sequence Tagging,” 2015, [Online].
[10]   A. Baweja and P. Garg, “Sentimental
                                                             Available:
       Analysis of Twitter Data for Job
                                                             http://arxiv.org/abs/1508.01991.
       Opportunities,” Int. Res. J. Eng.
                                                      [20]   A. Graves and J. Schmidhuber,
                                                                                              134


       “Framewise phoneme classification                   models in machine learning,” Comput.
       with bidirectional LSTM and other                   Sci. Rev., vol. 38, 100285, Nov. 2020,
       neural network architectures,” Neural               doi: 10.1016/j.cosrev.2020.100285.
       Networks, vol. 18, no. 5–6, pp. 602–610,     [30]   S. Rasoul and L. David, “A Survey of
       2005,                                 doi:          Decision Tree Classifier Methodology,”
       10.1016/j.neunet.2005.06.042.                       IEEE Trans. Syst. Man. Cybern., vol.
[21]   S. Shalev-Shwartz and S. Ben-David,                 21, no. 3, pp. 660–674, 1991.
       Understanding machine learning: From         [31]   G. R. Dattatreya and L. N. Kanal,
       theory to algorithms. Cambridge                     “Decision       Trees     in     Pattern
       University Press, 2014.                             Recognition,” in Progress in pattern
[22]   M. Thenuwara and H. R. K.                           recognition 2, 1985, pp. 189–239.
       Nagahamulla, “Offline handwritten            [32]   S. Nayak, M. K. Gourisaria, M. Pandey,
       signature verification system using                 and S. S. Rautaray, “Prediction of Heart
       random forest classifier,” in 17th                  Disease by Mining Frequent Items and
       International Conference on Advances                Classification Techniques,” in 2019
       in ICT for Emerging Regions, ICTer                  International Conference on Intelligent
       2017 - Proceedings, 2017, vol. 2018-                Computing and Control Systems
       Janua,       pp.       191–196,       doi:          (ICCS), May 2019, pp. 607–611, doi:
       10.1109/ICTER.2017.8257828.                         10.1109/ICCS45141.2019.9065805.
[23]   W. Yu, T. Liu, R. Valdez, M. Gwinn,          [33]   Wikipedia contributors. Decision tree
       and M. J. Khoury, “Application of                   learning.    Wikipedia,     The    Free
       support vector machine modeling for                 Encyclopedia,https://en.wikipedia.org/
       prediction of common diseases: The                  wiki/Decision_tree_learning,        Last
       case of diabetes and pre-diabetes,” BMC             accessed 2020/8/30.
       Med. Inform. Decis. Mak., vol. 10, no. 1,    [34]   A. Gupte, S. Joshi, P. Gadgul, and A.
       2010, doi: 10.1186/1472-6947-10-16.                 Kadam, “Comparative Study of
[24]   S. Nayak, M. Kumar Gourisaria, M.                   Classification Algorithms used in
       Pandey, and S. Swarup Rautaray, “Heart              Sentiment Analysis,” Int. J. Comput.
       Disease Prediction Using Frequent Item              Sci. Inf. Technol., vol. 5, no. 5, pp.
       Set     Mining      and     Classification          6261–6264, 2014.
       Technique,” Int. J. Inf. Eng. Electron.      [35]   A. Giachanou and F. Crestani, “Like It
       Bus., vol. 11, no. 6, pp. 9–15, 2019, doi:          or Not,” ACM Comput. Surv., vol. 49,
       10.5815/ijieeb.2019.06.02.                          no. 2, pp. 1–41, Nov. 2016, doi:
[25]   S. Ghumbre, C. Patil, and A. Ghatol,                10.1145/2938640.
       “Heart Disease Diagnosis using Support       [36]   G. Gautam and D. Yadav, “Sentiment
       Vector Machine,” Int. Conf. Comput.                 analysis of twitter data using machine
       Sci. Inf. Technol., pp. 84–88, 2011.                learning approaches and semantic
[26]   S. Nayak, M. K. Gourisaria, M. Pandey,              analysis,” in 2014 7th International
       and S. S. Rautaray, “Comparative                    Conference        on      Contemporary
       Analysis of Heart Disease Classification            Computing, IC3 2014, 2014, pp. 437–
       Algorithms Using Big Data Analytical                442, doi: 10.1109/IC3.2014.6897213.
       Tool,” 2020, pp. 582–588.                    [37]   B. Jason, “How to Use ROC Curves and
[27]   V. S and D. S, “Data Mining                         Precision-Recall       Curves        for
       Classification Algorithms for Kidney                Classification in Python,” Machine
       Disease Prediction,” Int. J. Cybern.                Learning Mastery. pp. 1–48, 2018.
       Informatics, vol. 4, no. 4, pp. 13–25,       [38]   A. H. Hossny, L. Mitchell, N. Lothian,
       2015, doi: 10.5121/ijci.2015.4402.                  and G. Osborne, “Feature selection
[28]   Wikipedia        contributors.     "Bayes           methods for event detection in Twitter:
       Theorem". Wikipedia, The Free                       a text mining approach,” Soc. Netw.
       Encyclopedia.                                       Anal. Min., vol. 10, no. 1, 2020, doi:
       https://en.wikipedia.org/wiki/Bayes'_th             10.1007/s13278-020-00658-3.
       eorem, Last accessed 2020/8/28.              [39]   S. Dey, M. K. Gourisaria, S. S. Rautray,
[29]   H. GM, M. K. Gourisaria, M. Pandey,                 and M. Pandey, “Segmentation of
       and S. S. Rautaray, “A comprehensive                Nuclei in Microscopy Images Across
       survey and analysis of generative                   Varied Experimental Systems,” 2021,
                                                                                             135


       pp. 87–95.
[40]   R. Sharma, S. Das, M. K. Gourisaria, S.     [43]   S. Sharma, M. K. Gourisaria, S. S.
       S. Rautaray, and M. Pandey, “A Model               Rautray, M. Pandey, and S. S. Patra,
       for Prediction of Paddy Crop Disease               “ECG Classification using Deep
       Using CNN,” 2020, pp. 533–543.                     Convolutional Neural Networks and
[41]   M. K. Gourisaria, S. Das, R. Sharma, S.            Data Analysis,” Int. J. Adv. Trends
       S. Rautaray, and M. Pandey, “A deep                Comput. Sci. Eng., no. 9, pp. 5788–
       learning model for malaria disease                 5795, 2020.
       detection and analysis using deep           [44]   G. Jee, H. GM and M. K.
       convolutional neural networks,” Int. J.            Gourisaria, “Juxtaposing       inference
       Emerg. Technol., vol. 11, no. 2, pp. 699–          capabilities of deep neural models over
       704, 2020.                                         posteroanterior    chest    radiographs
[42]   S. S. Rautaray, S. Dey, M. Pandey, and             facilitating COVID-19 detection,” J. of
       M. K. Gourisaria, “Nuclei segmentation             Interdisciplinary Mathematics, pp. 1-
       in cell images using fully convolutional           27, 2021,
       neural networks,” Int. J. Emerg.                   doi: 10.1080/09720502.2020.1838061
       Technol., vol. 11, no. 3, pp. 731–737,
       2020.

</pre>