<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sentiment Analysis of Online Reviews Using Bag-of-Words and LSTM Approaches</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>James Barry</string-name>
          <email>james.barry26@mail.dcu.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computing, Dublin City University</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper implements a binary sentiment classi cation task on datasets of online reviews. The datasets include the Amazon Fine Food Reviews Dataset and the Yelp Challenge Dataset. The paper performs sentiment classi cation via two approaches: rstly, a non-neural bag-of-words approach using Multinomial Naive Bayes and Support Vector Machine classi ers. Secondly, a Long Short-Term Memory (LSTM) Recurrent Neural Network is used. The experiment is designed to test the role of word order in sentiment classi cation by comparing bag-ofwords approaches where word order is absent with an LSTM approach which can handle sequential data as inputs. For the LSTM approaches, we test the role of various features such as pre-trained Word2vec and Glove embeddings as well as Word2vec embeddings learned on domain speci c corpora. We also test the e ect of initialising our own weights from scratch. The tests are carried out on balanced datasets as well as on datasets which follow their original distribution. This measure enables us to evaluate the e ect of ratings distribution on model performance. Our results show that the LSTM approaches using GloVe embeddings and self-learned Word2vec embeddings perform best, whilst the distribution of ratings in the data has a meaningful impact on model performance.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Sentiment Analysis is a fundamental task in Natural Language Processing (NLP).
Its uses are many: from analysing political sentiment on social media [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
gathering insight from user-generated product reviews [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or even for nancial purposes,
such as developing trading strategies based on market sentiment [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The goal
of most sentiment classi cation tasks is to identify the overall sentiment
polarity of the documents in question, i.e. is the sentiment of the document positive
or negative? For our case, we use online user-generated reviews from the
Amazon Fine Food Reviews [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Yelp Challenge [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] datasets. In order to perform
this sentiment classi cation task, we use a mixture of baseline machine learning
models and deep learning models to learn and predict the sentiment of binary
reviews. This poses a supervised learning task.
      </p>
      <p>
        Pioneering approaches for sentiment classi cation include Pang et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] who
use bag-of-words features with machine learning algorithms built on top to create
a sentiment classi er. The popularity of such bag-of-words approaches is mainly
due to their simplicity and e ciency, whilst having the ability to achieve very
high accuracy. Bag-of-words features are created by viewing the document as
an unordered collection of words, which are then used to classify the document.
Despite their overall high success rates, there exist some downsides to using
bagof-words or n-gram approaches. The main pitfall of such approaches is that that
they ignore long-range word ordering such that modi ers and their objects may
be separated by many unrelated words [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. As word order is lost, sentences with
di erent meanings which use the same words will have similar representations.
      </p>
      <p>
        Another key downside to using bag-of-words approaches is that they are
unable to deal e ectively with negation. For example, if the model sees words like
\great" or \inspiring" in a review, it will likely prompt a positive classi cation.
However, if the actual sentence was, \The cast was not great, nor was the movie
inspiring.", it has a completely di erent meaning which the model will fail to
pick up. Additionally, bag-of-words features have very little understanding of
the semantics of the words, which can be measured as the distances between
words in an embedding space [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This is because words are treated as atomic
units, resulting in sparse \one-hot" vectors and therefore there is no notion of
similarity between words [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        The inclusion of word embeddings in NLP tasks enables us to overcome such
problems. Recently, Mikolov et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and Pennington et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] developed the
very popular word embedding models, Word2vec and GloVe respectively which
gain an understanding of words in a corpora by analysing the co-occurrences of
words over a large training sample. Such representations can encode fundamental
relationships between words, such that simple algebraic operations can yield
meaningful semantic information between words.
      </p>
      <p>
        Furthermore, the addition of word embeddings to the eld of NLP has
enabled practitioners to use more advanced learning algorithms which can handle
sequential data as inputs such as Recurrent Neural Networks (RNNs). An
important development in the eld of RNNs was the introduction of the Long
Short-Term Memory (henceforth LSTM) RNN by Hochreiter and Schmidhuber
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Their success has been shown in NLP tasks such as handwriting recognition
by Graves and Schmidhuber [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Today, LSTMs are used for many tasks such
as speech recognition, machine translation, handwriting recognition and many
other sequential problems.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Literature</title>
      <p>
        Concerning sentiment classi cation, Pang et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] incorporated a standard
bagof-features framework to predict the sentiment class of movie reviews. Their
results showed that machine learning techniques using bag-of-words features
outperformed simple decision-making models which used hand-picked feature
words for sentiment classi cation. To overcome di culties with bag-of-words
methods such as negation, Turney [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], developed hand-written algorithms which
can reverse the semantic orientation of a word when it is preceded by a negative
word. While such algorithms are an important development to handle things
like negation, it can be very time-consuming to develop heuristically designed
rules which may not be able to deal with the multiple scenarios prevalent across
human language.
      </p>
      <p>
        Studies which use neural network architectures include Socher et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] who
use a semi-supervised approach using recursive autoencoders for predicting
sentiment distributions. Socher et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] introduce a Sentiment Treebank and a
Recursive Neural Tensor Network, which when trained on the new Treebank,
outperforms all previous methods on several metrics and forms a state of the art
method for determining the positive/negative classi cations of single sentences.
Li et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] compare recursive and recurrent neural networks on ve NLP tasks
including sentiment classi cation. Dai and Lei [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] perform a document
classication task across a variety of datasets as well as a sentiment analysis task.
They found that LSTMs pre-trained by recurrent language models or sequence
autoencoders are better than LSTMs initialised from scratch. Le and Mikolov [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
introduce Paragraph Vector to learn document representation from semantics of
words.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Data</title>
      <p>Our datasets include the Amazon Fine Food Reviews dataset1 and the Yelp
Challenge dataset 2, both of which contain a series of reviews and labeled ratings.
For this project, as it is a sentiment classi cation task, only the data containing
the raw text reviews and their equivalent rating were parsed. An example of a
positive and a negative review from the Amazon dataset is given below:</p>
      <p>Positive Review: These bars are great! Great tasting and with quality
wholesome ingredients. The company is great and has outstanding
customer service and stand by their product 110% I highly recommend these
bars in any avor.</p>
      <p>Negative Review: These get worse with every bite. I even tried putting
peanut butter on top to cover the taste. That didn't work. My ve-year-old
likes them. That is the only reason I didn't rate it lower.
3.1</p>
      <sec id="sec-3-1">
        <title>Amazon Fine Food Reviews</title>
        <p>The Amazon Fine Food Reviews Dataset contains 568,454 reviews. The dataset
contains almost 46 million words and comprises 2.8 million sentences, with an
average of 5 sentences per review.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Yelp Academic Reviews</title>
        <p>We use the Yelp Academic Reviews dataset from the Yelp Dataset Challenge,
which contains written reviews of listed businesses. We parse data from two
1 https://snap.stanford.edu/data/web-FineFoods.html
2 https://www.yelp.com/dataset/challenge
elds: \stars" and \text", where \stars" is the customer's rating from 1 to 5
and \text" is the customer's written review. There are 4,153,150 reviews in the
dataset. Out of a sample of 100,000 reviews, the number of sentences is 829,165,
while the number of words is 11.57 million. The average number of sentences in
the reviews is 8.</p>
        <p>(a) Distribution of Ratings Amazon
(b) Distribution of Ratings Yelp
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Word Embeddings</title>
        <p>For the pre-trained Word2vec word embeddings, we use the GoogleNews
embeddings 3 which were trained on 3 billion words from a Google News scrape. The
data contains 3 million 300-dimensional word vectors for the English language.
We also use GloVe embeddings as a comparison. We use the GloVe embeddings
which were trained on a crawl of 42 billion tokens, with a vocabulary of 1.9
million words. 4 Similarly, these vectors also have a dimension of 300.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Data Processing</title>
        <p>The ratings in the Amazon and Yelp datasets were turned into binary positive
and negative reviews where negative labels were assigned to ratings of 2 stars
and below. Positive labels were assigned to ratings of 4 stars and above. Neutral,
3-star reviews were excluded so that our data would be highly polarised. From
Figures (a) and (b) we can see that there is a large number of 5-star reviews
in both datasets. In order to test the e ect of using a balanced dataset, with
an even number of positive and negative reviews and a dataset which follows
the original distribution, we carry out two di erent sampling techniques: First,
we separate the datasets into an evenly split distribution of 82,000 positive and
3 The GoogleNews embeddings are available at: https://github.com/mmihaltz/
word2vec-GoogleNews-vectors.
4 The GloVe 42B embeddings are available at: https://nlp.stanford.edu/projects/
glove/.
82,000 negative reviews (as there are only around 82,000 negative reviews in the
Amazon dataset). For the second test, to analyse the e ect of the original
distribution, we randomly sample 164,000 reviews from the datasets, which should
have a proportionally higher number of positive reviews re ecting the original
distribution. We use 164,000 reviews in both datasets and distributions to ensure
di ering results are not attributed to varying dataset sizes. For our experiments,
we partition each of our datasets by an 80:20 training/test split. As the split
is made after the various dataset sampling measures, the distribution of ratings
in the training/test sets should be representative of the speci ed sampling
approach. By doing so, we avoid the situation whereby models trained on balanced
data are used to predict on the original distribution and vice versa.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Approach</title>
      <sec id="sec-4-1">
        <title>Baseline Approach I: Support Vector Machine</title>
        <p>
          Support Vector Machines are a type of machine learning model introduced by
Cortes and Vapnik [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. An SVM is used in our experiment for text classi cation
as they are shown to consistently achieve good performance on text
categorisation tasks compared to other models [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. A reason for this is that they possess
the ability to generalise well in high dimensional feature spaces and eliminate
the need for feature selection, making them a suitable choice of models for text
categorisation tasks. Many classi er learning algorithms such as SVMs using
a linear kernel assume that the training data is independently and identically
distributed as part of their derivation [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. This assumption is often violated in
applications such as text classi cation as the order of words in a sentence will
have a signi cant impact on the overall sentiment of the sentence or phrase.
Nevertheless, such classi ers can achieve high accuracy, representing good baseline
metrics for our study.
        </p>
        <p>!w := X
j cj !dj ; j</p>
        <p>0;
j</p>
        <p>
          Looking at the above solution, the idea behind the training method of the
SVM for this task is to nd a maximum separating hyperplane !w, that
separates the di erent document classes. The corresponding search is a constrained
optimisation problem, letting cj 2 f-1:1g (where 1 refers to positive and -1 is
negative) be the correct class of document dj . The j s are obtained by solving
a dual optimisation problem. The documents !dj where j is greater than zero
are called support vectors since only those document vectors are contributing to
the hyperplane !w. We are able to classify test instances by determining which
side of the of the hyperplane !w they lie [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          SVM Implementation For the bag-of-words approach, the reviews were cleaned
via a text-processing algorithm to remove any unwanted characters, HTML links
or numbers and retrieve only raw text. The next stage involves converting the
words in the reviews from text to integers so that they have a numeric
representation which can be used in machine learning models.5. The bag-of-words model
builds a vocabulary from all of the words in the documents. It then models each
document by counting the number of times a word in the vocabulary appears
in the document. Considering the datasets contain a large number of reviews,
resulting in a large vocabulary, we limit the size of the feature vectors by
choosing a maximum vocabulary size of the 5000 top-occurring words. Bag-of-words
features were created for both training and test sets. We used an 80:20 train/test
split for our experiment. GridSearch cross-validation with 3 folds was used to
nd the optimal Cost parameter for our Linear Support Vector Classi er (SVC)
on the training sets. A form of feature scaling used in text classi cation tasks
includes converting the words to tf-idf features, which stands for term frequency
inverse document frequency, which is a value that corresponds to how distinctive
a word is in a corpus [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. We evaluate both standard bag-of-words and tf-idf
features.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Baseline Approach II: Multinomial Naive Bayes</title>
        <p>As a second baseline classi er, we use the Multinomial Naive Bayes (MNB)
model. It is worth noting that Naive Bayes operates under the conditional
independence assumption, that given the class, each of our words are conditionally
independent of one another. This is not true in reality, as the order of words in
a sentence plays an important role in the overall sentiment of a sentence. That
said, Naive Bayes models using bag-of-words features can still achieve impressive
results, making it a valid baseline classi er.</p>
        <p>For our context, we can state Bayes theorem as follows:</p>
        <p>P (C(j) w(j),...,wn(j)) =
j 1</p>
        <p>P (C(j))P (w1(j),...,wn(j)jC(j))</p>
        <p>P (w1(j),...,wn(j))</p>
        <p>To carry out this task we want to know P (C(j)jw1(j),...,wn(j)), that is the
probability of the class of the document C(j) given its words w1(j),...,wn(j), where
j is the document.</p>
        <p>Multinomial Naive Bayes Implementation The MNB model was used as a
baseline model in order to compare the results with the linear SVM. GridSearch
cross-validation was used to nd the optimal value for each MNB model. As
with the SVM approach, both tf-idf and regular features were evaluated.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Long Short-Term Memory RNN</title>
        <p>
          For our neural network approach, we use LSTM RNNs because they generally
have a superior performance than traditional RNNs for learning relationships in
5 Both the Multinomial Naive Bayes and Support Vector Machine Classi ers were
implemented in Python using the Scikit-learn library
sequential data. A problem arises when using traditional RNNs for NLP tasks
because the gradients from the objective function can vanish or explode after a
few iterations of multiplying the weights of the network [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. For such reasons,
simple RNNs have rarely been used for NLP tasks such as text classi cation [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
In such a scenario we can turn to another model in the RNN family such as
the LSTM model. LSTMs are better suited to this task due to the presence of
input gates, forget gates, and output gates, which control the ow of information
through the network. The LSTM architecture is outlined below:
it = (W (i)xt) + U (i)ht 1 : Input gate
ft = (W (f)xt) + U (f)ht 1 : Forget gate
t = (W (o)xt) + U (o)ht 1 : Output gate
c~t = tanh(W (c)xt) + U (c)ht 1 : New memory cell
ct = ft ct 1 + it c~t : Final memory cell
        </p>
        <p>ht = ot tanh(ct)
1. New Memory Generation: The input word xt and the past hidden state
ht 1 are used to generate a new memory c~t which includes aspects of the
new word xt.
2. Input Gate: The input gate's function is to ensure that new memory is
generated only if the new word is important. The input gate achieves this by
using the input word and the past hidden state to determine whether or not
the input is worth preserving and thus controls the creation of new memory.</p>
        <p>
          It produces it as an indicator of this information.
3. Forget Gate: The forget gate is similar to the input gate but instead of
determining the usefulness of the input word, it assesses whether the past
memory cell is useful for the computation of the current memory cell. Here,
the forget gate looks at the input word and the past hidden state and
produces ft.
4. Final Memory Generation: For this stage, the model takes the advice
of the forget gate ft and accordingly forgets the past memory ct 1. It also
takes the advice of the input gate it and gates the new memory c~t. The
model sums these two results to produce the nal memory ct.
5. Output/Exposure Gate: This gate's purpose is to separate the nal
memory from the hidden state. Hidden states are present in every gate of an
LSTM and consequently, this gate assesses what part of the memory ct needs
to be exposed/present in the hidden state ht. The signal ot is produced by it
to indicate this and is used to gate the point-wise tanh of the memory [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ].
LSTM Implementation For our study, the LSTM implementation is carried out
with four variations: Firstly, we use pre-trained Word2vec embeddings. Secondly,
we use pre-trained GloVe embeddings. Thirdly, we use Word2vec embeddings
which were learned on domain-speci c corpora. For this experiment, we run the
Word2vec model to generate word embeddings on each of our datasets. These
embeddings are used as inputs to the LSTM to learn and predict on that
particular dataset. Lastly, we test how well we would have performed by not using
pre-trained word embeddings and instead keep the original word indices in the
embedding layer and allow the model to learn the weights itself. In contrast
to this approach, the Word2vec and GloVe methods allocate a dense numeric
vector to every word in the dictionary. By doing so, the distance (e.g. the
cosine distance) between the vectors will capture part of the semantic relationship
between the words in our vocabulary.
        </p>
        <p>LSTM Design As with the baseline approach, we use a text processing algorithm
to remove any unwanted characters. We then convert all text samples in the
dataset into vectors of word indices which involves converting each word to its
integer ID. For this study, we permit the 200,000 most commonly occurring words
in our vocabulary. We truncate the sequences to be a maximum length of 1000
words. Once the words in the reviews are converted into their corresponding
integers, for the word embedding approaches, we can prepare an embedding
matrix which contains at index i, the embedding vector (e.g. Word2vec or GloVe
embedding) for the word at index i. The embedding matrix is then loaded into
a Keras embedding layer and fed through the LSTM.6</p>
        <p>Param</p>
        <p>Value</p>
        <p>Param
Input length 1000
LSTM size 200
Dropout 0.25
Activation ReLU
Batch size 64</p>
        <p>Embedding size
Hidden layer size
Recurrent Dropout</p>
        <p>Optimizer</p>
        <p>Output</p>
        <p>Value
300
128
0.25
Nadam</p>
        <p>Sigmoid</p>
        <p>
          As with the baseline approaches, we partition the training and test data by an
80:20 split. We perform GridSearch cross-validation with 3 folds to nd the
optimal model hyperparameters on the training data. We tested for several
parameters including LSTM and hidden layer sizes, batch size and the dropout value.
After conducting GridSearch cross-validation, the following hyper-parameters
were chosen: The optimal LSTM layer size was found to be 200, while the
hidden layer size was found to be 128 units. A Dropout value of 25% was selected,
which helps our models to prevent over- tting by randomly turning o nodes in
the network. The activation function we use on the inner layer is ReLU (Recti ed
Linear Unit), which is a function that maps negative values to 0 and positive
values linearly which helps transmit errors during back-propagation. The
optimizer we use is Nadam, which is a variation of the popular Adam optimizer
6 Keras was used as the deep learning library to build the LSTM network. The GPU
version of Tensor ow was used to speed up training times signi cantly.
which incorporates Nesterov momentum [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. The output layer of our LSTM
model is a Sigmoid function which is used to condense the output value of our
network into a probability of classifying the review as positive or negative. The
number of epochs during model training was set to 10 as validation accuracy
was improving whilst signs of over- tting were not setting in at this value.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>The results of our various models and dataset distributions are shown in this
section. The metrics we use are accuracy and AUC (referring to area under ROC
curve), which gives a measure of the relative share of true positive and false
positive rates depending on a threshold. The inclusion of this metric will help
shed light on the role of ratings distributions and how they a ect the model's
classi cation ability. For example, a model trained on a dataset containing a
higher proportion of positive reviews may perform better on a super cial level
due to it being more likely to predict the majority class prevalent in the data.</p>
      <p>From Table 2, with respect to the balanced dataset, from the bag-of-words
approaches, the SVM using TF-IDF features performed the best achieving an
accuracy 88.95% on the Amazon dataset. Similarly, the SVM TF-IDF model
performed best on the Yelp dataset with an accuracy of 92.91%. In both cases,
the AUC score is very similar to the accuracy score. The MNB classi ers
performed worse across the range of tests but still achieved satisfactory accuracy
and AUC scores, rendering the use of MNB models as a baseline.</p>
      <p>For the LSTM models, we can see that they generally perform better than the
baseline bag-of-words methods with the exception of the LSTM using Word2vec
embeddings learned on the Amazon dataset, which lags behind the other LSTM
variations and even the SVM models and some MNB models on the Yelp dataset.
The LSTM with GloVe embeddings performs the best across the Amazon dataset
with an accuracy of 95.03% and and AUC score of .9889. Meanwhile, the LSTM
using dataset speci c Word2vec embeddings performs best with 95.75% accuracy
and an AUC score of .9924 on the Yelp dataset.</p>
      <p>The promising results from the LSTM models indicate that the LSTMs can
handle sequential information well and the addition of word embeddings help
improve model performance, where the models using embeddings narrowly
outperform the models using self-initialised weights. An interesting result is the di
erence in performance between the LSTM models using domain-speci c Word2vec
embeddings on the Amazon and Yelp datasets, where these features resulted in
a worst performing and best performing LSTM model respectively. A reason for
this could be due to attributes of the corpora involved. The Amazon dataset
had an average of ve words per sentence, whilst the Yelp dataset had an
average of eight words per sentence. The longer sentences in the Yelp dataset could
give rise to a scenario where more semantic information can be captured by the
Word2vec model, resulting in better embeddings.</p>
      <p>With respect to the original datasets: As with the balanced datasets, the
SVMs perform better than the MNB models, with SVMs using TF-IDF features
Distribution Model
Balanced</p>
      <p>SVM
SVM TF-IDF
MNB</p>
      <p>MNB TF-IDF
Original</p>
      <p>LSTM Word2vec
LSTM Self Initialised
LSTM GloVe
LSTM Word2vec-domain
SVM
SVM TF-IDF
MNB
MNB TF-IDF
LSTM Word2vec
LSTM Self Initialised
LSTM GloVe
LSTM Word2vec-domain
Accuracy AUC</p>
      <p>Accuracy AUC
performing better on the Amazon and Yelp datasets. Whilst the baseline models
look good at an initial glance in terms of accuracy, the AUC scores on the
datasets following the original distribution paint a rather di erent picture. We
are able to observe that there is a much wider gap between the accuracy and
AUC scores in the original distribution than in the balanced distribution. This
vindicates a prior assumption, that the greater success of a number models on
the original distribution in terms of accuracy can somewhat be attributed to the
increased likelihood of classifying the majority class. This is not as relevant to
the LSTM models on the original datasets as they still manage to achieve very
successful AUC scores.</p>
      <p>With regard to the LSTMs, the LSTMs which use GloVe embeddings perform
best in terms of accuracy on the Amazon dataset and Word2vec embeddings
learned on domain-speci c corpora perform best on the Yelp dataset and on
the Amazon dataset in terms of AUC. The models using GloVe embeddings
narrowly outperform the models using Word2vec embeddings across all tests,
which echoes the results on the balanced datasets where GloVe scores, in the
majority of cases, outperformed the scores of models using pre-trained Word2vec
embeddings. Similarly, models using self-initialised weights slightly lag behind
models using pre-trained weights in most cases. Despite this, the features which
consist solely of word indices and have no prior knowledge of word meaning,
still act as good features for the LSTM. The performance of LSTM models
using self-initialised weights is very stable across both datasets and distributions,
indicating that the LSTMs can learn meaningful information from the words in
the corpora without having a semantic understanding of the words. The fact that
these models do not use pre-trained weights and still outperform the baseline
bag-of-words methods, which also have a limited understanding of word meaning,
gives credence to the role of LSTMs in tasks which involve modeling sequential
data such as text.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this study we have compared bag-of-words and neural network based
approaches for sentiment classi cation. Firstly, we used unordered bag-of-words
models, and secondly, we used an LSTM model which can handle sequential data
as well as leverage the use of pre-trained word embeddings. From our analysis,
for the baseline approaches, the SVM models outperform the Multinomial Naive
Bayes classi ers. The LSTM models outperform the bag-of-words models across
both metrics for the majority of tests. Nevertheless, bag-of-words models can
still perform very well, particularly with respect to their shorter training period.
LSTM models using pre-trained GloVe embeddings and Word2vec embeddings
learned on domain-speci c corpora performed best. In most cases, pre-trained
GloVe embeddings served as better features than pre-trained Word2vec
embeddings. The strong performance of the models using domain-speci c Word2vec
embeddings could justify using such an approach provided there is an adequate
amount of text to train on.</p>
      <p>We also compared our results across two di erent dataset distributions, a
balanced distribution and one which follows the original distribution. While
the greatest accuracy was achieved on the original distribution using Word2vec
domain-speci c embeddings, there was less disparity among AUC scores on the
balanced datasets, particularly with respect to the baseline models. The
inclusion of sampling measures which balance the distribution of ratings can help
ensure the models are less likely to over t on positive reviews given their higher
respective share in the data. The fact that the LSTM models achieved greater
AUC scores than the baseline models highlights their ability in NLP tasks. The
LSTM models are able to learn more subtle relationships which the baseline
models fail to pick up on as evident in their comparative AUC scores.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bakliwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Foster</surname>
            , J., van der Puil, J.,
            <given-names>O</given-names>
          </string-name>
          <string-name>
            <surname>'Brien</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tounsi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hughes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Sentiment analysis of political tweets: Towards an accurate classi er, Association for Computational Linguistics (</article-title>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Pang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , et al.:
          <article-title>Opinion mining and sentiment analysis</article-title>
          .
          <source>Foundations and Trends R in Information Retrieval</source>
          <volume>2</volume>
          (
          <issue>1</issue>
          {2) (
          <year>2008</year>
          )
          <volume>1</volume>
          {
          <fpage>135</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Skiena</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , et al.:
          <article-title>Trading strategies to exploit blog and news sentiment</article-title>
          . In: Icwsm. (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>McAuley</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leskovec</surname>
          </string-name>
          , J.:
          <article-title>From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews</article-title>
          .
          <source>In: Proceedings of the 22nd international conference on World Wide Web, ACM</source>
          (
          <year>2013</year>
          )
          <volume>897</volume>
          {
          <fpage>908</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Yelp: Yelp Dataset Challenge. $https://www.yelp.ie/dataset_challenge$ (
          <year>2017</year>
          ) [Online; accessed 23-June-2017].
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Pang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vaithyanathan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Thumbs up?: sentiment classi cation using machine learning techniques</article-title>
          .
          <source>In: Proceedings of the ACL-02 conference on Empirical methods in natural language processing-</source>
          Volume
          <volume>10</volume>
          ,
          <string-name>
            <surname>Association for Computational Linguistics</surname>
          </string-name>
          (
          <year>2002</year>
          )
          <volume>79</volume>
          {
          <fpage>86</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          :
          <article-title>Semi-supervised sequence learning</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          . (
          <year>2015</year>
          )
          <volume>3079</volume>
          {
          <fpage>3087</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of sentences and documents</article-title>
          .
          <source>In: Proceedings of the 31st International Conference on Machine Learning (ICML-14)</source>
          . (
          <year>2014</year>
          )
          <volume>1188</volume>
          {
          <fpage>1196</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>E cient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.: Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In: EMNLP</source>
          . Volume
          <volume>14</volume>
          . (
          <year>2014</year>
          )
          <volume>1532</volume>
          {
          <fpage>1543</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation 9(8)</source>
          (
          <year>1997</year>
          )
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liwicki</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bertolami</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bunke</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A novel connectionist system for unconstrained handwriting recognition</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>31</volume>
          (
          <issue>5</issue>
          ) (
          <year>2009</year>
          )
          <volume>855</volume>
          {
          <fpage>868</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Turney</surname>
          </string-name>
          , P.D.:
          <article-title>Thumbs up or thumbs down?: semantic orientation applied to unsupervised classi cation of reviews</article-title>
          . In:
          <article-title>Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics (</article-title>
          <year>2002</year>
          )
          <volume>417</volume>
          {
          <fpage>424</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>E.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.:
          <article-title>Semisupervised recursive autoencoders for predicting sentiment distributions</article-title>
          .
          <source>In: Proceedings of the conference on empirical methods in natural language processing</source>
          ,
          <source>Association for Computational Linguistics</source>
          (
          <year>2011</year>
          )
          <volume>151</volume>
          {
          <fpage>161</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perelygin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>J.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chuang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potts</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , et al.:
          <article-title>Recursive deep models for semantic compositionality over a sentiment treebank</article-title>
          .
          <source>In: Proceedings of the conference on empirical methods in natural language processing (EMNLP)</source>
          . Volume
          <volume>1631</volume>
          . (
          <year>2013</year>
          )
          <fpage>1642</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luong</surname>
            ,
            <given-names>M.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hovy</surname>
          </string-name>
          , E.:
          <article-title>When are tree structures necessary for deep learning of representations? arXiv preprint</article-title>
          arXiv:
          <volume>1503</volume>
          .00185 (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Cortes</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Support-vector networks</article-title>
          .
          <source>Machine learning 20(3)</source>
          (
          <year>1995</year>
          )
          <volume>273</volume>
          {
          <fpage>297</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Transductive inference for text classi cation using support vector machines</article-title>
          .
          <source>In: ICML</source>
          . Volume
          <volume>99</volume>
          . (
          <year>1999</year>
          )
          <volume>200</volume>
          {
          <fpage>209</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Dundar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krishnapuram</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rao</surname>
            ,
            <given-names>R.B.</given-names>
          </string-name>
          :
          <article-title>Learning classi ers when the training data is not iid</article-title>
          . In: IJCAI. (
          <year>2007</year>
          )
          <volume>756</volume>
          {
          <fpage>761</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Speech and language processing</article-title>
          .
          <source>International Edition</source>
          <volume>710</volume>
          (
          <year>2000</year>
          )
          <fpage>25</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frasconi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>Gradient ow in recurrent nets: the di culty of learning long-term dependencies (</article-title>
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22. : Lecture notes: Part v2. http://web.stanford.edu/class/cs224n/lecture_ notes/cs224n-2017
          <source>-notes5.pdf Date last accessed 22-August-2017.</source>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Dozat</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Incorporating nesterov momentum into adam</article-title>
          . (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>