=Paper= {{Paper |id=Vol-1986/SML17_paper_7 |storemode=property |title=Convolutional Neural Networks for Sentiment Classification on Business Reviews |pdfUrl=https://ceur-ws.org/Vol-1986/SML17_paper_7.pdf |volume=Vol-1986 |authors=Andreea Salinca |dblpUrl=https://dblp.org/rec/conf/ijcai/Salinca17 }} ==Convolutional Neural Networks for Sentiment Classification on Business Reviews== https://ceur-ws.org/Vol-1986/SML17_paper_7.pdf
             Convolutional Neural Networks for Sentiment
                 Classification on Business Reviews

                                           Andreea Salinca
                 Faculty of Mathematics and Computer Science, University of Bucharest
                                         Bucharest, Romania
                                    andreea.salinca@fmi.unibuc.ro



                                                                       machines) [PL+ 08, PLV02, MDP+ 11]. Convolutional
                                                                       Neural Networks (CNNs) have achieved remarkable re-
                          Abstract                                     sults in the area of sentiment analysis and text classifi-
                                                                       cation on large-scale databases [Kim14, ZW15, JZ14].
     Recently Convolutional Neural Networks                                In this article, we conduct an empirical study of
     (CNNs) models have proven remarkable re-                          a word-based CNNs for sentiment classification us-
     sults for text classification and sentiment anal-                 ing Yelp 2017 challenge dataset [yel17] that comprises
     ysis. In this paper, we present our ap-                           4.1M user reviews about local business with star rating
     proach on the task of classifying business re-                    from 1 to 5. We choose two models for comparison, in
     views using word embeddings on a large-scale                      which both are word-based CNNs with one or multi-
     dataset provided by Yelp: Yelp 2017 challenge                     ple layer of convolution built on top of word vectors by
     dataset. We compare word-based CNN us-                            choosing pre-trained or end-to-end learned word rep-
     ing several pre-trained word embeddings and                       resentations with di↵erent embedding sizes. Previous
     end-to-end vector representations for text re-                    works report several techniques on sentiment classifi-
     views classification. We conduct several ex-                      cation results of text reviews using Yelp 2015 challenge
     periments to capture the semantic relation-                       dataset [ZZL15, TQL15, Sal15].
     ship between business reviews and we use deep                         A series of experiments are made to explore the ef-
     learning techniques that prove that the ob-                       fect of architecture components on model performance
     tained results are competitive with traditional                   along with the hyperparameters tuning, including fil-
     methods.                                                          ter region size, number of feature maps, and regular-
                                                                       ization parameters of the proposed convolutional neu-
1    Introduction                                                      ral networks. We discuss the design decisions for sen-
In recent years, researchers have been investigated the                timent classification on Yelp 2017 dataset and we o↵er
problem of automatic text categorization and senti-                    a comparison between these models and report the ob-
ment classification - the overall opinion towards the                  tained accuracy.
subject matter whether the user review is positive or                      In our work, we aim to identify empirical hyperpa-
negative. Sentiment classification is useful in the area               rameter tuning and practical settings and we inspire
of recommender systems and business intelligence ap-                   from other research conducted by [Kim14] on a CNNs
plications.                                                            simple architecture. Furthermore, we also take into
   The e↵ectiveness of applying machine learning tech-                 consideration some advices from the empirical analy-
niques in sentiment classification of product or movie                 sis of CNNs architectures and hyperparameter settings
reviews is achieved using traditional approaches such                  for sentence classification described by [ZW15]. We
as representing text reviews using bag-of-words model                  obtain an accuracy of 95.6%, via 3-fold cross valida-
and di↵erent methods such as Naive Bayes, maxi-                        tion, on Yelp 2017 challenge dataset using word-based
mum entropy classification and SVM (Support vector                     CNN along with sentiment-specific word embeddings.

Copyright 2017 c by the paper’s authors. Copying permitted             2    Prior Work
for private and academic purposes.
InIn:Proceedings
      Proceedings of
                  of IJCAI
                      IJCAI Workshop
                            WorkshopononSemantic Machine
                                          Semantic  MachineLearning
                                                              Learn-   Kim et al. present a series of experiments using a
ing (SML  (SML 2017),
            2017),      AugAugust,
                     19-25  19-25 2017, Melbourne,
                                    Melbourne,     Australia.
                                                Australia              simple one layer convolutional neural network built on
top of pre-trained word2vec models obtained from an           In prior work, the authors use traditional ap-
unsupervised neural language model with little param-      proaches in the sentiment analysis classification on
eter tuning for sentiment analysis and sentence classi-    Yelp 2015 challenge dataset (split in 80% for training
fication [Kim14]. Zhang et al. o↵er practical advice by    and 20% for testing and 3-fold cross validation). Lin-
performing an extensive study on the e↵ect of archi-       ear Support Vector Classification and Stochastic Gra-
tecture components of CNNs for sentence classification     dient Descent Classifier report an accuracy of 94.4%
on model performance with results that outperform          using unigrams and applying preprocessing techniques
baseline methods such as SVM or logistic regression        to extract a set of feature characteristics [Sal15].
[ZW15].
   In [JZ14] it is proven the benefit of word order
on topic classification and sentiment classification us-   3   Convolutional            Neural        Network
ing CNNs and bag-of-words model in the convolution             Model
layer.
   Other approaches use character-level convolutional      We model Yelp text reviews using two convolutional
networks rather than word-based approaches that            architecture approaches. The first model is word-
achieve state of art results for text classification and   based CNN having an embedding layer in which we
sentiment analysis on large-scale reviews datasets such    tokenize text review sentences to a sentence matrix
as Amazon and Yelp 2015 challenge dataset. For the         having rows with word vector representations of each
Yelp polarity dataset, by considering stars 1 and 2 neg-   token similar to the approach of Kim et al. [Kim14].
ative, 3 and 4 positive and dropping 5 star reviews, the   We will truncate the reviews to a maximum length
authors use 560 000 train samples, 38 000 test and 5       of 1000 words and we will only consider the top 100
000 epochs in training [ZZL15].                            000 most commonly occurring words in the business
   A comparison between several models using tra-          reviews dataset.
ditional techniques with several feature extractors:           We use both pre-trained word embeddings such as
Bag-of-words and TFIDF (term-frequency inverse-            GloVe [KFF15] using 100 dimensional embeddings
document-frequency), Bag-of-ngrams and TFIDF,              of 400k words computed on a 2014 dump of En-
Bag-of-means on word embedding (word2vec) and              glish Wikipedia, word2vec [MCCD13] using 300 di-
TFIDF and a linear classifier - multinomial logistic       mensional embeddings and fastText [BGJM16] using
regression and deep learning techniques: Word-based        300 dimensional embeddings and a vocabulary trained
ConvNets (Convolutional Neural Networks) (one large        from the reviews dataset using word2vec having
1024 and one small - 256 features sizes having 9 layers    100-dimension word embeddings. Out-of-vocabulary
deep with 6 convolutional layers and 3 fully-connected     words are randomly initialized by sampling values uni-
layers) and long-short term memory (LSTM) recur-           formly from (0.25, 0.25) and optimized during train-
rent neural network model is made. The testing errors      ing.
are reported on all models for Yelp sentiment analy-           Next, a convolutional layer with one region sized
sis: 4.36% is obtained for n-gram traditional approach,    filters is applied. Filter widths are equal to the di-
word-based CNNs with pre-trained word2vec obtain           mension of the word vectors [ZW15]. Then we apply
4.60% for the large-featured architecture and 5.56%        a max-pooling operation on the feature map to com-
for the small-featured architecture. Also, word-based      pute a fixed-length feature vector and finally a softmax
CNNs lookup tables achieve a score of 4.89% for the        classifier to predict the outputs. During training, we
large-featured architecture and 5.54% for the small-       use dropout regularization technique with deep net-
featured architecture. The character-level ConvNets        works where network units are randomly dropped dur-
model reports an error of 5.89% for the large-featured     ing training [GG16]. Also, we aim to minimize the cat-
architecture and 6.53% for the small-featured architec-    egorical cross-entropy loss. We use a 300 feature maps,
ture [ZZL15].                                              1D convolution window of lengths 2, rectified linear
   In [TQL15] is proposed a convolutional-gated recur-     unit (ReLU) activation function and 1-max-pooling of
rent neural network approach, which encodes relations      size 2, 0.2 dropout (p) probability.
between sentences and obtains a 67.1% accuracy on              The second model approach di↵ers from the first
Yelp 2015 dataset (split in training, development and      approach by using multiple filters for the same region
testing sets of 80/10/10) which is compared to a base-     size to learn complementary features from the same
line implementation of a convolutional neural network      regions. We propose 3 filter regions size, having 128
based on Kim work [Kim14] with an accuracy of 61.5%        features per filter region, 1D convolution window of
for sentiment analysis. On the same dataset, an accu-      length 5, a dropout (d) of 0.5 probability and 1-max-
racy of 62.4% is achieved using a traditional approach     pooling of 35. We compare two di↵erent optimizers:
with SVM and bigrams.                                      Nesterov Adam and RMSprop optimizer[SMDH13].
4     Results And Discussion                               views in the training dataset.
4.1   Yelp Challenge Dataset
                                                           4.3   Experimental Results
Yelp 2017 challenge dataset, introduced in the 9th
round of Yelp Challenge, comprises user reviews about      We conduct an empirical exploration on the use of the
local businesses in 11 cities across 4 countries with      proposed word-based CNNs architecture for sentiment
star rating from 1 to 5. The large-scale dataset com-      classification on Yelp business reviews.
prises 4.1M reviews and 947K tips by 1M users for             In the training phase, we use a batch size of 500 and
144K businesses [yel17]. Yelp 2017 challenge dataset       3 epochs for the first model approach, and a batch size
has been updated compared to datasets in previous          of 128 and 2 epochs for the second model approach.
rounds, such as Yelp 2015 challenge dataset or Yelp           We obtain the same accuracy of the classifica-
2013 challenge dataset.                                    tion task of 77.88% when using 100-dimension and
   We conduct our system evaluation on U.S. cities:        300-dimension GloVe word embeddings with the first
Pittsburgh, Charlotte, Urbana-Champaign, Phoenix,          CNNs proposed having 300 features maps and a convo-
Las Vegas, Madison, and Cleveland, having 1 942 339        lution of window of length 5 on the Small Yelp dataset.
reviews. For the sentiment analysis classification task,      We study the e↵ect of filter kernel size of the con-
we consider the 1 and 2 star ratings as negative senti-    volution when using only one region size on the model
ments and 4 and 5 as positive sentiments and we drop       accuracy shown in Fig. 1. We set number of feature
the 3 star ratings reviews as the average Yelp review      maps for this region size to 300 and consider region
is 3.7.                                                    sizes of 2, 3 and 5 and compute the means of 3-fold CV
   Next, we will use two subsets of Yelp 2017 dataset to   for each. We observe that using a smaller region size
conduct our experiments, due to computational power        the CNNs performs better, obtaining an accuracy of
constraints.                                               79,5% (window of 2 words) rather than using a larger
   Our first experiments are done on a smaller sub-        region size (window size of 5) and obtaining 22,1%.
set of Yelp dataset having 8200 training samples, 2000
validation samples and 900 testing samples. We will
call this Small Yelp dataset.
   Further, we experiment on 82 000 training samples,
20 000 validation samples and 9 000 testing samples.
We will call this Big Yelp dataset.
   In the last experiment, we split the large-scale Yelp
US dataset into 80% for training and 20% for test-
ing. We use 3-fold cross validation for evaluating dif-
ferent hyperparameters for the deep neural methods.
We use accuracy as evaluation metric, which is a stan-
dard metric to measure the overall sentiment reviews
classification performance [MS+ 99].
                                                           Figure 1: CNN accuracy for di↵erent kernel sizes when
4.2   Word Embeddings                                      feature map is 300
We use several pre-trained models of word embed-
dings built with an unsupervised learning algorithm           The word embeddings used in the embedding layer
for obtaining vector representations of words: GloVe       of our CNNs have successfully captured the semantic
[KFF15], word2vec along with pre-trained vectors           relations among entities in the unstructured text re-
trained on part of Google News dataset (about 100          view. For the Big Yelp dataset using the first CNN
billion words). The models contain 100-dimensional         model approach with 300 features map, with a region
vectors for 3 million words and phrases [MCCD13].          size of 2, a dropout probability of 0.2 and Nesterov
   We use also use fastText pre-trained word vec-          Adam optimizer we obtain a score of 89.59% in the
tors for English language which are an extension           sentiment classification.
of word2vec. These vectors in dimension 300 were              Furthermore, we conduct our study on the second
trained on Wikipedia using the skip-gram model de-         model approach of the word-based CNN having 3 filter
scribed in [BGJM16] with default parameters.               regions size, 128 features per filter region, 1D convolu-
   Moreover, we use in the embedding layer of both         tion window of length 5, a dropout (d) of 0.5 probabil-
proposed CNNs a 100-dimensional word2vec embed-            ity and 1-max-pooling of size 35 along with Nesterov
ding vectors that we have trained using the text re-       Adam optimizer.
   In Table 1 we report results achieved using the                                  sentiment classification of text review [TQL15]. The
second model approach along with pre-trained GloVe                                  authors report an accuracy of 61.5%, and propose a
with 100 dimension, word2vec, fastText word embed-                                  new method that represents document with convolu-
dings and vocabulary trained from the reviews dataset                               tional recurrent neural network, which adaptively en-
using word2vec of word embeddings with size of 100.                                 codes semantics of sentences and their relations and
For both pre-trained word2vec and fastText embed-                                   achieve 67.6%. Also, using traditional methods such
dings we choose 300-dimensional word vectors.                                       as SVM and bigrams report a score of 62.4%.
   We find that the choice of vector input representa-                                 In [ZZL15] the authors propose character-level
tion has an impact of the performance of the sentiment                              CNNs that achieve an accuracy of 94.11% for the
meaning. On the Small Yelp dataset we report a sig-                                 large-featured architecture and 93.47% for the small-
nificand di↵erence of 11.52% between the highest score                              featured architecture and compare the obtained re-
using pre-trained GloVe embeddings and self-built dic-                              sults to baseline word-based CNNs with pre-trained
tionary using word2vec model.                                                       word2vec that obtain 95.40% accuracy for a large-
   However, on the Big Yelp dataset we report a dif-                                featured architecture and 94.44% for the small-
ference of 0.81% between the highest score using pre-                               featured architecture. In their experiments the au-
trained fastText embeddings and pre-trained word2vec                                thors drop 5 star reviews, and use 560 000 train sam-
vectors. The relative performance achieved using the                                ples, 38 000 test samples from Yelp 2015 challenge
second CNN model approach has similar accuracy                                      dataset and 5 000 epochs in training. Traditional
scores on the Big Yelp dataset, regardless of the in-                               methods as n-grams linear classifier report a score of
put embeddings (Table 1). We can observe that the                                   95.64% on the subset.
scale of the dataset has an impact on the overall per-                                 In comparison against traditional models such as
formance in the sentiment classification task.                                      bag of words, n-grams and TFIDF variants, the deep
                                                                                    learning models - word-based CNNs and the hyperpa-
 Table 1: Accuracy results on Yelp reviews dataset.                                 rameters proposed in this paper obtain comparable to
                                                                                    the baseline methods [ZZL15, TQL15, Sal15]. On the
          Dataset    Model CNN                  Embed.dimension   train    test
Small Yelp reviews   Pre-trained GloVe          100               89.65%   87.36%   Big Yelp dataset, we report an accuracy of 94.73% us-
Small Yelp reviews   Pre-trained word2vec       300               91.25%   90.41%
Small Yelp reviews   Pre-trained fastText       300               89.90%   88.77%
                                                                                    ing pre-trained fastText vector embeddings and a CNN
Small Yelp reviews   Word2Vec self-dictionary   100               79.45%   78.89%   having 3 filter regions sizes and 128 feature maps.
 Big Yelp reviews    Pre-trained GloVe          100               94.46%   94.54%
 Big Yelp reviews    Pre-trained word2vec       300               93.80%   93.92%      Further, we conduct our evaluation on the complete
 Big Yelp reviews    Pre-trained fastText       300               94.49%   94.73%
 Big Yelp reviews    Word2Vec self-dictionary   100               94.45%   94.60%
                                                                                    Yelp 2017 challenge dataset. The second CNN model
                                                                                    approach proposed in this work yields the best per-
   Training is done through stochastic gradient descent                             formance on Yelp 2017 challenge dataset in terms of
over shu✏ed mini-batches with Nesterov Adam or RM-                                  accuracy. We obtain an accuracy of 95.6% using 3-fold
Sprop update rule. Nesterov Adam obtains better re-                                 cross validation.
sults than RMSprop [SMDH13] when using the second
model approach with the same number of epochs and a                                 5   Conclusions And Future Work
dropout of 0.2. The sentiment accuracy computed on
the Big Yelp dataset using RMSprop method scored                                    In the present work, we have described a series of ex-
0.16 less than the accuracy obtained using Nesterov                                 periments with word-based convolutional neural net-
Adam which scored 95.15                                                             works. We introduce two neural network models ap-
   The CNN model in the second approach performed                                   proaches with di↵erent architectural size and several
better in the text review classification than the first ap-                         word vector representations. We conduct an empiri-
proach due to the di↵erences in the architecture model                              cal study on e↵ect of hyperparameters on the overall
and the depth of the convolutional network, the filter                              performance in the sentiment classification task.
region size has a large e↵ect on the classifier perfor-                                In the experimental results, we find that the size
mance, for a dropout of 0.5 we obtain 94.54% com-                                   of the dataset has an important e↵ect on the system
pared to 95.15% for a 0.2 dropout.                                                  performance in training and evaluation, a better ac-
   When we impose a stronger regularization on the                                  curacy score is obtained using the second CNN model
model the performance increases: for a dropout of                                   approach on the Big Yelp dataset compared to the re-
0.5 we obtain 94.54% compared to 95.15% for a 0.2                                   sults obtained on Small Yelp dataset. Furthermore,
dropout. A similar remark about dropout regulariza-                                 when evaluating the second model approach on the
tion is reported in [ZW15]                                                          large scale 2017 Yelp Dataset, we achieve an accuracy
   Prior work o↵ers a baseline CNN configuration im-                                score of 95.6% using 3-fold cross validation.
plementing the architectural decisions and hyperpa-                                    The models proposed in this article show good abil-
rameters of [Kim14] on Yelp 2015 Challenge dataset for                              ity for understanding natural language and predicting
users sentiments. We see that our results are compara-               Human Language Technologies-Volume 1,
ble and sometimes overcome the ones in the literature                pages 142–150. Association for Computa-
for the task of classifying business reviews using Yelp              tional Linguistics, 2011.
2017 challenge dataset [ZZL15, TQL15, Sal15].
   In future work, we can explore Bayesian optimiza-      [MS+ 99]   Christopher D Manning, Hinrich Schütze,
tion frameworks for hyperparameters ranges rather                    et al. Foundations of statistical natural
than a grid search approach. Also, we can conduct                    language processing, volume 999. MIT
other experiments using Recursive Neural Network                     Press, 1999.
(RNN) with the Long Short Term Memory (LSTM) ar-          [PL+ 08]   Bo Pang, Lillian Lee, et al. Opinion mining
chitecture [Gra12] for sentiment categorization of Yelp              and sentiment analysis. Foundations and
user text reviews.                                                   Trends in Information Retrieval, 2(1–2):1–
                                                                     135, 2008.
References
                                                          [PLV02]    Bo Pang, Lillian Lee, and Shivakumar
[BGJM16] Piotr Bojanowski, Edouard Grave, Ar-
                                                                     Vaithyanathan. Thumbs up?: sentiment
         mand Joulin, and Tomas Mikolov. En-
                                                                     classification using machine learning tech-
         riching word vectors with subword infor-
                                                                     niques. In Proceedings of the ACL-02
         mation. arXiv preprint arXiv:1607.04606,
                                                                     conference on Empirical methods in nat-
         2016.
                                                                     ural language processing-Volume 10, pages
[GG16]     Yarin Gal and Zoubin Ghahramani. A the-                   79–86. Association for Computational Lin-
           oretically grounded application of dropout                guistics, 2002.
           in recurrent neural networks. In Advances
                                                          [Sal15]    Andreea Salinca. Business reviews classi-
           in Neural Information Processing Systems,
                                                                     fication using sentiment analysis. In Sym-
           pages 1019–1027, 2016.
                                                                     bolic and Numeric Algorithms for Scien-
[Gra12]    Alex Graves. Supervised sequence la-                      tific Computing (SYNASC), 2015 17th In-
           belling. In Supervised Sequence Labelling                 ternational Symposium on, pages 247–250.
           with Recurrent Neural Networks, pages 5–                  IEEE, 2015.
           13. Springer, 2012.
                                                          [SMDH13] Ilya Sutskever, James Martens, George
[JZ14]     Rie Johnson and Tong Zhang. E↵ective                    Dahl, and Geo↵rey Hinton. On the im-
           use of word order for text categorization               portance of initialization and momentum
           with convolutional neural networks. arXiv               in deep learning. In International con-
           preprint arXiv:1412.1058, 2014.                         ference on machine learning, pages 1139–
                                                                   1147, 2013.
[KFF15]    Andrej Karpathy and Li Fei-Fei. Deep
           visual-semantic alignments for generating      [TQL15]    Duyu Tang, Bing Qin, and Ting Liu. Doc-
           image descriptions.    In Proceedings of                  ument modeling with gated recurrent neu-
           the IEEE Conference on Computer Vision                    ral network for sentiment classification. In
           and Pattern Recognition, pages 3128–3137,                 EMNLP, pages 1422–1432, 2015.
           2015.
                                                          [yel17]    Yelp Challenge Dataset, 2017.
[Kim14]    Yoon Kim. Convolutional neural networks
                                                          [ZW15]     Ye Zhang and Byron Wallace. A sen-
           for sentence classification. arXiv preprint
                                                                     sitivity analysis of (and practitioners’
           arXiv:1408.5882, 2014.
                                                                     guide to) convolutional neural networks
[MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado,                      for sentence classification. arXiv preprint
         and Je↵rey Dean. Efficient estimation                       arXiv:1510.03820, 2015.
         of word representations in vector space.
                                                          [ZZL15]    Xiang Zhang, Junbo Zhao, and Yann Le-
         arXiv preprint arXiv:1301.3781, 2013.
                                                                     Cun. Character-level convolutional net-
[MDP+ 11] Andrew L Maas, Raymond E Daly, Pe-                         works for text classification. In Advances
          ter T Pham, Dan Huang, Andrew Y Ng,                        in neural information processing systems,
          and Christopher Potts. Learning word vec-                  pages 649–657, 2015.
          tors for sentiment analysis. In Proceed-
          ings of the 49th Annual Meeting of the As-
          sociation for Computational Linguistics: