=Paper=
{{Paper
|id=Vol-1986/SML17_paper_7
|storemode=property
|title=Convolutional Neural Networks for Sentiment Classification on Business Reviews
|pdfUrl=https://ceur-ws.org/Vol-1986/SML17_paper_7.pdf
|volume=Vol-1986
|authors=Andreea Salinca
|dblpUrl=https://dblp.org/rec/conf/ijcai/Salinca17
}}
==Convolutional Neural Networks for Sentiment Classification on Business Reviews==
Convolutional Neural Networks for Sentiment
Classification on Business Reviews
Andreea Salinca
Faculty of Mathematics and Computer Science, University of Bucharest
Bucharest, Romania
andreea.salinca@fmi.unibuc.ro
machines) [PL+ 08, PLV02, MDP+ 11]. Convolutional
Neural Networks (CNNs) have achieved remarkable re-
Abstract sults in the area of sentiment analysis and text classifi-
cation on large-scale databases [Kim14, ZW15, JZ14].
Recently Convolutional Neural Networks In this article, we conduct an empirical study of
(CNNs) models have proven remarkable re- a word-based CNNs for sentiment classification us-
sults for text classification and sentiment anal- ing Yelp 2017 challenge dataset [yel17] that comprises
ysis. In this paper, we present our ap- 4.1M user reviews about local business with star rating
proach on the task of classifying business re- from 1 to 5. We choose two models for comparison, in
views using word embeddings on a large-scale which both are word-based CNNs with one or multi-
dataset provided by Yelp: Yelp 2017 challenge ple layer of convolution built on top of word vectors by
dataset. We compare word-based CNN us- choosing pre-trained or end-to-end learned word rep-
ing several pre-trained word embeddings and resentations with di↵erent embedding sizes. Previous
end-to-end vector representations for text re- works report several techniques on sentiment classifi-
views classification. We conduct several ex- cation results of text reviews using Yelp 2015 challenge
periments to capture the semantic relation- dataset [ZZL15, TQL15, Sal15].
ship between business reviews and we use deep A series of experiments are made to explore the ef-
learning techniques that prove that the ob- fect of architecture components on model performance
tained results are competitive with traditional along with the hyperparameters tuning, including fil-
methods. ter region size, number of feature maps, and regular-
ization parameters of the proposed convolutional neu-
1 Introduction ral networks. We discuss the design decisions for sen-
In recent years, researchers have been investigated the timent classification on Yelp 2017 dataset and we o↵er
problem of automatic text categorization and senti- a comparison between these models and report the ob-
ment classification - the overall opinion towards the tained accuracy.
subject matter whether the user review is positive or In our work, we aim to identify empirical hyperpa-
negative. Sentiment classification is useful in the area rameter tuning and practical settings and we inspire
of recommender systems and business intelligence ap- from other research conducted by [Kim14] on a CNNs
plications. simple architecture. Furthermore, we also take into
The e↵ectiveness of applying machine learning tech- consideration some advices from the empirical analy-
niques in sentiment classification of product or movie sis of CNNs architectures and hyperparameter settings
reviews is achieved using traditional approaches such for sentence classification described by [ZW15]. We
as representing text reviews using bag-of-words model obtain an accuracy of 95.6%, via 3-fold cross valida-
and di↵erent methods such as Naive Bayes, maxi- tion, on Yelp 2017 challenge dataset using word-based
mum entropy classification and SVM (Support vector CNN along with sentiment-specific word embeddings.
Copyright 2017 c by the paper’s authors. Copying permitted 2 Prior Work
for private and academic purposes.
InIn:Proceedings
Proceedings of
of IJCAI
IJCAI Workshop
WorkshopononSemantic Machine
Semantic MachineLearning
Learn- Kim et al. present a series of experiments using a
ing (SML (SML 2017),
2017), AugAugust,
19-25 19-25 2017, Melbourne,
Melbourne, Australia.
Australia simple one layer convolutional neural network built on
top of pre-trained word2vec models obtained from an In prior work, the authors use traditional ap-
unsupervised neural language model with little param- proaches in the sentiment analysis classification on
eter tuning for sentiment analysis and sentence classi- Yelp 2015 challenge dataset (split in 80% for training
fication [Kim14]. Zhang et al. o↵er practical advice by and 20% for testing and 3-fold cross validation). Lin-
performing an extensive study on the e↵ect of archi- ear Support Vector Classification and Stochastic Gra-
tecture components of CNNs for sentence classification dient Descent Classifier report an accuracy of 94.4%
on model performance with results that outperform using unigrams and applying preprocessing techniques
baseline methods such as SVM or logistic regression to extract a set of feature characteristics [Sal15].
[ZW15].
In [JZ14] it is proven the benefit of word order
on topic classification and sentiment classification us- 3 Convolutional Neural Network
ing CNNs and bag-of-words model in the convolution Model
layer.
Other approaches use character-level convolutional We model Yelp text reviews using two convolutional
networks rather than word-based approaches that architecture approaches. The first model is word-
achieve state of art results for text classification and based CNN having an embedding layer in which we
sentiment analysis on large-scale reviews datasets such tokenize text review sentences to a sentence matrix
as Amazon and Yelp 2015 challenge dataset. For the having rows with word vector representations of each
Yelp polarity dataset, by considering stars 1 and 2 neg- token similar to the approach of Kim et al. [Kim14].
ative, 3 and 4 positive and dropping 5 star reviews, the We will truncate the reviews to a maximum length
authors use 560 000 train samples, 38 000 test and 5 of 1000 words and we will only consider the top 100
000 epochs in training [ZZL15]. 000 most commonly occurring words in the business
A comparison between several models using tra- reviews dataset.
ditional techniques with several feature extractors: We use both pre-trained word embeddings such as
Bag-of-words and TFIDF (term-frequency inverse- GloVe [KFF15] using 100 dimensional embeddings
document-frequency), Bag-of-ngrams and TFIDF, of 400k words computed on a 2014 dump of En-
Bag-of-means on word embedding (word2vec) and glish Wikipedia, word2vec [MCCD13] using 300 di-
TFIDF and a linear classifier - multinomial logistic mensional embeddings and fastText [BGJM16] using
regression and deep learning techniques: Word-based 300 dimensional embeddings and a vocabulary trained
ConvNets (Convolutional Neural Networks) (one large from the reviews dataset using word2vec having
1024 and one small - 256 features sizes having 9 layers 100-dimension word embeddings. Out-of-vocabulary
deep with 6 convolutional layers and 3 fully-connected words are randomly initialized by sampling values uni-
layers) and long-short term memory (LSTM) recur- formly from (0.25, 0.25) and optimized during train-
rent neural network model is made. The testing errors ing.
are reported on all models for Yelp sentiment analy- Next, a convolutional layer with one region sized
sis: 4.36% is obtained for n-gram traditional approach, filters is applied. Filter widths are equal to the di-
word-based CNNs with pre-trained word2vec obtain mension of the word vectors [ZW15]. Then we apply
4.60% for the large-featured architecture and 5.56% a max-pooling operation on the feature map to com-
for the small-featured architecture. Also, word-based pute a fixed-length feature vector and finally a softmax
CNNs lookup tables achieve a score of 4.89% for the classifier to predict the outputs. During training, we
large-featured architecture and 5.54% for the small- use dropout regularization technique with deep net-
featured architecture. The character-level ConvNets works where network units are randomly dropped dur-
model reports an error of 5.89% for the large-featured ing training [GG16]. Also, we aim to minimize the cat-
architecture and 6.53% for the small-featured architec- egorical cross-entropy loss. We use a 300 feature maps,
ture [ZZL15]. 1D convolution window of lengths 2, rectified linear
In [TQL15] is proposed a convolutional-gated recur- unit (ReLU) activation function and 1-max-pooling of
rent neural network approach, which encodes relations size 2, 0.2 dropout (p) probability.
between sentences and obtains a 67.1% accuracy on The second model approach di↵ers from the first
Yelp 2015 dataset (split in training, development and approach by using multiple filters for the same region
testing sets of 80/10/10) which is compared to a base- size to learn complementary features from the same
line implementation of a convolutional neural network regions. We propose 3 filter regions size, having 128
based on Kim work [Kim14] with an accuracy of 61.5% features per filter region, 1D convolution window of
for sentiment analysis. On the same dataset, an accu- length 5, a dropout (d) of 0.5 probability and 1-max-
racy of 62.4% is achieved using a traditional approach pooling of 35. We compare two di↵erent optimizers:
with SVM and bigrams. Nesterov Adam and RMSprop optimizer[SMDH13].
4 Results And Discussion views in the training dataset.
4.1 Yelp Challenge Dataset
4.3 Experimental Results
Yelp 2017 challenge dataset, introduced in the 9th
round of Yelp Challenge, comprises user reviews about We conduct an empirical exploration on the use of the
local businesses in 11 cities across 4 countries with proposed word-based CNNs architecture for sentiment
star rating from 1 to 5. The large-scale dataset com- classification on Yelp business reviews.
prises 4.1M reviews and 947K tips by 1M users for In the training phase, we use a batch size of 500 and
144K businesses [yel17]. Yelp 2017 challenge dataset 3 epochs for the first model approach, and a batch size
has been updated compared to datasets in previous of 128 and 2 epochs for the second model approach.
rounds, such as Yelp 2015 challenge dataset or Yelp We obtain the same accuracy of the classifica-
2013 challenge dataset. tion task of 77.88% when using 100-dimension and
We conduct our system evaluation on U.S. cities: 300-dimension GloVe word embeddings with the first
Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, CNNs proposed having 300 features maps and a convo-
Las Vegas, Madison, and Cleveland, having 1 942 339 lution of window of length 5 on the Small Yelp dataset.
reviews. For the sentiment analysis classification task, We study the e↵ect of filter kernel size of the con-
we consider the 1 and 2 star ratings as negative senti- volution when using only one region size on the model
ments and 4 and 5 as positive sentiments and we drop accuracy shown in Fig. 1. We set number of feature
the 3 star ratings reviews as the average Yelp review maps for this region size to 300 and consider region
is 3.7. sizes of 2, 3 and 5 and compute the means of 3-fold CV
Next, we will use two subsets of Yelp 2017 dataset to for each. We observe that using a smaller region size
conduct our experiments, due to computational power the CNNs performs better, obtaining an accuracy of
constraints. 79,5% (window of 2 words) rather than using a larger
Our first experiments are done on a smaller sub- region size (window size of 5) and obtaining 22,1%.
set of Yelp dataset having 8200 training samples, 2000
validation samples and 900 testing samples. We will
call this Small Yelp dataset.
Further, we experiment on 82 000 training samples,
20 000 validation samples and 9 000 testing samples.
We will call this Big Yelp dataset.
In the last experiment, we split the large-scale Yelp
US dataset into 80% for training and 20% for test-
ing. We use 3-fold cross validation for evaluating dif-
ferent hyperparameters for the deep neural methods.
We use accuracy as evaluation metric, which is a stan-
dard metric to measure the overall sentiment reviews
classification performance [MS+ 99].
Figure 1: CNN accuracy for di↵erent kernel sizes when
4.2 Word Embeddings feature map is 300
We use several pre-trained models of word embed-
dings built with an unsupervised learning algorithm The word embeddings used in the embedding layer
for obtaining vector representations of words: GloVe of our CNNs have successfully captured the semantic
[KFF15], word2vec along with pre-trained vectors relations among entities in the unstructured text re-
trained on part of Google News dataset (about 100 view. For the Big Yelp dataset using the first CNN
billion words). The models contain 100-dimensional model approach with 300 features map, with a region
vectors for 3 million words and phrases [MCCD13]. size of 2, a dropout probability of 0.2 and Nesterov
We use also use fastText pre-trained word vec- Adam optimizer we obtain a score of 89.59% in the
tors for English language which are an extension sentiment classification.
of word2vec. These vectors in dimension 300 were Furthermore, we conduct our study on the second
trained on Wikipedia using the skip-gram model de- model approach of the word-based CNN having 3 filter
scribed in [BGJM16] with default parameters. regions size, 128 features per filter region, 1D convolu-
Moreover, we use in the embedding layer of both tion window of length 5, a dropout (d) of 0.5 probabil-
proposed CNNs a 100-dimensional word2vec embed- ity and 1-max-pooling of size 35 along with Nesterov
ding vectors that we have trained using the text re- Adam optimizer.
In Table 1 we report results achieved using the sentiment classification of text review [TQL15]. The
second model approach along with pre-trained GloVe authors report an accuracy of 61.5%, and propose a
with 100 dimension, word2vec, fastText word embed- new method that represents document with convolu-
dings and vocabulary trained from the reviews dataset tional recurrent neural network, which adaptively en-
using word2vec of word embeddings with size of 100. codes semantics of sentences and their relations and
For both pre-trained word2vec and fastText embed- achieve 67.6%. Also, using traditional methods such
dings we choose 300-dimensional word vectors. as SVM and bigrams report a score of 62.4%.
We find that the choice of vector input representa- In [ZZL15] the authors propose character-level
tion has an impact of the performance of the sentiment CNNs that achieve an accuracy of 94.11% for the
meaning. On the Small Yelp dataset we report a sig- large-featured architecture and 93.47% for the small-
nificand di↵erence of 11.52% between the highest score featured architecture and compare the obtained re-
using pre-trained GloVe embeddings and self-built dic- sults to baseline word-based CNNs with pre-trained
tionary using word2vec model. word2vec that obtain 95.40% accuracy for a large-
However, on the Big Yelp dataset we report a dif- featured architecture and 94.44% for the small-
ference of 0.81% between the highest score using pre- featured architecture. In their experiments the au-
trained fastText embeddings and pre-trained word2vec thors drop 5 star reviews, and use 560 000 train sam-
vectors. The relative performance achieved using the ples, 38 000 test samples from Yelp 2015 challenge
second CNN model approach has similar accuracy dataset and 5 000 epochs in training. Traditional
scores on the Big Yelp dataset, regardless of the in- methods as n-grams linear classifier report a score of
put embeddings (Table 1). We can observe that the 95.64% on the subset.
scale of the dataset has an impact on the overall per- In comparison against traditional models such as
formance in the sentiment classification task. bag of words, n-grams and TFIDF variants, the deep
learning models - word-based CNNs and the hyperpa-
Table 1: Accuracy results on Yelp reviews dataset. rameters proposed in this paper obtain comparable to
the baseline methods [ZZL15, TQL15, Sal15]. On the
Dataset Model CNN Embed.dimension train test
Small Yelp reviews Pre-trained GloVe 100 89.65% 87.36% Big Yelp dataset, we report an accuracy of 94.73% us-
Small Yelp reviews Pre-trained word2vec 300 91.25% 90.41%
Small Yelp reviews Pre-trained fastText 300 89.90% 88.77%
ing pre-trained fastText vector embeddings and a CNN
Small Yelp reviews Word2Vec self-dictionary 100 79.45% 78.89% having 3 filter regions sizes and 128 feature maps.
Big Yelp reviews Pre-trained GloVe 100 94.46% 94.54%
Big Yelp reviews Pre-trained word2vec 300 93.80% 93.92% Further, we conduct our evaluation on the complete
Big Yelp reviews Pre-trained fastText 300 94.49% 94.73%
Big Yelp reviews Word2Vec self-dictionary 100 94.45% 94.60%
Yelp 2017 challenge dataset. The second CNN model
approach proposed in this work yields the best per-
Training is done through stochastic gradient descent formance on Yelp 2017 challenge dataset in terms of
over shu✏ed mini-batches with Nesterov Adam or RM- accuracy. We obtain an accuracy of 95.6% using 3-fold
Sprop update rule. Nesterov Adam obtains better re- cross validation.
sults than RMSprop [SMDH13] when using the second
model approach with the same number of epochs and a 5 Conclusions And Future Work
dropout of 0.2. The sentiment accuracy computed on
the Big Yelp dataset using RMSprop method scored In the present work, we have described a series of ex-
0.16 less than the accuracy obtained using Nesterov periments with word-based convolutional neural net-
Adam which scored 95.15 works. We introduce two neural network models ap-
The CNN model in the second approach performed proaches with di↵erent architectural size and several
better in the text review classification than the first ap- word vector representations. We conduct an empiri-
proach due to the di↵erences in the architecture model cal study on e↵ect of hyperparameters on the overall
and the depth of the convolutional network, the filter performance in the sentiment classification task.
region size has a large e↵ect on the classifier perfor- In the experimental results, we find that the size
mance, for a dropout of 0.5 we obtain 94.54% com- of the dataset has an important e↵ect on the system
pared to 95.15% for a 0.2 dropout. performance in training and evaluation, a better ac-
When we impose a stronger regularization on the curacy score is obtained using the second CNN model
model the performance increases: for a dropout of approach on the Big Yelp dataset compared to the re-
0.5 we obtain 94.54% compared to 95.15% for a 0.2 sults obtained on Small Yelp dataset. Furthermore,
dropout. A similar remark about dropout regulariza- when evaluating the second model approach on the
tion is reported in [ZW15] large scale 2017 Yelp Dataset, we achieve an accuracy
Prior work o↵ers a baseline CNN configuration im- score of 95.6% using 3-fold cross validation.
plementing the architectural decisions and hyperpa- The models proposed in this article show good abil-
rameters of [Kim14] on Yelp 2015 Challenge dataset for ity for understanding natural language and predicting
users sentiments. We see that our results are compara- Human Language Technologies-Volume 1,
ble and sometimes overcome the ones in the literature pages 142–150. Association for Computa-
for the task of classifying business reviews using Yelp tional Linguistics, 2011.
2017 challenge dataset [ZZL15, TQL15, Sal15].
In future work, we can explore Bayesian optimiza- [MS+ 99] Christopher D Manning, Hinrich Schütze,
tion frameworks for hyperparameters ranges rather et al. Foundations of statistical natural
than a grid search approach. Also, we can conduct language processing, volume 999. MIT
other experiments using Recursive Neural Network Press, 1999.
(RNN) with the Long Short Term Memory (LSTM) ar- [PL+ 08] Bo Pang, Lillian Lee, et al. Opinion mining
chitecture [Gra12] for sentiment categorization of Yelp and sentiment analysis. Foundations and
user text reviews. Trends in Information Retrieval, 2(1–2):1–
135, 2008.
References
[PLV02] Bo Pang, Lillian Lee, and Shivakumar
[BGJM16] Piotr Bojanowski, Edouard Grave, Ar-
Vaithyanathan. Thumbs up?: sentiment
mand Joulin, and Tomas Mikolov. En-
classification using machine learning tech-
riching word vectors with subword infor-
niques. In Proceedings of the ACL-02
mation. arXiv preprint arXiv:1607.04606,
conference on Empirical methods in nat-
2016.
ural language processing-Volume 10, pages
[GG16] Yarin Gal and Zoubin Ghahramani. A the- 79–86. Association for Computational Lin-
oretically grounded application of dropout guistics, 2002.
in recurrent neural networks. In Advances
[Sal15] Andreea Salinca. Business reviews classi-
in Neural Information Processing Systems,
fication using sentiment analysis. In Sym-
pages 1019–1027, 2016.
bolic and Numeric Algorithms for Scien-
[Gra12] Alex Graves. Supervised sequence la- tific Computing (SYNASC), 2015 17th In-
belling. In Supervised Sequence Labelling ternational Symposium on, pages 247–250.
with Recurrent Neural Networks, pages 5– IEEE, 2015.
13. Springer, 2012.
[SMDH13] Ilya Sutskever, James Martens, George
[JZ14] Rie Johnson and Tong Zhang. E↵ective Dahl, and Geo↵rey Hinton. On the im-
use of word order for text categorization portance of initialization and momentum
with convolutional neural networks. arXiv in deep learning. In International con-
preprint arXiv:1412.1058, 2014. ference on machine learning, pages 1139–
1147, 2013.
[KFF15] Andrej Karpathy and Li Fei-Fei. Deep
visual-semantic alignments for generating [TQL15] Duyu Tang, Bing Qin, and Ting Liu. Doc-
image descriptions. In Proceedings of ument modeling with gated recurrent neu-
the IEEE Conference on Computer Vision ral network for sentiment classification. In
and Pattern Recognition, pages 3128–3137, EMNLP, pages 1422–1432, 2015.
2015.
[yel17] Yelp Challenge Dataset, 2017.
[Kim14] Yoon Kim. Convolutional neural networks
[ZW15] Ye Zhang and Byron Wallace. A sen-
for sentence classification. arXiv preprint
sitivity analysis of (and practitioners’
arXiv:1408.5882, 2014.
guide to) convolutional neural networks
[MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, for sentence classification. arXiv preprint
and Je↵rey Dean. Efficient estimation arXiv:1510.03820, 2015.
of word representations in vector space.
[ZZL15] Xiang Zhang, Junbo Zhao, and Yann Le-
arXiv preprint arXiv:1301.3781, 2013.
Cun. Character-level convolutional net-
[MDP+ 11] Andrew L Maas, Raymond E Daly, Pe- works for text classification. In Advances
ter T Pham, Dan Huang, Andrew Y Ng, in neural information processing systems,
and Christopher Potts. Learning word vec- pages 649–657, 2015.
tors for sentiment analysis. In Proceed-
ings of the 49th Annual Meeting of the As-
sociation for Computational Linguistics: