=Paper=
{{Paper
|id=Vol-2006/paper080
|storemode=property
|title=Neural Sentiment Analysis for a Real-World Application
|pdfUrl=https://ceur-ws.org/Vol-2006/paper080.pdf
|volume=Vol-2006
|authors=Daniele Bonadiman,Giuseppe Castellucci,Andrea Favalli,Raniero Romagnoli,Alessandro Moschitti
|dblpUrl=https://dblp.org/rec/conf/clic-it/BonadimanCFRM17
}}
==Neural Sentiment Analysis for a Real-World Application==
Neural Sentiment Analysis for a Real-World Application
Daniele Bonadiman‡ , Giuseppe Castellucci† ,
Andrea Favalli† , Raniero Romagnoli† , Alessandro Moschitti‡
‡
Department of Computer Science and Information Engineering, University of Trento, Italy
Qatar Computing Research Institute, HBKU, Qatar
†
Almawave Srl., Italy
d.bonadiman@unitn.it, amoschitti@gmail.com
{g.castellucci,a.favalli,r.romagnoli}@almawave.it
Abstract been fed by the remarkable interest of the indus-
trial world on this topic as well as the relatively
English. In this paper, we describe our easy access to data, which, among other, allowed
neural network models for a commercial the academic world to promote evaluation cam-
application on sentiment analysis. Differ- paigns, e.g., (Nakov et al., 2016), for different lan-
ent from academic work, which is oriented guages. Many models have been developed and
towards complex networks for achieving a tested on these benchmarks, e.g., (Li et al., 2010;
marginal improvement, real scenarios re- Kiritchenko et al., 2014; Severyn and Moschitti,
quire flexible and efficient neural models. 2015; Castellucci et al., 2016). They all appear
The possibility to use the same models on very appealing from an industrial perspective, as
different domains and languages plays an SA is strongly connected to many types of busi-
important role in the selection of the most ness through specific KPIs1 . However, previous
appropriate architecture. We found that academic work has not provided clear indications
a small modification of the state-of-the- on how to select the most appropriate learning ar-
art network according to academic bench- chitecture for industrial applications.
marks led to a flexible neural model that In this paper, we report on our experience on
also preserves high accuracy. adopting academic models of SA to a commer-
cial application. This is a social media and micro-
Italiano. In questo lavoro, descrivi- blogging monitoring platform to analyze brand
amo i nostri modelli di reti neurali per reputation, competition, the voice of the customer
un’applicazione commerciale basata sul and customer experience. More in detail, senti-
sentiment analysis. A differenza del ment analysis algorithms register customers’ opin-
mondo accademico, dove la ricerca è ori- ions and feedbacks on services and products, both
entata verso reti anche complesse per direct and indirect.
il raggiungimento di un miglioramento An important aspect is that such clients push for
marginale, gli scenari di utilizzo reali easily adaptable and reliable solutions. Indeed,
richiedono modelli neurali flessibili, effi- multi-tenant applications and sentiment analysis
cienti e semplici. La possibilitá di utiliz- requirements cause a high variability of the ap-
zare gli stessi modelli per domini e lin- proaches to the tasks within the same platform.
guaggi variegati svolge un ruolo impor- This should be capable of managing multi-domain
tante nella scelta dell’architettura. Ab- and multi-channel content in different languages
biamo scoperto che una piccola modifica as it provides services for several clients in differ-
della rete allo stato dell’arte rispetto ai ent market segments. Moreover, scalability and
benchmarks accademici produce un mod- lightweight use of computational resources pre-
ello neurale flessibile che preserva anche serving accuracy is also an important aspect. Fi-
un’elevata precisione. nally, dealing with different client domains and
data potentially requires constantly training new
models with limited time availability.
1 Introduction To meet the above requirements we started from
In recent years, Sentiment Analysis (SA) in Twit- 1
Key Performance Indicators are strategic factors en-
ter has been widely studied. Its popularity has abling the performance measurement of a process or activity.
the state-of-the-art model proposed in (Severyn channel content in different languages and it is
and Moschitti, 2015), which is a Convolutional provided as a service for several clients on differ-
Neural Network (CNN) with few layers mainly ent market segments.
devoted to encoding a sentence representation. We The application uses an SA algorithm to analyze
modified it by adopting a recurrent pooling layer, the customers’ opinions and feedbacks on services
which allows the network to learn longer depen- and products, both direct and indirect. The senti-
dencies in the input sentence. An additional ben- ment metric is used by the application clients to
efit is that such simple architecture makes the net- point out customer experience, expectations, and
work more robust to biases from the dataset, gen- perception. The final aim is to promptly react
eralizing better on the less represented classes. and identify improvement opportunities and, af-
Our experiments on the SemEval data in English terward, measure the impact of the adopted initia-
as well as on a commercial dataset in Italian show tives.
a constant improvement of our networks over the
state of the art. 3.1 Focused Problem Description
In the following, Section 2 places the current Industrial applications, used by demanding
work in the literature. Section 3 introduces the ap- clients, and dealing with real data tend to prefer
plication scenario. Sections 4 and 5 presents re- easily adaptable and reliable solutions. Major
spectively our proposal for a flexible architecture problems are related to multi-tenant applications
and the experimental results. Finally, Section 6 re- with several client requirements on the sentiment
ports the conclusions. analysis problem, often requiring variations
on task approaches within the same platform.
2 Related Work Moreover, high attention is put on scalability
and lightweight use of computational resources,
Although sentiment analysis has been around for
preserving accurate performance. Finally, dealing
one decade, a clear and exact comparison of mod-
with different client domains and data potentially
els has been achieved thanks to the organization
requires constantly training new models with
of international evaluation campaigns. The main
limited time availability.
campaign for SA in Twitter in English is SemEval,
which has been organized since 2013. A similar 3.2 Data Description
campaign in the Italian language (SENTIPOLC)
The commercial social media and micro-blogging
(Barbieri et al., 2016) is promoted within Evalita
monitoring platform continuously acquires data
since 2014.
coming from several sources; among these, we se-
Among other approaches, Neural Networks
lected Twitter data as the main source for our pur-
(NNs), and in particular CNNs, outperformed the
poses.
previous state of the art techniques (Severyn and
First, the public Twitter stream was collected for
Moschitti, 2015; Castellucci et al., 2016; Attardi
several months without specific domain restriction
et al., 2016; Deriu et al., 2016). Those systems
to build the dataset used for the word embedding
share some architectural choices: (i) use of Convo-
training. The total amount of tweets used accounts
lutional Sentence encoders (Kim, 2014), (ii) lever-
for 100 million Italian tweets and 50 million En-
aging pre-trained word2vec embeddings (Mikolov
glish tweets.
et al., 2013) and (iii) use of distant supervision to
Then, a dataset has been constructed from a spe-
pre-train the network (Go et al., 2009). Despite
cific market sector in Italian. The data collection
this network is simple and provides state of the art
was performed on the public Twitter stream with
results, it does not model long-term dependencies
specific word restriction performed in order to fil-
in the tweet by construction.
ter the tweets of interest on the automotive do-
main. Afterward, the commercial platform applies
3 Application Scenario
different techniques in order to exclude from these
Our commercial application is a social media collections the tweets that are not relevant for the
and micro-blogging monitoring platform, which is specific insight analysis.
used to analyze brand reputation, competitors, the The messages were then used to construct the
voice of the customer and customer experience. It dataset for our experiments. A manual annota-
is capable of managing multi-domain and multi- tion phase has been performed together with the
demanding client in order to best suit the insight sentence matrix x̃, resulting in a fixed size vector
objective requirement. Even though structured representing the whole sentence.
guidelines were agreed upon before creating the In this work, we propose to substitute the max-
dataset and continuously checked against, this ap- pooling operation with a Bidirectional Gated Re-
proach tended to generate dataset characteristics: current Unit (BiGRU) (Chung et al., 2014; Schus-
in particular, unbalanced distribution of the ex- ter and Paliwal, 1997). The GRU is a Gated Re-
amples over the different classes has been mea- current Neural Network capturing long term de-
sured. It makes necessary a flexible model ca- pendencies over the input. A GRU processes the
pable of handling such phenomena without the input in a direction (e.g., from left to right), updat-
need of costly tuning phases and/or network re- ing a hidden state that keeps the memory of what
engineering. the network has processed so far. In this way, a
whole sentence can be represented by taking the
4 Our Neural Network Approach hidden state at the last step. In order to capture de-
pendencies in both directions, i.e., a stronger rep-
The task of SA in Twitter aims at classify-
resentation of the sentence, we apply a BiGRU,
ing a tweet t ∈ T into one of the three
which performs a GRU operation in both the di-
sentiment classes c ∈ C, where C = −−−→ ←−−−
rections BiGRU (x̃) = [GRU (x̃); GRU (x̃)].
{positive, neutral, negative}. This can be
achieved by learning function f : T → C through Classification: the final module of the network
a neural network. The architecture here proposed is the output layer (a logistic regression) that per-
is based on (Severyn and Moschitti, 2015) and it forms a linear transformation over the sentence
is structured in three steps: (i) a tweet is encoded vector by mapping it in a dclass dimensional vec-
into an embedding matrix, (ii) an encoder maps tor followed by a softmax activation, where dclass
the tweet matrix into a fixed size vector and (iii) is the number of classes.
a single output layer (a logistic regression layer)
classifies this vector over the three classes. 5 Experiments
In contrast to Severyn and Moschitti (2015), we
adopted a Recurrent Pooling layer that allows the 5.1 Setup
network to learn longer dependencies in the input Similarly to Severyn and Moschitti (2015), for the
sentence (i.e. sentiment shifts). This architectural CNN, we use a convolutional operation of size 5
change makes the network less sensible to learn bi- and dconv = 128 with rectified linear unit activa-
ases from the dataset and therefore generalize bet- tion, ReLU. For the BiGRU, we use 150 hidden
ter on poorly represented classes. ←−−− −−−→
units for both GRU and GRU obtaining a fixed
size vector of size 300.
Embedding: a tweet t is represented as a se-
Word embeddings: for all the proposed mod-
quence of words {w1 , .., wj , .., wN }. Tweets are
els, we pre-initialize the word embedding matrices
encoded into a sentence matrix t ∈ Rd×|t| ,
with the standard skip-gram embedding of dimen-
obtained by concatenating its word vectors wj ,
sionality 50 trained on tweets retrieved from the
where d is the size of the word embeddings.
Twitter Stream.
Sentence Encoder: it is a function that maps the Training: the network is trained using SGD
sentence matrix t into a fixed size vector x rep- with shuffled mini-batches using the Adam up-
resenting the whole sentence. Severyn and Mos- date rule (Kingma and Ba, 2014) and an early
chitti (2015) used a convolutional layer followed stopping (Prechelt, 1998) strategy with patience
by a global max-pooling layer to encode tweets. p = 10. Early stopping allows avoiding over-
The convolution operation applies a sliding win- fitting and to improve the generalization capabil-
dow operation (with window of size m) over the ities of the network. Then, we opted for adding
input sentence matrix. More specifically, it applies dropout (Srivastava et al., 2014) with rates of 0.2
a non-linear transformation generating an output to improve generalization and avoid co-adaptation
matrix x̃ ∈ RN ×dconv where dconv is the num- of features (Srivastava et al., 2014).
ber of convolutional filters and N is the length of Datasets: we trained and evaluated our archi-
the sentence. The max-pooling operation applies tecture on two datasets: the English dataset of Se-
an element-wise max operation to the transformed meval 2015 (Rosenthal et al., 2015) described by
Table 1: Splits of the Semeval dataset Table 3: English results on the SemEval dataset
pos. neu. neg. total 2013 test 2015 test
train 5,895 471 3,131 9,497 F1 F 1p,n F 1 F 1p,n
valid 648 57 430 1,135 S&M (2015) — 72.79 — 64.59
test 2013 2,734 160 1,541 4,435 CNN+Max 72.04 67.71 67.14 62.63
test 2015 1,899 190 1,008 3,097 CNN+BiGRU 71.67 68.10 68.03 63.82
Table 2: Splits of the Italian dataset
pos neu neg total Table 4: Italian results on the automotive dataset
Valid Test
train 4,234 6,434 2,170 12,838
F1 F 1p,n F 1 F 1p,n
valid 386 580 461 1,427
CNN+Max 65.34 62.35 69.35 62.88
test 185 232 83 500
CNN+BiGRU 64.85 67.71 68.32 67.55
Table 1 in terms of the size of the data splits and
the results obtained with the BiGRU pooling strat-
positive, negative and neutral instances. We used
egy described in Section 4. The proposed architec-
the validation set for parameter tuning and to apply
ture presents a slight improvement over the strong
early stopping whereas the systems are evaluated
baseline (∼ 1 point of both F 1 and F 1p,n score on
on the two test sets of 2013 and 2015, respectively.
the test).
The Italian dataset was built in-house for the au-
tomotive domain: we collected from the Twitter 5.3 Results on Italian Data
stream as explained in Section 3.2 and divided it Table 4 presents the result on the Italian
into three different splits for training, validation dataset. Despite that on this dataset the proposed
and testing, respectively. Table 2 shows the size of CNN+BiGRU model obtains lower F1 scores, it
the splits. Due to the nature of the domain, many shows improved performance in terms of F 1p,n (5
tweets in the dataset are neutral or objective, this points on both validation and test sets). This sug-
makes the label distribution much different from gests that the proposed model tends to generalize
the usual benchmarks. For example, the neutral better on the less represented classes, which, in the
class is the least represented in the English dataset case of the Italian training dataset, are the positive
(see Table 1) and the most represented in the Ital- and negative classes (as pointed out in Table 2).
ian data. The imbalance can potentially bias neu-
ral networks towards the most represented class. 5.4 Discussion of the Results
One of the features our approach is to diminish We analyzed the classification scores of some
such effect. words to show that our approach is less affected
Evaluation metrics: we used the following by the skewed distribution of the dataset. The
evaluation metrics, Macro-F1 (the average of the sentiment trends, as captured by the neural net-
F1 over the three sentiment categories). Addition- work in terms of scores, are shown in Table
ally, we report the F 1p,n , which is the average F1 5.4). For example, the word Mexico classified by
of the positive and negative class. This metric is CNN+Max produces the scores, 0.06, 0.35, 0.57,
the official evaluation score of the SemEval com- while CNN+BiGRU outcome, 0.18, 0.52, 0.30,
petition. for the negative, neutral and positive classes, re-
spectively. This shows that CNN+BiGRU is less
5.2 Results on English Data biased by the data distribution of the sampled word
Table 3 presents the results on the English dataset in the dataset, which is, 0, 1, 5, i.e., Mexico ap-
of SemEval 2015. The first row shows the out- pears 5 times more in positive than in neutral mes-
come reported by Severyn and Moschitti (2015) sages and never in negative messages.
(S&M). CNN+Max is a reimplementation of the This skewed distribution biased more
above system with Convolution and Max-Pooling CNN+Max as the positive class gets 0.57
but trained just on the official training data without while the negative one only 0.06. CNN+BiGRU is
distant supervision. This system is used as a strong able, instead, to recover the correct neutral class.
baseline in all our experiments. Lastly, we report We believe that CNN+Max is more influenced by
Cnn+Max Cnn+BiGRU Francesco Barbieri, Valerio Basile, Danilo Croce,
Mexico (.06, .35, .57) (.18, .51, .30) Malvina Nissim, Nicole Novielli, and Viviana Patti.
Italy (.06, .54, .38) (.18, .54, .26) 2016. Overview of the evalita 2016 sentiment po-
larity classification task. In Proceedings of Third
nice (.007, .009, .98) (.05, .07, .87) Italian Conference on Computational Linguistics
(CLiC-it 2016) & Fifth Evaluation Campaign of
Table 5: Word classification scores obtained with Natural Language Processing and Speech Tools for
the two neural architectures on English language. Italian. Final Workshop (EVALITA 2016).
The scores refer to the negative, neutral and posi-
Giuseppe Castellucci, Danilo Croce, and Roberto
tive classes, respectively. Basili. 2016. Context-aware convolutional neural
networks for twitter sentiment analysis in italian. In
Proceedings of Third Italian Conference on Compu-
the distribution bias as the max pooling operation tational Linguistics (CLiC-it 2016) & Fifth Evalua-
seems to capture very local phenomena. In con- tion Campaign of Natural Language Processing and
trast, BiGRU exploits the entire word sequence Speech Tools for Italian. Final Workshop (EVALITA
and thus can better capture larger informative 2016), Napoli, Italy, December 5-7, 2016. CEUR-
WS.org.
context.
A similar analysis in Italian shows the same Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,
trends. For example, the word panda is classified and Yoshua Bengio. 2014. Empirical evaluation of
as, 0.05, 0.28, 0.66, by CNN+Max and 0.07, 0.56, gated recurrent neural networks on sequence model-
0.35 by CNN+BiGRU, for negative, neutral and ing. arXiv preprint arXiv:1412.3555.
positive classes, respectively. Again, the distribu- Jan Deriu, Maurice Gonzenbach, Fatih Uzdilli, Au-
tion in the Italian training set of this word is very relien Lucchi, Valeria De Luca, and Martin Jaggi.
skewed towards the positive class: it confirms that 2016. Swisscheese at semeval-2016 task 4: Senti-
CNN+Max is more influenced by the distribution ment classification using an ensemble of convolu-
tional neural networks with distant supervision. In
bias, while our architecture can better deal with it. SemEval@ NAACL-HLT, pages 1124–1128.
6 Conclusions Alec Go, Richa Bhayani, and Lei Huang. 2009. Twit-
ter sentiment classification using distant supervision.
In this paper, we have studied state-of-the-art neu-
ral networks for the Sentiment Analysis of Twit- Yoon Kim. 2014. Convolutional neural networks for
ter text associated with a real application scenario. sentence classification. In Alessandro Moschitti,
We modified the network architecture by apply- Bo Pang, and Walter Daelemans, editors, Proceed-
ings of the 2014 Conference on Empirical Methods
ing a recurrent pooling layer enabling the learning in Natural Language Processing, EMNLP 2014, Oc-
of longer dependencies between words in tweets. tober 25-29, 2014, Doha, Qatar, A meeting of SIG-
The recurrent pooling layer makes the network DAT, a Special Interest Group of the ACL, pages
more robust to unbalanced data distribution. We 1746–1751. ACL.
have tested our models on the academic bench- Diederik Kingma and Jimmy Ba. 2014. Adam: A
mark and most importantly on our data derived method for stochastic optimization. arXiv preprint
from a real-world commercial application. The arXiv:1412.6980.
results show that our approach works well for
Svetlana Kiritchenko, Xiaodan Zhu, and Saif M. Mo-
both English and Italian languages. Finally, we hammad. 2014. Sentiment analysis of short in-
observed that our network suffers less from the formal texts. Journal of Artificial Intelligence Re-
dataset distribution bias. search, 50(1):723–762, May.
Shoushan Li, Sophia Yat Mei Lee, Ying Chen, Chu-
References Ren Huang, and Guodong Zhou. 2010. Sentiment
classification and polarity shifting. In Proceedings
Giuseppe Attardi, Daniele Sartiano, Chiara Alzetta, of the 23rd International Conference on Computa-
and Federica Semplici. 2016. Convolutional neural tional Linguistics, pages 635–643. Association for
networks for sentiment analysis on italian tweets. In Computational Linguistics.
Proceedings of Third Italian Conference on Compu-
tational Linguistics (CLiC-it 2016) & Fifth Evalua- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
tion Campaign of Natural Language Processing and rado, and Jeff Dean. 2013. Distributed representa-
Speech Tools for Italian. Final Workshop (EVALITA tions of words and phrases and their compositional-
2016), Napoli, Italy, December 5-7, 2016. CEUR- ity. In Advances in neural information processing
WS.org. systems, pages 3111–3119.
Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio
Sebastiani, and Veselin Stoyanov. 2016. Semeval-
2016 task 4: Sentiment analysis in twitter. In Se-
mEval@ NAACL-HLT, pages 1–18.
Lutz Prechelt. 1998. Early stopping-but when? In
Neural Networks: Tricks of the Trade, This Book is
an Outgrowth of a 1996 NIPS Workshop, pages 55–
69, London, UK, UK. Springer-Verlag.
Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko,
Saif Mohammad, Alan Ritter, and Veselin Stoyanov.
2015. Semeval-2015 task 10: Sentiment analysis
in twitter. In Proceedings of the 9th International
Workshop on Semantic Evaluation (SemEval 2015),
pages 451–463, Denver, Colorado, June. Associa-
tion for Computational Linguistics.
Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-
tional recurrent neural networks. IEEE Transactions
on Signal Processing, 45(11):2673–2681.
Aliaksei Severyn and Alessandro Moschitti. 2015.
Twitter sentiment analysis with deep convolutional
neural networks. In Proceedings of the 38th Inter-
national ACM SIGIR Conference on Research and
Development in Information Retrieval, pages 959–
962. ACM.
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky,
Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Dropout: a simple way to prevent neural networks
from overfitting. Journal of Machine Learning Re-
search, 15(1):1929–1958.