=Paper=
{{Paper
|id=Vol-2006/paper080
|storemode=property
|title=Neural Sentiment Analysis for a Real-World Application
|pdfUrl=https://ceur-ws.org/Vol-2006/paper080.pdf
|volume=Vol-2006
|authors=Daniele Bonadiman,Giuseppe Castellucci,Andrea Favalli,Raniero Romagnoli,Alessandro Moschitti
|dblpUrl=https://dblp.org/rec/conf/clic-it/BonadimanCFRM17
}}
==Neural Sentiment Analysis for a Real-World Application==
<pdf width="1500px">https://ceur-ws.org/Vol-2006/paper080.pdf</pdf>
<pre>
              Neural Sentiment Analysis for a Real-World Application
                         Daniele Bonadiman‡ , Giuseppe Castellucci† ,
                 Andrea Favalli† , Raniero Romagnoli† , Alessandro Moschitti‡
    ‡
      Department of Computer Science and Information Engineering, University of Trento, Italy
                      
                        Qatar Computing Research Institute, HBKU, Qatar
                                      †
                                        Almawave Srl., Italy
                  d.bonadiman@unitn.it, amoschitti@gmail.com
             {g.castellucci,a.favalli,r.romagnoli}@almawave.it

                      Abstract                       been fed by the remarkable interest of the indus-
                                                     trial world on this topic as well as the relatively
     English. In this paper, we describe our         easy access to data, which, among other, allowed
     neural network models for a commercial          the academic world to promote evaluation cam-
     application on sentiment analysis. Differ-      paigns, e.g., (Nakov et al., 2016), for different lan-
     ent from academic work, which is oriented       guages. Many models have been developed and
     towards complex networks for achieving a        tested on these benchmarks, e.g., (Li et al., 2010;
     marginal improvement, real scenarios re-        Kiritchenko et al., 2014; Severyn and Moschitti,
     quire flexible and efficient neural models.     2015; Castellucci et al., 2016). They all appear
     The possibility to use the same models on       very appealing from an industrial perspective, as
     different domains and languages plays an        SA is strongly connected to many types of busi-
     important role in the selection of the most     ness through specific KPIs1 . However, previous
     appropriate architecture. We found that         academic work has not provided clear indications
     a small modification of the state-of-the-       on how to select the most appropriate learning ar-
     art network according to academic bench-        chitecture for industrial applications.
     marks led to a flexible neural model that          In this paper, we report on our experience on
     also preserves high accuracy.                   adopting academic models of SA to a commer-
                                                     cial application. This is a social media and micro-
     Italiano.    In questo lavoro, descrivi-        blogging monitoring platform to analyze brand
     amo i nostri modelli di reti neurali per        reputation, competition, the voice of the customer
     un’applicazione commerciale basata sul          and customer experience. More in detail, senti-
     sentiment analysis.       A differenza del      ment analysis algorithms register customers’ opin-
     mondo accademico, dove la ricerca è ori-       ions and feedbacks on services and products, both
     entata verso reti anche complesse per           direct and indirect.
     il raggiungimento di un miglioramento              An important aspect is that such clients push for
     marginale, gli scenari di utilizzo reali        easily adaptable and reliable solutions. Indeed,
     richiedono modelli neurali flessibili, effi-    multi-tenant applications and sentiment analysis
     cienti e semplici. La possibilitá di utiliz-   requirements cause a high variability of the ap-
     zare gli stessi modelli per domini e lin-       proaches to the tasks within the same platform.
     guaggi variegati svolge un ruolo impor-         This should be capable of managing multi-domain
     tante nella scelta dell’architettura. Ab-       and multi-channel content in different languages
     biamo scoperto che una piccola modifica         as it provides services for several clients in differ-
     della rete allo stato dell’arte rispetto ai     ent market segments. Moreover, scalability and
     benchmarks accademici produce un mod-           lightweight use of computational resources pre-
     ello neurale flessibile che preserva anche      serving accuracy is also an important aspect. Fi-
     un’elevata precisione.                          nally, dealing with different client domains and
                                                     data potentially requires constantly training new
                                                     models with limited time availability.
1     Introduction                                      To meet the above requirements we started from
In recent years, Sentiment Analysis (SA) in Twit-        1
                                                           Key Performance Indicators are strategic factors en-
ter has been widely studied. Its popularity has      abling the performance measurement of a process or activity.
the state-of-the-art model proposed in (Severyn         channel content in different languages and it is
and Moschitti, 2015), which is a Convolutional          provided as a service for several clients on differ-
Neural Network (CNN) with few layers mainly             ent market segments.
devoted to encoding a sentence representation. We          The application uses an SA algorithm to analyze
modified it by adopting a recurrent pooling layer,      the customers’ opinions and feedbacks on services
which allows the network to learn longer depen-         and products, both direct and indirect. The senti-
dencies in the input sentence. An additional ben-       ment metric is used by the application clients to
efit is that such simple architecture makes the net-    point out customer experience, expectations, and
work more robust to biases from the dataset, gen-       perception. The final aim is to promptly react
eralizing better on the less represented classes.       and identify improvement opportunities and, af-
Our experiments on the SemEval data in English          terward, measure the impact of the adopted initia-
as well as on a commercial dataset in Italian show      tives.
a constant improvement of our networks over the
state of the art.                                       3.1   Focused Problem Description
   In the following, Section 2 places the current       Industrial applications, used by demanding
work in the literature. Section 3 introduces the ap-    clients, and dealing with real data tend to prefer
plication scenario. Sections 4 and 5 presents re-       easily adaptable and reliable solutions. Major
spectively our proposal for a flexible architecture     problems are related to multi-tenant applications
and the experimental results. Finally, Section 6 re-    with several client requirements on the sentiment
ports the conclusions.                                  analysis problem, often requiring variations
                                                        on task approaches within the same platform.
2   Related Work                                        Moreover, high attention is put on scalability
                                                        and lightweight use of computational resources,
Although sentiment analysis has been around for
                                                        preserving accurate performance. Finally, dealing
one decade, a clear and exact comparison of mod-
                                                        with different client domains and data potentially
els has been achieved thanks to the organization
                                                        requires constantly training new models with
of international evaluation campaigns. The main
                                                        limited time availability.
campaign for SA in Twitter in English is SemEval,
which has been organized since 2013. A similar          3.2   Data Description
campaign in the Italian language (SENTIPOLC)
                                                        The commercial social media and micro-blogging
(Barbieri et al., 2016) is promoted within Evalita
                                                        monitoring platform continuously acquires data
since 2014.
                                                        coming from several sources; among these, we se-
   Among other approaches, Neural Networks
                                                        lected Twitter data as the main source for our pur-
(NNs), and in particular CNNs, outperformed the
                                                        poses.
previous state of the art techniques (Severyn and
                                                           First, the public Twitter stream was collected for
Moschitti, 2015; Castellucci et al., 2016; Attardi
                                                        several months without specific domain restriction
et al., 2016; Deriu et al., 2016). Those systems
                                                        to build the dataset used for the word embedding
share some architectural choices: (i) use of Convo-
                                                        training. The total amount of tweets used accounts
lutional Sentence encoders (Kim, 2014), (ii) lever-
                                                        for 100 million Italian tweets and 50 million En-
aging pre-trained word2vec embeddings (Mikolov
                                                        glish tweets.
et al., 2013) and (iii) use of distant supervision to
                                                           Then, a dataset has been constructed from a spe-
pre-train the network (Go et al., 2009). Despite
                                                        cific market sector in Italian. The data collection
this network is simple and provides state of the art
                                                        was performed on the public Twitter stream with
results, it does not model long-term dependencies
                                                        specific word restriction performed in order to fil-
in the tweet by construction.
                                                        ter the tweets of interest on the automotive do-
                                                        main. Afterward, the commercial platform applies
3   Application Scenario
                                                        different techniques in order to exclude from these
Our commercial application is a social media            collections the tweets that are not relevant for the
and micro-blogging monitoring platform, which is        specific insight analysis.
used to analyze brand reputation, competitors, the         The messages were then used to construct the
voice of the customer and customer experience. It       dataset for our experiments. A manual annota-
is capable of managing multi-domain and multi-          tion phase has been performed together with the
demanding client in order to best suit the insight     sentence matrix x̃, resulting in a fixed size vector
objective requirement. Even though structured          representing the whole sentence.
guidelines were agreed upon before creating the           In this work, we propose to substitute the max-
dataset and continuously checked against, this ap-     pooling operation with a Bidirectional Gated Re-
proach tended to generate dataset characteristics:     current Unit (BiGRU) (Chung et al., 2014; Schus-
in particular, unbalanced distribution of the ex-      ter and Paliwal, 1997). The GRU is a Gated Re-
amples over the different classes has been mea-        current Neural Network capturing long term de-
sured. It makes necessary a flexible model ca-         pendencies over the input. A GRU processes the
pable of handling such phenomena without the           input in a direction (e.g., from left to right), updat-
need of costly tuning phases and/or network re-        ing a hidden state that keeps the memory of what
engineering.                                           the network has processed so far. In this way, a
                                                       whole sentence can be represented by taking the
4   Our Neural Network Approach                        hidden state at the last step. In order to capture de-
                                                       pendencies in both directions, i.e., a stronger rep-
The task of SA in Twitter aims at classify-
                                                       resentation of the sentence, we apply a BiGRU,
ing a tweet t ∈ T into one of the three
                                                       which performs a GRU operation in both the di-
sentiment classes c ∈ C, where C =                                                −−−→       ←−−−
                                                       rections BiGRU (x̃) = [GRU (x̃); GRU (x̃)].
{positive, neutral, negative}.         This can be
achieved by learning function f : T → C through        Classification: the final module of the network
a neural network. The architecture here proposed       is the output layer (a logistic regression) that per-
is based on (Severyn and Moschitti, 2015) and it       forms a linear transformation over the sentence
is structured in three steps: (i) a tweet is encoded   vector by mapping it in a dclass dimensional vec-
into an embedding matrix, (ii) an encoder maps         tor followed by a softmax activation, where dclass
the tweet matrix into a fixed size vector and (iii)    is the number of classes.
a single output layer (a logistic regression layer)
classifies this vector over the three classes.         5     Experiments
   In contrast to Severyn and Moschitti (2015), we
adopted a Recurrent Pooling layer that allows the      5.1    Setup
network to learn longer dependencies in the input      Similarly to Severyn and Moschitti (2015), for the
sentence (i.e. sentiment shifts). This architectural   CNN, we use a convolutional operation of size 5
change makes the network less sensible to learn bi-    and dconv = 128 with rectified linear unit activa-
ases from the dataset and therefore generalize bet-    tion, ReLU. For the BiGRU, we use 150 hidden
ter on poorly represented classes.                                     ←−−−        −−−→
                                                       units for both GRU and GRU obtaining a fixed
                                                       size vector of size 300.
Embedding: a tweet t is represented as a se-
                                                           Word embeddings: for all the proposed mod-
quence of words {w1 , .., wj , .., wN }. Tweets are
                                                       els, we pre-initialize the word embedding matrices
encoded into a sentence matrix t ∈ Rd×|t| ,
                                                       with the standard skip-gram embedding of dimen-
obtained by concatenating its word vectors wj ,
                                                       sionality 50 trained on tweets retrieved from the
where d is the size of the word embeddings.
                                                       Twitter Stream.
Sentence Encoder: it is a function that maps the           Training: the network is trained using SGD
sentence matrix t into a fixed size vector x rep-      with shuffled mini-batches using the Adam up-
resenting the whole sentence. Severyn and Mos-         date rule (Kingma and Ba, 2014) and an early
chitti (2015) used a convolutional layer followed      stopping (Prechelt, 1998) strategy with patience
by a global max-pooling layer to encode tweets.        p = 10. Early stopping allows avoiding over-
The convolution operation applies a sliding win-       fitting and to improve the generalization capabil-
dow operation (with window of size m) over the         ities of the network. Then, we opted for adding
input sentence matrix. More specifically, it applies   dropout (Srivastava et al., 2014) with rates of 0.2
a non-linear transformation generating an output       to improve generalization and avoid co-adaptation
matrix x̃ ∈ RN ×dconv where dconv is the num-          of features (Srivastava et al., 2014).
ber of convolutional filters and N is the length of        Datasets: we trained and evaluated our archi-
the sentence. The max-pooling operation applies        tecture on two datasets: the English dataset of Se-
an element-wise max operation to the transformed       meval 2015 (Rosenthal et al., 2015) described by
         Table 1: Splits of the Semeval dataset           Table 3: English results on the SemEval dataset
                   pos.     neu. neg.      total                           2013 test        2015 test
      train        5,895 471 3,131 9,497                                   F1       F 1p,n F 1      F 1p,n
      valid        648      57     430     1,135          S&M (2015)       —        72.79 —         64.59
      test 2013 2,734 160 1,541 4,435                     CNN+Max          72.04 67.71 67.14 62.63
      test 2015 1,899 190 1,008 3,097                     CNN+BiGRU 71.67 68.10 68.03 63.82

         Table 2: Splits of the Italian dataset
              pos      neu       neg      total          Table 4: Italian results on the automotive dataset
                                                                            Valid            Test
       train 4,234 6,434 2,170 12,838
                                                                            F1       F 1p,n F 1      F 1p,n
       valid 386       580       461      1,427
                                                         CNN+Max            65.34 62.35 69.35 62.88
       test   185      232       83       500
                                                         CNN+BiGRU 64.85 67.71 68.32 67.55

Table 1 in terms of the size of the data splits and
                                                         the results obtained with the BiGRU pooling strat-
positive, negative and neutral instances. We used
                                                         egy described in Section 4. The proposed architec-
the validation set for parameter tuning and to apply
                                                         ture presents a slight improvement over the strong
early stopping whereas the systems are evaluated
                                                         baseline (∼ 1 point of both F 1 and F 1p,n score on
on the two test sets of 2013 and 2015, respectively.
                                                         the test).
   The Italian dataset was built in-house for the au-
tomotive domain: we collected from the Twitter           5.3   Results on Italian Data
stream as explained in Section 3.2 and divided it        Table 4 presents the result on the Italian
into three different splits for training, validation     dataset. Despite that on this dataset the proposed
and testing, respectively. Table 2 shows the size of     CNN+BiGRU model obtains lower F1 scores, it
the splits. Due to the nature of the domain, many        shows improved performance in terms of F 1p,n (5
tweets in the dataset are neutral or objective, this     points on both validation and test sets). This sug-
makes the label distribution much different from         gests that the proposed model tends to generalize
the usual benchmarks. For example, the neutral           better on the less represented classes, which, in the
class is the least represented in the English dataset    case of the Italian training dataset, are the positive
(see Table 1) and the most represented in the Ital-      and negative classes (as pointed out in Table 2).
ian data. The imbalance can potentially bias neu-
ral networks towards the most represented class.         5.4   Discussion of the Results
One of the features our approach is to diminish          We analyzed the classification scores of some
such effect.                                             words to show that our approach is less affected
   Evaluation metrics: we used the following             by the skewed distribution of the dataset. The
evaluation metrics, Macro-F1 (the average of the         sentiment trends, as captured by the neural net-
F1 over the three sentiment categories). Addition-       work in terms of scores, are shown in Table
ally, we report the F 1p,n , which is the average F1     5.4). For example, the word Mexico classified by
of the positive and negative class. This metric is       CNN+Max produces the scores, 0.06, 0.35, 0.57,
the official evaluation score of the SemEval com-        while CNN+BiGRU outcome, 0.18, 0.52, 0.30,
petition.                                                for the negative, neutral and positive classes, re-
                                                         spectively. This shows that CNN+BiGRU is less
5.2    Results on English Data                           biased by the data distribution of the sampled word
Table 3 presents the results on the English dataset      in the dataset, which is, 0, 1, 5, i.e., Mexico ap-
of SemEval 2015. The first row shows the out-            pears 5 times more in positive than in neutral mes-
come reported by Severyn and Moschitti (2015)            sages and never in negative messages.
(S&M). CNN+Max is a reimplementation of the                 This skewed distribution biased more
above system with Convolution and Max-Pooling            CNN+Max as the positive class gets 0.57
but trained just on the official training data without   while the negative one only 0.06. CNN+BiGRU is
distant supervision. This system is used as a strong     able, instead, to recover the correct neutral class.
baseline in all our experiments. Lastly, we report       We believe that CNN+Max is more influenced by
                   Cnn+Max         Cnn+BiGRU              Francesco Barbieri, Valerio Basile, Danilo Croce,
    Mexico       (.06, .35, .57)   (.18, .51, .30)          Malvina Nissim, Nicole Novielli, and Viviana Patti.
    Italy        (.06, .54, .38)   (.18, .54, .26)          2016. Overview of the evalita 2016 sentiment po-
                                                            larity classification task. In Proceedings of Third
    nice       (.007, .009, .98)   (.05, .07, .87)          Italian Conference on Computational Linguistics
                                                            (CLiC-it 2016) & Fifth Evaluation Campaign of
Table 5: Word classification scores obtained with           Natural Language Processing and Speech Tools for
the two neural architectures on English language.           Italian. Final Workshop (EVALITA 2016).
The scores refer to the negative, neutral and posi-
                                                          Giuseppe Castellucci, Danilo Croce, and Roberto
tive classes, respectively.                                 Basili. 2016. Context-aware convolutional neural
                                                            networks for twitter sentiment analysis in italian. In
                                                            Proceedings of Third Italian Conference on Compu-
the distribution bias as the max pooling operation          tational Linguistics (CLiC-it 2016) & Fifth Evalua-
seems to capture very local phenomena. In con-              tion Campaign of Natural Language Processing and
trast, BiGRU exploits the entire word sequence              Speech Tools for Italian. Final Workshop (EVALITA
and thus can better capture larger informative              2016), Napoli, Italy, December 5-7, 2016. CEUR-
                                                            WS.org.
context.
   A similar analysis in Italian shows the same           Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,
trends. For example, the word panda is classified           and Yoshua Bengio. 2014. Empirical evaluation of
as, 0.05, 0.28, 0.66, by CNN+Max and 0.07, 0.56,            gated recurrent neural networks on sequence model-
0.35 by CNN+BiGRU, for negative, neutral and                ing. arXiv preprint arXiv:1412.3555.
positive classes, respectively. Again, the distribu-      Jan Deriu, Maurice Gonzenbach, Fatih Uzdilli, Au-
tion in the Italian training set of this word is very        relien Lucchi, Valeria De Luca, and Martin Jaggi.
skewed towards the positive class: it confirms that          2016. Swisscheese at semeval-2016 task 4: Senti-
CNN+Max is more influenced by the distribution               ment classification using an ensemble of convolu-
                                                             tional neural networks with distant supervision. In
bias, while our architecture can better deal with it.        SemEval@ NAACL-HLT, pages 1124–1128.
6   Conclusions                                           Alec Go, Richa Bhayani, and Lei Huang. 2009. Twit-
                                                            ter sentiment classification using distant supervision.
In this paper, we have studied state-of-the-art neu-
ral networks for the Sentiment Analysis of Twit-          Yoon Kim. 2014. Convolutional neural networks for
ter text associated with a real application scenario.       sentence classification. In Alessandro Moschitti,
We modified the network architecture by apply-              Bo Pang, and Walter Daelemans, editors, Proceed-
                                                            ings of the 2014 Conference on Empirical Methods
ing a recurrent pooling layer enabling the learning         in Natural Language Processing, EMNLP 2014, Oc-
of longer dependencies between words in tweets.             tober 25-29, 2014, Doha, Qatar, A meeting of SIG-
The recurrent pooling layer makes the network               DAT, a Special Interest Group of the ACL, pages
more robust to unbalanced data distribution. We             1746–1751. ACL.
have tested our models on the academic bench-             Diederik Kingma and Jimmy Ba. 2014. Adam: A
mark and most importantly on our data derived               method for stochastic optimization. arXiv preprint
from a real-world commercial application. The               arXiv:1412.6980.
results show that our approach works well for
                                                          Svetlana Kiritchenko, Xiaodan Zhu, and Saif M. Mo-
both English and Italian languages. Finally, we             hammad. 2014. Sentiment analysis of short in-
observed that our network suffers less from the             formal texts. Journal of Artificial Intelligence Re-
dataset distribution bias.                                  search, 50(1):723–762, May.

                                                          Shoushan Li, Sophia Yat Mei Lee, Ying Chen, Chu-
References                                                  Ren Huang, and Guodong Zhou. 2010. Sentiment
                                                            classification and polarity shifting. In Proceedings
Giuseppe Attardi, Daniele Sartiano, Chiara Alzetta,         of the 23rd International Conference on Computa-
  and Federica Semplici. 2016. Convolutional neural         tional Linguistics, pages 635–643. Association for
  networks for sentiment analysis on italian tweets. In     Computational Linguistics.
  Proceedings of Third Italian Conference on Compu-
  tational Linguistics (CLiC-it 2016) & Fifth Evalua-     Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
  tion Campaign of Natural Language Processing and          rado, and Jeff Dean. 2013. Distributed representa-
  Speech Tools for Italian. Final Workshop (EVALITA         tions of words and phrases and their compositional-
  2016), Napoli, Italy, December 5-7, 2016. CEUR-           ity. In Advances in neural information processing
  WS.org.                                                   systems, pages 3111–3119.
Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio
  Sebastiani, and Veselin Stoyanov. 2016. Semeval-
  2016 task 4: Sentiment analysis in twitter. In Se-
  mEval@ NAACL-HLT, pages 1–18.
Lutz Prechelt. 1998. Early stopping-but when? In
  Neural Networks: Tricks of the Trade, This Book is
  an Outgrowth of a 1996 NIPS Workshop, pages 55–
  69, London, UK, UK. Springer-Verlag.
Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko,
  Saif Mohammad, Alan Ritter, and Veselin Stoyanov.
  2015. Semeval-2015 task 10: Sentiment analysis
  in twitter. In Proceedings of the 9th International
  Workshop on Semantic Evaluation (SemEval 2015),
  pages 451–463, Denver, Colorado, June. Associa-
  tion for Computational Linguistics.
Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-
  tional recurrent neural networks. IEEE Transactions
  on Signal Processing, 45(11):2673–2681.
Aliaksei Severyn and Alessandro Moschitti. 2015.
  Twitter sentiment analysis with deep convolutional
  neural networks. In Proceedings of the 38th Inter-
  national ACM SIGIR Conference on Research and
  Development in Information Retrieval, pages 959–
  962. ACM.
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky,
  Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
  Dropout: a simple way to prevent neural networks
  from overfitting. Journal of Machine Learning Re-
  search, 15(1):1929–1958.

</pre>