Word Embeddings in Sentiment Analysis
                              Ruggero Petrolito• , Felice Dell’Orletta
                                          •
                                            Università di Pisa
                              ruggero.petrolito@gmail.com
            
              Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR)
                                   ItaliaNLP Lab - www.italianlp.it
                             felice.dellorletta@ilc.cnr.it


                     Abstract                           particular, in recent years systems based on deep
                                                        learning techniques represent the state of the art.
    English. In the late years sentiment analy-         In this field, word embeddings have been widely
    sis and its applications have reached grow-         used as a way of representing words in sentiment
    ing popularity. Concerning this field of            analysis tasks, and proved to be very effective.
    research, in the very late years machine               A relevant mirror of the state of the art in sen-
    learning and word representation learning           timent analysis field can be found in the SemEval
    derived from distributional semantics field         workshops. In the 2015 edition (Rosenthal et al.,
    (i.e. word embeddings) have proven to be            2015), most participants used machine learning
    very successful in performing sentiment             techniques; in many of the subtasks, the top rank-
    analysis tasks. In this paper we describe a         ing systems used deep learning methods and word
    set of experiments, with the aim of evalu-          embeddings, like the system submitted by Severyn
    ating the impact of word embedding-based            and Moschitti (2015), which was ranked 1st in
    features in sentiment analysis tasks.               subtask A and 2nd in subtask B. In 2016 edition
    Italiano.      Recentemente la Sentiment            (Nakov et al., 2016), deep learning based tech-
    Analysis e le sue applicazioni hanno ac-            niques, such as convolutional neural networks and
    quisito sempre maggiore popolarità. In              recurrent neural networks, were the most popular
    tale ambito di ricerca, negli ultimi anni il        approach. In 2017 edition (Rosenthal et al., 2017),
    machine learning e i metodi di rappresen-           machine learning methods were very popular, es-
    tazione delle parole che derivano dalla se-         pecially support vector machines and deep neural
    mantica distribuzionale (nello specifico i          networks like convolutional neural networks and
    word embedding) si sono dimostrati molto            long short-term neural networks.
    efficaci nello svolgimento dei vari com-               Concerning Italian language, EVALITA con-
    piti collegati con la sentiment analysis. In        ference well represents the state of the art in the
    questo articolo descriviamo una serie di            natural language processing field. In 2016 edi-
    esperimenti condotti con l’obiettivo di va-         tion (Barbieri et al., 2016), the top ranking sys-
    lutare l’impatto dell’uso di feature basate         tems used machine learning and deep learning
    sui word embedding nei vari compiti della           techniques (Castellucci et al. (2016), Attardi et
    sentiment analysis.                                 al. (2016), Di Rosa and Durante (2016)).
                                                           The purpose of this study is to explore ways of
                                                        using word embeddings to build meaningful rep-
1   Introduction                                        resentations of documents in sentiment analysis
In the late years sentiment analysis has reached        tasks performed on Italian tweets.
great popularity among NLP tasks. As reported
                                                        2   Our Contribution
by Mäntylä et al. (2016) the number of papers on
this subject has increased significantly in the first   In this paper we aimed to evaluate the effect of
two decades of 21st century, as well as the extent      exploiting word embeddings in sentiment analysis
of its applications. A wide variety of technologies     tasks. In particular, we explore the effect of five
has been used to assess sentiment analysis tasks        factors on the performance of a sentiment analy-
during this period. In the latter years, machine        sis classification system, to answer five research
learning techniques proved to be very effective; in     questions:
    1. What is the effect of the size of the corpus    4       Experimental Setup
       used to train the embeddings?
                                                       For our experiments, we used a classifier based on
    2. Which text domain allows us to train bet-       SVM using LIBLINEAR (Rong-En et al., 2013)
       ter embeddings (in-domain vs out-of-domain      as machine learning library. As features, the clas-
       data)?                                          sifier uses only information extracted combining
                                                       the word-embeddings of the words of the analyzed
    3. Which type of learning method produces
                                                       tweet.
       better embeddings (word vs character-based
                                                          In all the experiments described in this paper,
       word embeddings)?
                                                       our system addresses the classification tasks by
    4. Which method to combine the word vectors        performing 5-fold cross-validation on the train-
       produces a better document vector represen-     ing set provided for the SENTIPOLC 2016 eval-
       tation?                                         uation campaign. The final score is the average
                                                       score. We evaluate each fold using the Average
    5. What are the most important words (in terms
                                                       F-score described by Barbieri et al. (2016).
       of part-of-speech) to produce a better docu-
                                                          For what concerns the word embeddings, we
       ment vector representation?
                                                       trained two types of word embedding representa-
   To answer such questions, we performed sev-         tions: i) the first one using the word2vec1 toolkit
eral classification experiments testing our system     (Mikolov et al., 2013). This tool learns lower-
on the three sentiment analysis tasks proposed in      dimensional word embeddings, which are repre-
the 2016 EVALITA SENTIPOLC campaign (Bar-              sented by a set of latent (hidden) variables, and
bieri et al., 2016): Subjectivity Classification,      each word is associated to a multidimensional vec-
Polarity Classification and Irony Detection. In        tor that represents a specific instantiation of these
the first of these tasks, the highest accuracy was     variables; ii) the second one using fastText (Bo-
achieved by the system of Castellucci et al. (2016).   janowski et al., 2016), a library for efficient learn-
Concerning the 2nd task, the most accurate system      ing of word representations and sentence classifi-
was the one submitted by Attardi et al. (2016). Re-    cation. This library allows to overcome the prob-
garding the 3rd task, the highest accuracy value       lem of out-of-vocabulary words which affects the
was reached by the system of Di Rosa and Du-           methodology of word2vec. Generating out-of-
rante (2016). Among these systems, Castellucci et      vocabulary word embeddings is a typical issue for
al. (2016) and Attardi et al. (2016) use deep learn-   morphologically rich languages with large vocab-
ing techniques (convolutional neural networks),        ularies and many rare words. FastText overcomes
while Di Rosa and Durante (2016) use an ensem-         this limitation by representing each word as a bag
ble of many supervised learning classifiers.           of character n-grams. A vector representation is
                                                       associated to each character n-gram and the word
3     Datasets                                         is represented as the sum of these character n-gram
We tested our system on the three sentiment            representations.
analysis tasks proposed in 2016 EVALITA SEN-              In both cases, each word is represented by a 100
TIPOLC campaign. These tasks and the re-               dimensions vector, computed using the CBOW al-
lated datasets have been described by Barbieri et      gorithm – that learns to predict the word in the
al. (2016). We conducted our experiments on            middle of a symmetric window based on the sum
the training set provided by the organizers of the     of the vector representations of the words in the
evaluation campaign, which is composed of 7921         window – and considering a context window of 5
tweets.                                                words.
   We train our word embeddings on two corpora:
in-domain and out-domain. The in-domain dataset        5       Experiments and Results
is a collection of tweets that we collected for this
                                                       To answer the questions listed in Section 2, we
work, named Tweets. It is composed by almost 80
                                                       conducted a great amount of experiments, testing
millions of tweets, resulting in around 1.2 billions
                                                       many ways of representing the tweets by exploit-
of tokens. The out-of-domain dataset is the Paisà
                                                       ing in different manners the word embeddings of
corpus, a collection of Italian web texts described
                                                           1
by Lyding et al. (Lyding et al., 2013).                        http://code.google.com/p/word2vec/
                    0.59                                            particularly as regards the Irony Detection task:
                    0.58                                            in this case, the average F-score basically stops
                    0.57
                    0.56                                            growing after around 80 millions of tokens.
 Average F-scores


                    0.55                                               When we use embeddings trained with fastText,
                    0.54                                            the outcome is the opposite: the average F-score
                    0.53                                            values decrease as bigger amounts of data are used
                    0.52
                                                                    to train the embeddings. The decrease of the val-
                    0.51
                     0.5                                            ues is faster when using the first hundreds of mil-
                    0.49                                            lions of tokens.
                    0.48                                               Lesson learned: these results suggest that,
                    0.47                                            regarding word-based word embeddings, as the
                           0   200   400   600    800 1,000 1,200
                                                                    training corpus grows the accuracy rises, but it
                                      Millions of tokens            becomes stable quickly. On the other hand, the
Figure 1: Average F-scores obtained by using embeddings             increase of the size of the training corpus appar-
trained on increasing amounts of token, using word2vec (cir-        ently doesn’t influence the accuracy values when
cles) and fastText (crosses). Blue is assigned to Subj. Classi-     the embedding have been produced using fastText
fication, red to Pol. Classification and green to Irony Detec-
tion.                                                               (or it even causes a lowering of the accuracy val-
                                                                    ues).

the words extracted from the tweets.                                5.2   Domain of the Embeddings Training
   To evaluate the impact (in terms of classifica-                        Corpus
tion accuracy) of the variations of each studied pa-                To answer the question n. 2, we ran a set of ex-
rameter, we report the accuracy for each variation                  periments using the four models obtained using
of the parameter calculated as the average accu-                    word2vec and fastText on Paisà and Tweet cor-
racy across all the classification experiments that                 pora. Table 1 reports the results of the experi-
we conducted by varying all the other parameters                    ments. As we can see, the embeddings trained
(in a 5-fold cross-validation scenario).                            with word2vec on the in-domain dataset (Tweets)
   In all the experiments, we used only features                    provide features that allow to achieve a higher av-
based on word embeddings.                                           erage accuracy compared to the features extracted
                                                                    from the out-domain corpus. Differently, there
5.1                 Size of the Embeddings Training Corpus
                                                                    isn’t any variation in terms of accuracy when the
To answer the question n. 1, we trained several                     embeddings are trained with fastText.
word embedding models on different partitions                          Lesson learned: the in-domain word embed-
of Tweets corpus of increasing sizes, using both                    dings are very important in a semantic classifica-
word2vec and fastText. Ten smaller partitions were                  tion scenario. Apparently, this is not true when
obtained starting with just ten millions of tokens                  character-based word embedding are used.
(for the smaller one) and adding other ten millions
for each new partition, reaching the amount of 100                             Subj.             Pol.              Iro.
millions. We created other four bigger partitions,                        w2v      ft       w2v      ft       w2v      ft
which contain respectively 240, 480, 720 and 960                     tw   0.5901   0.5198   0.592    0.5384   0.4837   0.4776
millions of tokens; the size of the smaller of this                  pa   0.572    0.5206   0.5693   0.5312   0.4793   0.4759

four partitions is comparable to the size of Paisà.                 Table 1: Average F-scores obtained by using word embed-
   Figure 1 reports the results. When we use                        dings trained on Twitter (tw) and Paisà (pa) corpora.
embeddings trained with word2vec on increasing
amounts of data, the average value of F-score
grows for all the three subtasks. The amount of                     5.3   Type of Embeddings Learning Model
this growth is similar for the subtasks Subjectivity                For what regards the question n. 3, the type of
Classification (0.016) and Polarity Classification                  embeddings learning model (words vs character
(0.019), while it’s smaller for the subtask Irony                   n-grams) influences considerably the performance
Detection, which is the most challenging among                      of the classifier. Using embeddings trained with
the three. In all cases the increase is significantly               word2vec leads to F-score values that are signif-
faster in the first 80 to 100 millions of tokens,                   icantly higher in comparison to the accuracy ob-
tained using embeddings trained with fastText (see                      Subj.             Pol.              Iro.
Table 1).                                                          w2v      ft       w2v      ft       w2v      ft
   Lesson learned: this outcome suggests that em-          Sum     0.6054   0.534    0.6085   0.5532   0.4887   0.5033
                                                           Mean    0.6017   0.5951   0.5954   0.5916   0.4709   0.4811
beddings learned by methods that treat words as            Max     0.5957   0.5012   0.5964   0.507    0.4736   0.4698
atomic entities provide features that are more use-        Min     0.593    0.5012   0.5951   0.5011   0.4754   0.4707
                                                           Prod    0.4415   0.4759   0.4384   0.5012   0.4693   0.4628
ful in a semantic task such as sentiment classifica-
                                                           All     0.6236   0.4846   0.6246   0.51     0.5202   0.4715
tion, in comparison with character-based embed-
dings.                                                    Table 2: Average F-scores obtained by using different strate-
                                                          gies of combination of word embeddings. Bold black values
5.4      Methods to Combine Word Embeddings               are the best F-scores overall; blue bold values are the best
                                                          F-scores obtained by using a single combination method in
To answer the question n. 4, we tested many meth-         the word-based word embeddings scenario (w2v); red bold
ods to combine the embeddings of the words of             values are the best F-scores in the character-based word em-
                                                          beddings scenario (ft).
each document into a document-level vector rep-
resentation.
   We experimented five combining methods:                Meanwhile, the worst approach is the Product
Sum, Mean, Maximum-pooling, Minimum-                      combination. Interestingly, while the concatena-
pooling, Product. Each of this methods returns a          tion of all the combined word-based word embed-
single vector ~t , such that each tn is obtained by       dings is surely the best approach to produce the
combining the nth components w1n , w2n . . . wmn          document-level vector representation, this is not
of the embedding of each tweet word. Figure 2             true for the character-based ones.
shows a graphical representation of this process.
                                                          5.5   Selection of Morpho-syntactic Categories
             w
             ~1        w11      w12       ...       w1d         of Combined Word Embeddings
             w
             ~2        w21      w22       ...       w2d   To answer the question n. 5, we ran a set of experi-
                                                          ments using only a subset of the word embeddings
 tweet       w
             ~3        w31      w32       ...       w3d
                                                          of each document to produce the document vector
              ..        ..        ..                 ..   representation. The word selection is guided by
               .         .         .                  .

                      wn1       wn2       ...       wnd
                                                          the morpho-syntactic categories of the words. We
             w
             ~n
                                                          tested four categories: noun, verb, adjective, ad-
                                                          verb. The embeddings of the words belonging to
              ~t       t1        t2       ...       td
                                                          each of these categories were combined in a pos-
         Figure 2: Embeddings combination process         based vector representation document. In addi-
                                                          tion, we tested the document representation vector
   We tested these methods separately, and all of         obtained through the concatenation of the differ-
them jointly as well. When using all methods,             ent pos-based vectors (N, V, Adj, Adv) with and
the document representation is obtained concate-          without the all-word document vector All words,
nating the vectors returned by each method.               which is the only one taking into account emoti-
   As we can see in Table 2, the Sum method               cons and hash tags.
proved to be the best method for all the tasks,              Table 3 reports the results of the experiments. In
when using embeddings obtained by word2vec.               the word-based word embedding scenario, regard-
The best results overall are obtained using the con-      ing the contribution of single morpho-syntactic
catenation of each of the vectors returned by the         categories, noun shows the highest performance.
used methods (row All in the Table). When using           Overall, the highest score is yielded by the combi-
embeddings trained with fastText, the best results        nation of all the selected categories concatenated
are obtained with mean for Subjectivity and Polar-        with the combined vector of all the word embed-
ity Classification, and with sum for Irony Detec-         dings (All words rows in the table). For what re-
tion. In this case, the combination of all vector         gards the character-based word embeddings, we
leads to poor results.                                    can see that the noun is the individually best per-
   Lesson learned: these outcomes suggest that            forming category only for the Subjectivity Clas-
the best combination methods are sum for word             sification task, while the adjective and the verb
vectors obtained by using word-based word em-             are the best performing category for the other two
beddings and mean for character-based ones.               tasks.
                                 Subj.              Pol.              Iro.
                                                                                   Yoshua Bengio, Réjean Ducharme, Pascal Vincent and
                             w2v      ft       w2v      ft       w2v      ft
 N                           0.553    0.5171   0.5417   0.5091   0.4725   0.4749
                                                                                     Christian Jauvin. 2003. A Neural Probabilistic Lan-
 V                           0.4755   0.4778   0.5091   0.5136   0.469    0.4897     guage Model. Journal of Machine Learning Re-
 Adj                         0.4406   0.4534   0.5184   0.5335   0.4705   0.4826     search 3 (2003) 1137–1155.
 Adv                         0.4397   0.4504   0.4971   0.5033   0.4702   0.485
 N, V, Adj, Adv              0.6266   0.5578   0.6141   0.5667   0.4948   0.5041
 All words                   0.6251   0.5363   0.5941   0.515    0.4773   0.4521   Piotr Bojanowski, Edouard Grave, Armand Joulin and
 All words, N                0.6287   0.5221   0.6032   0.5343   0.4887   0.4646      Tomas Mikolov. 2016. Efficient Estimation of
 All words, V                0.6326   0.5276   0.6035   0.5339   0.4841   0.4634
 All words, Adj              0.6374   0.5328   0.6185   0.5184   0.4867   0.4693      Word Representations in Vector Space. CoRR
 All words, Adv              0.6337   0.5243   0.6087   0.5187   0.4856   0.4674      abs/1607.04606, 2016.
 All words, N, V, Adj, Adv   0.6521   0.5691   0.6319   0.5546   0.5139   0.4886

                                                                                   Giuseppe Castellucci, Danilo Croce and Roberto
Table 3: Average F-scores obtained using embedding of                                Basili. 2016. Context-aware Convolutional Neural
words belonging to different morpho-syntactic classes. Bold                          Networks for Twitter Sentiment Analysis in Italian.
black values are the best F-scores overall; blue bold values
are the best F-scores obtained using a single grammar class
                                                                                     CLiC-it/EVALITA.
in the word-based word embeddings scenario (w2v); red bold
values are the best F-scores obtained using a single grammar
                                                                                   Emanuele Di Rosa and Alberto Durante.         2016.
class in the character-based word embeddings scenario (ft).                          Tweet2Check evaluation at Evalita Sentipolc 2016.
                                                                                     CLiC-it/EVALITA.
                                                                                   Verena Lyding, Egon Stemle, Claudia Borghetti, Marco
   Lesson learned: these results show that noun                                      Brunello, Sara Castagnoli, Felice Dell’Orletta Hen-
class is the most important grammatical category                                     rik Dittmann, Alessandro Lenci and Vito Pirrelli.
only in the word-based word embedding scenario;                                      2013. PAISÀ Corpus of Italian Web Text. Institute
meanwhile the concatenation of all the pos-based                                     for Applied Linguistics, Eurac Research.
vectors and the All words vector yields the best                                   Mika V. Mäntylä, Daniel Graziotin and Miikka Kuutila.
accuracy in both scenarios.                                                          2016. The Evolution of Sentiment Analysis - A Re-
                                                                                     view of Research Topics, Venues, and Top Cited Pa-
6    Conclusions                                                                     pers. Computer Science Review, Volume 27, Febru-
                                                                                     ary 2016, Pages 16-32.
In this work we study the impact of word                                           Tomas Mikolov, Kai Chen, Greg Corrado and Jeffrey
embedding-based features in the sentiment anal-                                      Dean. 2013. Efficient Estimation of Word Repre-
ysis tasks. We performed several classification                                      sentations in Vector Space. CoRR abs/1301.3781,
experiments to investigate the effects on classifi-                                  2013.
cation performances of five dimensions related to                                  Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio
the word embeddings. We tested several different                                     Sebastiani and Veselin Stoyanov. 2016. SemEval-
ways of selecting and combining the embeddings                                       2016 task 4: Sentiment analysis in Twitter. Proceed-
and we studied how the performance of a senti-                                       ings of the 10th international workshop on semantic
                                                                                     evaluation (semeval-2016).
ment classifier changes.
   Despite the lessons learned from this work, sev-                                Fan Rong-En, Chang Kai-Wei, Hsieh Cho-Jui, Wang
eral aspects remain to investigate, such as, for ex-                                 Xiang-Ruind Lin Chih-Jen. 2008. LIBLINEAR: A
                                                                                     Library for Large Linear Classification. Journal of
ample, the tuning of the parameters used to train                                    Machine Learning Research, 9:1871-1874.
the embeddings, and new vector combining strate-
gies.                                                                              Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko,
                                                                                     Saif Mohammad, Alan Ritter and Veselin Stoyanov.
                                                                                     2015. Semeval-2015 task 10: Sentiment analysis in
                                                                                     twitter. Proceedings of the 9th international work-
References                                                                           shop on semantic evaluation (SemEval 2015), 451–
                                                                                     463.
Giuseppe Attardi, Daniele Sartiano, Chiara Alzetta and
  Federica Semplici. 2016. Convolutional Neural                                    Sara Rosenthal, Noura Farra and Preslav Nakov. 2017.
  Networks for Sentiment Analysis on Italian Tweets.                                 SemEval-2017 task 4: Sentiment analysis in Twitter.
  CLiC-it/EVALITA.                                                                   Proceedings of the 11th International Workshop on
                                                                                     Semantic Evaluation (SemEval-2017), 502–518.
Francesco Barbieri, Valerio Basile, Danilo Croce,
  Malvina Nissim, Nicole Novielli and Viviana Patti.                               Aliaksei Severyn and Alessandro Moschitti. 2015.
  2016. Overview of the evalita 2016 sentiment polar-                                Unitn: Training deep convolutional neural network
  ity classification task. Proceedings of Third Italian                              for twitter sentiment classification. Proceedings of
  Conference on Computational Linguistics (CLiC-it                                   the 9th international workshop on semantic evalua-
  2016) & Fifth Evaluation Campaign of Natural Lan-                                  tion (SemEval 2015), 464–469.
  guage Processing and Speech Tools for Italian. Final
  Workshop (EVALITA 2016).