                        Word Embeddings in Sentiment Analysis
                              Ruggero Petrolito• , Felice Dell’Orletta
                                            Università di Pisa
              Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR)
                                   ItaliaNLP Lab - www.italianlp.it

                     Abstract                           particular, in recent years systems based on deep
                                                        learning techniques represent the state of the art.
    English. In the late years sentiment analy-         In this field, word embeddings have been widely
    sis and its applications have reached grow-         used as a way of representing words in sentiment
    ing popularity. Concerning this field of            analysis tasks, and proved to be very effective.
    research, in the very late years machine               A relevant mirror of the state of the art in sen-
    learning and word representation learning           timent analysis field can be found in the SemEval
    derived from distributional semantics field         workshops. In the 2015 edition (Rosenthal et al.,
    (i.e. word embeddings) have proven to be            2015), most participants used machine learning
    very successful in performing sentiment             techniques; in many of the subtasks, the top rank-
    analysis tasks. In this paper we describe a         ing systems used deep learning methods and word
    set of experiments, with the aim of evalu-          embeddings, like the system submitted by Severyn
    ating the impact of word embedding-based            and Moschitti (2015), which was ranked 1st in
    features in sentiment analysis tasks.               subtask A and 2nd in subtask B. In 2016 edition
    Italiano.      Recentemente la Sentiment            (Nakov et al., 2016), deep learning based tech-
    Analysis e le sue applicazioni hanno ac-            niques, such as convolutional neural networks and
    quisito sempre maggiore popolarità. In              recurrent neural networks, were the most popular
    tale ambito di ricerca, negli ultimi anni il        approach. In 2017 edition (Rosenthal et al., 2017),
    machine learning e i metodi di rappresen-           machine learning methods were very popular, es-
    tazione delle parole che derivano dalla se-         pecially support vector machines and deep neural
    mantica distribuzionale (nello specifico i          networks like convolutional neural networks and
    word embedding) si sono dimostrati molto            long short-term neural networks.
    efficaci nello svolgimento dei vari com-               Concerning Italian language, EVALITA con-
    piti collegati con la sentiment analysis. In        ference well represents the state of the art in the
    questo articolo descriviamo una serie di            natural language processing field. In 2016 edi-
    esperimenti condotti con l’obiettivo di va-         tion (Barbieri et al., 2016), the top ranking sys-
    lutare l’impatto dell’uso di feature basate         tems used machine learning and deep learning
    sui word embedding nei vari compiti della           techniques (Castellucci et al. (2016), Attardi et
    sentiment analysis.                                 al. (2016), Di Rosa and Durante (2016)).
                                                           The purpose of this study is to explore ways of
                                                        using word embeddings to build meaningful rep-
1   Introduction                                        resentations of documents in sentiment analysis
In the late years sentiment analysis has reached        tasks performed on Italian tweets.
great popularity among NLP tasks. As reported
                                                        2   Our Contribution
by Mäntylä et al. (2016) the number of papers on
this subject has increased significantly in the first   In this paper we aimed to evaluate the effect of
two decades of 21st century, as well as the extent      exploiting word embeddings in sentiment analysis
of its applications. A wide variety of technologies     tasks. In particular, we explore the effect of five
has been used to assess sentiment analysis tasks        factors on the performance of a sentiment analy-
during this period. In the latter years, machine        sis classification system, to answer five research
learning techniques proved to be very effective; in     questions:
    1. What is the effect of the size of the corpus    4       Experimental Setup
       used to train the embeddings?
                                                       For our experiments, we used a classifier based on
    2. Which text domain allows us to train bet-       SVM using LIBLINEAR (Rong-En et al., 2013)
       ter embeddings (in-domain vs out-of-domain      as machine learning library. As features, the clas-
       data)?                                          sifier uses only information extracted combining
                                                       the word-embeddings of the words of the analyzed
    3. Which type of learning method produces
       better embeddings (word vs character-based
                                                          In all the experiments described in this paper,
       word embeddings)?
                                                       our system addresses the classification tasks by
    4. Which method to combine the word vectors        performing 5-fold cross-validation on the train-
       produces a better document vector represen-     ing set provided for the SENTIPOLC 2016 eval-
       tation?                                         uation campaign. The final score is the average
                                                       score. We evaluate each fold using the Average
    5. What are the most important words (in terms
                                                       F-score described by Barbieri et al. (2016).
       of part-of-speech) to produce a better docu-
                                                          For what concerns the word embeddings, we
       ment vector representation?
                                                       trained two types of word embedding representa-
   To answer such questions, we performed sev-         tions: i) the first one using the word2vec1 toolkit
eral classification experiments testing our system     (Mikolov et al., 2013). This tool learns lower-
on the three sentiment analysis tasks proposed in      dimensional word embeddings, which are repre-
the 2016 EVALITA SENTIPOLC campaign (Bar-              sented by a set of latent (hidden) variables, and
bieri et al., 2016): Subjectivity Classification,      each word is associated to a multidimensional vec-
Polarity Classification and Irony Detection. In        tor that represents a specific instantiation of these
the first of these tasks, the highest accuracy was     variables; ii) the second one using fastText (Bo-
achieved by the system of Castellucci et al. (2016).   janowski et al., 2016), a library for efficient learn-
Concerning the 2nd task, the most accurate system      ing of word representations and sentence classifi-
was the one submitted by Attardi et al. (2016). Re-    cation. This library allows to overcome the prob-
garding the 3rd task, the highest accuracy value       lem of out-of-vocabulary words which affects the
was reached by the system of Di Rosa and Du-           methodology of word2vec. Generating out-of-
rante (2016). Among these systems, Castellucci et      vocabulary word embeddings is a typical issue for
al. (2016) and Attardi et al. (2016) use deep learn-   morphologically rich languages with large vocab-
ing techniques (convolutional neural networks),        ularies and many rare words. FastText overcomes
while Di Rosa and Durante (2016) use an ensem-         this limitation by representing each word as a bag
ble of many supervised learning classifiers.           of character n-grams. A vector representation is
                                                       associated to each character n-gram and the word
3     Datasets                                         is represented as the sum of these character n-gram
We tested our system on the three sentiment            representations.
analysis tasks proposed in 2016 EVALITA SEN-              In both cases, each word is represented by a 100
TIPOLC campaign. These tasks and the re-               dimensions vector, computed using the CBOW al-
lated datasets have been described by Barbieri et      gorithm – that learns to predict the word in the
al. (2016). We conducted our experiments on            middle of a symmetric window based on the sum
the training set provided by the organizers of the     of the vector representations of the words in the
evaluation campaign, which is composed of 7921         window – and considering a context window of 5
tweets.                                                words.
   We train our word embeddings on two corpora:
in-domain and out-domain. The in-domain dataset        5       Experiments and Results
is a collection of tweets that we collected for this
                                                       To answer the questions listed in Section 2, we
work, named Tweets. It is composed by almost 80
                                                       conducted a great amount of experiments, testing
millions of tweets, resulting in around 1.2 billions
                                                       many ways of representing the tweets by exploit-
of tokens. The out-of-domain dataset is the Paisà
                                                       ing in different manners the word embeddings of
corpus, a collection of Italian web texts described
by Lyding et al. (Lyding et al., 2013).                        http://code.google.com/p/word2vec/
                    0.59                                            particularly as regards the Irony Detection task:
                    0.58                                            in this case, the average F-score basically stops
                    0.56                                            growing after around 80 millions of tokens.
 Average F-scores

                    0.55                                               When we use embeddings trained with fastText,
                    0.54                                            the outcome is the opposite: the average F-score
                    0.53                                            values decrease as bigger amounts of data are used
                                                                    to train the embeddings. The decrease of the val-
                     0.5                                            ues is faster when using the first hundreds of mil-
                    0.49                                            lions of tokens.
                    0.48                                               Lesson learned: these results suggest that,
                    0.47                                            regarding word-based word embeddings, as the
                           0   200   400   600    800 1,000 1,200
                                                                    training corpus grows the accuracy rises, but it
                                      Millions of tokens            becomes stable quickly. On the other hand, the
Figure 1: Average F-scores obtained by using embeddings             increase of the size of the training corpus appar-
trained on increasing amounts of token, using word2vec (cir-        ently doesn’t influence the accuracy values when
cles) and fastText (crosses). Blue is assigned to Subj. Classi-     the embedding have been produced using fastText
fication, red to Pol. Classification and green to Irony Detec-
tion.                                                               (or it even causes a lowering of the accuracy val-

the words extracted from the tweets.                                5.2   Domain of the Embeddings Training
   To evaluate the impact (in terms of classifica-                        Corpus
tion accuracy) of the variations of each studied pa-                To answer the question n. 2, we ran a set of ex-
rameter, we report the accuracy for each variation                  periments using the four models obtained using
of the parameter calculated as the average accu-                    word2vec and fastText on Paisà and Tweet cor-
racy across all the classification experiments that                 pora. Table 1 reports the results of the experi-
we conducted by varying all the other parameters                    ments. As we can see, the embeddings trained
(in a 5-fold cross-validation scenario).                            with word2vec on the in-domain dataset (Tweets)
   In all the experiments, we used only features                    provide features that allow to achieve a higher av-
based on word embeddings.                                           erage accuracy compared to the features extracted
                                                                    from the out-domain corpus. Differently, there
5.1                 Size of the Embeddings Training Corpus
                                                                    isn’t any variation in terms of accuracy when the
To answer the question n. 1, we trained several                     embeddings are trained with fastText.
word embedding models on different partitions                          Lesson learned: the in-domain word embed-
of Tweets corpus of increasing sizes, using both                    dings are very important in a semantic classifica-
word2vec and fastText. Ten smaller partitions were                  tion scenario. Apparently, this is not true when
obtained starting with just ten millions of tokens                  character-based word embedding are used.
(for the smaller one) and adding other ten millions
for each new partition, reaching the amount of 100                             Subj.             Pol.              Iro.
millions. We created other four bigger partitions,                        w2v      ft       w2v      ft       w2v      ft
which contain respectively 240, 480, 720 and 960                     tw   0.5901   0.5198   0.592    0.5384   0.4837   0.4776
millions of tokens; the size of the smaller of this                  pa   0.572    0.5206   0.5693   0.5312   0.4793   0.4759

four partitions is comparable to the size of Paisà.                 Table 1: Average F-scores obtained by using word embed-
   Figure 1 reports the results. When we use                        dings trained on Twitter (tw) and Paisà (pa) corpora.
embeddings trained with word2vec on increasing
amounts of data, the average value of F-score
grows for all the three subtasks. The amount of                     5.3   Type of Embeddings Learning Model
this growth is similar for the subtasks Subjectivity                For what regards the question n. 3, the type of
Classification (0.016) and Polarity Classification                  embeddings learning model (words vs character
(0.019), while it’s smaller for the subtask Irony                   n-grams) influences considerably the performance
Detection, which is the most challenging among                      of the classifier. Using embeddings trained with
the three. In all cases the increase is significantly               word2vec leads to F-score values that are signif-
faster in the first 80 to 100 millions of tokens,                   icantly higher in comparison to the accuracy ob-
tained using embeddings trained with fastText (see                      Subj.             Pol.              Iro.
Table 1).                                                          w2v      ft       w2v      ft       w2v      ft
   Lesson learned: this outcome suggests that em-          Sum     0.6054   0.534    0.6085   0.5532   0.4887   0.5033
                                                           Mean    0.6017   0.5951   0.5954   0.5916   0.4709   0.4811
beddings learned by methods that treat words as            Max     0.5957   0.5012   0.5964   0.507    0.4736   0.4698
atomic entities provide features that are more use-        Min     0.593    0.5012   0.5951   0.5011   0.4754   0.4707
                                                           Prod    0.4415   0.4759   0.4384   0.5012   0.4693   0.4628
ful in a semantic task such as sentiment classifica-
                                                           All     0.6236   0.4846   0.6246   0.51     0.5202   0.4715
tion, in comparison with character-based embed-
dings.                                                    Table 2: Average F-scores obtained by using different strate-
                                                          gies of combination of word embeddings. Bold black values
5.4      Methods to Combine Word Embeddings               are the best F-scores overall; blue bold values are the best
                                                          F-scores obtained by using a single combination method in
To answer the question n. 4, we tested many meth-         the word-based word embeddings scenario (w2v); red bold
ods to combine the embeddings of the words of             values are the best F-scores in the character-based word em-
                                                          beddings scenario (ft).
each document into a document-level vector rep-
   We experimented five combining methods:                Meanwhile, the worst approach is the Product
Sum, Mean, Maximum-pooling, Minimum-                      combination. Interestingly, while the concatena-
pooling, Product. Each of this methods returns a          tion of all the combined word-based word embed-
single vector ~t , such that each tn is obtained by       dings is surely the best approach to produce the
combining the nth components w1n , w2n . . . wmn          document-level vector representation, this is not
of the embedding of each tweet word. Figure 2             true for the character-based ones.
shows a graphical representation of this process.
                                                          5.5   Selection of Morpho-syntactic Categories
             ~1        w11      w12       ...       w1d         of Combined Word Embeddings
             ~2        w21      w22       ...       w2d   To answer the question n. 5, we ran a set of experi-
                                                          ments using only a subset of the word embeddings
 tweet       w
             ~3        w31      w32       ...       w3d
                                                          of each document to produce the document vector
              ..        ..        ..                 ..   representation. The word selection is guided by
               .         .         .                  .

                      wn1       wn2       ...       wnd
                                                          the morpho-syntactic categories of the words. We
                                                          tested four categories: noun, verb, adjective, ad-
                                                          verb. The embeddings of the words belonging to
              ~t       t1        t2       ...       td
                                                          each of these categories were combined in a pos-
         Figure 2: Embeddings combination process         based vector representation document. In addi-
                                                          tion, we tested the document representation vector
   We tested these methods separately, and all of         obtained through the concatenation of the differ-
them jointly as well. When using all methods,             ent pos-based vectors (N, V, Adj, Adv) with and
the document representation is obtained concate-          without the all-word document vector All words,
nating the vectors returned by each method.               which is the only one taking into account emoti-
   As we can see in Table 2, the Sum method               cons and hash tags.
proved to be the best method for all the tasks,              Table 3 reports the results of the experiments. In
when using embeddings obtained by word2vec.               the word-based word embedding scenario, regard-
The best results overall are obtained using the con-      ing the contribution of single morpho-syntactic
catenation of each of the vectors returned by the         categories, noun shows the highest performance.
used methods (row All in the Table). When using           Overall, the highest score is yielded by the combi-
embeddings trained with fastText, the best results        nation of all the selected categories concatenated
are obtained with mean for Subjectivity and Polar-        with the combined vector of all the word embed-
ity Classification, and with sum for Irony Detec-         dings (All words rows in the table). For what re-
tion. In this case, the combination of all vector         gards the character-based word embeddings, we
leads to poor results.                                    can see that the noun is the individually best per-
   Lesson learned: these outcomes suggest that            forming category only for the Subjectivity Clas-
the best combination methods are sum for word             sification task, while the adjective and the verb
vectors obtained by using word-based word em-             are the best performing category for the other two
beddings and mean for character-based ones.               tasks.
                                 Subj.              Pol.              Iro.
