The Contribution of Embeddings to Sentiment Analysis on YouTube

            Moniek Nieuwenhuis                                      Malvina Nissim
         CLCG, University of Groningen                        CLCG, University of Groningen
               The Netherlands                                      The Netherlands
     m.l.nieuwenhuis@student.rug.nl                              m.nissim@rug.nl


                      Abstract                                The SenTube corpus (Uryupina et al., 2014) has
                                                           been created along these lines. It contains English
    We train a variety of embeddings on a                  and Italian commercial or review videos about
    large corpus of YouTube comments, and                  some product, and annotated comments. The an-
    test them on three different tasks on both             notations specify both the polarity (positive, nega-
    the English and the Italian portions of                tive, neutral) and the target (the video itself or the
    the SenTube corpus. We show that in-                   product in the video). In Figure 1 we show two
    domain (YouTube) embeddings perform                    positive comments with different targets.
    better than previously used generic em-
                                                              The SenTube’s tasks have been firstly addressed
    beddings, achieving state-of-the-art per-
                                                           by Severyn et al. (2016) with an SVM based on
    formance on most of the tasks. We also
                                                           topic and shallow syntactic information, later out-
    show that a simple method for creating
                                                           performed by a convolutional N-gram BiLSTM
    sentiment-aware embeddings outperforms
                                                           word embedding model (Nguyen and Le Nguyen,
    previous strategies, and that sentiment em-
                                                           2018). The corpus has also served as testbed for
    beddings are more informative than plain
                                                           multiple state-of-the-art sentiment analysis meth-
    embeddings for the SenTube tasks.
                                                           ods (Barnes et al., 2017), with best results ob-
                                                           tained using sentiment-specific word embeddings
                                                           (Tang et al., 2014). On the English sentiment task
1   Introduction and Background                            of SenTube though this method does not outper-
                                                           form corpus-specific approaches (Severyn et al.,
Sentiment analysis, or opinion mining, on social
                                                           2016; Nguyen and Le Nguyen, 2018).
media is by now a well established task, though
surely not solved (Liu et al., 2005; Barnes et al.,           We further explore the potential of (senti-
2017). Part of the difficulty comes from its intrin-       ment) embeddings, using the model developed by
sic subjective nature, which makes creating reli-          Nguyen and Le Nguyen (2018). We believe that
able resources hard (Kiritchenko and Mohammad,             training in-domain (YouTube) embeddings rather
2017). Another part comes from its heavy interac-          than using generic ones might yield improve-
tion with pragmatic phenomena such as irony and            ments, and that additional gains might come from
world knowledge (Nissim and Patti, 2017; Basile            sentiment-aware embeddings. In this context, we
et al., 2018; Cignarella et al., 2018; Van Hee et          propose a simple new semi-supervised method to
al., 2018). And another difficulty comes from the          train sentiment embeddings and show that it per-
fact that given a piece of text, be it a tweet, or a       forms better than two other existing ones. We run
review, it isn’t always clear what exactly the ex-         all experiments on English and Italian data.
pressed sentiment (should there be any) is about.
In commercial reviews, for example, the target of          Contributions We show that in-domain embed-
a user’s evaluation could be a specific aspect or          dings outperform generic embeddings on most
part of a given product. Aspect-based sentiment            task of the SenTube corpus for both Italian and
analysis has developed as a subfield to address this       English. We also show that sentiment embed-
problem (Thet et al., 2010; Pontiki et al., 2014).         dings obtained through a simple semi-supervised
                                                           strategy that we newly introduce in this paper
     Copyright c 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0   add a boost to performance. We make all de-
International (CC BY 4.0)                                  veloped Italian and English embeddings avail-
Figure 1: Two sample comments on a video about a Ferrari car. Top: positive comment about the product.
Bottom: positive comment about the video.


able at this link: https://github.com/                  From SenTube we exclude any comment that
malvinanissim/youtube-embeds.                           is annotated both as product-related and video-
                                                        related or is both positive and negative. Table 1
2     Data and Task                                     shows the label distribution for the three tasks. All
                                                        comments are further lowercased and tokenised.
We use two different datasets of YouTube com-
ments. The first is the existing SenTube cor-           2.2       Semi-supervised YouTube corpus
pus (Uryupina et al., 2014). The other dataset
                                                        To train in-domain embeddings we collected more
is collected from YouTube to create a big semi-
                                                        data from YouTube. We searched for relevant
supervised corpus for making the embeddings.
                                                        videos querying the YouTube API with a set of
2.1    SenTube corpus                                   keywords (“car”, “tablet”, “macchina”, “automo-
                                                        bile”, ...). For each retrieved video we checked
The SenTube corpus contains 217 videos in En-           that it was not already included in the SenTube
glish and 198 in Italian (Uryupina et al., 2014). All   corpus, and verified that its description was in En-
videos are a review or commercial about a product       glish/Italian using Python’s langdetect mod-
in the category “automobile” or “tablet”.               ule. We then retrieved all comments for each video
   All comments from the videos are annotated ac-       that had more than one comment.
cording to their target (whether they are about the        Next, we used the convolutional N-gram BiL-
video or about the product) and their sentiment po-     STM word embedding model by (Nguyen and
larity (positive, negative, neutral). Some of the       Le Nguyen, 2018), which has state-of-the-art per-
comments were discarded because of spam, be-            formance on SenTube, to label the data on the sen-
cause they were written in a language other than        timent task, as we want to exploit the labels to train
the intended one (Italian for the Italian corpus, En-   sentiment embeddings. Table 2 shows an overview
glish for the English one), or just off topic. Senti-   of the collected dataset. A manual check on a ran-
ment is type-specific, and the following labels are     domly chosen test set of 100 comments for each
used: positive-product, negative-product, positive-     language, revealed a rough accuracy of just under
video and negative-video. If neither positive or        60% for English, and just under 65% for Italian.
negative is annotated, the comment is assumed to
be neutral.                                             3       Embeddings
   The corpus lends itself to three different tasks,
                                                        We test three different categories of embeddings:
all of which we tackle in this work:
                                                        some pre-trained models, a variety of models
• the sentiment task, namely predicting whether a       trained on our in-domain dataset, and sentiment-
  YouTube comment is written in a positive, neg-        aware embeddings, which we obtain in three dif-
  ative or a neutral sentiment.                         ferent ways. All of the embeddings are tested in
                                                        the model developed by (Nguyen and Le Nguyen,
• the type task, namely predicting if the comment       2018) to specifically tackle the SenTube tasks.
  is written about the product mentioned in the
  video, about the video itself or if it is not an      3.1       Plain Embeddings
  informative comment (spam or off-topic).              Generic models For English we used Google-
• the full task: predicting at the same time the sen-   News vectors1 , which are those used in (Nguyen
                                                            1
  timent and the type of each comment.                          https://code.google.com/archive/p/word2vec/
                          Table 1: Label distribution for each task in the SenTube corpus
                                         English                           Italian
                                Automobile            %       Tablet         %      Automobile           %      Tablet          %
 Product-related                        5,834       38.8     11,067        56.2             1,718      40.9      2,976        61.0
 Video-related                          5,201       34.5      3,665        18.6             1,317      31.4        845        17.3
 Uninfo.                                4,020       26.7      4,961        25.2             1,161      27.7      1,055        21.6
 Positive sentiment                     3,284       21.8      3,637        18.5               946      22.5        770        15.8
 Negative sentiment                     1,988       13.2      3,038        15.4               752      17.9        825        16.9
 No sentiment/neutral                   9,801       65.0     13,021        66.1             2,499      59.5      3,281        67.3
 Product-pos.                           1,740       11.5       2,280       11.6               479      11.4        544        11.4
 Product-neg.                           1,360        9.0       2,473       12.5               538      12.8        711        14.6
 Product-neu.                           2,744       18.2       6,310       32.0               703      16.8      1,721        35.3
 Video-pos.                             1,543       10.2       1,357        6.9               467      11.1        226         4.6
 Video-neg.                               628        4.2         565        2.9               214       5.1        114         2.3
 Video-neu.                             3,030       20.1       1,743        8.8               635      15.1        505        10.4
 Uninfo.                                4,028       26.7       4,968       25.2             1,161      27.7      1055         21.6


                              Table 2: Overview of extra data collected from YouTube
                                                 English                                              Italian
                          Automobile            Tablet                 Total      Automobile          Tablet                  Total
 Videos                         1,592         1,675                   3,267              1,622        1,151                2,773
 Comments                   1,028,136       587,506               1,615,642             99,328      118,274              217.602
 Tokens                    18,124,184     9,156,324              27,280,508          1,596,190    1,579,591            3,175,781
 Unique tokens                754,962       416,835               1,030,574            170,956      155,738              277,114
 Positive sentiment           165,725           97,439       263,164 (16.3%)            11,091       13,356      24,447(11.2%)
 Negative sentiment            49,490           53,557        103,047 (6.4%)             4,898        4,514        9,412(4.3%)
 Neutral sentiment            812,921          436,510     1,249,431 (77.3%)            83,339      100,404     183,743(84.4%)


and Le Nguyen, 2018), and the 200-dimensional                       3.2      Sentiment-aware Embeddings
GloVe Twitter embeddings2 . For Italian we used
vectors from (Bojanowski et al., 2016) a Fast-                      We use three methods for adding sentiment to the
Text model trained on the the Italian Wikipedia,                    embeddings, in all cases using the Word2Vec skip-
and also used by (Nguyen and Le Nguyen, 2018).                      gram models (Mikolov et al., 2013) with and with-
Furthermore, we tested two models developed at                      out negative sampling 10. The first two methods
ISTI-CNR, which are trained on Italian Wikipedia                    are existing methods, namely retrofitting (Faruqui
with skip-gram’s Word2Vec and with GloVe.3                          et al., 2015) and the refinement method suggested
                                                                    by Yu et al. (2017), while the third method is
                                                                    newly proposed in this work.
In-domain trained models We trained three
Word2Vec models (Mikolov et al., 2013), all of di-                  Retrofitting Retrofitting embedding models is a
mension 300, using Gensim (Řehůřek and Sojka,                    method to refine vector space representations us-
2010). Beside a CBOW model with default set-                        ing relational information from semantic lexicons
tings, we trained two different skip-gram models,                   by encouraging linked words to have similar vec-
one with default settings and one with a negative                   tor representations (Faruqui et al., 2015).4 We
sampling of 10. We also trained a FastText model                    used two sentiment lexicons to retrofit the skip-
(Bojanowski et al., 2016), and a 100-dimension                      gram models. A SentiWordNet-derived lexicon
GloVe model (Pennington et al., 2014).                              for English (Baccianella et al., 2010), and Sentix
                                                                    for Italian (Basile and Nissim, 2013).5

  2                                                                    4
      https://nlp.stanford.edu/projects/glove/                             https://github.com/mfaruqui/retrofitting.
  3                                                                    5
      http://hlt.isti.cnr.it/wordembeddings/                               http://valeriobasile.github.io/twita/sentix.html
Sentiment Embedding refinement We tested                                        versions of the word into a single one, testing two
the method proposed by Yu et al. (2017) using                                   different methods:
the provided code6 to refine our own skip-gram
Word2Vec models. In this method the similar top-                                • averaging: average the vectors with each other;
k words will be re-ranked by sentiment on the dif-                                the two contexts have equal weight;
ference in valence scores from a sentiment lexi-                                • weighting: weigh each vector by the proportion
con. For English we used the E-ANEW sentiment                                     of times the word is in either context (in the
lexicon (Warriner et al., 2013) and for Italian we                                semi-supervised corpus), and sum them.
used Sentix (Basile and Nissim, 2013).
                                                                                4      Experiments
Our Embedding refinement For each lan-
guage, we use a sentiment lexicon and our                                       We split the SenTube corpus in 50% train and
YouTube corpus to train sentiment embeddings.                                   50% test. We could not exactly replicate the
                                                                                split by Nguyen and Le Nguyen (2018) due to
   From the sentiment lexicon we create two lists
                                                                                lack of sufficient details in their code. We use
of words: positive words (positive score > 0.6 and
                                                                                their model to test all embeddings, including those
negative score < 0.2) and negative words (nega-
                                                                                used in their implementation (GoogleNews for En-
tive score > 0.6 and positive score < 0.2).
                                                                                glish, and FastText for Italian), for direct compar-
   For each word in the positive list, we check if
                                                                                ison with our embeddings. For completeness, we
it occurs in a comment with a positive label. We
                                                                                also include the results reported by Severyn et al.
do the same for the negative list and negative la-
                                                                                (2016) (with their own split), and a most frequent
belled comments. If the word occurs in the list we
                                                                                label baseline for each task. As was done in pre-
add the affixes "_pos" or "_neg" to the word
                                                                                vious work on this corpus, and for more direct
occurrence in a positive or negative comment. If a
                                                                                comparison, we report accuracy across all exper-
word from the positive list is found in a comment
                                                                                iments.
with negative or neutral label it isn’t touched, and
likewise for words in the negative list. An example
of this approach is in Table 3.                                                           Table 4: English embeddings results
                                                                                 Task           Embeddings                     AUTO    TABLET

 Example                                                              Label
                                                                                 Sentiment      Most frequent label baseline   0.632     0.680
                                                                                                (Severyn et al., 2016)         0.557     0.705
 ”I love pos this review! It’s not the technical review that every   positive                   (Nguyen and Le Nguyen, 2018)   0.669     0.702
 YouTube vid has bit more of a usable hands on one! makes me
 really pos want one even more than before! Thank you!”                                         CBOW                           0.725     0.755
 ”I love being a cheapskate. Please tell me what in the world        neutral                    SKIP                           0.740     0.750
 ”gimp” is.”                                                                        in-domain   SKIP neg samp                  0.730     0.756
                                                                                                GloVe                          0.709     0.754
 ”I don’t understand why people love apple shit [...]                negative
                                                                                                FastText                       0.729     0.754
                                                                                                GoogleNews                     0.715     0.748
Table 3: Example of the word “love” changed in                                      generic
                                                                                                GLoVe Twitter                  0.723     0.742
the positive comment and not changed in neutral                                  Type           Most frequent label baseline   0.384     0.565
or negative comments.                                                                           (Severyn et al., 2016)         0.594     0.786
                                                                                                (Nguyen and Le Nguyen, 2018)   0.684     0.795
                                                                                                CBOW                           0.714     0.784
We then trained the embeddings with skip-gram                                                   SKIP                           0.733     0.800
Word2Vec (Mikolov et al., 2013), with therein the                                   in-domain   SKIP neg samp                  0.723     0.801
                                                                                                GloVe                          0.697     0.779
two separate appearances of words, i.e. with and                                                FastText                       0.727     0.779
without affixes. This of course poses a problem                                                 GoogleNews                     0.688     0.773
                                                                                    generic
at test time, since two vectors are now available                                               GLoVe Twitter                  0.690     0.775
for some of the words (great pos and great                                       Full           Most frequent label baseline   0.243     0.342
                                                                                                (Severyn et al., 2016)         0.415     0.603
for “great”, for example, or brutto neg and                                                     (Nguyen and Le Nguyen, 2018)   0.538     0.613
brutto for “brutto” [en: ugly]), but one must                                                   CBOW                           0.536     0.618
eventually choose one for representing the en-                                                  SKIP                           0.547     0.621
                                                                                    in-domain   SKIP neg samp                  0.558     0.629
countered word “great”, or “brutto”.                                                            GloVe                          0.504     0.596
   Instead of devising a strategy for choosing one                                              FastText                       0.540     0.615

of the two vectors, we opted for re-joining the two                                             GoogleNews                     0.504     0.580
                                                                                    generic
                                                                                                GLoVe Twitter                  0.487     0.600
   6
       https://github.com/wangjin0818/word_embedding_refine
4.1     Results with plain embeddings                          4.2      Results with sentiment embeddings

The results using plain embeddings are shown in                Tables 6 and 7 show the results of the sentiment
Tables 4 and 5. Most of the in-domain embed-                   embeddings. In almost all tasks the sentiment em-
dings on English outperform the GoogleNews vec-                beddings outperform the plain embeddings. Sur-
tors used by Nguyen and Le Nguyen (2018); the                  prisingly, this is true even for the English type task,
results are also higher than those reported in pre-            while the sentiment automobile task has a slightly
vious work with different splits (Severyn et al.,              lower accuracy. For Italian only in the automobile
2016; Nguyen and Le Nguyen, 2018). Only for                    type task sentiment embeddings do not outperform
both full tasks and the tablet type task there are             standard ones. Among the sentiment embeddings,
a few of the in-domain embeddings which do not                 our refinement method seems to work best, while
outperform on previous work results. For Italian,              retrofitting does not lead to any improvement.
not all in-domain embeddings outperform previ-                    In terms of weighing versus averaging the vec-
ous work in all tasks, but they mostly do when                 tors in our method, for English averaging yields
embeddings used in previous work are tested on                 the best score three times, and weighting two
the same split. For both languages the skip-gram               times. For Italian, weighting yields the best re-
models are performing best compared to all the                 sult two times on the tablet data set, while for the
other in-domain embedding models. On Italian,                  full task averaging is better. For cars, weighting is
the generic Wikipedia SKIP embeddings and the                  better, but does not outperform plain embeddings.
generic FastText embeddings (Bojanowski et al.,
2016) are performing slightly better on the senti-                     Table 6: English sentiment embedding test
ment and full task for tablets.                                 Task        Embeddings                                     AUTO    TABLET
                                                                Sentiment   SKIP neg samp retrofitted                      0.701     0.751
                                                                            SKIP retrofitted                               0.710     0.742
                                                                            SKIP sentiment embedding refinement            0.725     0.747
           Table 5: Italian embedding results                               SKIP neg samp sentiment embedding refinement   0.725     0.753
                                                                            SKIP sentiment change average                  0.715     0.760
 Task         Embeddings                     AUTO    TABLET
                                                                            SKIP sentiment change weight sum               0.737     0.767
 Sentiment    Most frequent label baseline   0.601     0.668                SKIP neg samp sentiment change average         0.729     0.758
              (Severyn et al., 2016)         0.616     0.644                SKIP neg samp sentiment change weight sum      0.734     0.749
              (Nguyen and Le Nguyen, 2018)   0.614     0.656    Type        SKIP neg samp retrofitted                      0.688     0.774
                                                                            SKIP retrofitted                               0.680     0.781
              CBOW                           0.622     0.700
                                                                            SKIP sentiment embedding refinement            0.732     0.794
              SKIP                           0.636     0.687
                                                                            SKIP neg samp sentiment embedding refinement   0.735     0.796
 in-domain    SKIP neg samp                  0.652     0.697
                                                                            SKIP sentiment change average                  0.723     0.806
              GloVe                          0.607     0.673                SKIP sentiment change weight sum               0.716     0.798
              FastText                       0.640     0.645                SKIP neg samp sentiment change average         0.722     0.807
                                                                            SKIP neg samp sentiment change weight sum      0.739     0.794
              FastText                       0.648     0.682
 generic      Wikipedia SKIP                 0.629     0.701    Full        SKIP neg samp retrofitted                      0.500     0.600
              Wikipedia GloVe                0.613     0.679                SKIP retrofitted                               0.501     0.594
                                                                            SKIP sentiment embedding refinement            0.537     0.594
 Type         Most frequent label baseline   0.415     0.568                SKIP neg samp sentiment embedding refinement   0.522     0.606
              (Severyn et al., 2016)         0.707     0.773
                                                                            SKIP sentiment change average                  0.560     0.616
              (Nguyen and Le Nguyen, 2018)   0.748     0.796                SKIP sentiment change weight sum               0.544     0.623
              CBOW                           0.742     0.710                SKIP neg samp sentiment change average         0.549     0.631
                                                                            SKIP neg samp sentiment change weight sum      0.547     0.618
              SKIP                           0.768     0.695
 in-domain    SKIP neg samp                  0.762     0.722
              GloVe                          0.744     0.676
              FastText                       0.703     0.703
              FastText                       0.769     0.716
                                                               5       Conclusion
 generic      Wikipedia SKIP                 0.756     0.682
              Wikipedia GloVe                0.725     0.694   We have explored the contribution of in-domain
 Full         Most frequent label baseline   0.320     0.252   embeddings on the SenTube corpus, on two do-
              (Severyn et al., 2016)         0.456     0.524   mains and two languages. In 10 out of the 12
              (Nguyen and Le Nguyen, 2018)   0.511     0.550
                                                               tasks, in-domain embeddings outperform generic
              CBOW                           0.470     0.484
              SKIP                           0.489     0.487   ones. This confirms the experiments on the SEN-
 in-domain    SKIP neg samp                  0.517     0.485   TIPOLC 2016 tasks (Barbieri et al., 2016) re-
              GloVe                          0.450     0.490
              FastText                       0.459     0.484   ported by Petrolito and Dell’Orletta (2018), who
              FastText                       0.491     0.497   recommend the use of in-domain embeddings for
 generic      Wikipedia SKIP                 0.492     0.495   sentiment analysis, especially if trained at the
              Wikipedia GloVe                0.441     0.449
                                                               word rather than carachter level. However, a simi-
                                                               lar work in the field of sentiment analysis for soft-
                                                                              mance computing cluster which we used to run the
        Table 7: Italian sentiment embedding test
 Task        Embeddings                                     AUTO    TABLET
                                                                              experiments reported in this paper. We are also
 Sentiment   SKIP neg samp retrofitted                      0.649     0.682   grateful to the reviewers for helpful comments.
             SKIP retrofitted                               0.622     0.686
             SKIP sentiment embedding refinement            0.610     0.682
             SKIP neg samp sentiment embedding refinement   0.632     0.703
             SKIP sentiment change average                  0.628     0.690   References
             SKIP sentiment change weight sum               0.623     0.704
             SKIP neg samp sentiment change average         0.640     0.682   Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas-
             SKIP neg samp sentiment change weight sum      0.631     0.710
                                                                                 tiani. 2010. Sentiwordnet 3.0: An enhanced lexical
 Type        SKIP neg samp retrofitted                      0.730     0.712
             SKIP retrofitted                               0.744     0.712
                                                                                 resource for sentiment analysis and opinion mining.
             SKIP sentiment embedding refinement            0.761     0.716
                                                                                 volume 10, 01.
             SKIP neg samp sentiment embedding refinement   0.754     0.712
             SKIP sentiment change average                  0.763     0.701   Francesco Barbieri, Valerio Basile, Danilo Croce,
             SKIP sentiment change weight sum               0.746     0.729     Malvina Nissim, Nicole Novielli, and Viviana Patti.
             SKIP neg samp sentiment change average         0.760     0.732
             SKIP neg samp sentiment change weight sum      0.756     0.739     2016. Overview of the EVALITA 2016 sentiment
 Full        SKIP neg samp retrofitted                      0.478     0.447
                                                                                polarity classification task (SENTIPOLC). In Pro-
             SKIP retrofitted                               0.490     0.469     ceedings of the 5th evaluation campaign of natu-
             SKIP sentiment embedding refinement            0.504     0.497     ral language processing and speech tools for Italian
             SKIP neg samp sentiment embedding refinement   0.466     0.500     (EVALITA 2016).
             SKIP sentiment change average                  0.503     0.512
             SKIP sentiment change weight sum               0.505     0.477
             SKIP neg samp sentiment change average         0.497     0.489   Jeremy Barnes, Roman Klinger, and Sabine Schulte im
             SKIP neg samp sentiment change weight sum      0.485     0.497      Walde. 2017. Assessing state-of-the-art sentiment
                                                                                 models on state-of-the-art sentiment datasets. arXiv
                                                                                 preprint arXiv:1709.04219.
ware engineering texts, where in-domain (Stack-
                                                                              Valerio Basile and Malvina Nissim. 2013. Sentiment
overflow) embeddings were compared to generic                                   analysis on italian tweets. In Proceedings of the 4th
ones (GoogleNews), did not yield such clearcut re-                              Workshop on Computational Approaches to Subjec-
sults (Biswas et al., 2019).                                                    tivity, Sentiment and Social Media Analysis, pages
   We have also suggested a simple strategy to                                  100–107.
train sentiment embeddings, and shown that it                                 Valerio Basile, Nicole Novielli, Danilo Croce,
outperforms other existing methods for this task.                               Francesco Barbieri, Malvina Nissim, and Viviana
More in general, sentiment embeddings perform                                   Patti. 2018. Sentiment polarity classification at
consistently better than plain embeddings for both                              evalita: Lessons learned and open challenges. IEEE
                                                                                Transactions on Affective Computing.
languages in the ”tablet” domain, but less evi-
dently so in the automobile domain. The reason                                Eeshita Biswas, K Vijay-Shanker, and Lori Pollock.
for this requires further investigation. Further test-                          2019. Exploring word embedding techniques to
                                                                                improve sentiment analysis of software engineering
ing is also necessary to assess the influence of vec-                           texts. In Proceedings of the 16th International Con-
tor size in our experiments. Indeed, not all em-                                ference on Mining Software Repositories, pages 68–
beddings are trained with the same dimensions,                                  78. IEEE Press.
an aspect that might also affect performance dif-
                                                                              Piotr Bojanowski, Edouard Grave, Armand Joulin, and
ferences, though the true impact of size is not yet                              Tomas Mikolov. 2016. Enriching word vectors with
fully understood (Yin and Shen, 2018).                                           subword information. CoRR, abs/1607.04606.
   In terms of different embeddings types, it would
                                                                              Alessandra Teresa Cignarella, Simona Frenda, Valerio
be also interesting to compare our simple embed-                                Basile, Cristina Bosco, Viviana Patti, Paolo Rosso,
ding refinement method, which takes specific con-                               et al. 2018. Overview of the evalita 2018 task
textual occurrences into account, with the perfor-                              on irony detection in italian tweets (ironita). In
mance of contextual word embeddings (Peters et                                  Sixth Evaluation Campaign of Natural Language
                                                                                Processing and Speech Tools for Italian (EVALITA
al., 2018; Devlin et al., 2019), which work di-                                 2018), volume 2263, pages 1–6. CEUR-WS.
rectly at the token rather than the type level. More
complex training strategies could also be explored                            Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
(Dong and De Melo, 2018).                                                        Kristina Toutanova. 2019. BERT: Pre-training of
                                                                                 deep bidirectional transformers for language under-
                                                                                 standing. In Proceedings of the 2019 Conference of
Acknowledgments                                                                  the North American Chapter of the Association for
                                                                                 Computational Linguistics: Human Language Tech-
We would like to thank the Center for Informa-                                   nologies, Volume 1 (Long and Short Papers), pages
tion Technology of the University of Groningen                                   4171–4186, Minneapolis, Minnesota, June. Associ-
for providing access to the Peregrine high perfor-                               ation for Computational Linguistics.
Xin Dong and Gerard De Melo. 2018. A helping hand:       Aliaksei Severyn, Alessandro Moschitti, Olga
  Transfer learning for deep sentiment analysis. In        Uryupina, Barbara Plank, and Katja Filippova.
  Proceedings of the 56th Annual Meeting of the As-        2016. Multi-lingual opinion mining on youtube. In-
  sociation for Computational Linguistics (Volume 1:       formation Processing & Management, 52(1):46–60.
  Long Papers), pages 2524–2534.
                                                         Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting
Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris        Liu, and Bing Qin. 2014. Learning sentiment-
 Dyer, Eduard Hovy, and Noah A. Smith. 2015.               specific word embedding for twitter sentiment clas-
 Retrofitting word vectors to semantic lexicons. In        sification. In Proceedings of the 52nd Annual Meet-
 Proceedings of NAACL.                                     ing of the Association for Computational Linguistics
                                                           (Volume 1: Long Papers), pages 1555–1565. Asso-
Svetlana Kiritchenko and Saif M Mohammad. 2017.            ciation for Computational Linguistics.
  Capturing reliable fine-grained sentiment associ-
  ations by crowdsourcing and best-worst scaling.        Tun Thura Thet, Jin-Cheon Na, and Christopher SG
  arXiv preprint arXiv:1712.01741.                         Khoo. 2010. Aspect-based sentiment analysis of
                                                           movie reviews on discussion boards. Journal of in-
Bing Liu, Minqing Hu, and Junsheng Cheng. 2005.
                                                           formation science, 36(6):823–848.
  Opinion observer: analyzing and comparing opin-
  ions on the web. In Proceedings of the 14th interna-   Olga Uryupina, Barbara Plank, Aliaksei Severyn,
  tional conference on World Wide Web, pages 342–          Agata Rotondi, and Alessandro Moschitti. 2014.
  351. ACM.                                                Sentube: A corpus for sentiment analysis on youtube
Tomas Mikolov, Kai Chen, G.s Corrado, and Jeffrey          social media. In LREC, pages 4244–4249.
  Dean. 2013. Efficient estimation of word represen-     Cynthia Van Hee, Els Lefever, and Véronique Hoste.
  tations in vector space. Proceedings of Workshop at      2018. Semeval-2018 task 3: Irony detection in en-
  ICLR, 2013, 01.                                          glish tweets. In Proceedings of The 12th Interna-
Huy Tien Nguyen and Minh Le Nguyen. 2018. Multi-           tional Workshop on Semantic Evaluation, pages 39–
  lingual opinion mining on youtube–a convolutional        50.
  n-gram bilstm word embedding. Information Pro-
  cessing & Management, 54(3):451–462.                   Amy Beth Warriner, Victor Kuperman, and Marc Brys-
                                                          baert. 2013. Norms of valence, arousal, and dom-
Malvina Nissim and Viviana Patti. 2017. Semantic          inance for 13,915 english lemmas. Behavior Re-
 aspects in sentiment analysis. In Sentiment analysis     search Methods, 45(4):1191–1207, Dec.
 in social networks, pages 31–48. Elsevier.
                                                         Zi Yin and Yuanyuan Shen. 2018. On the dimension-
Jeffrey Pennington, Richard Socher, and Christo-            ality of word embedding. In Advances in Neural In-
   pher D. Manning. 2014. Glove: Global vectors for         formation Processing Systems, pages 887–898.
   word representation. In Empirical Methods in Nat-
   ural Language Processing (EMNLP), pages 1532–         Liang-Chih Yu, Jin Wang, K Lai, and Xuejie Zhang.
   1543.                                                   2017. Refining word embeddings for sentiment
                                                           analysis. pages 534–539, 01.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
 Gardner, Christopher Clark, Kenton Lee, and Luke
 Zettlemoyer. 2018. Deep contextualized word rep-
 resentations. In Proceedings of the 2018 Conference
 of the North American Chapter of the Association
 for Computational Linguistics: Human Language
 Technologies, volume 1, pages 2227–2237.
Ruggero Petrolito and Felice Dell’Orletta. 2018. Word
  embeddings in sentiment analysis. In CLiC-it.
Maria Pontiki, Dimitris Galanis, John Pavlopoulos,
 Harris Papageorgiou, Ion Androutsopoulos, and
 Suresh Manandhar. 2014. SemEval-2014 task 4:
 Aspect based sentiment analysis. In Proceedings of
 the 8th International Workshop on Semantic Evalua-
 tion (SemEval 2014), pages 27–35, Dublin, Ireland.
 Association for Computational Linguistics.
Radim Řehůřek and Petr Sojka. 2010. Software
  Framework for Topic Modelling with Large Cor-
  pora. In Proceedings of the LREC 2010 Workshop
  on New Challenges for NLP Frameworks, pages 45–
  50, Valletta, Malta, May. ELRA. http://is.
  muni.cz/publication/884893/en.