The Contribution of Embeddings to Sentiment Analysis on YouTube Moniek Nieuwenhuis Malvina Nissim CLCG, University of Groningen CLCG, University of Groningen The Netherlands The Netherlands m.l.nieuwenhuis@student.rug.nl m.nissim@rug.nl Abstract The SenTube corpus (Uryupina et al., 2014) has been created along these lines. It contains English We train a variety of embeddings on a and Italian commercial or review videos about large corpus of YouTube comments, and some product, and annotated comments. The an- test them on three different tasks on both notations specify both the polarity (positive, nega- the English and the Italian portions of tive, neutral) and the target (the video itself or the the SenTube corpus. We show that in- product in the video). In Figure 1 we show two domain (YouTube) embeddings perform positive comments with different targets. better than previously used generic em- The SenTube’s tasks have been firstly addressed beddings, achieving state-of-the-art per- by Severyn et al. (2016) with an SVM based on formance on most of the tasks. We also topic and shallow syntactic information, later out- show that a simple method for creating performed by a convolutional N-gram BiLSTM sentiment-aware embeddings outperforms word embedding model (Nguyen and Le Nguyen, previous strategies, and that sentiment em- 2018). The corpus has also served as testbed for beddings are more informative than plain multiple state-of-the-art sentiment analysis meth- embeddings for the SenTube tasks. ods (Barnes et al., 2017), with best results ob- tained using sentiment-specific word embeddings (Tang et al., 2014). On the English sentiment task 1 Introduction and Background of SenTube though this method does not outper- form corpus-specific approaches (Severyn et al., Sentiment analysis, or opinion mining, on social 2016; Nguyen and Le Nguyen, 2018). media is by now a well established task, though surely not solved (Liu et al., 2005; Barnes et al., We further explore the potential of (senti- 2017). Part of the difficulty comes from its intrin- ment) embeddings, using the model developed by sic subjective nature, which makes creating reli- Nguyen and Le Nguyen (2018). We believe that able resources hard (Kiritchenko and Mohammad, training in-domain (YouTube) embeddings rather 2017). Another part comes from its heavy interac- than using generic ones might yield improve- tion with pragmatic phenomena such as irony and ments, and that additional gains might come from world knowledge (Nissim and Patti, 2017; Basile sentiment-aware embeddings. In this context, we et al., 2018; Cignarella et al., 2018; Van Hee et propose a simple new semi-supervised method to al., 2018). And another difficulty comes from the train sentiment embeddings and show that it per- fact that given a piece of text, be it a tweet, or a forms better than two other existing ones. We run review, it isn’t always clear what exactly the ex- all experiments on English and Italian data. pressed sentiment (should there be any) is about. In commercial reviews, for example, the target of Contributions We show that in-domain embed- a user’s evaluation could be a specific aspect or dings outperform generic embeddings on most part of a given product. Aspect-based sentiment task of the SenTube corpus for both Italian and analysis has developed as a subfield to address this English. We also show that sentiment embed- problem (Thet et al., 2010; Pontiki et al., 2014). dings obtained through a simple semi-supervised strategy that we newly introduce in this paper Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 add a boost to performance. We make all de- International (CC BY 4.0) veloped Italian and English embeddings avail- Figure 1: Two sample comments on a video about a Ferrari car. Top: positive comment about the product. Bottom: positive comment about the video. able at this link: https://github.com/ From SenTube we exclude any comment that malvinanissim/youtube-embeds. is annotated both as product-related and video- related or is both positive and negative. Table 1 2 Data and Task shows the label distribution for the three tasks. All comments are further lowercased and tokenised. We use two different datasets of YouTube com- ments. The first is the existing SenTube cor- 2.2 Semi-supervised YouTube corpus pus (Uryupina et al., 2014). The other dataset To train in-domain embeddings we collected more is collected from YouTube to create a big semi- data from YouTube. We searched for relevant supervised corpus for making the embeddings. videos querying the YouTube API with a set of 2.1 SenTube corpus keywords (“car”, “tablet”, “macchina”, “automo- bile”, ...). For each retrieved video we checked The SenTube corpus contains 217 videos in En- that it was not already included in the SenTube glish and 198 in Italian (Uryupina et al., 2014). All corpus, and verified that its description was in En- videos are a review or commercial about a product glish/Italian using Python’s langdetect mod- in the category “automobile” or “tablet”. ule. We then retrieved all comments for each video All comments from the videos are annotated ac- that had more than one comment. cording to their target (whether they are about the Next, we used the convolutional N-gram BiL- video or about the product) and their sentiment po- STM word embedding model by (Nguyen and larity (positive, negative, neutral). Some of the Le Nguyen, 2018), which has state-of-the-art per- comments were discarded because of spam, be- formance on SenTube, to label the data on the sen- cause they were written in a language other than timent task, as we want to exploit the labels to train the intended one (Italian for the Italian corpus, En- sentiment embeddings. Table 2 shows an overview glish for the English one), or just off topic. Senti- of the collected dataset. A manual check on a ran- ment is type-specific, and the following labels are domly chosen test set of 100 comments for each used: positive-product, negative-product, positive- language, revealed a rough accuracy of just under video and negative-video. If neither positive or 60% for English, and just under 65% for Italian. negative is annotated, the comment is assumed to be neutral. 3 Embeddings The corpus lends itself to three different tasks, We test three different categories of embeddings: all of which we tackle in this work: some pre-trained models, a variety of models • the sentiment task, namely predicting whether a trained on our in-domain dataset, and sentiment- YouTube comment is written in a positive, neg- aware embeddings, which we obtain in three dif- ative or a neutral sentiment. ferent ways. All of the embeddings are tested in the model developed by (Nguyen and Le Nguyen, • the type task, namely predicting if the comment 2018) to specifically tackle the SenTube tasks. is written about the product mentioned in the video, about the video itself or if it is not an 3.1 Plain Embeddings informative comment (spam or off-topic). Generic models For English we used Google- • the full task: predicting at the same time the sen- News vectors1 , which are those used in (Nguyen 1 timent and the type of each comment. https://code.google.com/archive/p/word2vec/ Table 1: Label distribution for each task in the SenTube corpus English Italian Automobile % Tablet % Automobile % Tablet % Product-related 5,834 38.8 11,067 56.2 1,718 40.9 2,976 61.0 Video-related 5,201 34.5 3,665 18.6 1,317 31.4 845 17.3 Uninfo. 4,020 26.7 4,961 25.2 1,161 27.7 1,055 21.6 Positive sentiment 3,284 21.8 3,637 18.5 946 22.5 770 15.8 Negative sentiment 1,988 13.2 3,038 15.4 752 17.9 825 16.9 No sentiment/neutral 9,801 65.0 13,021 66.1 2,499 59.5 3,281 67.3 Product-pos. 1,740 11.5 2,280 11.6 479 11.4 544 11.4 Product-neg. 1,360 9.0 2,473 12.5 538 12.8 711 14.6 Product-neu. 2,744 18.2 6,310 32.0 703 16.8 1,721 35.3 Video-pos. 1,543 10.2 1,357 6.9 467 11.1 226 4.6 Video-neg. 628 4.2 565 2.9 214 5.1 114 2.3 Video-neu. 3,030 20.1 1,743 8.8 635 15.1 505 10.4 Uninfo. 4,028 26.7 4,968 25.2 1,161 27.7 1055 21.6 Table 2: Overview of extra data collected from YouTube English Italian Automobile Tablet Total Automobile Tablet Total Videos 1,592 1,675 3,267 1,622 1,151 2,773 Comments 1,028,136 587,506 1,615,642 99,328 118,274 217.602 Tokens 18,124,184 9,156,324 27,280,508 1,596,190 1,579,591 3,175,781 Unique tokens 754,962 416,835 1,030,574 170,956 155,738 277,114 Positive sentiment 165,725 97,439 263,164 (16.3%) 11,091 13,356 24,447(11.2%) Negative sentiment 49,490 53,557 103,047 (6.4%) 4,898 4,514 9,412(4.3%) Neutral sentiment 812,921 436,510 1,249,431 (77.3%) 83,339 100,404 183,743(84.4%) and Le Nguyen, 2018), and the 200-dimensional 3.2 Sentiment-aware Embeddings GloVe Twitter embeddings2 . For Italian we used vectors from (Bojanowski et al., 2016) a Fast- We use three methods for adding sentiment to the Text model trained on the the Italian Wikipedia, embeddings, in all cases using the Word2Vec skip- and also used by (Nguyen and Le Nguyen, 2018). gram models (Mikolov et al., 2013) with and with- Furthermore, we tested two models developed at out negative sampling 10. The first two methods ISTI-CNR, which are trained on Italian Wikipedia are existing methods, namely retrofitting (Faruqui with skip-gram’s Word2Vec and with GloVe.3 et al., 2015) and the refinement method suggested by Yu et al. (2017), while the third method is newly proposed in this work. In-domain trained models We trained three Word2Vec models (Mikolov et al., 2013), all of di- Retrofitting Retrofitting embedding models is a mension 300, using Gensim (Řehůřek and Sojka, method to refine vector space representations us- 2010). Beside a CBOW model with default set- ing relational information from semantic lexicons tings, we trained two different skip-gram models, by encouraging linked words to have similar vec- one with default settings and one with a negative tor representations (Faruqui et al., 2015).4 We sampling of 10. We also trained a FastText model used two sentiment lexicons to retrofit the skip- (Bojanowski et al., 2016), and a 100-dimension gram models. A SentiWordNet-derived lexicon GloVe model (Pennington et al., 2014). for English (Baccianella et al., 2010), and Sentix for Italian (Basile and Nissim, 2013).5 2 4 https://nlp.stanford.edu/projects/glove/ https://github.com/mfaruqui/retrofitting. 3 5 http://hlt.isti.cnr.it/wordembeddings/ http://valeriobasile.github.io/twita/sentix.html Sentiment Embedding refinement We tested versions of the word into a single one, testing two the method proposed by Yu et al. (2017) using different methods: the provided code6 to refine our own skip-gram Word2Vec models. In this method the similar top- • averaging: average the vectors with each other; k words will be re-ranked by sentiment on the dif- the two contexts have equal weight; ference in valence scores from a sentiment lexi- • weighting: weigh each vector by the proportion con. For English we used the E-ANEW sentiment of times the word is in either context (in the lexicon (Warriner et al., 2013) and for Italian we semi-supervised corpus), and sum them. used Sentix (Basile and Nissim, 2013). 4 Experiments Our Embedding refinement For each lan- guage, we use a sentiment lexicon and our We split the SenTube corpus in 50% train and YouTube corpus to train sentiment embeddings. 50% test. We could not exactly replicate the split by Nguyen and Le Nguyen (2018) due to From the sentiment lexicon we create two lists lack of sufficient details in their code. We use of words: positive words (positive score > 0.6 and their model to test all embeddings, including those negative score < 0.2) and negative words (nega- used in their implementation (GoogleNews for En- tive score > 0.6 and positive score < 0.2). glish, and FastText for Italian), for direct compar- For each word in the positive list, we check if ison with our embeddings. For completeness, we it occurs in a comment with a positive label. We also include the results reported by Severyn et al. do the same for the negative list and negative la- (2016) (with their own split), and a most frequent belled comments. If the word occurs in the list we label baseline for each task. As was done in pre- add the affixes "_pos" or "_neg" to the word vious work on this corpus, and for more direct occurrence in a positive or negative comment. If a comparison, we report accuracy across all exper- word from the positive list is found in a comment iments. with negative or neutral label it isn’t touched, and likewise for words in the negative list. An example of this approach is in Table 3. Table 4: English embeddings results Task Embeddings AUTO TABLET Example Label Sentiment Most frequent label baseline 0.632 0.680 (Severyn et al., 2016) 0.557 0.705 ”I love pos this review! It’s not the technical review that every positive (Nguyen and Le Nguyen, 2018) 0.669 0.702 YouTube vid has bit more of a usable hands on one! makes me really pos want one even more than before! Thank you!” CBOW 0.725 0.755 ”I love being a cheapskate. Please tell me what in the world neutral SKIP 0.740 0.750 ”gimp” is.” in-domain SKIP neg samp 0.730 0.756 GloVe 0.709 0.754 ”I don’t understand why people love apple shit [...] negative FastText 0.729 0.754 GoogleNews 0.715 0.748 Table 3: Example of the word “love” changed in generic GLoVe Twitter 0.723 0.742 the positive comment and not changed in neutral Type Most frequent label baseline 0.384 0.565 or negative comments. (Severyn et al., 2016) 0.594 0.786 (Nguyen and Le Nguyen, 2018) 0.684 0.795 CBOW 0.714 0.784 We then trained the embeddings with skip-gram SKIP 0.733 0.800 Word2Vec (Mikolov et al., 2013), with therein the in-domain SKIP neg samp 0.723 0.801 GloVe 0.697 0.779 two separate appearances of words, i.e. with and FastText 0.727 0.779 without affixes. This of course poses a problem GoogleNews 0.688 0.773 generic at test time, since two vectors are now available GLoVe Twitter 0.690 0.775 for some of the words (great pos and great Full Most frequent label baseline 0.243 0.342 (Severyn et al., 2016) 0.415 0.603 for “great”, for example, or brutto neg and (Nguyen and Le Nguyen, 2018) 0.538 0.613 brutto for “brutto” [en: ugly]), but one must CBOW 0.536 0.618 eventually choose one for representing the en- SKIP 0.547 0.621 in-domain SKIP neg samp 0.558 0.629 countered word “great”, or “brutto”. GloVe 0.504 0.596 Instead of devising a strategy for choosing one FastText 0.540 0.615 of the two vectors, we opted for re-joining the two GoogleNews 0.504 0.580 generic GLoVe Twitter 0.487 0.600 6 https://github.com/wangjin0818/word_embedding_refine 4.1 Results with plain embeddings 4.2 Results with sentiment embeddings The results using plain embeddings are shown in Tables 6 and 7 show the results of the sentiment Tables 4 and 5. Most of the in-domain embed- embeddings. In almost all tasks the sentiment em- dings on English outperform the GoogleNews vec- beddings outperform the plain embeddings. Sur- tors used by Nguyen and Le Nguyen (2018); the prisingly, this is true even for the English type task, results are also higher than those reported in pre- while the sentiment automobile task has a slightly vious work with different splits (Severyn et al., lower accuracy. For Italian only in the automobile 2016; Nguyen and Le Nguyen, 2018). Only for type task sentiment embeddings do not outperform both full tasks and the tablet type task there are standard ones. Among the sentiment embeddings, a few of the in-domain embeddings which do not our refinement method seems to work best, while outperform on previous work results. For Italian, retrofitting does not lead to any improvement. not all in-domain embeddings outperform previ- In terms of weighing versus averaging the vec- ous work in all tasks, but they mostly do when tors in our method, for English averaging yields embeddings used in previous work are tested on the best score three times, and weighting two the same split. For both languages the skip-gram times. For Italian, weighting yields the best re- models are performing best compared to all the sult two times on the tablet data set, while for the other in-domain embedding models. On Italian, full task averaging is better. For cars, weighting is the generic Wikipedia SKIP embeddings and the better, but does not outperform plain embeddings. generic FastText embeddings (Bojanowski et al., 2016) are performing slightly better on the senti- Table 6: English sentiment embedding test ment and full task for tablets. Task Embeddings AUTO TABLET Sentiment SKIP neg samp retrofitted 0.701 0.751 SKIP retrofitted 0.710 0.742 SKIP sentiment embedding refinement 0.725 0.747 Table 5: Italian embedding results SKIP neg samp sentiment embedding refinement 0.725 0.753 SKIP sentiment change average 0.715 0.760 Task Embeddings AUTO TABLET SKIP sentiment change weight sum 0.737 0.767 Sentiment Most frequent label baseline 0.601 0.668 SKIP neg samp sentiment change average 0.729 0.758 (Severyn et al., 2016) 0.616 0.644 SKIP neg samp sentiment change weight sum 0.734 0.749 (Nguyen and Le Nguyen, 2018) 0.614 0.656 Type SKIP neg samp retrofitted 0.688 0.774 SKIP retrofitted 0.680 0.781 CBOW 0.622 0.700 SKIP sentiment embedding refinement 0.732 0.794 SKIP 0.636 0.687 SKIP neg samp sentiment embedding refinement 0.735 0.796 in-domain SKIP neg samp 0.652 0.697 SKIP sentiment change average 0.723 0.806 GloVe 0.607 0.673 SKIP sentiment change weight sum 0.716 0.798 FastText 0.640 0.645 SKIP neg samp sentiment change average 0.722 0.807 SKIP neg samp sentiment change weight sum 0.739 0.794 FastText 0.648 0.682 generic Wikipedia SKIP 0.629 0.701 Full SKIP neg samp retrofitted 0.500 0.600 Wikipedia GloVe 0.613 0.679 SKIP retrofitted 0.501 0.594 SKIP sentiment embedding refinement 0.537 0.594 Type Most frequent label baseline 0.415 0.568 SKIP neg samp sentiment embedding refinement 0.522 0.606 (Severyn et al., 2016) 0.707 0.773 SKIP sentiment change average 0.560 0.616 (Nguyen and Le Nguyen, 2018) 0.748 0.796 SKIP sentiment change weight sum 0.544 0.623 CBOW 0.742 0.710 SKIP neg samp sentiment change average 0.549 0.631 SKIP neg samp sentiment change weight sum 0.547 0.618 SKIP 0.768 0.695 in-domain SKIP neg samp 0.762 0.722 GloVe 0.744 0.676 FastText 0.703 0.703 FastText 0.769 0.716 5 Conclusion generic Wikipedia SKIP 0.756 0.682 Wikipedia GloVe 0.725 0.694 We have explored the contribution of in-domain Full Most frequent label baseline 0.320 0.252 embeddings on the SenTube corpus, on two do- (Severyn et al., 2016) 0.456 0.524 mains and two languages. In 10 out of the 12 (Nguyen and Le Nguyen, 2018) 0.511 0.550 tasks, in-domain embeddings outperform generic CBOW 0.470 0.484 SKIP 0.489 0.487 ones. This confirms the experiments on the SEN- in-domain SKIP neg samp 0.517 0.485 TIPOLC 2016 tasks (Barbieri et al., 2016) re- GloVe 0.450 0.490 FastText 0.459 0.484 ported by Petrolito and Dell’Orletta (2018), who FastText 0.491 0.497 recommend the use of in-domain embeddings for generic Wikipedia SKIP 0.492 0.495 sentiment analysis, especially if trained at the Wikipedia GloVe 0.441 0.449 word rather than carachter level. However, a simi- lar work in the field of sentiment analysis for soft- mance computing cluster which we used to run the Table 7: Italian sentiment embedding test Task Embeddings AUTO TABLET experiments reported in this paper. We are also Sentiment SKIP neg samp retrofitted 0.649 0.682 grateful to the reviewers for helpful comments. SKIP retrofitted 0.622 0.686 SKIP sentiment embedding refinement 0.610 0.682 SKIP neg samp sentiment embedding refinement 0.632 0.703 SKIP sentiment change average 0.628 0.690 References SKIP sentiment change weight sum 0.623 0.704 SKIP neg samp sentiment change average 0.640 0.682 Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas- SKIP neg samp sentiment change weight sum 0.631 0.710 tiani. 2010. Sentiwordnet 3.0: An enhanced lexical Type SKIP neg samp retrofitted 0.730 0.712 SKIP retrofitted 0.744 0.712 resource for sentiment analysis and opinion mining. SKIP sentiment embedding refinement 0.761 0.716 volume 10, 01. SKIP neg samp sentiment embedding refinement 0.754 0.712 SKIP sentiment change average 0.763 0.701 Francesco Barbieri, Valerio Basile, Danilo Croce, SKIP sentiment change weight sum 0.746 0.729 Malvina Nissim, Nicole Novielli, and Viviana Patti. SKIP neg samp sentiment change average 0.760 0.732 SKIP neg samp sentiment change weight sum 0.756 0.739 2016. Overview of the EVALITA 2016 sentiment Full SKIP neg samp retrofitted 0.478 0.447 polarity classification task (SENTIPOLC). In Pro- SKIP retrofitted 0.490 0.469 ceedings of the 5th evaluation campaign of natu- SKIP sentiment embedding refinement 0.504 0.497 ral language processing and speech tools for Italian SKIP neg samp sentiment embedding refinement 0.466 0.500 (EVALITA 2016). SKIP sentiment change average 0.503 0.512 SKIP sentiment change weight sum 0.505 0.477 SKIP neg samp sentiment change average 0.497 0.489 Jeremy Barnes, Roman Klinger, and Sabine Schulte im SKIP neg samp sentiment change weight sum 0.485 0.497 Walde. 2017. Assessing state-of-the-art sentiment models on state-of-the-art sentiment datasets. arXiv preprint arXiv:1709.04219. ware engineering texts, where in-domain (Stack- Valerio Basile and Malvina Nissim. 2013. Sentiment overflow) embeddings were compared to generic analysis on italian tweets. In Proceedings of the 4th ones (GoogleNews), did not yield such clearcut re- Workshop on Computational Approaches to Subjec- sults (Biswas et al., 2019). tivity, Sentiment and Social Media Analysis, pages We have also suggested a simple strategy to 100–107. train sentiment embeddings, and shown that it Valerio Basile, Nicole Novielli, Danilo Croce, outperforms other existing methods for this task. Francesco Barbieri, Malvina Nissim, and Viviana More in general, sentiment embeddings perform Patti. 2018. Sentiment polarity classification at consistently better than plain embeddings for both evalita: Lessons learned and open challenges. IEEE Transactions on Affective Computing. languages in the ”tablet” domain, but less evi- dently so in the automobile domain. The reason Eeshita Biswas, K Vijay-Shanker, and Lori Pollock. for this requires further investigation. Further test- 2019. Exploring word embedding techniques to improve sentiment analysis of software engineering ing is also necessary to assess the influence of vec- texts. In Proceedings of the 16th International Con- tor size in our experiments. Indeed, not all em- ference on Mining Software Repositories, pages 68– beddings are trained with the same dimensions, 78. IEEE Press. an aspect that might also affect performance dif- Piotr Bojanowski, Edouard Grave, Armand Joulin, and ferences, though the true impact of size is not yet Tomas Mikolov. 2016. Enriching word vectors with fully understood (Yin and Shen, 2018). subword information. CoRR, abs/1607.04606. In terms of different embeddings types, it would Alessandra Teresa Cignarella, Simona Frenda, Valerio be also interesting to compare our simple embed- Basile, Cristina Bosco, Viviana Patti, Paolo Rosso, ding refinement method, which takes specific con- et al. 2018. Overview of the evalita 2018 task textual occurrences into account, with the perfor- on irony detection in italian tweets (ironita). In mance of contextual word embeddings (Peters et Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA al., 2018; Devlin et al., 2019), which work di- 2018), volume 2263, pages 1–6. CEUR-WS. rectly at the token rather than the type level. More complex training strategies could also be explored Jacob Devlin, Ming-Wei Chang, Kenton Lee, and (Dong and De Melo, 2018). Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference of Acknowledgments the North American Chapter of the Association for Computational Linguistics: Human Language Tech- We would like to thank the Center for Informa- nologies, Volume 1 (Long and Short Papers), pages tion Technology of the University of Groningen 4171–4186, Minneapolis, Minnesota, June. Associ- for providing access to the Peregrine high perfor- ation for Computational Linguistics. Xin Dong and Gerard De Melo. 2018. A helping hand: Aliaksei Severyn, Alessandro Moschitti, Olga Transfer learning for deep sentiment analysis. In Uryupina, Barbara Plank, and Katja Filippova. Proceedings of the 56th Annual Meeting of the As- 2016. Multi-lingual opinion mining on youtube. In- sociation for Computational Linguistics (Volume 1: formation Processing & Management, 52(1):46–60. Long Papers), pages 2524–2534. Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Liu, and Bing Qin. 2014. Learning sentiment- Dyer, Eduard Hovy, and Noah A. Smith. 2015. specific word embedding for twitter sentiment clas- Retrofitting word vectors to semantic lexicons. In sification. In Proceedings of the 52nd Annual Meet- Proceedings of NAACL. ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1555–1565. Asso- Svetlana Kiritchenko and Saif M Mohammad. 2017. ciation for Computational Linguistics. Capturing reliable fine-grained sentiment associ- ations by crowdsourcing and best-worst scaling. Tun Thura Thet, Jin-Cheon Na, and Christopher SG arXiv preprint arXiv:1712.01741. Khoo. 2010. Aspect-based sentiment analysis of movie reviews on discussion boards. Journal of in- Bing Liu, Minqing Hu, and Junsheng Cheng. 2005. formation science, 36(6):823–848. Opinion observer: analyzing and comparing opin- ions on the web. In Proceedings of the 14th interna- Olga Uryupina, Barbara Plank, Aliaksei Severyn, tional conference on World Wide Web, pages 342– Agata Rotondi, and Alessandro Moschitti. 2014. 351. ACM. Sentube: A corpus for sentiment analysis on youtube Tomas Mikolov, Kai Chen, G.s Corrado, and Jeffrey social media. In LREC, pages 4244–4249. Dean. 2013. Efficient estimation of word represen- Cynthia Van Hee, Els Lefever, and Véronique Hoste. tations in vector space. Proceedings of Workshop at 2018. Semeval-2018 task 3: Irony detection in en- ICLR, 2013, 01. glish tweets. In Proceedings of The 12th Interna- Huy Tien Nguyen and Minh Le Nguyen. 2018. Multi- tional Workshop on Semantic Evaluation, pages 39– lingual opinion mining on youtube–a convolutional 50. n-gram bilstm word embedding. Information Pro- cessing & Management, 54(3):451–462. Amy Beth Warriner, Victor Kuperman, and Marc Brys- baert. 2013. Norms of valence, arousal, and dom- Malvina Nissim and Viviana Patti. 2017. Semantic inance for 13,915 english lemmas. Behavior Re- aspects in sentiment analysis. In Sentiment analysis search Methods, 45(4):1191–1207, Dec. in social networks, pages 31–48. Elsevier. Zi Yin and Yuanyuan Shen. 2018. On the dimension- Jeffrey Pennington, Richard Socher, and Christo- ality of word embedding. In Advances in Neural In- pher D. Manning. 2014. Glove: Global vectors for formation Processing Systems, pages 887–898. word representation. In Empirical Methods in Nat- ural Language Processing (EMNLP), pages 1532– Liang-Chih Yu, Jin Wang, K Lai, and Xuejie Zhang. 1543. 2017. Refining word embeddings for sentiment analysis. pages 534–539, 01. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word rep- resentations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 1, pages 2227–2237. Ruggero Petrolito and Felice Dell’Orletta. 2018. Word embeddings in sentiment analysis. In CLiC-it. Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. SemEval-2014 task 4: Aspect based sentiment analysis. In Proceedings of the 8th International Workshop on Semantic Evalua- tion (SemEval 2014), pages 27–35, Dublin, Ireland. Association for Computational Linguistics. Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Cor- pora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45– 50, Valletta, Malta, May. ELRA. http://is. muni.cz/publication/884893/en.