Distant Supervision for Emotion Classification Task using emoji2emotion Aisulu Rakhmetullina Dietrich Trautmann Georg Groh Informatics Dept. Informatics Dept. Informatics Dept. Garching, 85748 Garching, 85748 Garching, 85748 aisulu.rakhmetullina@tum.de dietrich.trautmann@cs.tum.edu grohg@in.tum.de Technical University of Munich we apply it to emoji to emotion mapping. Since to our knowledge there is no such experimentally created Abstract mapping between them, we introduce the name for it - emoji2emotion. Increasing number of research in the area There exist different emotion classification models, of distant supervision for emotion detection either discrete or dimensional. In this work we have task requires a reliable mapping between noisy chosen Plutchik’s wheel of emotions [Plu91] that com- labels and emotion classes. We propose a bine characteristics of both discrete and dimensional method for an experimental creation of such models. We use main 8 emotions out of it called a reliable mapping based on manually an- Plutchik’s eight (anger, anticipation, joy, trust, fear, notated data and quantitative relations be- surprise, sadness and disgust) that shown in Figure 1. tween labels and classes on example of emoji- . emotion pair in a form of emoji2emotion map- ping. 1 Introduction The japanese word emoji means “picture + charac- ter”, and has no semantical connection to english emo- tion as you might have thought. However, emojis in- deed very often carry the emotional state of the writer. That is why, no surprises that as a part of the digital text emojis were exploited in various NLP researches related to sentiment analysis or emotion classification. In later works based on machine learning ap- proaches, most of the time emojis are used as a noisy label for a distant supervision task. However, the matching between emoji and sentiment or emotion class is often done manually [WR16]. That approach implies subjectivity and could lead to mismatching. The goal of this work is to propose a method for ex- perimental matching between emoji and classes that should be more reliable. To evaluate our method Copyright c 2018 held by the author(s). Copying permitted for private and academic purposes. In: S. Wijeratne, E. Kiciman, H. Saggion, A. Sheth (eds.): Pro- Figure 1: Plutchiks Wheel of Emotions with Plutchik’s ceedings of the 1st International Workshop on Emoji Under- standing and Applications in Social Media (Emoji2018), Stan- Eight highlighted [Plu91] ford, CA, USA, 25-JUN-2018, published at http://ceur-ws.org 2 Related Work 3.1 Data Acquisition The first step of an emoji containing tweets corpus One of the first attempts to characterize emoji from creation is to choose the list of emojis. To select most its sentimental load perspective was a project called popular emojis in the twitter and in general in text on- Emoji Sentiment Ranking - the first emoji sentiment line, we looked into the Emojitracker [etr13] project as lexicon (Figure 2). It was created by [NSSM15] and well as into Emoji Sentiment Ranking table [NSSM15]. provides a map between 751 most frequently used emo- By application of threshold for each ranking (>100 000 jis and sentiments. The valuable insights from it that 000 for Emojitracker and >100 for Emoji Sentiment we use: the majority of emojis are positive, especially Ranking) 31 emojis from the first list and 50 from the the top popular ones; among tweets with emojis, the second was picked. We selected emojis that were the inter-annotator agreement tend to be higher. intersection of both lists and additionally handpicked In [ERA+ 16] authors release emoji2vec, set of pre- some emojis that were in the top lists but not in the trained embeddings for all emojis in Unicode learned intersection one. That is how the set of 43 tweets was from emojisdescription taken from Unicode emoji created. After that, we calculated the distribution per- standard. That is one of the examples of mapping centages for each source and found the average. That emojis to another forms that are compatible to incor- average percentage was used to create the same natu- porate into machine learning tasks. And in general, ral balance in our corpus. representation learning and usage of pre-trained word The second step of corpus creation is a collection embeddings is popular among natural language pro- of data using the results of the previous step. In this cessing applications focused on social media. paper, we use easily accessible Twitter data that we In several works [BFMP13], [HBF+ 15], [JLL+ 14], crawl with help of tweepy library. As a result, 84777 [KZM14] emoticons were used to create a lexicon for a tweets containing emojis were crawled. Turned out, later use in a knowledge-based approach for sentiment the vast majority of them (92.3%) contains only one analysis or emotion detection. The common thing be- type of emoji and most of the time its quantity is equal tween these works is a utilization of a high number of to 1 (average emoji count per tweet is 1.2). That is emoticon types, usually hundreds. Later works based why we decided to focus on single emoji type tweets on machine learning approach in contrast to works in and after filtering out tweets with multiple emoji types the previous paragraph use emoticons and emojis as or with emoji types that are not in our emoji list, the noisy labels for distant supervision tasks. Such works 74670 tweets left for training purposes. are [Rea05], [GBH09], [DTR10], [ZDWX12]. The last step in the creation of corpus for labelling is a tweets preprocessing. On this stage, the raw tweets The recent paper [FMS+ 17] presents a project downloaded in the previous step are processed to the called DeepMoji and shows that diversification of noisy ready tweets . To do so, the number of emoji types in a labels set for the distant supervision allows models to tweet is counted, as well as the number of occurrences learn richer representations. They obtained state-of- per each emoji type present. The replacement of tags, the-art performance on the 8 benchmark datasets ac- hashtags and URLs by the placeholders is done. cording to sentiment, emotion and sarcasm detection, which proves the effectiveness of the noisy level ap- 3.2 Data Annotation proach. Furthermore, their analyses confirm the as- sumption that diversity of emotional labels results in a In order to start annotation process, we picked 500 performance improvement comparing to previous dis- tweets with additional requirement in order to enhance tant supervision methods. the quality of tweets to be annotated. The require- ments were: • Tweet does not contain URL-s, TAG-s. That is 3 Data Acquisition and Annotation a common practice in NLP that allows to exclude meaningless parts of the text. In this section, the process of manually annotated cor- • Tweet does not contain HASHTAGS. Even pus creation is described in detail. First, an acquisi- though [DTR10] found hashtags useful for auto- tion of data for further annotation is explained in three mated sentiment analysis, in our case we decided steps: emoji list creation, tweets crawling and tweets to eliminate them in order to increase the read- preprocessing. Second, the annotation course is pre- ability for annotators. sented in another three steps: tweets filtering, anno- tation and averaging of vectors, analysis of resulting • Tweet contains from 5 to 15 words. That way corpus. we have not so short and not so long tweets. • Tweet contains no more than 2 uppercase words. That is also for readability reasons. • Tweet contains no unlemmatizeable words (using spacys lemmatizer). Here it serves the data purity purposes as well as the understandability of the text for annotators. • Tweet contains no certain keywords (the list was manually generated after revision of corpus) in order to eliminate spam tweets. After choosing these 500 tweets, 3 annotators were asked to go through the procedure of tweets evalua- tion using an Web Interface created by us. For each tweet they could choose arbitrary number of emotions (including none) out of Plutchik’s Eight and set the intensity value for it from 1 to 3. The resulting labels were averaged according to rule of where more than half of annotators should agree on label. The resulting corpus consists of 500 labeled tweets, where labels are vectors of size 8 containing emotion intensities for 8 emotions. In the annotated set nearly half of tweets has only one emotion type, and the other half the combination of them (up to 4 out of 8 at once), resulting 1.1 emotion per tweet in average. The most prevalent emotion was joy that appeared in 57% of tweets to some extent. Other emotions were not that spread, and appeared in a quarter or less of tweets each. In Table 1 the statistics of emotion and emotion combination distributions over the dataset is pre- sented. For clarity emotions and emotion combina- tions are grouped into the positive, negative and neu- tral groups. Here the grouping was made under the assumption that emotions joy and trust are positive; sadness, anger, disgust, and fear are negative; and no emotions(neutral), anticipation and surprise are neu- tral. The combinations were determined by the pre- vailing sentiment, and in case of equality of positive and negative emotions, it was grouped into the neu- tral category. The macro distribution shows that tweets with pos- itive emotions are prevailing with about 60%, while the negative and neutral emotion tweets are only the rest. That is predicted that positive tweets will appear more (as stated in [NSSM15]), however, a distribution of classes is quite imbalanced. 4 Mapping emoji2emotion Using annotated dataset from the previous step the percentage of emoji occurrences per emotion and vice verse was done. In order to create a mapping, we checked each possible pair of emotions and emojis for the following two conditions. First, emojis percentage Table 1: Distribution of positive, neutral and negative emotions across the resulting corpus of appearing in the tweets subset of certain emotion should be at least equal to the median value of pos- sible percentages. Second, an emotion should appear in certain emoji tweets at least half of the time. As a result, the following mapping was done as shown in Table 3. Table 2: Results of emoji2emotion mapping To evaluate the quality of mapping, we use them as noisy labels in emotion annotation subtask of SemEval 2007 task 14 - Affective text [sem07]. That task aims to explore the connection between emotions and lexical semantics. Since the task is carried out in an unsuper- vised setting, only testing data is provided. It consists of 1000 short texts (news headlines) annotated accord- ing to 6 emotions (Anger, Disgust, Fear, Happiness, Sadness, Surprise) which are Ekman’s Six, and their intensity. Due to the fact that 6 emotions of Plutchik’s Eight compose Ekman’s Six, this data could be com- patible with ours. For that, we reduce the number of classes from 8 to 6 and labelled 74670 tweets from Data Table 3: Results of applying emoji2emotion to task 14 Acquisition step using emoji2emotion mapping to use of SemEval 2017 [sem07] as training data. We used coarse version of SemEval’s test set as well as labelled our training set with binary 5 Findings and Contribution vectors. To train our model we turned the news headlines in We propose a method of experimental mapping be- test set as well as tweet texts in training set into word tween emoji and sentiment or emotion classes based on embeddings using the word2vec methodology and open a special processing of manually annotated data. The source code of emoji2vec. Then we fed these word em- processing includes the finding quantitate relation be- beddings as well as noisy labels to 4 classifiers (SGD, tween emoji and emotion in form of cooccurrence per- Naive Bayes, Random Forest and k-NN) from the centage and further thresholding. To implement the scikit library. Using the trained model we predicted method we annotated the corpus of 500 tweets con- emotion categories per headline for the 1000 test set taining emojis with help of 3 human judges. From the mentioned before. The resulting precision, recall and average annotation labels we constructed mapping as f1 scores are presented in the Table 3. The bold val- described above. Due to significant imbalance in emo- ues represent maximum values, while green values are tions distribution across the dataset mapping was done those that outperform the SemEval’s best scores. only for 4 emotion categories and evaluated by exploit- That is evident that the training data has a imbal- ing as noisy labels for emotion detection task on those ance towards certain emotion categories which we link 4 emotions. The results on emotion detection task to the number of emojis picked per emotion. That is show that it is feasible to continue in that direction by why the results of training also translate that kind of increasing the size of the annotated corpus and further bias. To avoid that bias we need a more balanced set, tuning the training parameters. and for that, in turn, we need more balanced mapping. The resulting corpus of manually labeled emoji To achieve that, more training data will be needed in containing tweets is shared open source online the next run of the experiment and we leave that for (https://github.com/Aisulu/emoji2emotion) for the future development of the work. benefits of scientific society. 6 Challenges and Limitations [etr13] Emojitracker, 2013. After the annotation process that was evident to us, + [FMS 17] Bjarke Felbo, Alan Mislove, Anders that the labelling for 8 classes and 3 intensity levels for Søgaard, Iyad Rahwan, and Sune each of them require the high cognitive load from the Lehmann. Using millions of emoji annotators and in average takes 18 seconds per tweet. occurrences to learn any-domain represen- Even though we knew that the increase in the class tations for detecting sentiment, emotion numbers leads to the slower labelling [BKT+ 13], it was and sarcasm. In Conference on Empirical higher than we expected and lead to the decrease of the Methods in Natural Language Processing final corpus size. As a result, not all the emotions were (EMNLP), 2017. presented in large enough size in the dataset which leads to the convergence of the classes to fewer classes. [GBH09] Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using dis- tant supervision. 150, 01 2009. 7 Future Work [HBF+ 15] Alexander Hogenboom, Danella Bal, Flav- We aim to find less time-consuming form of annotation ius Frasincar, Malissa Bal, Franciska process for users to increase the size of the manually De Jong, and Uzay Kaymak. Exploiting annotated corpus. After that we plan to repeat exper- emoticons in polarity classification of text. imental procedures. J. Web Eng., 14(1-2):22–40, March 2015. References [JLL+ 14] Fei Jiang, Yiqun Liu, Huanbo Luan, Min Zhang, and Shaoping Ma. Microblog [BFMP13] Marina Boia, Boi Faltings, Claudiu- Sentiment Analysis with Emoticon Space Cristian Musat, and Pearl Pu. A: ) is Model, pages 76–87. Springer Berlin Hei- worth a thousand words: How people at- delberg, Berlin, Heidelberg, 2014. tach sentiment to emoticons and words in tweets. In Proceedings of the 2013 Inter- [KZM14] Svetlana Kiritchenko, Xiaodan Zhu, and national Conference on Social Computing, Saif M. Mohammad. Sentiment analysis SOCIALCOM ’13, pages 345–350, Wash- of short informal texts. J. Artif. Int. Res., ington, DC, USA, 2013. IEEE Computer 50(1):723–762, May 2014. Society. [NSSM15] Petra Kralj Novak, Jasmina Smailovic, [BKT+ 13] Michael Brooks, Katie Kuksenok, Borut Sluban, and Igor Mozetic. Senti- Megan K. Torkildson, Daniel Perry, ment of emojis. 2015. John J. Robinson, Taylor J. Scott, Ona [Plu91] R. Plutchik. The Emotions. University Anicello, Ariana Zukowski, Paul Harris, Press of America, 1991. and Cecilia R. Aragon. Statistical af- fect detection in collaborative chat. In [Rea05] Jonathon Read. Using emoticons to re- Proceedings of the 2013 Conference on duce dependency in machine learning tech- Computer Supported Cooperative Work, niques for sentiment classification. In CSCW ’13, pages 317–328, New York, Proceedings of the ACL Student Research NY, USA, 2013. ACM. Workshop, ACLstudent ’05, pages 43–48, Stroudsburg, PA, USA, 2005. Association [DTR10] Dmitry Davidov, Oren Tsur, and Ari for Computational Linguistics. Rappoport. Enhanced sentiment learn- ing using twitter hashtags and smileys. [sem07] Affective text. semeval task 14, 2007. In Proceedings of the 23rd International [WR16] I. D. Wood and S. Ruder. Emoji as emo- Conference on Computational Linguistics: tion tags for tweets. Proceedings of the Posters, COLING ’10, pages 241–249, Emotion and Sentiment Analysis Work- Stroudsburg, PA, USA, 2010. Association shop LREC2016, Portoro, Slovenia, pages for Computational Linguistics. 76–79, 2016. [ERA+ 16] Ben Eisner, Tim Rocktäschel, Isabelle Au- [ZDWX12] Jichang Zhao, Li Dong, Junjie Wu, and genstein, Matko Bosnjak, and Sebastian Ke Xu. Moodlens: An emoticon-based Riedel. emoji2vec: Learning emoji repre- sentiment analysis system for chinese sentations from their description. CoRR, tweets. 08 2012. abs/1609.08359, 2016.