-

Learning Emo ji Embeddings using Emo ji Co-occurrence Network Graph

Anurag Illendula

aianurag09@iitkgp.ac.in 0

Manish Reddy Yedulla

es15btech11012@iith.ac.in 1 0 Department Of Mathematics , IIT Kharagpur 1 Department of Engineering Science , IIT Hyderabad

2018

23 26

Usage of emoji in social media platforms has seen a rapid increase over the last few years. Majority of the social media posts are laden with emoji and users often use more than one emoji in a single social media post to express their emotions and to emphasize certain words in a message. Utilizing the emoji cooccurrence can be helpful to understand how emoji are used in social media posts and their meanings in the context of social media posts. In this paper, we investigate whether emoji cooccurrences can be used as a feature to learn emoji embeddings which can be used in many downstream applications such sentiment analysis and emotion identi cation in social media text. We utilize 147 million tweets which have emojis in them and build an emoji cooccurrence network. Then, we train a network embedding model to embed emojis into a low dimensional vector space. We evaluate our embeddings using sentiment analysis and emoji similarity experiments, and experimental results show that our embeddings outperform the current state-of-the-art results for sentiment analysis tasks.

Emojis are the 21st century s successor to the emoticon. They arose from the need to communicate body language and facial expressions during text conversations. They are two-dimensional visual embodiments of everyday aspects of life which were standardized by the Unicode Consortium in 2010 as part of Unicode 6.0. Emoji proliferated throughout the globe and has particularly become a part of the popular culture in the west. It has been adopted by almost all social media platforms and messaging services. Emojis serve many purposes during online communication, among which conveying emotion is one of the primary uses. According to the latest statistics released by Emojipedia in June 2017, the number of emojis has increased to 2,666, posing challenges to applications that list them in small hand-held devices such as mobile phones. To overcome this challenge, emoji keyboards in most of the smartphones contains categorizes emoji into several categories listed in Table 1.

Many recent Natural Language Processing (NLP) systems rely on word representations in nitedimensional vector space. These NLP systems mainly use pre-trained word embeddings obtained from word2vec [MSC+13] or GloVe [PSM14] or fastText [BGJM16]. Earlier GloVe embeddings were used for training most NLP systems, but fastText trained word embeddings could achieve much higher accuracies of NLP systems involving social media data because the fastText model could learn sub-word information. Emoji embeddings have been of fundamental importance to improve the accuracies of many emoji understanding tasks. Recent research proved that emoji embeddings could enhance the performance of emoji prediction [FMS+17, BBS17], emoji similarity [WBSD17b], and emoji sense disambiguation tasks [WBSD17a, SPWT17]. These emoji representations have also been e cient in understanding the behavior of emojis in di erent contexts. The need to learn emoji representations for improving the performance of social NLP systems has been recognized by Eisner et al. [ERA+16] and Francesco et al. [BRS16] among others, where they used traditional approaches which include skip-gram and CBOW model to learn emoji embeddings.

Information networks such as publication networks, World Wide Web are characterized by the interplay between various content and a sophisticated underlying knowledge structure. Graph embedding models are helpful to scale information from large-scale information networks and embed them into a nitedimensional vector space, and these embeddings have shown great success in various NLP tasks such as node classi cation [BCM11], link prediction [LNK07] and classi cation [YRS+14] tasks. These graph embedding models have been of crucial importance and have enhanced the performance of word similarity and word analogical reasoning tasks using language networks [TQW+15]. The analysis of emoji co-occurrence network graphs can help us understand emojis from different perspectives. We hypothesize that emojis which co-occur in a tweet contains the same sentiment as the overall sentiment of the tweet. Consider a tweet, \I got betrayed by , I want to kill you ", here both the emojis , contain negative sentiment, the overall sentiment of the tweet is also negative. Hence we investigate whether emoji co-occurrence could be a better feature to learn emoji representations to improve the accuracy of classi cation tasks. In this paper, we introduce an approach to learn emoji representations using emoji co-occurrence network graph and largescale information network embedding model and evaluate our embeddings using the gold-standard dataset for sentiment analysis task. , , , , , , , , , , , , , ,

This paper is organized as follows. Section 2 discusses the related work done by other researchers in the eld of emoji understanding and learning network representations. Section 3 discusses the process of creating an emoji co-occurrence network using our twitter corpus. Section 4 explains our model architecture to learn emoji representations from emoji co-occurrence network graph. Section 5 reports the accuracies obtained by our emoji embeddings on the gold-standard dataset for sentiment analysis task, emoji similarity tasks. We discuss the reason behind high accuracies obtained for sentiment analysis task in Section 6 followed by plans for future work in Section 7. 2

Related Work

One of the exciting work by Wijeratne et al. [WBSD16, WBSD17a] in the eld of emoji understanding is EmojiNet (http://emojinet.knoesis.org/home.php), the largest machine readable emoji sense inventory, this inventory helps computers understand emojis. In this work Wijeratne et al. tried to connect emojis and their senses to corresponding words in babelnet ([NP12]) using their respective babelnetId. EmojiNet opened doors to many of the emoji understanding tasks like emoji similarity, emoji prediction, emoji sense disambiguation.

The other interesting work done by Wijeratne et al. [WBSD17b] addressed the challenge of measuring emoji similarity using the semantics of emoji. They de ned two types of semantics embeddings using the textual senses and the textual descriptions of emojis. Prior work by Francesco et al. ([BRS16]) and Eisner et al. ([ERA+16]) used traditional approaches to learn emoji embeddings. The semantic embeddings have achieved accuracies which outperformed the previous state-of-the-art results in sentiment analysis task; this high accuracy is due to the fact that semantic embeddings can learn syntactic, semantic, sentiment features of emojis.

Seyednezhad et al. ([SM17]) created a network using the emoji co-occurrences in the same tweet; they claim that each edge weight can help us understand the user's context to use multiple emojis. This emoji network also enabled them to justify the use of cooccurred emojis in di erent perceptions. This also enabled them to understand emoji usage by understanding possible relations between these special characters in common text. Fede et al. ([FHSM17]) studied different characteristics of this emoji co-occurrence network graph which include studying user's behavior to use a sequence of emojis in di erent contexts.

Information networks have been of primary use to store large amounts of information. Many researchers have proposed di erent graph embedding models in machine learning literature which allow us to embed nodes of large information networks into a low dimensional vector space ([PARS14] [GL16] [CLX15]). These embeddings helped address many tasks such as node classi cation, visualization, and link prediction tasks. 3

Data and Network

The emoji network is constructed using a twitter corpus of 147 million tweets crawled through a period of 2 months (from 6th August 2016 to 8th September 2016) by Wijeratne et al. [WBSD17a]. We lter the tweets and only consider the tweets which have multiple emojis embedded in a tweet. This reduces the number of distinct tweets in the dataset to 14.3 million. Figure 1 shows the distribution of the number of tweets of the most frequently occurring emojis. Each tweet generates a polygon of n sides where n is the number of emojis embedded in the tweet. The construction of emoji network is straightforward and Figure 2 explains the construction of emoji polygons with the help of di erent examples.

The weight of an edge signi es the number of cooccurrences of the emojis sharing the edge considering the complete twitter corpus. For example in the case of tweets shown in Figure 2 the emoji pair ( , ) appeared twice hence the weight of the edge corresponding to these two emojis is considered as 2. Similarly, the weight of all the edges in the emoji network is calculated.

The emoji co-occurrence network created using the tweets in Figure 2 is represented in Figure 3. We input the emoji co-occurrence network graph to our graph embedding model to learn 300-dimensional emoji embeddings, and we evaluate our embeddings using the gold-standard dataset for sentiment analysis. We use the gold-standard dataset ([NSSM15]) to evaluate our embeddings because the current state-of-the-art results [ERA+16] for sentiment analysis were obtained on this dataset. Here we discuss two di erent types of measures which signify the proximity between two nodes of the cooccurrence network graph, and the model developed by Jian et al. [TQW+15] to learn the node representations of a network graph.

First Order Proximity : The rst order proximity is de ned as the local pairwise proximity which can be related to the weight of the edge formed by joining the two vertices. The rst order proximity between an edge (u,v) is the weight Wuv of the edge formed by vertices u, v. It can also be inferred from the de nition that the rst-order proximity between any two non-connected vertices is zero.

Tweet 4

Tweet 1 Tweet 2

Tweet 3

Negative sentiment tweet Positive sentiment tweet

Second Order Proximity : The second order proximity is de ned as the similarity between neighbourhood network structures. For example, consider u to be an emoji node, let pu = (w(u;1); w(u;2); :::::::; w(u;jV j)) denote the rst order proximity of the emoji node \u" with all the vertices then the second order proximity is de ned as the similarity between pu and pv. If there exists no common vertex between u and v, then second-order proximity is zero. 4.1.1

Network embedding using proximity: rst order

Let ui and uj represent the network embedding in d dimensional vector space, where (i,j) is an undirected edge in the network graph. The joint probability which signi es the proximity between vertices vi, vj is de ned as p1(vi; vj) =

1 1 + exp(~uiT ~uj) (1) where u~i 2 Rd is a low dimensional representation also called as embedding for emoji node vi, wij represents the weight of the edge between the nodes vi and vj The probability distribution between di erent pair of vertices is de ned as p(.,.) over the vector space V x V and the empirical probability is de ned as pe1(i; j) pe1(i; j) = wij W and

W = wij

(2)

X (i;j)2E

To maintain the rst Order proximity between the vertices of the network graph, the objective function (O1) which is the distance between the empirical probability function and the proximity function is to be optimized.

O1(i; j) = d(pe1(i; j); p1(i; j)) (3) (4) (5) (6) where d(pe1(i; j); p1(i; j)) is de ned as the distance between the two probability distributions. Replacing d( ; ) by KL-divergence, the objective function reduces to

O1 =

O1(i; j)

X (i;j)2E O1 =

wij log p1(vi; vj)

X (i;j)2E 4.1.2

Network embedding using second order proximity: The second order proximity of two nodes (vi; vj) measures the similarity of the neighbourhood network structures of nodes (vi; vj). This measure is applicable for both directed and undirected graphs. Hence our objective, in this case, is to look at the vertex and the \context" of the vertex which can also be related to the distribution of neighbours of the given vertex. Hence for each edge (vi; vj) the probability of \context" is de ned by p2(vjjvi) =

exp(u~0jT ~ui)

PjkV=j1 exp(u~0kT ~ui)

Where jV j is the number of vertices. As mentioned before, the second order proximity assumes that vertices with similar distribution over the contexts as similar vertices. To maintain the second order proximity, the similarity distance between the contexts p2( jvi) represented in the low dimensional vector space and the empirical distribution p2( jj) must be optimized. e Hence our objective function (O2) in this case is O2 = X vi2V id(pe2( jvi); p2( jvi)) (7) where d( ; ) is the distance between two probability distributions, here the variable i is used to consider the importance of the vertex vi during the process of optimization. As de ned in the previous case the empirical distribution is de ned as pe2(i; j) = wij di and di = wik

(8)

X k2N(i) wij is the weight of edge (vi; vj) and di is de ned as the out-degree of vertex and N(i) is the set of neighbours of vi. Considering i = di for the purpose of simplicity, and replacing d( ; ) with KL-divergence O2 = wij log p2(vjjvi) (9)

X (i;j)2E The approach of negative sampling proposed by Mikolov et al. [MSC+13] is used to optimize the objective function which helps us to represent every vertex of the network graph in the low dimensional vector space. Hence the objective function simpli es to: log (u~0jT ~ui)) +

K X EvnPn(v)[log (u~0jT i=1 ~ui)] (10) where (x) = 1=(1 + exp( x)) is the sigmoid function. We use the stochastic gradient descent algorithm [RRWN11] for optimizing the objective function and we update the model parameters on a batch of edges. Thus after completion of the training process, we get the embeddings corresponding to each vertex. The gradient with respect to an embedding u~i of vertex vi will be: (11) (12)

We learn the node embeddings (u~i) by optimizing the objective function in both cases and call the embeddings as rst order embeddings and second order embeddings respectively.The model is trained using the Tensor ow ([ABC+16]) library on a cuda GPU. Model is trained using RMS Propagation gradient descent algorithm with learning rate as 0.025, and we used a batch size as 128, the number of batches = 300000 and 300-dimensional embeddings. The code is made available on Github1, 300-dimensional emoji embeddings learned using the emoji co-occurrence network can also be accessed at this link. 5 5.1

Experiments Sentiment Analysis

In this section, we report our accuracies obtained for the sentiment analysis task on the gold-standard dataset developed by Novak et al. [NSSM15]. Our experiments have achieved accuracies which outperform the current state-of-the-art results for sentiment analysis on the gold-standard dataset. The gold-standard dataset2 consists of 64599 manually labelled tweets classi ed into positive, negative, neutral sentiment. The dataset is divided into training set that consists 51679 tweets, 9405 out of which contain emoji and testing set that consists of 12920 tweets, 2295 out of 1https://bit.ly/2I5hYNd 2https://bit.ly/2pLaKVZ which contain emoji. In both training and testing sets, 29% are labelled as positive, 25% are labelled as negative, and 46% are labelled as neural. We use the pre-trained FastText word embeddings3 [MGB+18] to embed words into a low dimensional vector space. We calculate the bag of words vector for each tweet and then use this vector as a feature to train a support vector machine and a random forest model on the training set, and evaluate the accuracies obtained for classi cation task on whole testing dataset consisting of 12920 tweets. The accuracies obtained for classi cation task using the rst order embeddings surpass the current state-of-the-art [WBSD17b] results. Emoji similarity4 is one of the important challenges which should be addressed for the development of emoji keyboards since the current emoji keyboard consists of 2666 emojis, and the complete list cannot be accommodated in a small screen. These emoji embeddings learned using the emoji co-occurrence network graph could be helpful to calculate the similarity between emojis using cosine distance as the similarity measure and group emojis which have high similarity values. This grouping of emojis can decrease the number of distinct emojis and helps us accommodate this grouped emojis on a small screen. In this section, we report the emoji similarity values found considering the rst order embeddings and second order embeddings.

We consider the cosine distance to be the similarity measure between two embeddings. Let ~a and ~b be two vectors which represent embeddings of emojis e1 and e2 respectively, the similarity measure between these two emojis (e1 and e2) is calculated as 3https://bit.ly/2FMTB4N 4Our main objective is not to address the emoji similarity task. Our main objective is to demonstrate the usefulness of our emoji embeddings for sentiment analysis task. similarity(e1; e2) = (13) The analogical reasoning task introduced by Mikolov et al. [MSC+13], de nes the syntactic and semantic analogies. For example, consider the semantic analogy such as USA : Washington = India : ? where we ll the gap (represented by \?") by nding a word from the vocabulary whose embedding(represented by vec(x)) is closest to vec(Washington) - vec(USA) + vec(India). Here cosine distance is considered as the similarity measure between the two vectors. We extrapolate the semantic analogy task introduced by Mikolov et al. [MSC+13] in the context of emojis, by replacing words with emojis. Consider an emoji analogy, ( : ) = ( : ?), we ll the gap (represented by \?") by nding an emoji from the complete list of emojis whose embedding(represented by vec(x)) is closest to vec( ) - vec( ) + vec( ). Table 8 reports some of the interesting analogies found using rst order and second order embeddings. The high accuracy for classi cation task using the rst order embedding model is due to the fact that all cooccurring emojis in a tweet possess the same sentiment feature, hence during classi cation these embeddings would increase the accuracy of the classi cation model. Consider the tweet, \Who uses this emoji , I miss the one that had this mouth and these eyes ! ... where did he go?! Why did he leave?!" , in this tweet we observe the overall sentiment to be positive, and we also observe that all the emojis embedded in the tweet possess the same sentiment. Hence co-occurring emojis would be better attribute to learn emoji embeddings which can increase the accuracy of sentiment analysis and other related classi cation tasks.

We use the Spearman s rank correlation coe cient to evaluate the emoji similarity ranks obtained using rst order and second order embeddings learned using emoji co-occurrence network with the emoji similarity ranks of gold-standard dataset5. Table 7 reports the Spearman s correlation coe cient obtained by our emoji embeddings. According to the correlation coefcients the rst emoji embeddings show a strong correlation (0:6 < < 0:79).

The top 6 most similar emoji pairs observed considering the rst order embeddings are reported in Table

Future Work

Usage of external knowledge has improved the accuracies of various natural language processing tasks and outperformed many state-of-the-art results. Jian et al. [BGL14] have worked on leveraging external knowledge in learning word embeddings which gave better accuracies in word similarity and word analogy tasks. The rst set of examples in EmoSim5087 dataset look more convincing than the results in Table 5 and Table 6; the reason being semantic knowledge helps to us compare the similarity between di erent emojis e ciently. Using Jian et al. s work as a reference, we could work on incorporating external knowledge from EmojiNet to our network embedding model which might further improve the accuracies of sentiment analysis and emoji similarity tasks.

Acknowledgement

We are grateful to Sanjaya Wijeratne and Amit Sheth for thought-provoking discussions on the topic. We acknowledge support from the Indian Institute of Technology Kharagpur. Any opinions, ndings, and conclusions/recommendations expressed in this material are those of the author(s) and do not necessarily re ect the views of Indian Institute of Technology Kharagpur.

6The semantic similarity is the similarity measure obtained using semantic embeddings developed by Wijeratne et al. 7https://bit.ly/2GztSR2 [ABC+16] [BBS17] [BCM11]

Mart n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geo rey Irving, Michael Isard, et al. Tensor ow: A system for large-scale machine learning. In OSDI, volume 16, pages 265{283, 2016.

Francesco Barbieri, Miguel Ballesteros, and Horacio Saggion. Are emojis predictable? arXiv preprint arXiv:1702.07285, 2017.

Smriti Bhagat, Graham Cormode, and S Muthukrishnan. Node classi cation in social networks. In Social network data analytics, pages 115{148. Springer, 2011. [BGJM16] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov.

Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016. [BGL14] [BRS16] [CLX15] [ERA+16] [FHSM17]

Jiang Bian, Bin Gao, and Tie-Yan Liu.

Knowledge-powered deep learning for word embedding. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 132{ 148. Springer, 2014.

Francesco Barbieri, Francesco Ronzano, and Horacio Saggion. What does this emoji mean? a vector space skip-gram model for twitter emojis. In LREC, 2016.

Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representations with global structural information.

In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 891{900.

ACM, 2015.

Ben Eisner, Tim Rocktaschel, Isabelle Augenstein, Matko Bosnjak, and Sebastian Riedel. emoji2vec: Learning emoji representations from their description. arXiv preprint arXiv:1609.08359, 2016.

Halley Fede, Isaiah Herrera, SM Mahdi Seyednezhad, and Ronaldo Menezes.

Representing emoji usage using directed networks: A twitter case study. In International Workshop on Complex Networks and their Applications, pages 829{ 842. Springer, 2017. [FMS+17]

Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701{710. ACM, 2014.

Je rey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532{1543, 2014. [WBSD17b] Sanjaya Wijeratne, Lakshika Balasuriya, Amit P. Sheth, and Derek Doran. A [YRS+14]

Xiao

Yu , Xiang Ren, Yizhou Sun, Quanquan Gu, Bradley Sturt, Urvashi Khandelwal, Brandon Norick, and Jiawei Han. Personalized entity recommendation: A heterogeneous information network approach . In Proceedings of the 7th ACM international conference on Web search and data mining , pages 283 { 292 . ACM, 2014 .