Predicting Emoji Exploiting Multimodal Data: FBK Participation in ITAmoji Task Andrei Catalin Coman Yaroslav Nechaev Giacomo Zara Fondazione Bruno Kessler Fondazione Bruno Kessler Fondazione Bruno Kessler coman@fbk.eu nechaev@fbk.eu gzara@fbk.eu Abstract messaging services in our lives, we have been wit- nessing how common it has become for average English. In this paper, we present our ap- users to enrich natural language by means of emo- proach that has won the ITAmoji task of jis. An emoji is essentially a symbol placed di- the 2018 edition of the EVALITA evalu- rectly into the text, which is meant to convey a ation campaign1 . ITAmoji is a classifi- simple concept or more specifically, as the name cation task for predicting the most prob- says, an emotion. able emoji (a total of 25 classes) to go The emoji phenomenon has attracted consider- along with the target tweet written by a able research interest. In particular, recent works given person in Italian. We demonstrate have studied the connection between the natural that using only textual features is insuf- language and the emojis used in a specific piece ficient to achieve reasonable performance of text. The 2018 edition of EVALITA ITAmoji levels on this task and propose a system competition (Ronzano et al., 2018) is a prime ex- that is able to benefit from the multimodal ample of such interest. In this competition, par- information contained in the training set, ticipants were asked to predict one of the 25 emo- enabling significant F1 gains and earning jis to be used in a given Italian tweet based on a us the first place in the final ranking. text, the date and the user that has written it. Dif- Italiano. In questo articolo presentiamo ferently from the similar SemEval (Barbieri et al., l’approccio con cui abbiamo vinto la com- 2018) challenge, the addition of the user informa- petizione ITAmoji dell’edizione 2018 di tion significantly expanded the scope of potential EVALITA1 . ITAmoji è un task di classi- solutions that could be devised. ficazione per predire l’emoji più proba- In this paper, we describe our neural network- bile (tra un totale di 25 classi) che possa based system that exhibited the best performance essere associato ad un dato tweet scritto among the submitted approaches in this task. Our in italiano da uno specifico utente. Di- approach is able to successfully exploit user in- mostriamo che utilizzare esclusivamente formation, such as the prior emoji usage history dati testuali non è sufficiente per ottenere of a user, in conjunction with the textual features un ragionevole livello di performance su that are customary for this task. In our experi- questo task, e proponiamo un sistema in ments, we have found that the usage of just the grado di beneficiare dalle informazioni textual information from the tweet provides lim- multimodali contenute nel training set, au- ited results: none of our text-based models were mentando significativamente lo score F1 able to outperform a simple rule-based baseline e guadagnando la prima posizione nella based on prior emoji history of a target user. How- classifica finale. ever, by considering all the modalities of the input data that were made available to us, we were able to improve our results significantly. Specifically, 1 Introduction we combine into a single efficient neural network the typical Bi-LSTM-based recurrent architecture, Particularly over the last few years, with the in- that has shown excellent performance previously creasing presence of social networks and instant in this task, with the multilayer perceptron applied 1 EVALITA: http://evalita.it/2018 to user-based features. Figure 1: A diagram of the approach. 2 Description of the System 2.2 User-based features ITAmoji task is a classification task of predicting Rather than relying solely on a text of the target one of the 25 emojis to go along with the tweet. tweet, we exploit additional user-based features The training set provided by the organizers of the to improve performance. The task features many competition consists of 250 000 Italian tweets, in- variations of the smiling face and three different cluding for each tweet the text (without the target heart emojis, making it impossible even for a hu- emoji), the user ID and the timestamp as features. man to determine the most suitable one just based Participants were explicitly forbidden to expand on a tweet. One of the features we considered the training set. Figure 1 provides an overview of was the prior emoji distribution for a target author. our approach. In this section, we provide detailed The hypothesis was that the choice of a particular descriptions of the methods we employed to solve emoji is driven mainly by the personal user prefer- the proposed task. ences exemplified by the previous emoji choices. To this end, we have collected two different types of emoji history for each user. Firstly, we 2.1 Textual features use labels in the training set to compute emoji dis- In order to embed the textual content of the tweet, tributions for each user yielding vectors of size 25. we have decided to apply a vectorization based on Users from the test set that were not present in fastText (Bojanowski et al., 2017), a recent ap- the training set were initialized with zeroes. Sec- proach for learning unsupervised low-dimensional ondly, we have gathered the last 200 tweets for word representations. fastText sees the words each user using Twitter API5 , and then extracted as a collection of character n-grams, learning a and counted all emojis that were present in those representation for each n-gram. fastText fol- tweets. This yielded a sparse vector of size 1284. lows the famous distributional semantics hypoth- At this step we took extra care to prevent data esis utilized in other approaches, such as LSA, leaks: if a tweet from the test set ended up in the word2vec and GloVe. In this work, we exploit the collected 200 tweets, it wasn’t considered in the Italian embeddings trained on text from Wikipedia user history. The runs that used the former, train- and Common Crawl2 and made available by the ing set-based approach had a ” tr” suffix in its fastText authors3 . Such embeddings include name. The ones that used the full user history- 300-dimensional vectors for each of 2M words in based approach had a ” ud” suffix. the vocabulary. Additionally, we have trained our In addition to prior emoji distribution, we did own4 embeddings using the corpus of 48M Ital- preliminary experiments with user’s social graph. ian tweets that were acquired from Twitter Stream- Social graph, which is a graph of connections be- ing API. This yielded 1.1M 100-dimensional word tween the users, is shown to be an important fea- vectors. Finally, we have also conducted experi- ture for many tasks on social media, for example, ments with word vectors suggested by the task or- user profiling. We followed the recently proposed ganizers (Barbieri et al., 2016). approach (Nechaev et al., 2018a) to acquire 300- dimensional dense user representations based on a 2 http://commoncrawl.org/ social graph. This feature, however, did not im- 3 https://github.com/facebookresearch/ prove the performance of our approach and was fastText/blob/master/docs/crawl-vectors. md excluded. 4 https://doi.org/10.5281/zenodo. 5 1467220 https://developer.twitter.com Layer Parameter Value dimensionality using the embedding matrix of one Textual Input seq. length 48 of the approaches listed in Section 2.1. The result- input 256 ing sequence is padded with zeroes to a constant output 100 length, in our case 48, and fed into the neural net- Embedding trainable true l2 regularization 10−6 work. Dropout probability 0.4 2.4 Overall implementation Bi-LSTM output 512 In order to accommodate both textual and user- (a) Bi-LSTM model hyperparameters. based features, we devise a joint architecture that History Input input 1284 takes both types of features as input and produces output 256 probability distribution for the target 25 classes. Dense l2 regularization 10−5 The general logic of our approach is shown in Fig- activation tanh ure 1. The core consists of two main components: output 256 Dense l2 regularization 10−5 activation tanh • Bi-LSTM. The recurrent unit consumes the input sequence one vector at a time, modify- (b) User history model hyperparameters. ing the hidden state (i.e., memory). After the Concatenate output 768 whole sequence is consumed in both direc- output 25 tions, the internal states of the two RNNs are Dense l2 regularization – concatenated and used as a tweet embedding. activation softmax Additionally, we perform l2-regularization of method Adam learning rate 0.001 the input embedding matrix and the dropout Optimizer β1 0.9 to prevent overfitting and fine-tune the perfor- β2 0.999 mance. Table 1a details the hyperparameters decay 0.001 we used for a textual part of our approach. (c) Joint model and optimizer parameters. • User-based features. The emoji distribution Table 1: Model hyperparameters. we collected (as described in Section 2.2) was fed as input to a multilayer perceptron: 2.3 RNN exploiting textual features two densely-connected layers with tanh as The Recurrent Neural Networks have turned out activation and l2-regularization to prevent to be a powerful architecture when it comes to an- overfitting. Table 1b showcases the chosen alyzing and performing prediction on sequential hyperparameters for this component using data. In particular, over the last few years, differ- the full user history as input. ent variations of the RNN has shown to be the top The outputs of the two components are then con- performing approaches for a wide variety of tasks, catenated and a final layer with the sof tmax ac- including tasks in Natural Language Processing tivation is applied to acquire the probability dis- (NLP). RNN consumes the input sequence one tribution of the 25 emoji labels. The network is element at the time, modifying the internal state then optimized jointly with cross entropy as the along the way to capture relevant information from objective function using Adam optimizer. Table 1c the sequence. When used for NLP tasks, RNN is includes all relevant hyperparameters we used for able to consider the entirety of the target sentence, this step. capturing even the longest dependencies within Since the runs are evaluated based on macro-F1 , the text. In our system, we use the bi-directional in order to optimize our approach for this metric, long short-term memory (Bi-LSTM) variation of we have introduced class weights into the objec- the RNN. This variation uses two separate RNNs tive function. Each class i is associated with the to traverse the input sequence in both directions weight equal to: (hence bi-directional) and employs LSTM cells. maxi (N ) α   Input text provided by the organizers is split into wi = (1) Ni tokens using a modified version of the Keras tok- enizer (can be found in our repository). Then, the where Ni is the amount of samples in a particular input tokens are turned into word vectors of fixed class and α = 1.1 is a hyperparameter we tuned for this task. This way the optimizer is assigning of a particular word embedding approach. In a greater penalty for mistakes in rare classes, thus particular, provided refers to the ones that optimising for the target metric. were suggested by organisers, custom-100d During the training of our approach, we employ indicates our fastText-based embeddings and an early stopping criteria to halt the training once common-300d refers to the ones available on the performance on the validation set stops im- fastText website. Table 2 details the perfor- proving. In order to properly evaluate our system, mances of the mentioned models. we employ 10-fold cross-validation, additionally Finally, we submitted three of our best mod- extracting a small validation set from the train- els for the official evaluation. All of the sub- ing set for that fold to perform the early stopping. mitted runs use the Bi-LSTM approach with For the final submission we use a simple ensemble our custom-100d word embeddings along with mechanism, where predictions are acquired inde- some variation of user emoji distribution as de- pendently from each fold and then averaged out to tailed in Section 2.2. Two of the runs use the en- produce the final submission. Additionally, one of sembling trick using all available cross-validation the runs was submitted using predictions from the folds, while the remaining one we submitted random fold. Runs exploiting the ensemble ap- (” 1f”) uses predictions from just one fold. proach have the ” 10f” suffix, while runs using just one fold have the ” 1f” suffix. 4 Results The code used to preprocess data, train and eval- uate our approach is available on GitHub6 . Here we report performances of the models bench- marked both during our local evaluation (Table 2) 3 Evaluation setting and the official results (Table 3). We started experiments with just the textual models testing In this section, we provide details on some of the different architectures and embedding combina- approaches we have tested during the development tions. Among those, the Bi-LSTM architecture of our system, as well as the models we submitted was a clear choice, providing 1-2% F1 over CNN, for the official evaluation. In this paper, we report which led to us abandoning the CNN-based mod- results for the following models: els. Among the three word embedding mod- • MF HISTORY. A rule-based baseline that al- els we evaluated, our custom-100d embed- ways outputs the most frequent emoji from ding exhibited the best performance on Bi-LSTM, the user history based on a training set. while common-300d showed the best perfor- mance using the CNN architecture. • BASE CNN. A basic Convolutional Neural After we have acquired the user emoji dis- Network (CNN) taking word embeddings as tributions, we have devised a simple baseline input without any user-based features. (MF HISTORY), which, to our surprise, outper- • BASE LSTM. A Bi-LSTM model described formed all the text-based models we’ve tested so in Section 2.3 used with textual features only. far: 3% F1 improvement compared to the best Bi- LSTM model. When we introduced the user emoji • BASE LSTM TR. The complete approach in- histories in our approach, we have gained a signif- cluding both feature families with emoji dis- icant performance gain: 4% when using the scarce tribution coming from the training set. training set data and 12% when using the complete user history of 1284 emojis from recent tweets. • BASE LSTM UD. The complete approach During the final days of the competition, we have with emoji distribution coming from the most tried to exploit other user-based features to further recent 200 tweets for each user. bolster our results, for example, the social graph For the other models tested during our local eval- of a user. Unfortunately, such experiments did not uation and complete experimental results, please yield performance gains before the deadline. refer to our GitHub repository. During the official evaluation, complete user Additionally, for the BASE LSTM approach we history-based runs exhibited top performance with report performance variations due to a choice ensembling trick actually decreasing the final F1 . 6 GitHub repository: https://github.com/ As we expected from our experiments, training Remper/emojinet set-based emoji distribution was much less per- Approach Embedding Accuracy Precision Recall F1 macro MF HISTORY – 0.4396 0.4076 0.2774 0.3133 BASE CNN common-300d 0.4351 0.3489 0.2464 0.2673 BASE LSTM common-300d 0.4053 0.3167 0.2534 0.2707 BASE LSTM provided 0.4415 0.3836 0.2408 0.2622 BASE LSTM custom-100d 0.4443 0.3666 0.2586 0.2809 BASE LSTM TR custom-100d 0.4874 0.4343 0.3218 0.3565 BASE LSTM UD custom-100d 0.5498 0.4872 0.4097 0.4397 Table 2: Performance of the approaches as tested locally by us. Run Accuracy@5 Accuracy@10 Accuracy@15 Accuracy@20 F1 macro BASE UD 1F 0.8167 0.9214 0.9685 0.9909 0.3653 BASE UD 10F 0.8152 0.9194 0.9681 0.9917 0.3563 BASE TR 10F 0.7453 0.8750 0.9434 0.9800 0.2920 gw2017 p.list 0.6718 0.8148 0.8941 0.9299 0.2329 Table 3: Official evaluation results for our three submitted runs and the runner-up model. formant but still offered significant improvement Precision Recall F1 Support over the runner-up team (gw2017 p.list) as 0.7991 0.6490 0.7163 5069 shown in Table 3. Additionally, we detail the per- 0.4765 0.7116 0.5708 4966 formance of our best submission (BASE UD 1F) 0.6402 0.4337 0.5171 279 for each individual emoji in Table 4 and Figure 2. 0.5493 0.4315 0.4834 387 0.4937 0.4453 0.4683 265 5 Discussion and Conclusions 0.7254 0.3229 0.4469 319 0.3576 0.5370 0.4293 2363 Our findings suggest that emojis are currently used 0.4236 0.4089 0.4161 834 mostly based on user preferences: the more prior 0.4090 0.3775 0.3926 506 0.4034 0.3354 0.3663 1282 user history we added, the more significant per- 0.4250 0.3299 0.3715 885 formance boost we have observed. Therefore, the 0.3743 0.3184 0.3441 1338 emojis in a text cannot be considered indepen- 0.3684 0.3239 0.3447 1028 dently from the person that has used them and 0.3854 0.2782 0.3231 266 textual features alone can not yield a sufficiently 0.3844 0.2711 0.3179 546 performant approach for predicting emojis. Addi- 0.3899 0.2648 0.3154 642 tionally, we have shown that the task was sensitive 0.3536 0.2743 0.3089 700 0.3835 0.2566 0.3075 417 to the choice of a particular neural architecture as 0.3525 0.1922 0.2488 541 well as to the choice of the word embeddings used 0.2866 0.2639 0.2748 341 to represent text. 0.2280 0.2922 0.2562 373 An analogous task was proposed to the par- 0.2751 0.2133 0.2403 347 ticipants of the SemEval 2018 competition. The 0.2845 0.1741 0.2160 379 winners of that edition applied an SVM-based ap- 0.3154 0.1822 0.2310 483 proach for the classification (Çöltekin and Rama, 0.2956 0.1824 0.2256 444 2018). Instead, we have opted for a neural network-based architecture that allowed us greater Table 4: Precision, Recall, F1 of our best submis- flexibility to experiment with various features sion and the number of samples in test set for each coming from different modalities: the text of the emoji. tweet represented using word embeddings and the sparse user-based history. During our experiments prediction task the effectiveness of different ap- with the SemEval 2018 task as part of the NL4AI proaches may significantly vary based either on workshop (Coman et al., 2018), we have found the a language of the tweets or based on a way the CNN-based architecture to perform better, while dataset was constructed. here the RNN was a clear winner. Such discrep- In the future, we would like to investigate this ancy might suggest that even within the emoji topic further by trying to study differences in Figure 2: Confusion matrix for our best submission normalized by support size: each value in a row is divided by the row marginal. Diagonal values give recall for each individual class (see Table 4). emoji usage between languages and communities. Andrei Catalin Coman, Giacomo Zara, Yaroslav Additionally, we aim to further improve our ap- Nechaev, Gianni Barlacchi, and Alessandro Mos- chitti. 2018. Exploiting deep neural networks for proach by identifying more user-based features, tweet-based emoji prediction. In Proc. of the 2nd for example, by taking into account the feature Workshop on Natural Language for Artificial Intel- families suggested by Nechaev et al. (Nechaev et ligence co-located with 17th Int. Conf. of the Italian al., 2018b). Association for Artificial Intelligence (AI*IA 2018), Trento, Italy. References Yaroslav Nechaev, Francesco Corcoglioniti, and Clau- Francesco Barbieri, German Kruszewski, Francesco dio Giuliano. 2018a. Sociallink: Exploiting graph Ronzano, and Horacio Saggion. 2016. How cos- embeddings to link dbpedia entities to twitter pro- mopolitan are emojis?: Exploring emojis usage and files. Progress in AI, 7(4):251–272. meaning over different languages with distributional Yaroslav Nechaev, Francesco Corcoglioniti, and Clau- semantics. In Proceedings of the 2016 ACM on Mul- dio Giuliano. 2018b. Type prediction combining timedia Conference, pages 531–535. ACM. linked open data and social media. In Proc. of the 27th ACM Int. Conf. on Information and Knowl- Francesco Barbieri, Jose Camacho-Collados, edge Management, CIKM 2018, Torino, Italy, pages Francesco Ronzano, Luis Espinosa-Anke, Miguel 1033–1042. Ballesteros, Valerio Basile, Viviana Patti, and Horacio Saggion. 2018. SemEval-2018 Task Francesco Ronzano, Francesco Barbieri, Endang 2: Multilingual Emoji Prediction. In Proc. of Wahyu Pamungkas, Viviana Patti, and Francesca the 12th Int. Workshop on Semantic Evaluation Chiusaroli. 2018. Overview of the evalita 2018 (SemEval-2018), New Orleans, LA, United States. italian emoji prediction (itamoji) task. In Tom- Association for Computational Linguistics. maso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso, editors, Proceedings of the 6th evalua- Piotr Bojanowski, Edouard Grave, Armand Joulin, and tion campaign of Natural Language Processing and Tomas Mikolov. 2017. Enriching word vectors with Speech tools for Italian (EVALITA’18), Turin, Italy. subword information. Transactions of the Associa- CEUR.org. tion for Computational Linguistics, 5:135–146. Çagri Çöltekin and Taraka Rama. 2018. Tübingen- oslo at semeval-2018 task 2: Svms perform bet- ter than rnns in emoji prediction. In Proc. of The 12th Int. Workshop on Semantic Evaluation, SemEval@NAACL-HLT, New Orleans, Louisiana, pages 34–38.