The validity of word vectors over the time for the EVALITA 2018 Emoji prediction task (ITAmoji) Mauro Bennici Xileny Seijas Portocarrero You Are My GUide You Are My GUide mauro@youaremyguide.com xileny@youaremyguide.com elli per ottenere una maggiore compren- Abstract sione del dataset. English. This document describes the re- 1 Introduction sults of our system in the evaluation campaign on the prediction of Emoji in In the field of communication, the importance of Italian, organized in the context of addressing your audience with a common lan- EVALITA 20181 (Ronzano et al., 2018). guage in which the customer can recognize and Given the text of a tweet in Italian, the identify with each other is fundamental. In social task is to predict the emoji most likely interactions, an increasing amount of communi- associated with that tweet among the 25 cation occurs in a non-verbal way, as with emoji. emojis selected by the organizers. In this Being able to predict the best emoji to use in a report, we describe the three proposed message can increase the perception of the same systems for evaluation. and give strength to the message itself. The approach described starts from the possibility of creating two different mod- In the context of the Italian Emoji Prediction task els, one for the part of categorization, and called ITAmoji, we have tried to predict one of the other for the part of polarity. And to 25 possible emojis from different tweets. combine the two models to get a better understanding of the dataset. Despite the knowledge of how a system of SVM could be the best solution for the problem, as per Italiano. Questo documento descrive i the previous context SemEval 2018 (Rama & nostri risultati del nostro sistema nella Çöltekin, 2018), a different approach was cho- campagna di valutazione sulla predizione sen to focus on the effectiveness of a Neural delle Emoji in italiano, organizzata nel Network based model contesto di EVALITA 2018. 2 Description of the system Dato il testo di un tweet in italiano, il We first started by cleaning the given data from task consiste nel predire l’emoji più all the noise information. All the punctuation probabilmente associata a quel tweet tra marks were removed from the text of tweets, and le 25 emojis selezionate dagli or- we focused on cleaning the text and removing ganizzatori. In questo report descriviamo ambiguities such as shortened words and abbre- i tre sistemi proposti per la valutazione. viations. We substituted all the hyperlinks with a more generic word “LINK” and we did the same L'approccio descritto parte dalla possibil- with the usernames preceded by ‘@’ (users’ ità di creare due modelli diversi, uno per tags), after seeing that it was not relevant in the la parte di categorizzazione, e l'altro per prediction of the most likely emoji for the tweet. la parte di polarità. E di unire i due mod- We tried removing the stop words from the tweets’ text to leave only the words with relevant 1 https://sites.google.com/view/itamoji/ meaning in it, but the results were poor. • the number of spaces Then we converted every word of the tweet’s text into its lemma, and while doing the lemma- • the number of stop words tization, we saw that sometimes the username • the ratio between words and stop words was misleading in the text, so we chose to re- move it and substitute it with a more generic • the ratio between words and spaces word ‘USERNAME’. • the ratio between words and hashtags We used two different fastText2 vectors created in the 2016 and the other created in 2017, all and are joined to the vector created by the bi- with Italian tweets containing at least one emojis. gram and the trigram of the tweet itself at word The idea is to analyze if different fastText vec- and character level. tors created with tweets published in different The number of leaves is 250, the learner set as periods could discover the use of the emojis and ‘Feature’, and the learning rate at 0.04. its evolution over the time. The ensemble is done in the weighted average The system created is an Ensemble of two differ- when the BI_LSTM decide the 60% of the vote ent models to replicate the result obtained in the and the LightGBM the 40%. emotion classification (Akhtar at al., 2018). The first model is a bi-directional Long Short- It was also tried to add a linear classifier but the Term Memory (BI-LSTM) implemented in attempt did not provide any advantage. The Keras3. cross-validation task to find a good weight was _______________________________________ ineffectual and the provision was insignificant. Layer (type) Output Shape Param # =================================== 3 Results e (Embedding) (None, 25, 200) 34978200 The results of the Bi-LSTM were: _______________________________________ b (Bidirectional) (None, 512) 935936 _______________________________________ BI-LSTM with 2016 fastText d (Dense) (None, 25) 12825 =================================== precision recall F1 score 0.3595 0.2519 0.2715 A dropout and a recurrent_dropout of 0.9. The optimizer is the RMSProp. The embedding Table 1: precision, recall, and F1 score with 2016 fastText vector. is trainable. The second is a LightGBM4, where the following properties are extracted from the tweet text: BI-LSTM with 2017 fastText • length of the tweet precision recall F1 score • percentage of special characters 0.3520 0.2577 0.2772 • the number of exclamation points Table 2: precision, recall, and F1 score with 2017 fastText vector. • the number of question marks The model trained with the data published during • the number of words the 2017 is quite similar to the model trained • the number of characters with the data published on the 2016. 2 The results of the LightGBM were: https://fasttext.cc 3 https://keras.io 4 https://github.com/Microsoft/LightGBM LightGBM only text BI-LSTM with 2016 fastText + LightGBM only text precision recall F1 score precision recall F1 score 0.2399 0.3094 0.2460 0.4121 0.2715 0.2955 Table 3: precision, recall, and F1 score Table 5: precision, recall, and F1 score The LightGBM model was also tested by adding to the already mentioned properties additional BI-LSTM with 2017 fastText + LightGBM with user and date information such as the user ID and information extracted from the tweet date such as day, month, precision recall F1 score the day of the week and time. 0.3650 0.2917 0.3048 The results obtained also indicate here that there Table 6: precision, recall, and F1 score is a correspondence between the use of emojis, the user, the time and the day. For example the The result of the validation was however very far Christmas tree in December or the heart emoji in from that obtained during the training phase. It the evening hours. will be necessary to evaluate if, as in the research Exploring Emoji Usage and Prediction Through LightGBM with user and date a Temporal Variation Lens (Barbieri et al., 2018), it was the time of the publication of the tweets is to be distant from the date of the tweets precision recall F1 score analyzed. 0.5044 0.2331 0.2702 Table 4: precision, recall, and F1 score If the tweets analyzed were too different from those of the training dataset, if the users in the The level of Precision obtained in this way was test dataset have different behaviors, or if the very high even if the F1 score is still lower than system suffered from some kind of overfitting the BI-LSTM model. (visible in the third submission, gw2017_pe). To avoid the unbalancing of the emojis present in gw2017_e gw2017_p gw2017_pe the training dataset various undersampling and Macro F1 0.222082 0.232940 0.037520 oversampling operations were performed without any appreciable results. Micro F1 0.421920 0.400920 0.119480 Turning to the result of the ensemble of the two Weighted F1 0.368996 0.378105 0.109664 models we had a marked increase in the F1 score Coverage error 4.601440 5.661600 13.489400 thanks to the substantial growth of the Recall in Accuracy at 5 0.713000 0.671840 0.279280 both cases. Accuracy at 10 0.859040 0.814880 0.430360 In the tables 5 and 6 there are the results from the Accuracy at 15 0.943080 0.894160 0.560000 minimum and the maximum F1 score obtained Accuracy at 20 0.982520 0.929920 0.662720 during the process of the ensemble. Table 7: macro F1, micro F1, weighted F1, coverage error, accuracy at 5, 10, 15 and 20 for the three runs submitted. In table 8 we can observe the result of the three submissions split for each emoji. Runs gw2017_e gw2017_p gw2017_pe Label precision recall f1-score precision recall f1-score precision recall f1-score quantity 0.2150 0.0224 0.0405 0.1395 0.0642 0.0879 0.0242 0.0107 0.0148 1028 0.4429 0.1917 0.2676 0.3608 0.2075 0.2635 0.0215 0.0178 0.0195 506 0.3142 0.3417 0.3274 0.2726 0.3681 0.3133 0.0343 0.0468 0.0396 834 0.3624 0.3540 0.3582 0.3204 0.3850 0.3498 0.0107 0.0155 0.0127 387 0.3137 0.0360 0.0646 0.1608 0.0518 0.0784 0.0077 0.0023 0.0035 444 0.3533 0.8357 0.4967 0.4185 0.6104 0.4965 0.2024 0.2648 0.2294 4966 0.3902 0.1535 0.2203 0.3257 0.2038 0.2507 0.0263 0.0264 0.0263 417 0.2917 0.0554 0.0931 0.2190 0.0678 0.1035 0.0328 0.0102 0.0155 885 0.0800 0.0053 0.0099 0.0581 0.0132 0.0215 0.0380 0.0079 0.0131 379 0.5143 0.2581 0.3437 0.4464 0.2688 0.3356 0.0044 0.0036 0.0039 279 0.3144 0.1635 0.2152 0.1895 0.2520 0.2163 0.0135 0.0134 0.0135 373 0.7567 0.7497 0.7531 0.7803 0.7358 0.7574 0.2101 0.2016 0.2058 5069 0.1714 0.0110 0.0207 0.1053 0.0183 0.0312 0.0137 0.0018 0.0032 546 0.3769 0.1849 0.2481 0.3439 0.2038 0.2559 0.0142 0.0113 0.0126 265 0.3137 0.4109 0.3558 0.2952 0.4824 0.3663 0.0904 0.1583 0.1151 2363 0.2384 0.1607 0.1920 0.2068 0.1747 0.1894 0.0526 0.0546 0.0536 1282 0.3174 0.1043 0.1570 0.2432 0.1157 0.1568 0.0317 0.0243 0.0275 700 0.4667 0.1579 0.2360 0.3239 0.1729 0.2255 0.0096 0.0075 0.0084 266 0.6735 0.3103 0.4249 0.6221 0.3354 0.4358 0.0106 0.0063 0.0079 319 0.3204 0.1220 0.1767 0.2101 0.2680 0.2356 0.0193 0.0185 0.0189 541 0.4278 0.1199 0.1873 0.3043 0.1526 0.2033 0.0249 0.0171 0.0203 642 0.3220 0.0548 0.0936 0.2368 0.0778 0.1171 0.0187 0.0086 0.0118 347 0.3590 0.0411 0.0737 0.2537 0.0499 0.0833 0.0161 0.0059 0.0086 341 0.2082 0.1181 0.1507 0.1584 0.2451 0.1924 0.0369 0.0419 0.0392 1338 0.2609 0.0248 0.0454 0.1860 0.0331 0.0562 0.0336 0.0083 0.0133 483 avg / total 0.4071 0.4219 0.3690 0.3870 0.4009 0.3781 0.1051 0.1195 0.1097 25000 Table 8: Precision, Recall, F1 Score, and quantity in the test set of the 25 most frequent emojis. It is important to note that despite the significant presence of the dataset the has a meager final F1 score. On the other hand, the has a high F1 score even if only present in 319 items. ation of the BI-LSTM and the features extrapola- 4 Discussion tion used in the LightGBM model can be merged during the same training session. In the study of the dataset, three critical issues emerged. We will also focus on the creation of fastText vectors of different size containing tweets for l The first is that the use of similar emojis specific contexts and published in different peri- seems more dictated by a personal ods to identify the periodicity and variation in the use of particular emoji. The intent is to discover choice of the user. other hidden patterns, more than the obvious that There are not many pieces of evidence has emerged for the holiday periods. because the use of one emoji is pre- Reference ferred. Francesco Ronzano, Francesco Barbieri, Endang In particular for the following emoji: Wahyu Pamungkas, Viviana Patti, and Francesca Chiusaroli (2018) ITAmoji: Overview of the Italian emoji prediction task @ Evalita 2018. In Proceed- ings of Sixth Evaluation Campaign of Natural l The second is that, especially in cases Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2018), where a tweet begins by indicating a CEUR.org, Turin, Italy. USERNAME, or in a mention or a direct response, the use of emoji takes on a Taraka Rama and Çagri Çöltekin. (2018, June). Tü- bingen-Oslo at SemEval-2018 Task 2: SVMs per- sub-language value. That is, the use of a form better than RNNs in Emoji Prediction. Re- specific word or emoji has a meaning trieved from https://aclanthology.coli.uni- saarland.de/papers/S18-1004/s18-1004 that only the tweet recipients know. Use of emoji and could be irony or Francesco Barbieri, José Camacho-Collados, Frances- co Ronzano, Luis Espinosa Anke, Miguel Balles- just references to previous pasted experi- teros, Valerio Basile, Viviana Patti, Horacio. ences in common. (2018) Saggion: SemEval 2018 Task 2: Multilin- gual Emoji Prediction. SemEval@NAACL-HLT l Thirdly, the strong imbalance of the 2018: 24-33. ACL. training dataset is not the only reason for the unbalanced prediction of some emo- Md Shad Akhtar, Deepanway Ghosal, Asif Ekbal, Pushpak Bhattacharyya, Sadao Kurohashi. (2018, jis, as in the case of and . October 15). A Multi-task Ensemble Framework for Emotion, Sentiment and Intensity Prediction. Retrieved from https://arxiv.org/abs/1808.01216 5 Conclusion The result of the ensemble was pretty good and Francesco Barbieri, Luis Marujo, Pradeep Karuturi, demonstrate the validity of this kind of approach. William Brendel, Horacio Saggion. (2018, May The use of emoji is personal and also depends on 02). Exploring Emoji Usage and Prediction the context and the people in the discussion. A Through a Temporal Variation Lens. Retrieved system with the emojis with the same meaning from https://arxiv.org/abs/1805.00731 merged could be more proficient and ready for the production. In the near future, we will evaluate the speed and effectiveness of a CNN model in which the oper-