Overview of the EVALITA 2018 Italian Emoji Prediction (ITAMoji) Task Francesco Ronzano Francesco Barbieri Universitat Pompeu Fabra, Spain Universitat Pompeu Fabra Hospital del Mar Medical Research Center Barcelona, Spain Barcelona, Spain francesco.barbieri@upf.edu francesco.ronzano@upf.edu Endang Wahyu Pamungkas,Viviana Patti Francesca Chiusaroli Department of Computer Science Department of Humanities University of Turin, Italy Università di Macerata, Italy {pamungka,patti}@di.unito.it f.chiusaroli@unimc.it Abstract sono stati valutati i risultati di dodici sis- temi di predizione di emoji messi a punto English. The Italian Emoji Prediction task da cinque gruppi di lavoro. Presenti- (ITAmoji) is proposed at EVALITA 2018 amo qui i dataset, la metodologia di va- evaluation campaign for the first time, af- lutazione (che include diverse metriche) e ter the success of the twin Multilingual gli approcci dei sistemi che hanno parteci- Emoji Prediction Task, organized in the pato. Presentiamo inoltre una riflessione context of SemEval-2018 in order to chal- sui risultati ottenuti in tale task da sistemi lenge the research community to automat- automatici e umani. ically model the semantics of emojis in Twitter. Participants were invited to sub- 1 Introduction mit systems designed to predict, given During the last decade the use of emoji has in- an Italian tweet, its most likely associ- creasingly pervaded social media platforms by ated emoji, selected in a wide and het- providing users with a rich set of pictograms use- erogeneous emoji space. Twelve runs ful to visually complement and enrich the expres- were submitted at ITAmoji by five teams. siveness of short text messages. Nowadays this We present the data sets, the evaluation novel, visual way of communication represents a methodology including different metrics de facto standard in a wide range of social media and the approaches of the participating platforms including fully-fledged portals for user- systems. We also present a comparison be- generated contents like Twitter, Facebook and In- tween the performance of automatic sys- stagram as well as instant-messaging services like tems and humans solving the same task. WhatsApp. As a consequence, the possibility to Data and further information about this effectively interpret and model the semantics of task can be found at: https://sites. emojis has become an essential task to deal with google.com/view/itamoji/. when we analyze social media contents. Italiano. Il task italiano per la predizione Even if over the last few years the study of degli emoji in Twitter (ITAmoji) viene pro- this new form of language has been receiving a posto nell’ambito della campagna di valu- growing attention, at present the body of investiga- tazione di Evalita 2018 per la prima volta, tions that deal with emojis is still scarce, especially dopo il successo del task gemello, il Mul- when we consider their characterization from a tilingual Emoji Prediction Task, proposto Natural Language Processing (NLP) perspective. a Semeval-2018 per stimolare la comu- While there are notable exceptions which study nità di ricerca a costruire modelli com- the semantics of emojis and their usage (Barbi- putazionali della semantica delle emoji in eri et al., 2016a; Barbieri et al., 2018b; Aoki Twitter. I partecipanti sono stati invitati and Uchida, 2011; Eisner et al., 2016; Ljubešić a costruire sistemi disegnati per predire and Fišer, 2016), reflecting also on their informa- l’emoji piú probabile dato un tweet in ital- tive behaviour (Donato and Paggio, 2017; Donato iano, selezionandola in uno spazio am- and Paggio, 2018), or their sentiment (Novak et pio e eterogeneo di emoji. In ITAmoji al., 2015), the interplay between text-based mes- sages and emojis remains still explored only by a tail the evaluation metrics, we describe the partic- small number of studies. Among these investiga- ipants results and we propose a first comparison tions there is the analysis of emoji predictability with performances of humans solving the same by (Barbieri et al., 2017), which proposed a neural task. We conclude the paper with some reflections model to predict the most likely emoji to appear on the outcomes of the proposed task. in a text message (tweet). The task resulted to be hard, as emojis encode multiple meanings (Barbi- 2 Emojis and Italian eri et al., 2016b). Related to this, in the context of the International Workshop on Semantic Eval- We can observe a growing interest on the se- uation (SemEVAL 2018), the Multilingual Emoji mantics of emojis in relation with Italian. In Prediction Task (Barbieri et al., 2018a) has been particular, some recent interesting projects have organized in order to challenge the research com- been carried out in the last years, which address munity to automatically model the semantics of the issue in a translation framework, investigating emojis occurring in English and Spanish Twitter the possibility to translate from Italian literary messages. The task was very successful, with 49 texts into the universal visual language of emoji teams participating in the English subtask and 22 (Chiusaroli, 2015; Monti et al., 2016). In partic- in the Spanish subtask. This motivated us to pro- ular, the Emojitaliano project was launched as a pose the shared task also for the Italian language translation project of the Italian novel Pinocchio in the context of the Evalita 2018 evaluation cam- in emoji (Chiusaroli, 2017) on Twitter. An paign (Caselli et al., 2018), with the twofold aim to original approach based on crowdsourcing was widen the setting for cross-language comparisons adopted, by involving for the translation task the for emoji prediction in Twitter and to experiment Twitter community named as Scritture Brevi. with novel metrics to better assess the quality of The Twitter community #scritturebrevi The the automatic predictions. community (#scritturebrevi, @FChiusaroli, 10,151 followers in November 2018) had previ- In general, exciting and highly relevant avenues ously been involved in experiments of creative for research are still to explore with respect to writing, also in emojis: with the hashtag #inemoti- emoji understanding, since emojis represent often con, on Twitter, experiments of mixed translation an essential component of social media texts: ig- - words and emojis – have been carried out, noring or misinterpreting them may lead to mis- experiencing the semantic versatility of emojis, understandings in comprehending the intended and their values in rebus writings. Translating meaning of a message (Miller et al., 2016). The the whole Pinocchio book was a more complex ambiguity of emojis raises also interesting ques- and engaging task, especially for its focus on tions in application domains, think for instance to developing a common code base, in terms of a human-computer interaction setting: how can glossary and grammar, which is absolute new we teach an artificial agent to correctly interpret with respect to previous projects. The translation and recognize emojis’ use in spontaneous conver- of Pinocchio started on February 2016. Everyday, sation? The main motivation behind this question for 28 weeks, sentences taken from Pinocchio is that an AI system able to predict emojis could were tweeted, and the followers were invited to contribute notably to better natural language un- suggest their translations to emoji; at the end of derstanding (Novak et al., 2015) and thus to other each day, the official version of the translation Natural Language Processing tasks such as gen- was validated and published. An online tool erating emoji-enriched social media content, en- Emojiitalianobot has been developed in order to hancing emotion/sentiment analysis systems, im- support the community to memorize the semantic proving retrieval of social network material, and values assigned to each emoji during the collec- ultimately improving user profiling. tive translation process. Since its first beginning In the following, we describe the main elements on Twitter, the project was an instant success, of the shared task (Section 3), after proposing a becoming a viral web phenomenon thanks to brief summary about previous projects reflecting the Scritture brevi community. Therefore, it on the semantics of emojis in Italian (Section 2). was a natural choice to involve the same Twitter Then, we cover the data collection, curation and community to reflect on the semantics of emoji release process (Section 4). In Section 5 we de- from a different perspective, i.e. the one we propose in the context of the ITAmoji shared task, samples. thus helping us to understand how humans are good at predicting emojis (see Section 5.5.2). Emoji % Tweet in Train and Test set 20.27 3 Task Description 19.86 We invited participants to submit systems de- 9.45 signed to predict, given a tweet in Italian, its most 5.35 likely associated emoji, only based on the text of 5.13 the tweet. As for the experimental setting, for sim- plicity purposes, we considered tweets including 4.11 only one emoji (eventually repeated). After re- 3.54 moving the emoji from the tweet, we asked users 3.33 to predict it. We challenged systems to predict 2.80 2.57 Innamorato sempre di più [URL] 2.18 2.16 Figure 1: Example of tweet with an emoji at the 2.03 end, considered in the emoji prediction task. 1.94 1.78 emojis among a wide and heterogeneous emoji space. In particular, we selected the tweets that 1.67 included one of the twenty five emojis that occur 1.55 most frequently in the Twitter data we collected 1.52 (see Table 1). Therefore, the task can be seen 1.49 as a multi-class classification task where systems 1.39 should predict one of 25 possible emojis from the text of a given tweet. Each participant was al- 1.37 lowed to submit up to three system runs. Partic- 1.28 ipants were allowed to use additional data to train 1.12 the systems such as lexicons and pre-trained word 1.07 embeddings. In order to have the possibility to 1.06 perform a finer grained evaluation of results, we encouraged participants to submit, for each tweet, Table 1: The distribution (percentage) for each not only the most likely emoji predicted but also emoji in the train and test set the complete rank from the most likely to the less likely emoji to be associated to the text of the tweet. 5 Evaluation 4 Task Data In this section we present the evaluation setting for the ITAmoji shared task. The data for this task were retrieved from Twitter by experimenting with two different approaches: 5.1 Metrics (i) gathering Twitter stream on (geolocalized) Ital- The evaluation of the emoji prediction systems ian tweets from October 2015 to February 2018; has been based on the classic precision and re- and (ii) retrieving tweets from the followers of the call metrics over each emoji. The final ranking of most popular Italian newspaper’s accounts. We the participating teams of ITAmoji 2018 relies on randomly selected 275, 000 tweets from these col- the Macro F1 score computed with respect to the lections by choosing tweets that contained one and most likely emoji predicted, given the text of each only one emoji over 25 most frequent emojis listed tweet of the test set, in line with the proposal in the in Table 1. We split our data into two sets consist- twin task at Semeval 2018 for English and Spanish ing of 250, 000 training samples and 25, 000 test (Barbieri et al., 2018a). In this way we intend to encourage systems to perform well overall, which Regarding the emoji-rank-based metrics, we would inherently mean a better sensitivity to the considered: use of emojis in general, rather than for instance overfitting a model to do well in the three or four • Coverage error: compute how far we need most common emojis of the test data. to go through the ranked scores of labels In general, the identification of a coherent and (emojis) to cover all true labels; effective approach to compare the performance of distinct emoji prediction systems is not an easy • Accuracy@n: is the accuracy value com- task. We have often the clear impression that the puted by considering as right predictions the semantics of some sets of emojis can be similar, ones in which the right label (emoji) is among therefore it would be interesting to have a way to the top N most likely ones. compare and evaluate at a finer grained level the emoji prediction quality of two distinct systems, 5.2 Baseline when they both fail in predicting the right emoji to In order to compare the performance of the associate to a tweet. In such cases, indeed, it can ITAmoji participating systems with baseline ap- be important to distinguish between the system proaches, we considered three different baselines: that identifies the right prediction among the most - Majority baseline: for each text of a tweet we likely emojis to be associated to that tweet and the predict the ordered list of 25 most-likely emojis one that characterizes the right prediction as an sorted by their frequency in the training set, that emoji that is unlikely to be associated to that tweet. is, we always predict as first choice the red heart, In order to catch this aspect, we gave ITAmoji par- and as last choice the rose emoji. ticipants the possibility to submit as emoji predic- - Weighted random baseline: for each text of a tions, the ordered ranking of the 25 emojis con- tweet we predict the ordered list of the 25 most- sidered in ITAmoji. Systems providing the ranked likely emojis where the first prediction is ran- list of emoji predictions were also compared by domly selected taking in consideration the label- considering the following additional emoji-rank- frequency in the training set (in order to keep the based metrics: Accuracy@5/10/15/20 and Cov- same labels distribution) and the rest of the pre- erage Error. All the submissions we received dictions (from the second to the last one) are gen- provided the ranked list of 25 emojis as predic- erated by considering the rest of emojis sorted by tions: as a consequence it was possible to compute label-frequency. the emoji-rank-based metrics considered for all of - FastText baseline: for each text of a tweet them. we predict the ordered list of the 25 most-likely A detailed description of all the evaluation met- emojis by relying on fasttext with basic parame- rics we considered to compare the quality of emoji ters1 and pretrained embeddings with 300 dimen- prediction approaches is given below. The fol- sions (Barbieri et al., 2016a). lowing three standard metrics are computed by considering only the emoji predicted as the most 5.3 Participating Systems and Results likely one to be associated to the text of a tweet: We received 12 submissions in total from 5 differ- • Macro F1: compute the F1 score for each la- ent teams. The main approaches and features of bel (emoji), and find their un-weighted mean participating teams are described below. (exploited to determine the final ranking of FBK_FLEXED_BICEPS (Andrei et al., 2018) the participating teams); This system exploit recurrent neural network ar- chitecture Bidirectional Long Short Term Mem- • Micro F1: compute the F1 score globally by ory (Bi-LSTM), together with user based features counting the total true positives, false nega- to deal with this task. They concatenate the out- tives and false positives across all label (emo- put of Bi-LSTM network that take word sequence jis); as input with the user history distribution in us- ing emoji. Finally, the softmax activation is used • Weighted F1: compute the F1 score for to get the probability distribution of the 25 emoji each label (emoji), and find their average, labels. weighted by support (the number of true in- 1 stances for each label); https://fasttext.cc/ GW2017 (Mauro and Xileny, 2018) This sys- can be motivated by the trend, when we consider tem based on ensemble of two models, Bi-LSTM Micro F1, to favour systems that tend to overfit and LightGBM2 . The first model uses two differ- their prediction model to do well in the most com- ent word2vec models based on the time creation, mon emojis of the test data with respect to sys- while the second model exploits several surfaces tems with good performances over all emojis: this feature extracted from tweet text (e.g., number of fact confirms our choice to select Macro F1 as the words, number of characters). official metric to rank ITAmoji 2018 participating CIML-UNIPI (Daniele et al., 2018) This system systems. is based on ensemble composed of 13 models (12 From Table 3 we can see how the order to the basen on TreeESNs and one on LSTM over char- top-5 best performing systems in terms of Macro acters. Models based on TreeESN are built by F1 is substantially preserved when we consider varying the number of reservoir units, activation the emoji-rank-based metrics Coverage Error and function, readout and parser. Accuracy@5 (except for the switch between the sentim (Jacob, 2018) This system relies on a fourth and fifth best performing approach). convolutional neural network (CNN) architecture If we consider the performance of our three which uses character embedding as input. 9 layers baseline systems (described in Section 5.2) we can of residual dilated convolutions with skip connec- notice from Table 2 that, as expected, FastText is tions are applied, followed by a ReLU activation the best performing baseline approach: a FastText to increase nonlinearity. embedding based prediction system would have UNIBA (Lucia and Daniela, 2018) This system ranked as eight by Macro F1 in ITAmoji 2018. is built by using ensemble classifier based on Table 6 shows the highest F1 score for each WEKA3 and scikit-learn4 . Several features are emoji / label across all ITAmoji 2018 team sub- exploited by using micro-blogging based feature, missions. We can notice that even if specific emo- sentiment based feature, and semantic based fea- jis like , , , or are characterized by a small ture. percentage of training samples (about 1%), pre- Table 2 shows the official results of ITA- diction systems manage to obtain high Macro F1 moji 2018 task, ordered by decreasing Macro scores. In contrast, when we consider emojis like F1. The best performing system was proposed by or , even if there are more training samples the FBK_FLEXED_BICEPS team, which achieves available with respect to the previous set of emo- 0.365312 in Macro F1. Overall, we can see that jis (more than 2%), we observe that the predic- systems which exploit neural network architec- tion systems do not manage to get high Macro F1 ture obtained good performances in this task, es- scores. This fact can be explained by the variabil- pecially when relying on Bi-LSTM model. Table ity of the context of use that characterizes the lat- 3 shows the performance of ITAmoji systems with ter set of emojis that makes it difficult for system respect to emoji-rank-based metrics. to learn to predict. 5.4 Analysis To conclude our analysis, we have to notice that the three runs that obtained the highest Macro F1 From Table 2 we can notice that the ranking or- scores, to predict the emojis exploited, besides the der of the 5 system runs that obtained the best text of a tweet, the way the author of that tweet Macro F1 is substantially preserved when we con- used emojis in previous tweets. This fact high- sider Micro F1 or Weighted F1. Anyway, with re- lights that the choice of an emoji strongly depends spect to Macro F1, when we consider Micro F1 on the preferences and writing style of each indi- the differences among the scores obtained by the vidual, both representing relevant inputs to model top-performing systems tend to be substantially in order to improve emoji prediction quality. smaller: for instance the Macro F1 of the best sys- tem is greater by a factor of 1.64 with respect to 5.5 Emoji prediction by humans the fifth system, while the Micro F1 of the best system is greater by a factor of 1.18 with respect In this section we present a preliminary discussion to the fifth system (ranked by Micro F1). This fact of the results of two experiments designed in or- 2 https://github.com/Microsoft/LightGBM der to evaluate how humans perform when they 3 https://www.cs.waikato.ac.nz/ml/weka/ are requested to identify the most likely emoji(s) 4 http://scikit-learn.org/stable/ to associate to the text of an Italian tweet. The Rank Team Run Name Macro F1 Micro F1 Weighted F1 1 FBK_FLEXED_BICEPS base_ud_1f 36.53 47.67 46.98 2 FBK_FLEXED_BICEPS base_ud_10f 35.63 47.62 46.58 3 FBK_FLEXED_BICEPS base_tr_10f 29.21 42.35 39.57 4 GW2017 gw2017_p 23.29 40.09 37.81 5 GW2017 gw2017_e 22.21 42.19 36.90 6 CIML-UNIPI run1 19.24 29.12 31.48 7 CIML-UNIPI run2 18.80 37.63 34.101 - FastText baseline 11.96 28.72 27.02 8 sentim Sentim_Test_Run_3 10.62 29.43 23.24 9 sentim Sentim_Test_Run_2 10.23 31.27 23.11 - Weighted random baseline 3.94 10.36 10.36 10 GW2017 gw2017_pe 3.75 11.95 10.97 11 UNIBA itamoji_uniba_run1 3.19 27.38 15.61 12 sentim Sentim_Test_Run_1 1.95 6.48 3.99 - Majority baseline 1.35 20.28 6.84 Table 2: Official Results of ITAmoji Shared Task: evaluation metrics computed by considering only the emoji predicted as the most likely one to be associated to the text of a tweet. Teams runs are ranked by Macro F1. The table shows also the performance of the three baselines considered in ITAmoji, ranked with respect to their Macro F1. Rank Team Run Name Coverage Error Accuracy@5 / 10 / 15 / 20 1 FBK_FLEXED_BICEPS base_ud_1f 3.47 81.67 / 92.14 / 96.86 / 99.10 2 FBK_FLEXED_BICEPS base_ud_10f 3.49 81.53 / 91.94 / 96.82 / 99.17 3 FBK_FLEXED_BICEPS base_tr_10f 4.35 74.54 / 87.50 / 94.34 / 98.00 4 GW2017 gw2017_p 5.66 67.18 / 81.49 / 89.42 / 92.99 5 GW2017 gw2017_e 4.60 71.30 / 85.90 / 94.30 / 98.25 6 CIML-UNIPI run1 5.43 64.60 / 83.02 / 93.00 / 98.01 7 CIML-UNIPI run2 5.11 68.46 / 83.86 / 92.38 / 97.28 - FastText baseline 7.23 59.07 / 74.22 / 82.58 / 88.89 8 sentim Sentim_Test_Run_3 6.41 58.53 / 76.93 / 88.52 / 95.74 9 sentim Sentim_Test_Run_2 6.33 57.60 / 77.17 / 89.70 / 96.41 - Weighted random baseline 6.92 59.06 / 76.11 / 86.42 / 94.10 10 GW2017 gw2017_pe 13.49 27.93 / 43.04 / 56.00 / 66.27 11 UNIBA itamoji_uniba_run1 6.70 58.78 / 75.97 / 86.36 / 93.53 12 sentim Sentim_Test_Run_1 12.45 29.20 / 48.78 / 64.38 / 74.04 - Majority baseline 6.63 60.07 / 76.43 / 86.51 / 94.12 Table 3: Official Results of ITAmoji Shared Task: emoji-rank-based metrics (Coverage error and Accu- racy@n). Teams runs ranked by Macro F1. The table shows also the performance of the three baselines considered in ITAmoji, ranked with respect to their Macro F1. final purpose here is to explore if humans are bet- ter than automated systems in the emoji predic- tion task from text, or viceversa. In an attempt to consider an uniform set of emojis in our experi- mental settings, in both human emoji prediction experiments described in the rest of this section Table 4: The set of 15 face emoji considered in the we decided to focus only on the 15 emojis shown human annotation experiments. in Table 4. This group of emojis includes all the yellow-face emojis considered in the ITAmoji task (Table 1). notators to chose the first, second and third most likely face emoji they would associate to the text 5.5.1 Figure 8 human annotation of each tweet 6 . The set of 1,005 tweets to annotate We selected 1,005 tweets with one face-emojis was perfectly balanced across the 15 face emojis from the ITAmoji test set and set up a collaborative considered. A total of 64 annotators from the F8 annotation task in Figure Eight (F8)5 by asking an- 6 Instructions provided to annotators (in Italian) here: 5 https://www.figure-eight.com/ http://bit.ly/itaMoji platform provided 6,150 evaluations by spotting ent approaches to collect data: a controlled collab- the 3 most likely face emojis to associate to the orative annotation environment in the case of F8 text of a tweet. (Section 5.5.1) and a “crowdcourcing in the wild” The Macro F1 of F8 annotators is 24.74. On setting in the case of the Scritture Brevi Twitter the same set of 1,005 tweets, the emoji prediction community (Section 5.5.2). In Table 5 we com- performance of human annotators was better than pare the emoji prediction performance of human 9 out of 12 systems submitted to ITAmoji. How- annotators (from both F8 and Scritture Brevi Twit- ever, the the best performing system submitted to ter community) with the performance of the emoji ITAmoji obtained a Macro F1 of 40.48 on those prediction systems submitted to ITAmoji. To per- tweets, suggesting that computational models can form this comparison we consider the set of 428 perform better than humans in this task. tweets of the ITAmoji test set annotated by F8 and the Scritture Brevi Twitter community. 5.5.2 Twitter human annotation We can notice that human predictions, both Thanks to the support and collaboration of the from F8 and Scritture Brevi, outperforms most #scritturebrevi Twitter community, we replicated of the automated systems. Moreover, F8 predic- the human annotation experiment carried out in F8 tions obtain a Macro F1 (24.46) higher than Scrit- in a “crowdcourcing in the wild” setting. From the ture Brevi Twitter community (22.94). This trend end of July to the beginning of September 2018, may be related to the fact that F8, in contrast to we posted 485 tweets on the Scritture Brevi Twit- the #scritturebrevi Twitter community, represents ter account (@FChiusaroli), most of them selected a controlled annotation environment. from the same portion of the ITAmoji test set con- sidered in our F8 experiment (see Section 5.5.1). 6 Conclusion Members of the Scritture Brevi Twitter commu- Considered the widespread diffusion of emojis nity were called to participate to a sort of Twitter as visual devices useful to provide an additional crowdsourcing game with slogan #ITAmoji che layer of meaning to social media messages, on one passione and hashtag #ITAmoji. Every day a set hand, and the unquestionable role of Twitter as one of tweets without emoji was posted on the Scrit- of the most important social media platforms, on ture Brevi Twitter account, and ITAmojiers had the other, we proposed this year at Evalita 2018 to post as a reply the most likely face emoj they ITAmoji, the Italian Emoji Prediction task. would associate to the text of the posted tweet7 . Results of automated systems are in line with ones The game became viral. We managed to involve obtained in the twin shared task proposed for En- more than one hundred users with an average num- glish and Spanish at Semeval 2018 (Barbieri et ber of valid predictions/replies per tweet equal al., 2018a). The introduction of new experimental to 5.4. When the #ITAmoji che passione game emoji-rank based metrics in ITAmoji allowed us ended, we were able to identify for each tweet to perform a finer-grained evaluation of the sys- posted on #scritturebrevi (485 tweets in total) the tems’ emoji prediction quality. Moreover, com- most-likely face-emoji that the Twitter community paring performances of humans and systems in the would associate. In general, the emoji prediction emoji prediction task confirms also in an Italian performance of people from Scritture Brevi Twit- setting the outcomes of a similar experiment pro- ter community was better than 8 out of 12 systems posed for English (Barbieri et al., 2017), suggest- submitted to ITAmoji (always on the same set of ing that computational models are able to better 485 tweets annotated by that community). capture the underlying semantics of emojis. 5.5.3 Comparing human and automated emoji predictions References In the two experiments just described, we asked Catalin Coman Andrei, Nechaev Yaroslav, and Zara humans to identify the face emoji(s) they would Giacomo. 2018. Predicting emoji exploiting mul- associate to the text of a tweet by exploiting differ- timodal data: Fbk participation in itamoji task. In Tommaso Caselli, Nicole Novielli, Viviana Patti, 7 The announce of the “#ITAmoji che passione” game and Paolo Rosso, editors, Proceedings of 6th Eval- was published on the Scritture Brevi’s blog and linked to uation Campaign of Natural Language Process- every posted tweet: https://www.scritturebrevi. ing and Speech Tools for Italian. Final Workshop it/2018/07/16/itamoji-che-passione/ (EVALITA 2018), Turin, Italy. CEUR.org. Team Run Name Macro F1 Micro F1 Weighted F1 FBK_FLEXED_BICEPS base_ud_1f 35.70 34.81 35.94 FBK_FLEXED_BICEPS base_tr_10f 35.03 34.81 35.36 FBK_FLEXED_BICEPS base_ud_10f 34.73 34.11 34.83 Figure Eight predictions 24.46 26.40 24.57 CIML-UNIPI run1 24.03 25.00 23.65 Scritture Brevi predictions 22.94 24.06 22.99 GW2017 gw2017_p 20.40 23.13 19.97 GW2017 gw2017_e 20.33 22.66 19.83 CIML-UNIPI run2 19.45 21.26 18.80 sentim Sentim_Test_Run_2 12.17 15.19 11.59 sentim Sentim_Test_Run_3 11.07 14.49 10.82 GW2017 gw2017_pe 5.01 7.48 5.02 UNIBA itamoji_uniba_run1 2.95 7.47 2.84 sentim Sentim_Test_Run_1 2.74 4.90 2.83 Table 5: Performance of human (Scritture Brevi and Figure 8) and automated emoji prediction ap- proaches, compared by considering the set of 428 tweets with face-emoji that are part of the ITAmoji test set and have been annotated by both Figure 8 platform and #scritturebrevi community. Emoji prediction approaches are ranked by decreasing Macro F1. Emoji Label Macro F1 Num. % Samples Samples red heart 75.74 5069 20.28 face with tears of joy 57.08 4966 19.86 kiss mark 51.71 279 1.12 face savoring food 48.34 387 1.55 rose 46.83 265 1.06 sun 44.69 319 1.28 smiling face with heart eyes 42.93 2363 9.45 face blowing a kiss 41.61 834 3.34 blue heart 39.26 506 2.02 smiling face with smiling eyes 38.92 1282 5.13 grinning face 37.74 885 3.54 winking face 34.98 1338 5.35 beaming face with smiling eyes 34.47 1028 4.11 sparkles 32.31 266 1.06 rolling on the floor laughing 31.79 546 2.18 thumbs up 31.55 642 2.57 smiling face with sunglasses 30.89 700 2.80 flexed biceps 30.75 417 1.67 thinking face 29.06 541 2.16 two hearts 27.48 341 1.36 loudly crying face 25.62 373 1.49 top arrow 24.03 347 1.39 grinning face with sweat 23.94 379 1.52 winking face with tongue 23.66 483 1.93 face screaming in fear 22.56 444 1.78 Table 6: Best F1 score for each emoji / label across all ITAmoji 2018 teams. The fourth and fifth columns respectively show, for each emoji, the number and percentage of test samples present in the test dataset. Sho Aoki and Osamu Uchida. 2011. A method for of Natural Language Processing and Speech Tools automatically generating the emotional vectors of for Italian. Final Workshop (EVALITA 2018), Turin, emoticons using weblog articles. In Proc. 10th Italy. CEUR.org. WSEAS Int. Conf. on Applied Computer and Applied Computational Science, Stevens Point, Wisconsin, Giulia Donato and Patrizia Paggio. 2017. Investigat- USA, pages 132–136. ing redundancy in emoji use: Study on a Twitter based corpus. In Proceedings of the 8th Workshop Francesco Barbieri, German Kruszewski, Francesco on Computational Approaches to Subjectivity, Senti- Ronzano, and Horacio Saggion. 2016a. How cos- ment and Social Media Analysis, pages 118–126. mopolitan are emojis?: Exploring emojis usage and Giulia Donato and Patrizia Paggio. 2018. Classifying meaning over different languages with distributional the Informative Behaviour of Emoji in Microblogs. semantics. In Proceedings of the 2016 ACM on Mul- In Proc. of the 11th International Conference on timedia Conference, pages 531–535. ACM. Language Resources and Evaluation (LREC 2018), Francesco Barbieri, Francesco Ronzano, and Horacio Miyazaki, Japan, May 7-12, 2018. ELRA. Saggion. 2016b. What does this emoji mean? a Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, vector space skip-gram model for Twitter emojis. In Matko Bošnjak, and Sebastian Riedel. 2016. Proc. of LREC 2016. emoji2vec: Learning emoji representations from their description. arXiv preprint arXiv:1609.08359. Francesco Barbieri, Miguel Ballesteros, and Horacio Saggion. 2017. Are emojis predictable? In Pro- Anderson Jacob. 2018. Fully convolutional networks ceedings of the 15th Conference of the European for text classification. In Tommaso Caselli, Nicole Chapter of the Association for Computational Lin- Novielli, Viviana Patti, and Paolo Rosso, editors, guistics: Volume 2, Short Papers, pages 105–111, Proceedings of 6th Evaluation Campaign of Natu- Valencia, Spain, April. ACL. ral Language Processing and Speech Tools for Ital- ian. Final Workshop (EVALITA 2018), Turin, Italy. Francesco Barbieri, Jose Camacho-Collados, CEUR.org. Francesco Ronzano, Luis Espinosa Anke, Miguel Ballesteros, Valerio Basile, Viviana Patti, and Nikola Ljubešić and Darja Fišer. 2016. A global anal- Horacio Saggion. 2018a. Semeval 2018 task 2: ysis of emoji usage. In Proceedings of the 10th Web Multilingual emoji prediction. In Proceedings as Corpus Workshop, pages 82–89. Association for of The 12th International Workshop on Semantic Computational Linguistics. Evaluation, pages 24–33. ACL. Siciliani Lucia and Girardi Daniela. 2018. The uniba Francesco Barbieri, Luis Marujo, William Brendel, system at the evalita 2018 italian emoji prediction Pradeep Karuturim, and Horacio Saggion. 2018b. task. In Tommaso Caselli, Nicole Novielli, Vi- Exploring Emoji Usage and Prediction Through a viana Patti, and Paolo Rosso, editors, Proceedings Temporal Variation Lens. In 1st International Work- of 6th Evaluation Campaign of Natural Language shop on Emoji Understanding and Applications in Processing and Speech Tools for Italian. Final Work- Social Media (at ICWSM 2018). shop (EVALITA 2018), Turin, Italy. CEUR.org. Tommaso Caselli, Nicole Novielli, Viviana Patti, and Bennici Mauro and Seijas Portocarrero Xileny. 2018. Paolo Rosso. 2018. Evalita 2018: Overview of The validity of word vectors over the time for the the 6th evaluation campaign of natural language evalita 2018 emoji prediction task (itamoji). In Tom- processing and speech tools for italian. In Tom- maso Caselli, Nicole Novielli, Viviana Patti, and maso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso, editors, Proceedings of 6th Evalua- Paolo Rosso, editors, Proceedings of Sixth Evalua- tion Campaign of Natural Language Processing and tion Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA Speech Tools for Italian. Final Workshop (EVALITA 2018), Turin, Italy. CEUR.org. 2018), Turin, Italy. CEUR.org. Hannah Miller, Jacob Thebault-Spieker, Shuo Chang, Isaac Johnson, Loren Terveen, and Brent Hecht. Francesca Chiusaroli. 2015. La scrittura in emoji 2016. “Blissfully happy" or “ready to fight": Vary- tra dizionario e traduzione. In Proceedings of 2nd ing interpretations of emoji. Proc. of ICWSM’16. Italian Conference on Computational Linguistics (CLiC-it 2015), Trento, Italy, December 3-4, 2015. Johanna Monti, Federico Sangati, Francesca Aacademia University Press. Chiusaroli, Martin Benjamin, and Sina Man- sour. 2016. Emojitalianobot and emojiworldbot F. Chiusaroli. 2017. Pinocchio in emojitaliano. Apice - new online tools and digital environments for Libri. translation into emoji. In Proc. CLiC-it 2016, Napoli, Italy, December 5-7, 2016., volume 1749 of Di Sarli Daniele, Gallicchio Claudio, and Micheli CEUR Workshop Proceedings. Alessio. 2018. Itamoji 2018: Emoji prediction via tree echo state networks. In Tommaso Caselli, Petra Kralj Novak, Jasmina Smailović, Borut Sluban, Nicole Novielli, Viviana Patti, and Paolo Rosso, and Igor Mozetič. 2015. Sentiment of emojis. PloS editors, Proceedings of 6th Evaluation Campaign one, 10(12):e0144296.