Overview of the EVALITA 2018 Italian Emoji Prediction (ITAMoji) Task
              Francesco Ronzano                   Francesco Barbieri
       Universitat Pompeu Fabra, Spain          Universitat Pompeu Fabra
   Hospital del Mar Medical Research Center         Barcelona, Spain
                Barcelona, Spain            francesco.barbieri@upf.edu
     francesco.ronzano@upf.edu

   Endang Wahyu Pamungkas,Viviana Patti                 Francesca Chiusaroli
       Department of Computer Science                  Department of Humanities
          University of Turin, Italy                  Università di Macerata, Italy
    {pamungka,patti}@di.unito.it                     f.chiusaroli@unimc.it

                  Abstract                           sono stati valutati i risultati di dodici sis-
                                                     temi di predizione di emoji messi a punto
  English. The Italian Emoji Prediction task         da cinque gruppi di lavoro. Presenti-
  (ITAmoji) is proposed at EVALITA 2018              amo qui i dataset, la metodologia di va-
  evaluation campaign for the first time, af-        lutazione (che include diverse metriche) e
  ter the success of the twin Multilingual           gli approcci dei sistemi che hanno parteci-
  Emoji Prediction Task, organized in the            pato. Presentiamo inoltre una riflessione
  context of SemEval-2018 in order to chal-          sui risultati ottenuti in tale task da sistemi
  lenge the research community to automat-           automatici e umani.
  ically model the semantics of emojis in
  Twitter. Participants were invited to sub-     1   Introduction
  mit systems designed to predict, given
                                                 During the last decade the use of emoji has in-
  an Italian tweet, its most likely associ-
                                                 creasingly pervaded social media platforms by
  ated emoji, selected in a wide and het-
                                                 providing users with a rich set of pictograms use-
  erogeneous emoji space. Twelve runs
                                                 ful to visually complement and enrich the expres-
  were submitted at ITAmoji by five teams.
                                                 siveness of short text messages. Nowadays this
  We present the data sets, the evaluation
                                                 novel, visual way of communication represents a
  methodology including different metrics
                                                 de facto standard in a wide range of social media
  and the approaches of the participating
                                                 platforms including fully-fledged portals for user-
  systems. We also present a comparison be-
                                                 generated contents like Twitter, Facebook and In-
  tween the performance of automatic sys-
                                                 stagram as well as instant-messaging services like
  tems and humans solving the same task.
                                                 WhatsApp. As a consequence, the possibility to
  Data and further information about this
                                                 effectively interpret and model the semantics of
  task can be found at: https://sites.
                                                 emojis has become an essential task to deal with
  google.com/view/itamoji/.
                                                 when we analyze social media contents.
  Italiano. Il task italiano per la predizione      Even if over the last few years the study of
  degli emoji in Twitter (ITAmoji) viene pro-    this new form of language has been receiving a
  posto nell’ambito della campagna di valu-      growing attention, at present the body of investiga-
  tazione di Evalita 2018 per la prima volta,    tions that deal with emojis is still scarce, especially
  dopo il successo del task gemello, il Mul-     when we consider their characterization from a
  tilingual Emoji Prediction Task, proposto      Natural Language Processing (NLP) perspective.
  a Semeval-2018 per stimolare la comu-          While there are notable exceptions which study
  nità di ricerca a costruire modelli com-       the semantics of emojis and their usage (Barbi-
  putazionali della semantica delle emoji in     eri et al., 2016a; Barbieri et al., 2018b; Aoki
  Twitter. I partecipanti sono stati invitati    and Uchida, 2011; Eisner et al., 2016; Ljubešić
  a costruire sistemi disegnati per predire      and Fišer, 2016), reflecting also on their informa-
  l’emoji piú probabile dato un tweet in ital-   tive behaviour (Donato and Paggio, 2017; Donato
  iano, selezionandola in uno spazio am-         and Paggio, 2018), or their sentiment (Novak et
  pio e eterogeneo di emoji. In ITAmoji          al., 2015), the interplay between text-based mes-
sages and emojis remains still explored only by a       tail the evaluation metrics, we describe the partic-
small number of studies. Among these investiga-         ipants results and we propose a first comparison
tions there is the analysis of emoji predictability     with performances of humans solving the same
by (Barbieri et al., 2017), which proposed a neural     task. We conclude the paper with some reflections
model to predict the most likely emoji to appear        on the outcomes of the proposed task.
in a text message (tweet). The task resulted to be
hard, as emojis encode multiple meanings (Barbi-        2   Emojis and Italian
eri et al., 2016b). Related to this, in the context
of the International Workshop on Semantic Eval-         We can observe a growing interest on the se-
uation (SemEVAL 2018), the Multilingual Emoji           mantics of emojis in relation with Italian. In
Prediction Task (Barbieri et al., 2018a) has been       particular, some recent interesting projects have
organized in order to challenge the research com-       been carried out in the last years, which address
munity to automatically model the semantics of          the issue in a translation framework, investigating
emojis occurring in English and Spanish Twitter         the possibility to translate from Italian literary
messages. The task was very successful, with 49         texts into the universal visual language of emoji
teams participating in the English subtask and 22       (Chiusaroli, 2015; Monti et al., 2016). In partic-
in the Spanish subtask. This motivated us to pro-       ular, the Emojitaliano project was launched as a
pose the shared task also for the Italian language      translation project of the Italian novel Pinocchio
in the context of the Evalita 2018 evaluation cam-      in emoji (Chiusaroli, 2017) on Twitter. An
paign (Caselli et al., 2018), with the twofold aim to   original approach based on crowdsourcing was
widen the setting for cross-language comparisons        adopted, by involving for the translation task the
for emoji prediction in Twitter and to experiment       Twitter community named as Scritture Brevi.
with novel metrics to better assess the quality of      The Twitter community #scritturebrevi The
the automatic predictions.                              community (#scritturebrevi,          @FChiusaroli,
                                                        10,151 followers in November 2018) had previ-
   In general, exciting and highly relevant avenues
                                                        ously been involved in experiments of creative
for research are still to explore with respect to
                                                        writing, also in emojis: with the hashtag #inemoti-
emoji understanding, since emojis represent often
                                                        con, on Twitter, experiments of mixed translation
an essential component of social media texts: ig-
                                                        - words and emojis – have been carried out,
noring or misinterpreting them may lead to mis-
                                                        experiencing the semantic versatility of emojis,
understandings in comprehending the intended
                                                        and their values in rebus writings. Translating
meaning of a message (Miller et al., 2016). The
                                                        the whole Pinocchio book was a more complex
ambiguity of emojis raises also interesting ques-
                                                        and engaging task, especially for its focus on
tions in application domains, think for instance to
                                                        developing a common code base, in terms of
a human-computer interaction setting: how can
                                                        glossary and grammar, which is absolute new
we teach an artificial agent to correctly interpret
                                                        with respect to previous projects. The translation
and recognize emojis’ use in spontaneous conver-
                                                        of Pinocchio started on February 2016. Everyday,
sation? The main motivation behind this question
                                                        for 28 weeks, sentences taken from Pinocchio
is that an AI system able to predict emojis could
                                                        were tweeted, and the followers were invited to
contribute notably to better natural language un-
                                                        suggest their translations to emoji; at the end of
derstanding (Novak et al., 2015) and thus to other
                                                        each day, the official version of the translation
Natural Language Processing tasks such as gen-
                                                        was validated and published. An online tool
erating emoji-enriched social media content, en-
                                                        Emojiitalianobot has been developed in order to
hancing emotion/sentiment analysis systems, im-
                                                        support the community to memorize the semantic
proving retrieval of social network material, and
                                                        values assigned to each emoji during the collec-
ultimately improving user profiling.
                                                        tive translation process. Since its first beginning
   In the following, we describe the main elements      on Twitter, the project was an instant success,
of the shared task (Section 3), after proposing a       becoming a viral web phenomenon thanks to
brief summary about previous projects reflecting        the Scritture brevi community. Therefore, it
on the semantics of emojis in Italian (Section 2).      was a natural choice to involve the same Twitter
Then, we cover the data collection, curation and        community to reflect on the semantics of emoji
release process (Section 4). In Section 5 we de-        from a different perspective, i.e. the one we
propose in the context of the ITAmoji shared task,      samples.
thus helping us to understand how humans are
good at predicting emojis (see Section 5.5.2).                Emoji    % Tweet in Train and Test set
                                                                                     20.27
3   Task Description
                                                                                     19.86
We invited participants to submit systems de-                                         9.45
signed to predict, given a tweet in Italian, its most                                 5.35
likely associated emoji, only based on the text of
                                                                                      5.13
the tweet. As for the experimental setting, for sim-
plicity purposes, we considered tweets including                                      4.11
only one emoji (eventually repeated). After re-                                       3.54
moving the emoji from the tweet, we asked users                                       3.33
to predict it. We challenged systems to predict
                                                                                      2.80
                                                                                      2.57
    Innamorato sempre di più         [URL]                                            2.18
                                                                                      2.16
Figure 1: Example of tweet with an emoji at the                                       2.03
end, considered in the emoji prediction task.                                         1.94
                                                                                      1.78
emojis among a wide and heterogeneous emoji
space. In particular, we selected the tweets that                                     1.67
included one of the twenty five emojis that occur                                     1.55
most frequently in the Twitter data we collected                                      1.52
(see Table 1). Therefore, the task can be seen                                        1.49
as a multi-class classification task where systems
                                                                                      1.39
should predict one of 25 possible emojis from the
text of a given tweet. Each participant was al-                                       1.37
lowed to submit up to three system runs. Partic-                                      1.28
ipants were allowed to use additional data to train                                   1.12
the systems such as lexicons and pre-trained word                                     1.07
embeddings. In order to have the possibility to
                                                                                      1.06
perform a finer grained evaluation of results, we
encouraged participants to submit, for each tweet,
                                                        Table 1: The distribution (percentage) for each
not only the most likely emoji predicted but also
                                                        emoji in the train and test set
the complete rank from the most likely to the less
likely emoji to be associated to the text of the
tweet.                                                  5     Evaluation

4   Task Data                                           In this section we present the evaluation setting for
                                                        the ITAmoji shared task.
The data for this task were retrieved from Twitter
by experimenting with two different approaches:         5.1    Metrics
(i) gathering Twitter stream on (geolocalized) Ital-    The evaluation of the emoji prediction systems
ian tweets from October 2015 to February 2018;          has been based on the classic precision and re-
and (ii) retrieving tweets from the followers of the    call metrics over each emoji. The final ranking of
most popular Italian newspaper’s accounts. We           the participating teams of ITAmoji 2018 relies on
randomly selected 275, 000 tweets from these col-       the Macro F1 score computed with respect to the
lections by choosing tweets that contained one and      most likely emoji predicted, given the text of each
only one emoji over 25 most frequent emojis listed      tweet of the test set, in line with the proposal in the
in Table 1. We split our data into two sets consist-    twin task at Semeval 2018 for English and Spanish
ing of 250, 000 training samples and 25, 000 test       (Barbieri et al., 2018a). In this way we intend to
encourage systems to perform well overall, which           Regarding the emoji-rank-based metrics, we
would inherently mean a better sensitivity to the        considered:
use of emojis in general, rather than for instance
overfitting a model to do well in the three or four        • Coverage error: compute how far we need
most common emojis of the test data.                         to go through the ranked scores of labels
   In general, the identification of a coherent and          (emojis) to cover all true labels;
effective approach to compare the performance of
distinct emoji prediction systems is not an easy           • Accuracy@n: is the accuracy value com-
task. We have often the clear impression that the            puted by considering as right predictions the
semantics of some sets of emojis can be similar,             ones in which the right label (emoji) is among
therefore it would be interesting to have a way to           the top N most likely ones.
compare and evaluate at a finer grained level the
emoji prediction quality of two distinct systems,        5.2     Baseline
when they both fail in predicting the right emoji to     In order to compare the performance of the
associate to a tweet. In such cases, indeed, it can      ITAmoji participating systems with baseline ap-
be important to distinguish between the system           proaches, we considered three different baselines:
that identifies the right prediction among the most      - Majority baseline: for each text of a tweet we
likely emojis to be associated to that tweet and the     predict the ordered list of 25 most-likely emojis
one that characterizes the right prediction as an        sorted by their frequency in the training set, that
emoji that is unlikely to be associated to that tweet.   is, we always predict as first choice the red heart,
In order to catch this aspect, we gave ITAmoji par-      and as last choice the rose emoji.
ticipants the possibility to submit as emoji predic-     - Weighted random baseline: for each text of a
tions, the ordered ranking of the 25 emojis con-         tweet we predict the ordered list of the 25 most-
sidered in ITAmoji. Systems providing the ranked         likely emojis where the first prediction is ran-
list of emoji predictions were also compared by          domly selected taking in consideration the label-
considering the following additional emoji-rank-         frequency in the training set (in order to keep the
based metrics: Accuracy@5/10/15/20 and Cov-              same labels distribution) and the rest of the pre-
erage Error. All the submissions we received             dictions (from the second to the last one) are gen-
provided the ranked list of 25 emojis as predic-         erated by considering the rest of emojis sorted by
tions: as a consequence it was possible to compute       label-frequency.
the emoji-rank-based metrics considered for all of       - FastText baseline: for each text of a tweet
them.                                                    we predict the ordered list of the 25 most-likely
   A detailed description of all the evaluation met-     emojis by relying on fasttext with basic parame-
rics we considered to compare the quality of emoji       ters1 and pretrained embeddings with 300 dimen-
prediction approaches is given below. The fol-           sions (Barbieri et al., 2016a).
lowing three standard metrics are computed by
considering only the emoji predicted as the most         5.3     Participating Systems and Results
likely one to be associated to the text of a tweet:      We received 12 submissions in total from 5 differ-
  • Macro F1: compute the F1 score for each la-          ent teams. The main approaches and features of
    bel (emoji), and find their un-weighted mean         participating teams are described below.
    (exploited to determine the final ranking of         FBK_FLEXED_BICEPS (Andrei et al., 2018)
    the participating teams);                            This system exploit recurrent neural network ar-
                                                         chitecture Bidirectional Long Short Term Mem-
  • Micro F1: compute the F1 score globally by           ory (Bi-LSTM), together with user based features
    counting the total true positives, false nega-       to deal with this task. They concatenate the out-
    tives and false positives across all label (emo-     put of Bi-LSTM network that take word sequence
    jis);                                                as input with the user history distribution in us-
                                                         ing emoji. Finally, the softmax activation is used
  • Weighted F1: compute the F1 score for
                                                         to get the probability distribution of the 25 emoji
    each label (emoji), and find their average,
                                                         labels.
    weighted by support (the number of true in-
                                                            1
    stances for each label);                                    https://fasttext.cc/
GW2017 (Mauro and Xileny, 2018) This sys-             can be motivated by the trend, when we consider
tem based on ensemble of two models, Bi-LSTM          Micro F1, to favour systems that tend to overfit
and LightGBM2 . The first model uses two differ-      their prediction model to do well in the most com-
ent word2vec models based on the time creation,       mon emojis of the test data with respect to sys-
while the second model exploits several surfaces      tems with good performances over all emojis: this
feature extracted from tweet text (e.g., number of    fact confirms our choice to select Macro F1 as the
words, number of characters).                         official metric to rank ITAmoji 2018 participating
CIML-UNIPI (Daniele et al., 2018) This system         systems.
is based on ensemble composed of 13 models (12           From Table 3 we can see how the order to the
basen on TreeESNs and one on LSTM over char-          top-5 best performing systems in terms of Macro
acters. Models based on TreeESN are built by          F1 is substantially preserved when we consider
varying the number of reservoir units, activation     the emoji-rank-based metrics Coverage Error and
function, readout and parser.                         Accuracy@5 (except for the switch between the
sentim (Jacob, 2018) This system relies on a          fourth and fifth best performing approach).
convolutional neural network (CNN) architecture          If we consider the performance of our three
which uses character embedding as input. 9 layers     baseline systems (described in Section 5.2) we can
of residual dilated convolutions with skip connec-    notice from Table 2 that, as expected, FastText is
tions are applied, followed by a ReLU activation      the best performing baseline approach: a FastText
to increase nonlinearity.                             embedding based prediction system would have
UNIBA (Lucia and Daniela, 2018) This system           ranked as eight by Macro F1 in ITAmoji 2018.
is built by using ensemble classifier based on           Table 6 shows the highest F1 score for each
WEKA3 and scikit-learn4 . Several features are        emoji / label across all ITAmoji 2018 team sub-
exploited by using micro-blogging based feature,      missions. We can notice that even if specific emo-
sentiment based feature, and semantic based fea-      jis like , , , or are characterized by a small
ture.                                                 percentage of training samples (about 1%), pre-
   Table 2 shows the official results of ITA-         diction systems manage to obtain high Macro F1
moji 2018 task, ordered by decreasing Macro           scores. In contrast, when we consider emojis like
F1. The best performing system was proposed by            or , even if there are more training samples
the FBK_FLEXED_BICEPS team, which achieves            available with respect to the previous set of emo-
0.365312 in Macro F1. Overall, we can see that        jis (more than 2%), we observe that the predic-
systems which exploit neural network architec-        tion systems do not manage to get high Macro F1
ture obtained good performances in this task, es-     scores. This fact can be explained by the variabil-
pecially when relying on Bi-LSTM model. Table         ity of the context of use that characterizes the lat-
3 shows the performance of ITAmoji systems with       ter set of emojis that makes it difficult for system
respect to emoji-rank-based metrics.                  to learn to predict.
5.4   Analysis                                           To conclude our analysis, we have to notice that
                                                      the three runs that obtained the highest Macro F1
From Table 2 we can notice that the ranking or-
                                                      scores, to predict the emojis exploited, besides the
der of the 5 system runs that obtained the best
                                                      text of a tweet, the way the author of that tweet
Macro F1 is substantially preserved when we con-
                                                      used emojis in previous tweets. This fact high-
sider Micro F1 or Weighted F1. Anyway, with re-
                                                      lights that the choice of an emoji strongly depends
spect to Macro F1, when we consider Micro F1
                                                      on the preferences and writing style of each indi-
the differences among the scores obtained by the
                                                      vidual, both representing relevant inputs to model
top-performing systems tend to be substantially
                                                      in order to improve emoji prediction quality.
smaller: for instance the Macro F1 of the best sys-
tem is greater by a factor of 1.64 with respect to
                                                      5.5   Emoji prediction by humans
the fifth system, while the Micro F1 of the best
system is greater by a factor of 1.18 with respect    In this section we present a preliminary discussion
to the fifth system (ranked by Micro F1). This fact   of the results of two experiments designed in or-
  2
    https://github.com/Microsoft/LightGBM
                                                      der to evaluate how humans perform when they
  3
    https://www.cs.waikato.ac.nz/ml/weka/             are requested to identify the most likely emoji(s)
  4
    http://scikit-learn.org/stable/                   to associate to the text of an Italian tweet. The
            Rank   Team                     Run Name             Macro F1      Micro F1     Weighted F1
            1      FBK_FLEXED_BICEPS base_ud_1f                   36.53         47.67          46.98
            2      FBK_FLEXED_BICEPS base_ud_10f                  35.63         47.62          46.58
            3      FBK_FLEXED_BICEPS base_tr_10f                  29.21         42.35          39.57
            4      GW2017                   gw2017_p              23.29         40.09          37.81
            5      GW2017                   gw2017_e              22.21         42.19          36.90
            6      CIML-UNIPI               run1                  19.24         29.12          31.48
            7      CIML-UNIPI               run2                  18.80         37.63         34.101
            -      FastText baseline                              11.96         28.72          27.02
            8      sentim                   Sentim_Test_Run_3     10.62         29.43          23.24
            9      sentim                   Sentim_Test_Run_2     10.23         31.27          23.11
            -      Weighted random baseline                        3.94         10.36          10.36
            10     GW2017                   gw2017_pe              3.75         11.95          10.97
            11     UNIBA                    itamoji_uniba_run1     3.19         27.38          15.61
            12     sentim                   Sentim_Test_Run_1      1.95          6.48           3.99
            -      Majority baseline                               1.35         20.28           6.84


Table 2: Official Results of ITAmoji Shared Task: evaluation metrics computed by considering only the
emoji predicted as the most likely one to be associated to the text of a tweet. Teams runs are ranked by
Macro F1. The table shows also the performance of the three baselines considered in ITAmoji, ranked
with respect to their Macro F1.
         Rank   Team                     Run Name             Coverage Error     Accuracy@5 / 10 / 15 / 20
         1      FBK_FLEXED_BICEPS base_ud_1f                       3.47          81.67 / 92.14 / 96.86 / 99.10
         2      FBK_FLEXED_BICEPS base_ud_10f                      3.49          81.53 / 91.94 / 96.82 / 99.17
         3      FBK_FLEXED_BICEPS base_tr_10f                      4.35          74.54 / 87.50 / 94.34 / 98.00
         4      GW2017                   gw2017_p                  5.66          67.18 / 81.49 / 89.42 / 92.99
         5      GW2017                   gw2017_e                  4.60          71.30 / 85.90 / 94.30 / 98.25
         6      CIML-UNIPI               run1                      5.43          64.60 / 83.02 / 93.00 / 98.01
         7      CIML-UNIPI               run2                      5.11          68.46 / 83.86 / 92.38 / 97.28
         -      FastText baseline                                  7.23          59.07 / 74.22 / 82.58 / 88.89
         8      sentim                   Sentim_Test_Run_3         6.41          58.53 / 76.93 / 88.52 / 95.74
         9      sentim                   Sentim_Test_Run_2         6.33          57.60 / 77.17 / 89.70 / 96.41
         -      Weighted random baseline                           6.92          59.06 / 76.11 / 86.42 / 94.10
         10     GW2017                   gw2017_pe                13.49          27.93 / 43.04 / 56.00 / 66.27
         11     UNIBA                    itamoji_uniba_run1        6.70          58.78 / 75.97 / 86.36 / 93.53
         12     sentim                   Sentim_Test_Run_1        12.45          29.20 / 48.78 / 64.38 / 74.04
         -      Majority baseline                                  6.63          60.07 / 76.43 / 86.51 / 94.12


Table 3: Official Results of ITAmoji Shared Task: emoji-rank-based metrics (Coverage error and Accu-
racy@n). Teams runs ranked by Macro F1. The table shows also the performance of the three baselines
considered in ITAmoji, ranked with respect to their Macro F1.


final purpose here is to explore if humans are bet-
ter than automated systems in the emoji predic-
tion task from text, or viceversa. In an attempt to
consider an uniform set of emojis in our experi-
mental settings, in both human emoji prediction
experiments described in the rest of this section         Table 4: The set of 15 face emoji considered in the
we decided to focus only on the 15 emojis shown           human annotation experiments.
in Table 4. This group of emojis includes all the
yellow-face emojis considered in the ITAmoji task
(Table 1).                                                notators to chose the first, second and third most
                                                          likely face emoji they would associate to the text
5.5.1     Figure 8 human annotation                       of each tweet 6 . The set of 1,005 tweets to annotate
We selected 1,005 tweets with one face-emojis             was perfectly balanced across the 15 face emojis
from the ITAmoji test set and set up a collaborative      considered. A total of 64 annotators from the F8
annotation task in Figure Eight (F8)5 by asking an-
                                                            6
                                                              Instructions provided to annotators (in Italian) here:
   5
       https://www.figure-eight.com/                      http://bit.ly/itaMoji
platform provided 6,150 evaluations by spotting             ent approaches to collect data: a controlled collab-
the 3 most likely face emojis to associate to the           orative annotation environment in the case of F8
text of a tweet.                                            (Section 5.5.1) and a “crowdcourcing in the wild”
   The Macro F1 of F8 annotators is 24.74. On               setting in the case of the Scritture Brevi Twitter
the same set of 1,005 tweets, the emoji prediction          community (Section 5.5.2). In Table 5 we com-
performance of human annotators was better than             pare the emoji prediction performance of human
9 out of 12 systems submitted to ITAmoji. How-              annotators (from both F8 and Scritture Brevi Twit-
ever, the the best performing system submitted to           ter community) with the performance of the emoji
ITAmoji obtained a Macro F1 of 40.48 on those               prediction systems submitted to ITAmoji. To per-
tweets, suggesting that computational models can            form this comparison we consider the set of 428
perform better than humans in this task.                    tweets of the ITAmoji test set annotated by F8 and
                                                            the Scritture Brevi Twitter community.
5.5.2   Twitter human annotation                               We can notice that human predictions, both
Thanks to the support and collaboration of the              from F8 and Scritture Brevi, outperforms most
#scritturebrevi Twitter community, we replicated            of the automated systems. Moreover, F8 predic-
the human annotation experiment carried out in F8           tions obtain a Macro F1 (24.46) higher than Scrit-
in a “crowdcourcing in the wild” setting. From the          ture Brevi Twitter community (22.94). This trend
end of July to the beginning of September 2018,             may be related to the fact that F8, in contrast to
we posted 485 tweets on the Scritture Brevi Twit-           the #scritturebrevi Twitter community, represents
ter account (@FChiusaroli), most of them selected           a controlled annotation environment.
from the same portion of the ITAmoji test set con-
sidered in our F8 experiment (see Section 5.5.1).           6   Conclusion
Members of the Scritture Brevi Twitter commu-               Considered the widespread diffusion of emojis
nity were called to participate to a sort of Twitter        as visual devices useful to provide an additional
crowdsourcing game with slogan #ITAmoji che                 layer of meaning to social media messages, on one
passione and hashtag #ITAmoji. Every day a set              hand, and the unquestionable role of Twitter as one
of tweets without emoji was posted on the Scrit-            of the most important social media platforms, on
ture Brevi Twitter account, and ITAmojiers had              the other, we proposed this year at Evalita 2018
to post as a reply the most likely face emoj they           ITAmoji, the Italian Emoji Prediction task.
would associate to the text of the posted tweet7 .          Results of automated systems are in line with ones
The game became viral. We managed to involve                obtained in the twin shared task proposed for En-
more than one hundred users with an average num-            glish and Spanish at Semeval 2018 (Barbieri et
ber of valid predictions/replies per tweet equal            al., 2018a). The introduction of new experimental
to 5.4. When the #ITAmoji che passione game                 emoji-rank based metrics in ITAmoji allowed us
ended, we were able to identify for each tweet              to perform a finer-grained evaluation of the sys-
posted on #scritturebrevi (485 tweets in total) the         tems’ emoji prediction quality. Moreover, com-
most-likely face-emoji that the Twitter community           paring performances of humans and systems in the
would associate. In general, the emoji prediction           emoji prediction task confirms also in an Italian
performance of people from Scritture Brevi Twit-            setting the outcomes of a similar experiment pro-
ter community was better than 8 out of 12 systems           posed for English (Barbieri et al., 2017), suggest-
submitted to ITAmoji (always on the same set of             ing that computational models are able to better
485 tweets annotated by that community).                    capture the underlying semantics of emojis.
5.5.3   Comparing human and automated
        emoji predictions                                   References
In the two experiments just described, we asked             Catalin Coman Andrei, Nechaev Yaroslav, and Zara
humans to identify the face emoji(s) they would               Giacomo. 2018. Predicting emoji exploiting mul-
associate to the text of a tweet by exploiting differ-        timodal data: Fbk participation in itamoji task. In
                                                              Tommaso Caselli, Nicole Novielli, Viviana Patti,
   7
     The announce of the “#ITAmoji che passione” game         and Paolo Rosso, editors, Proceedings of 6th Eval-
was published on the Scritture Brevi’s blog and linked to     uation Campaign of Natural Language Process-
every posted tweet: https://www.scritturebrevi.               ing and Speech Tools for Italian. Final Workshop
it/2018/07/16/itamoji-che-passione/                           (EVALITA 2018), Turin, Italy. CEUR.org.
             Team                        Run Name             Macro F1   Micro F1   Weighted F1
             FBK_FLEXED_BICEPS base_ud_1f                     35.70       34.81        35.94
             FBK_FLEXED_BICEPS base_tr_10f                    35.03       34.81        35.36
             FBK_FLEXED_BICEPS base_ud_10f                    34.73       34.11        34.83
             Figure Eight predictions                         24.46       26.40        24.57
             CIML-UNIPI                  run1                 24.03       25.00        23.65
             Scritture Brevi predictions                      22.94       24.06        22.99
             GW2017                      gw2017_p             20.40       23.13        19.97
             GW2017                      gw2017_e             20.33       22.66        19.83
             CIML-UNIPI                  run2                 19.45       21.26        18.80
             sentim                      Sentim_Test_Run_2    12.17       15.19        11.59
             sentim                      Sentim_Test_Run_3    11.07       14.49        10.82
             GW2017                      gw2017_pe            5.01         7.48         5.02
             UNIBA                       itamoji_uniba_run1   2.95         7.47         2.84
             sentim                      Sentim_Test_Run_1    2.74         4.90         2.83


Table 5: Performance of human (Scritture Brevi and Figure 8) and automated emoji prediction ap-
proaches, compared by considering the set of 428 tweets with face-emoji that are part of the ITAmoji test
set and have been annotated by both Figure 8 platform and #scritturebrevi community. Emoji prediction
approaches are ranked by decreasing Macro F1.


  Emoji     Label                                    Macro F1            Num.             % Samples
                                                                         Samples
            red heart                                75.74               5069             20.28
            face with tears of joy                   57.08               4966             19.86
            kiss mark                                51.71               279              1.12
            face savoring food                       48.34               387              1.55
            rose                                     46.83               265              1.06
            sun                                      44.69               319              1.28
            smiling face with heart eyes             42.93               2363             9.45
            face blowing a kiss                      41.61               834              3.34
            blue heart                               39.26               506              2.02
            smiling face with smiling eyes           38.92               1282             5.13
            grinning face                            37.74               885              3.54
            winking face                             34.98               1338             5.35
            beaming face with smiling eyes           34.47               1028             4.11
            sparkles                                 32.31               266              1.06
            rolling on the floor laughing            31.79               546              2.18
            thumbs up                                31.55               642              2.57
            smiling face with sunglasses             30.89               700              2.80
            flexed biceps                            30.75               417              1.67
            thinking face                            29.06               541              2.16
            two hearts                               27.48               341              1.36
            loudly crying face                       25.62               373              1.49
            top arrow                                24.03               347              1.39
            grinning face with sweat                 23.94               379              1.52
            winking face with tongue                 23.66               483              1.93
            face screaming in fear                   22.56               444              1.78

Table 6: Best F1 score for each emoji / label across all ITAmoji 2018 teams. The fourth and fifth columns
respectively show, for each emoji, the number and percentage of test samples present in the test dataset.
Sho Aoki and Osamu Uchida. 2011. A method for              of Natural Language Processing and Speech Tools
  automatically generating the emotional vectors of        for Italian. Final Workshop (EVALITA 2018), Turin,
  emoticons using weblog articles. In Proc. 10th           Italy. CEUR.org.
  WSEAS Int. Conf. on Applied Computer and Applied
  Computational Science, Stevens Point, Wisconsin,       Giulia Donato and Patrizia Paggio. 2017. Investigat-
  USA, pages 132–136.                                      ing redundancy in emoji use: Study on a Twitter
                                                           based corpus. In Proceedings of the 8th Workshop
Francesco Barbieri, German Kruszewski, Francesco           on Computational Approaches to Subjectivity, Senti-
  Ronzano, and Horacio Saggion. 2016a. How cos-            ment and Social Media Analysis, pages 118–126.
  mopolitan are emojis?: Exploring emojis usage and
                                                         Giulia Donato and Patrizia Paggio. 2018. Classifying
  meaning over different languages with distributional
                                                           the Informative Behaviour of Emoji in Microblogs.
  semantics. In Proceedings of the 2016 ACM on Mul-
                                                           In Proc. of the 11th International Conference on
  timedia Conference, pages 531–535. ACM.
                                                           Language Resources and Evaluation (LREC 2018),
Francesco Barbieri, Francesco Ronzano, and Horacio         Miyazaki, Japan, May 7-12, 2018. ELRA.
  Saggion. 2016b. What does this emoji mean? a           Ben Eisner, Tim Rocktäschel, Isabelle Augenstein,
  vector space skip-gram model for Twitter emojis. In      Matko Bošnjak, and Sebastian Riedel.          2016.
  Proc. of LREC 2016.                                      emoji2vec: Learning emoji representations from
                                                           their description. arXiv preprint arXiv:1609.08359.
Francesco Barbieri, Miguel Ballesteros, and Horacio
  Saggion. 2017. Are emojis predictable? In Pro-         Anderson Jacob. 2018. Fully convolutional networks
  ceedings of the 15th Conference of the European          for text classification. In Tommaso Caselli, Nicole
  Chapter of the Association for Computational Lin-        Novielli, Viviana Patti, and Paolo Rosso, editors,
  guistics: Volume 2, Short Papers, pages 105–111,         Proceedings of 6th Evaluation Campaign of Natu-
  Valencia, Spain, April. ACL.                             ral Language Processing and Speech Tools for Ital-
                                                           ian. Final Workshop (EVALITA 2018), Turin, Italy.
Francesco Barbieri,      Jose Camacho-Collados,            CEUR.org.
  Francesco Ronzano, Luis Espinosa Anke, Miguel
  Ballesteros, Valerio Basile, Viviana Patti, and        Nikola Ljubešić and Darja Fišer. 2016. A global anal-
  Horacio Saggion. 2018a. Semeval 2018 task 2:             ysis of emoji usage. In Proceedings of the 10th Web
  Multilingual emoji prediction.   In Proceedings          as Corpus Workshop, pages 82–89. Association for
  of The 12th International Workshop on Semantic           Computational Linguistics.
  Evaluation, pages 24–33. ACL.
                                                         Siciliani Lucia and Girardi Daniela. 2018. The uniba
Francesco Barbieri, Luis Marujo, William Brendel,           system at the evalita 2018 italian emoji prediction
  Pradeep Karuturim, and Horacio Saggion. 2018b.            task. In Tommaso Caselli, Nicole Novielli, Vi-
  Exploring Emoji Usage and Prediction Through a            viana Patti, and Paolo Rosso, editors, Proceedings
  Temporal Variation Lens. In 1st International Work-       of 6th Evaluation Campaign of Natural Language
  shop on Emoji Understanding and Applications in           Processing and Speech Tools for Italian. Final Work-
  Social Media (at ICWSM 2018).                             shop (EVALITA 2018), Turin, Italy. CEUR.org.

Tommaso Caselli, Nicole Novielli, Viviana Patti, and     Bennici Mauro and Seijas Portocarrero Xileny. 2018.
  Paolo Rosso. 2018. Evalita 2018: Overview of             The validity of word vectors over the time for the
  the 6th evaluation campaign of natural language          evalita 2018 emoji prediction task (itamoji). In Tom-
  processing and speech tools for italian. In Tom-         maso Caselli, Nicole Novielli, Viviana Patti, and
  maso Caselli, Nicole Novielli, Viviana Patti, and        Paolo Rosso, editors, Proceedings of 6th Evalua-
  Paolo Rosso, editors, Proceedings of Sixth Evalua-       tion Campaign of Natural Language Processing and
  tion Campaign of Natural Language Processing and         Speech Tools for Italian. Final Workshop (EVALITA
  Speech Tools for Italian. Final Workshop (EVALITA        2018), Turin, Italy. CEUR.org.
  2018), Turin, Italy. CEUR.org.                         Hannah Miller, Jacob Thebault-Spieker, Shuo Chang,
                                                           Isaac Johnson, Loren Terveen, and Brent Hecht.
Francesca Chiusaroli. 2015. La scrittura in emoji          2016. “Blissfully happy" or “ready to fight": Vary-
  tra dizionario e traduzione. In Proceedings of 2nd       ing interpretations of emoji. Proc. of ICWSM’16.
  Italian Conference on Computational Linguistics
  (CLiC-it 2015), Trento, Italy, December 3-4, 2015.     Johanna Monti, Federico Sangati, Francesca
  Aacademia University Press.                              Chiusaroli, Martin Benjamin, and Sina Man-
                                                           sour. 2016. Emojitalianobot and emojiworldbot
F. Chiusaroli. 2017. Pinocchio in emojitaliano. Apice      - new online tools and digital environments for
   Libri.                                                  translation into emoji. In Proc. CLiC-it 2016,
                                                           Napoli, Italy, December 5-7, 2016., volume 1749 of
Di Sarli Daniele, Gallicchio Claudio, and Micheli          CEUR Workshop Proceedings.
  Alessio. 2018. Itamoji 2018: Emoji prediction
  via tree echo state networks. In Tommaso Caselli,      Petra Kralj Novak, Jasmina Smailović, Borut Sluban,
  Nicole Novielli, Viviana Patti, and Paolo Rosso,         and Igor Mozetič. 2015. Sentiment of emojis. PloS
  editors, Proceedings of 6th Evaluation Campaign          one, 10(12):e0144296.