Predicting Emoji Exploiting Multimodal Data:
                             FBK Participation in ITAmoji Task

     Andrei Catalin Coman                     Yaroslav Nechaev                Giacomo Zara
    Fondazione Bruno Kessler               Fondazione Bruno Kessler       Fondazione Bruno Kessler
       coman@fbk.eu                          nechaev@fbk.eu                  gzara@fbk.eu


                         Abstract                       messaging services in our lives, we have been wit-
                                                        nessing how common it has become for average
        English. In this paper, we present our ap-      users to enrich natural language by means of emo-
        proach that has won the ITAmoji task of         jis. An emoji is essentially a symbol placed di-
        the 2018 edition of the EVALITA evalu-          rectly into the text, which is meant to convey a
        ation campaign1 . ITAmoji is a classifi-        simple concept or more specifically, as the name
        cation task for predicting the most prob-       says, an emotion.
        able emoji (a total of 25 classes) to go
                                                           The emoji phenomenon has attracted consider-
        along with the target tweet written by a
                                                        able research interest. In particular, recent works
        given person in Italian. We demonstrate
                                                        have studied the connection between the natural
        that using only textual features is insuf-
                                                        language and the emojis used in a specific piece
        ficient to achieve reasonable performance
                                                        of text. The 2018 edition of EVALITA ITAmoji
        levels on this task and propose a system
                                                        competition (Ronzano et al., 2018) is a prime ex-
        that is able to benefit from the multimodal
                                                        ample of such interest. In this competition, par-
        information contained in the training set,
                                                        ticipants were asked to predict one of the 25 emo-
        enabling significant F1 gains and earning
                                                        jis to be used in a given Italian tweet based on a
        us the first place in the final ranking.
                                                        text, the date and the user that has written it. Dif-
        Italiano. In questo articolo presentiamo        ferently from the similar SemEval (Barbieri et al.,
        l’approccio con cui abbiamo vinto la com-       2018) challenge, the addition of the user informa-
        petizione ITAmoji dell’edizione 2018 di         tion significantly expanded the scope of potential
        EVALITA1 . ITAmoji è un task di classi-        solutions that could be devised.
        ficazione per predire l’emoji più proba-          In this paper, we describe our neural network-
        bile (tra un totale di 25 classi) che possa     based system that exhibited the best performance
        essere associato ad un dato tweet scritto       among the submitted approaches in this task. Our
        in italiano da uno specifico utente. Di-        approach is able to successfully exploit user in-
        mostriamo che utilizzare esclusivamente         formation, such as the prior emoji usage history
        dati testuali non è sufficiente per ottenere   of a user, in conjunction with the textual features
        un ragionevole livello di performance su        that are customary for this task. In our experi-
        questo task, e proponiamo un sistema in         ments, we have found that the usage of just the
        grado di beneficiare dalle informazioni         textual information from the tweet provides lim-
        multimodali contenute nel training set, au-     ited results: none of our text-based models were
        mentando significativamente lo score F1         able to outperform a simple rule-based baseline
        e guadagnando la prima posizione nella          based on prior emoji history of a target user. How-
        classifica finale.                              ever, by considering all the modalities of the input
                                                        data that were made available to us, we were able
                                                        to improve our results significantly. Specifically,
1       Introduction                                    we combine into a single efficient neural network
                                                        the typical Bi-LSTM-based recurrent architecture,
Particularly over the last few years, with the in-
                                                        that has shown excellent performance previously
creasing presence of social networks and instant
                                                        in this task, with the multilayer perceptron applied
    1
        EVALITA: http://evalita.it/2018                 to user-based features.
                                 Figure 1: A diagram of the approach.

2       Description of the System                     2.2     User-based features

ITAmoji task is a classification task of predicting   Rather than relying solely on a text of the target
one of the 25 emojis to go along with the tweet.      tweet, we exploit additional user-based features
The training set provided by the organizers of the    to improve performance. The task features many
competition consists of 250 000 Italian tweets, in-   variations of the smiling face and three different
cluding for each tweet the text (without the target   heart emojis, making it impossible even for a hu-
emoji), the user ID and the timestamp as features.    man to determine the most suitable one just based
Participants were explicitly forbidden to expand      on a tweet. One of the features we considered
the training set. Figure 1 provides an overview of    was the prior emoji distribution for a target author.
our approach. In this section, we provide detailed    The hypothesis was that the choice of a particular
descriptions of the methods we employed to solve      emoji is driven mainly by the personal user prefer-
the proposed task.                                    ences exemplified by the previous emoji choices.
                                                         To this end, we have collected two different
                                                      types of emoji history for each user. Firstly, we
2.1      Textual features
                                                      use labels in the training set to compute emoji dis-
In order to embed the textual content of the tweet,   tributions for each user yielding vectors of size 25.
we have decided to apply a vectorization based on     Users from the test set that were not present in
fastText (Bojanowski et al., 2017), a recent ap-      the training set were initialized with zeroes. Sec-
proach for learning unsupervised low-dimensional      ondly, we have gathered the last 200 tweets for
word representations. fastText sees the words         each user using Twitter API5 , and then extracted
as a collection of character n-grams, learning a      and counted all emojis that were present in those
representation for each n-gram. fastText fol-         tweets. This yielded a sparse vector of size 1284.
lows the famous distributional semantics hypoth-      At this step we took extra care to prevent data
esis utilized in other approaches, such as LSA,       leaks: if a tweet from the test set ended up in the
word2vec and GloVe. In this work, we exploit the      collected 200 tweets, it wasn’t considered in the
Italian embeddings trained on text from Wikipedia     user history. The runs that used the former, train-
and Common Crawl2 and made available by the           ing set-based approach had a ” tr” suffix in its
fastText authors3 . Such embeddings include           name. The ones that used the full user history-
300-dimensional vectors for each of 2M words in       based approach had a ” ud” suffix.
the vocabulary. Additionally, we have trained our        In addition to prior emoji distribution, we did
own4 embeddings using the corpus of 48M Ital-         preliminary experiments with user’s social graph.
ian tweets that were acquired from Twitter Stream-    Social graph, which is a graph of connections be-
ing API. This yielded 1.1M 100-dimensional word       tween the users, is shown to be an important fea-
vectors. Finally, we have also conducted experi-      ture for many tasks on social media, for example,
ments with word vectors suggested by the task or-     user profiling. We followed the recently proposed
ganizers (Barbieri et al., 2016).                     approach (Nechaev et al., 2018a) to acquire 300-
                                                      dimensional dense user representations based on a
    2
     http://commoncrawl.org/                          social graph. This feature, however, did not im-
    3
     https://github.com/facebookresearch/             prove the performance of our approach and was
fastText/blob/master/docs/crawl-vectors.
md                                                    excluded.
   4
     https://doi.org/10.5281/zenodo.
                                                         5
1467220                                                      https://developer.twitter.com
         Layer        Parameter           Value          dimensionality using the embedding matrix of one
      Textual Input   seq. length         48             of the approaches listed in Section 2.1. The result-
                      input               256            ing sequence is padded with zeroes to a constant
                      output              100            length, in our case 48, and fed into the neural net-
      Embedding
                      trainable           true
                      l2 regularization   10−6
                                                         work.
       Dropout        probability         0.4            2.4   Overall implementation
       Bi-LSTM        output              512
                                                         In order to accommodate both textual and user-
          (a) Bi-LSTM model hyperparameters.
                                                         based features, we devise a joint architecture that
      History Input   input               1284           takes both types of features as input and produces
                      output              256            probability distribution for the target 25 classes.
         Dense        l2 regularization   10−5           The general logic of our approach is shown in Fig-
                      activation          tanh
                                                         ure 1. The core consists of two main components:
                      output              256
         Dense        l2 regularization   10−5
                      activation          tanh
                                                           • Bi-LSTM. The recurrent unit consumes the
                                                             input sequence one vector at a time, modify-
         (b) User history model hyperparameters.
                                                             ing the hidden state (i.e., memory). After the
      Concatenate     output              768                whole sequence is consumed in both direc-
                      output              25                 tions, the internal states of the two RNNs are
         Dense        l2 regularization   –                  concatenated and used as a tweet embedding.
                      activation          softmax
                                                             Additionally, we perform l2-regularization of
                      method              Adam
                      learning rate       0.001              the input embedding matrix and the dropout
       Optimizer      β1                  0.9                to prevent overfitting and fine-tune the perfor-
                      β2                  0.999              mance. Table 1a details the hyperparameters
                      decay               0.001
                                                             we used for a textual part of our approach.
        (c) Joint model and optimizer parameters.
                                                           • User-based features. The emoji distribution
         Table 1: Model hyperparameters.                     we collected (as described in Section 2.2)
                                                             was fed as input to a multilayer perceptron:
2.3   RNN exploiting textual features                        two densely-connected layers with tanh as
The Recurrent Neural Networks have turned out                activation and l2-regularization to prevent
to be a powerful architecture when it comes to an-           overfitting. Table 1b showcases the chosen
alyzing and performing prediction on sequential              hyperparameters for this component using
data. In particular, over the last few years, differ-        the full user history as input.
ent variations of the RNN has shown to be the top        The outputs of the two components are then con-
performing approaches for a wide variety of tasks,       catenated and a final layer with the sof tmax ac-
including tasks in Natural Language Processing           tivation is applied to acquire the probability dis-
(NLP). RNN consumes the input sequence one               tribution of the 25 emoji labels. The network is
element at the time, modifying the internal state        then optimized jointly with cross entropy as the
along the way to capture relevant information from       objective function using Adam optimizer. Table 1c
the sequence. When used for NLP tasks, RNN is            includes all relevant hyperparameters we used for
able to consider the entirety of the target sentence,    this step.
capturing even the longest dependencies within              Since the runs are evaluated based on macro-F1 ,
the text. In our system, we use the bi-directional       in order to optimize our approach for this metric,
long short-term memory (Bi-LSTM) variation of            we have introduced class weights into the objec-
the RNN. This variation uses two separate RNNs           tive function. Each class i is associated with the
to traverse the input sequence in both directions        weight equal to:
(hence bi-directional) and employs LSTM cells.
                                                                                 maxi (N ) α
                                                                                          
   Input text provided by the organizers is split into                  wi =                             (1)
                                                                                    Ni
tokens using a modified version of the Keras tok-
enizer (can be found in our repository). Then, the       where Ni is the amount of samples in a particular
input tokens are turned into word vectors of fixed       class and α = 1.1 is a hyperparameter we tuned
for this task. This way the optimizer is assigning      of a particular word embedding approach. In
a greater penalty for mistakes in rare classes, thus    particular, provided refers to the ones that
optimising for the target metric.                       were suggested by organisers, custom-100d
   During the training of our approach, we employ       indicates our fastText-based embeddings and
an early stopping criteria to halt the training once    common-300d refers to the ones available on
the performance on the validation set stops im-         fastText website. Table 2 details the perfor-
proving. In order to properly evaluate our system,      mances of the mentioned models.
we employ 10-fold cross-validation, additionally           Finally, we submitted three of our best mod-
extracting a small validation set from the train-       els for the official evaluation. All of the sub-
ing set for that fold to perform the early stopping.    mitted runs use the Bi-LSTM approach with
For the final submission we use a simple ensemble       our custom-100d word embeddings along with
mechanism, where predictions are acquired inde-         some variation of user emoji distribution as de-
pendently from each fold and then averaged out to       tailed in Section 2.2. Two of the runs use the en-
produce the final submission. Additionally, one of      sembling trick using all available cross-validation
the runs was submitted using predictions from the       folds, while the remaining one we submitted
random fold. Runs exploiting the ensemble ap-           (” 1f”) uses predictions from just one fold.
proach have the ” 10f” suffix, while runs using
just one fold have the ” 1f” suffix.                    4   Results
   The code used to preprocess data, train and eval-
uate our approach is available on GitHub6 .             Here we report performances of the models bench-
                                                        marked both during our local evaluation (Table 2)
3    Evaluation setting                                 and the official results (Table 3). We started
                                                        experiments with just the textual models testing
In this section, we provide details on some of the
                                                        different architectures and embedding combina-
approaches we have tested during the development
                                                        tions. Among those, the Bi-LSTM architecture
of our system, as well as the models we submitted
                                                        was a clear choice, providing 1-2% F1 over CNN,
for the official evaluation. In this paper, we report
                                                        which led to us abandoning the CNN-based mod-
results for the following models:
                                                        els. Among the three word embedding mod-
    • MF HISTORY. A rule-based baseline that al-        els we evaluated, our custom-100d embed-
      ways outputs the most frequent emoji from         ding exhibited the best performance on Bi-LSTM,
      the user history based on a training set.         while common-300d showed the best perfor-
                                                        mance using the CNN architecture.
    • BASE CNN. A basic Convolutional Neural               After we have acquired the user emoji dis-
      Network (CNN) taking word embeddings as           tributions, we have devised a simple baseline
      input without any user-based features.            (MF HISTORY), which, to our surprise, outper-
    • BASE LSTM. A Bi-LSTM model described              formed all the text-based models we’ve tested so
      in Section 2.3 used with textual features only.   far: 3% F1 improvement compared to the best Bi-
                                                        LSTM model. When we introduced the user emoji
    • BASE LSTM TR. The complete approach in-           histories in our approach, we have gained a signif-
      cluding both feature families with emoji dis-     icant performance gain: 4% when using the scarce
      tribution coming from the training set.           training set data and 12% when using the complete
                                                        user history of 1284 emojis from recent tweets.
    • BASE LSTM UD. The complete approach
                                                        During the final days of the competition, we have
      with emoji distribution coming from the most
                                                        tried to exploit other user-based features to further
      recent 200 tweets for each user.
                                                        bolster our results, for example, the social graph
For the other models tested during our local eval-      of a user. Unfortunately, such experiments did not
uation and complete experimental results, please        yield performance gains before the deadline.
refer to our GitHub repository.                            During the official evaluation, complete user
   Additionally, for the BASE LSTM approach we          history-based runs exhibited top performance with
report performance variations due to a choice           ensembling trick actually decreasing the final F1 .
  6
    GitHub repository:      https://github.com/         As we expected from our experiments, training
Remper/emojinet                                         set-based emoji distribution was much less per-
            Approach                Embedding      Accuracy     Precision    Recall      F1 macro
            MF HISTORY                  –           0.4396       0.4076      0.2774       0.3133
            BASE CNN              common-300d       0.4351       0.3489      0.2464       0.2673
            BASE LSTM             common-300d       0.4053       0.3167      0.2534       0.2707
            BASE LSTM               provided        0.4415       0.3836      0.2408       0.2622
            BASE LSTM             custom-100d       0.4443       0.3666      0.2586       0.2809
            BASE LSTM TR          custom-100d       0.4874       0.4343      0.3218       0.3565
            BASE LSTM UD          custom-100d       0.5498       0.4872      0.4097       0.4397

                       Table 2: Performance of the approaches as tested locally by us.

      Run                            Accuracy@5   Accuracy@10 Accuracy@15 Accuracy@20        F1 macro
      BASE UD 1F                       0.8167          0.9214      0.9685       0.9909          0.3653
      BASE UD 10F                      0.8152          0.9194      0.9681       0.9917          0.3563
      BASE TR 10F                      0.7453          0.8750      0.9434       0.9800          0.2920
      gw2017 p.list                    0.6718          0.8148      0.8941       0.9299          0.2329

       Table 3: Official evaluation results for our three submitted runs and the runner-up model.


formant but still offered significant improvement                Precision   Recall        F1            Support
over the runner-up team (gw2017 p.list) as                        0.7991     0.6490      0.7163           5069
shown in Table 3. Additionally, we detail the per-                0.4765     0.7116      0.5708           4966
formance of our best submission (BASE UD 1F)                      0.6402     0.4337      0.5171            279
for each individual emoji in Table 4 and Figure 2.                0.5493     0.4315      0.4834            387
                                                                  0.4937     0.4453      0.4683            265
5   Discussion and Conclusions                                    0.7254     0.3229      0.4469            319
                                                                  0.3576     0.5370      0.4293           2363
Our findings suggest that emojis are currently used               0.4236     0.4089      0.4161            834
mostly based on user preferences: the more prior                  0.4090     0.3775      0.3926            506
                                                                  0.4034     0.3354      0.3663           1282
user history we added, the more significant per-
                                                                  0.4250     0.3299      0.3715            885
formance boost we have observed. Therefore, the                   0.3743     0.3184      0.3441           1338
emojis in a text cannot be considered indepen-                    0.3684     0.3239      0.3447           1028
dently from the person that has used them and                     0.3854     0.2782      0.3231            266
textual features alone can not yield a sufficiently               0.3844     0.2711      0.3179            546
performant approach for predicting emojis. Addi-                  0.3899     0.2648      0.3154            642
tionally, we have shown that the task was sensitive               0.3536     0.2743      0.3089            700
                                                                  0.3835     0.2566      0.3075            417
to the choice of a particular neural architecture as
                                                                  0.3525     0.1922      0.2488            541
well as to the choice of the word embeddings used
                                                                  0.2866     0.2639      0.2748            341
to represent text.                                                0.2280     0.2922      0.2562            373
   An analogous task was proposed to the par-                     0.2751     0.2133      0.2403            347
ticipants of the SemEval 2018 competition. The                    0.2845     0.1741      0.2160            379
winners of that edition applied an SVM-based ap-                  0.3154     0.1822      0.2310            483
proach for the classification (Çöltekin and Rama,               0.2956     0.1824      0.2256            444
2018). Instead, we have opted for a neural
network-based architecture that allowed us greater        Table 4: Precision, Recall, F1 of our best submis-
flexibility to experiment with various features           sion and the number of samples in test set for each
coming from different modalities: the text of the         emoji.
tweet represented using word embeddings and the
sparse user-based history. During our experiments         prediction task the effectiveness of different ap-
with the SemEval 2018 task as part of the NL4AI           proaches may significantly vary based either on
workshop (Coman et al., 2018), we have found the          a language of the tweets or based on a way the
CNN-based architecture to perform better, while           dataset was constructed.
here the RNN was a clear winner. Such discrep-              In the future, we would like to investigate this
ancy might suggest that even within the emoji             topic further by trying to study differences in
Figure 2: Confusion matrix for our best submission normalized by support size: each value in a row is
divided by the row marginal. Diagonal values give recall for each individual class (see Table 4).

emoji usage between languages and communities.           Andrei Catalin Coman, Giacomo Zara, Yaroslav
Additionally, we aim to further improve our ap-            Nechaev, Gianni Barlacchi, and Alessandro Mos-
                                                           chitti. 2018. Exploiting deep neural networks for
proach by identifying more user-based features,
                                                           tweet-based emoji prediction. In Proc. of the 2nd
for example, by taking into account the feature            Workshop on Natural Language for Artificial Intel-
families suggested by Nechaev et al. (Nechaev et           ligence co-located with 17th Int. Conf. of the Italian
al., 2018b).                                               Association for Artificial Intelligence (AI*IA 2018),
                                                           Trento, Italy.
References                                               Yaroslav Nechaev, Francesco Corcoglioniti, and Clau-
Francesco Barbieri, German Kruszewski, Francesco           dio Giuliano. 2018a. Sociallink: Exploiting graph
  Ronzano, and Horacio Saggion. 2016. How cos-             embeddings to link dbpedia entities to twitter pro-
  mopolitan are emojis?: Exploring emojis usage and        files. Progress in AI, 7(4):251–272.
  meaning over different languages with distributional
                                                         Yaroslav Nechaev, Francesco Corcoglioniti, and Clau-
  semantics. In Proceedings of the 2016 ACM on Mul-
                                                           dio Giuliano. 2018b. Type prediction combining
  timedia Conference, pages 531–535. ACM.
                                                           linked open data and social media. In Proc. of the
                                                           27th ACM Int. Conf. on Information and Knowl-
Francesco Barbieri,     Jose Camacho-Collados,             edge Management, CIKM 2018, Torino, Italy, pages
  Francesco Ronzano, Luis Espinosa-Anke, Miguel            1033–1042.
  Ballesteros, Valerio Basile, Viviana Patti, and
  Horacio Saggion. 2018. SemEval-2018 Task               Francesco Ronzano, Francesco Barbieri, Endang
  2: Multilingual Emoji Prediction. In Proc. of            Wahyu Pamungkas, Viviana Patti, and Francesca
  the 12th Int. Workshop on Semantic Evaluation            Chiusaroli. 2018. Overview of the evalita 2018
  (SemEval-2018), New Orleans, LA, United States.          italian emoji prediction (itamoji) task. In Tom-
  Association for Computational Linguistics.               maso Caselli, Nicole Novielli, Viviana Patti, and
                                                           Paolo Rosso, editors, Proceedings of the 6th evalua-
Piotr Bojanowski, Edouard Grave, Armand Joulin, and        tion campaign of Natural Language Processing and
   Tomas Mikolov. 2017. Enriching word vectors with        Speech tools for Italian (EVALITA’18), Turin, Italy.
   subword information. Transactions of the Associa-       CEUR.org.
   tion for Computational Linguistics, 5:135–146.

Çagri Çöltekin and Taraka Rama. 2018. Tübingen-
   oslo at semeval-2018 task 2: Svms perform bet-
   ter than rnns in emoji prediction. In Proc. of
   The 12th Int. Workshop on Semantic Evaluation,
   SemEval@NAACL-HLT, New Orleans, Louisiana,
   pages 34–38.