=Paper=
{{Paper
|id=Vol-2130/paper2
|storemode=property
|title=Multi-task Emoji Learning
|pdfUrl=https://ceur-ws.org/Vol-2130/paper2.pdf
|volume=Vol-2130
|authors=Francesco Barbieri,Luís Marujo,Pradeep Karuturi,William Brendel
}}
==Multi-task Emoji Learning==
<pdf width="1500px">https://ceur-ws.org/Vol-2130/paper2.pdf</pdf>
<pre>
                                  Multi-task Emoji Learning

    Francesco Barbieri♣     Luı́s Marujo♥     Pradeep Karuturi♥       William Brendel♥
        ♣
          Large Scale Text Understanding Systems Lab, TALN, UPF, Barcelona, Spain
                        ♥
                          Snap Inc. Research, Venice, California, USA
                         ♣
                             {name.surname}@upf.edu, ♥ {name.surname}@snap.com


                                                                     Over the past few years, the interest in emoji re-
                                                                  search has increased and several studies has been pub-
                        Abstract                                  lished in the area of distributional semantics [BRS16,
                                                                  ERA+ 16, WBSD17b, WBSD17a, BCC18], sentiment
     Emojis are very common in social media                       analysis [NSSM15, HGS+ 17, KK17, RPG+ 18] and
     and understanding their underlying seman-                    multimodal systems [CMS15, CSG+ 18, BBRS18]. In
     tics is of great interest from a Natural Lan-                the past year researchers also focused on the possi-
     guage Processing point of view. In this work,                bility of predicting emojis in a text message [BBS17,
     we investigate emoji prediction in short text                FMS+ 17]. The emoji prediction task consists in pre-
     messages using a multi-task pipeline that si-                dicting the original emoji present in a tweet (or snap
     multaneously predicts emojis, their categories               caption) given only the non-emoji textual content.
     and sub-categories. The categories are either                Prior explorations of emoji prediction tended to focus
     manually predefined in the unicode standard                  on less than 2% of the total number (2653) of emojis
     or automatically obtained by clustering over                 in Unicode 6 standard 1 emojis. Another limitation of
     word embeddings. We show that using this                     those papers was that emoji prediction could be am-
     categorical information adds meaningful in-                  biguous. For example, when the model predicts the
     formation, thus improving the performance of                 correct label to be , emojis like , , , or          can
     emoji prediction task. We systematically an-                 also be valid predictions.
     alyze the performance of the emoji prediction                   In this work, we extended the emoji prediction task
     task by varying the number of training sam-                  to 300 emojis in order to study a larger number of
     ples and also do a qualitative analysis by using             emojis along with their unicode standard categories,
     attention weights from the prediction task.                  sub-categories, and the new semantic clusters that we
                                                                  created. We are not aware of any previous research
1    Introduction                                                 work focused on either predicting a large number of
                                                                  emojis (300), or using a multi-task approach to predict
Emojis are a popular set of ideograms created in the              emojis or emoji categories. We also do a systematic
late 1990s to enrich written communication by adding              analysis of how the number of training samples affect
nonverbal expressive power to digital communication.              the performance of the emoji prediction task. To mit-
These symbols can be used by human readers to con-                igate the problem of emoji ambiguity, we concentrate
vey emotions and information in a condensed form. As              on broad emoji category prediction in addition to that
Snapchat, Twitter and other social media platforms                of individual emoji prediction. We grouped emojis in
have become popular, so has the usage of emojis.                  two different ways. The first one was defined by the
   Despite their popularity, there is very little research        Unicode consortium2 , which groups emojis into seven
work in predicting emojis.                                        categories (e.g., “Smileys & People”, “Nature”) and 74
                                                                  sub-categories (e.g., “face-positive”, “face-negative”).
Copyright c 2018 held by the author(s). Copying permitted for     The main categories are commonly found on mobile
private and academic purposes.
                                                                  phone keyboards as shown in Figure 1). Alternatively
In: S. Wijeratne, E. Kiciman, H. Saggion, A. Sheth (eds.): Pro-
ceedings of the 1st International Workshop on Emoji Under-
standing and Applications in Social Media (Emoji2018), Stan-        1 www.unicode.org/emoji/charts/full-emoji-list.html

ford, CA, USA, 25-JUN-2018, published at http://ceur-ws.org         2 www.unicode.org/emoji/charts/emoji-ordering.html
                                                           Note that we remove skin color from the emojis3 to
                                                           avoid generating very similar labels. Table 2 includes
                                                           statistics on the two datasets. We can see that Snap
                                                           captions are shorter than tweets, while average word
                                                           length is similar. Another important difference be-
                                                           tween the two datasets is the most frequent emojis
                                                           used. Table 1 shows the 60 most frequent emojis in
                                                           each dataset (on the top Twitter and on the bottom
Figure 1: Screenshot of Apple’ Emojis keyboard in-         Snap data), along with the number of documents that
cluding a subset of Smileys & People category emojis,      include each emoji. In both datasets the frequency is
and buttons to access the remaining categories.            very unbalanced: 16% of tweets and 25% of snaps
we also created semantic clusters using embeddings.        include one of the three most frequent emojis , ,
    We use a Multi-task approach to combine the tasks        . Therefore we use a balanced dataset in our ex-
of emoji and category prediction. Multi-task ap-           periments, in order to give same importance to each
proaches [Car, Car97, Ben12, CW08] improves gen-           emoji, independent of the frequency of usage. We sub-
eralization by transferring information across different   sample the most frequent emojis in order to match
tasks and improving each task individually. In partic-     the same number of examples of the least represented
ular, multi-task learning with simultaneous training       emoji (1,500 examples for Twitter and 3,000 for Snap
on multiple tasks has demonstrated promising results       data). We show that using fewer than 1,500 examples
[CW08][FMS+ 17][Ben12].                                    per emoji leads to a drastic decrease in accuracy of
    Our work performs multi-task learning by training      the emoji detection (see Figure 3). We focus our ex-
a single model with multiple outputs (the dataset is       periments on 300 emojis because we do not have more
annotated with multiple labels) and we evaluate us-        than 1,500 tweets per emoji beyond the top 300 emojis
ing our gold standard created from Twitter and Snap        in our Twitter dataset. For our experiments we ran-
public posts as described in the Datasets section.         domly chose 80% of the documents for training, 10%
The subjectivity of emoji interpretation makes emoji       for validation and 10% for testing.
prediction a very challenging task. Nevertheless our
                                                           2.1   Twitter Dataset
work shows that simultaneously predicting emojis,
their categories, and sub-categories in a multitask        The Twitter dataset contains 50 million tweets re-
framework improves the overall results. It not only        trieved using Twitter API. Tweets were posted be-
improves emoji prediction, but it also helps with the      tween January 2016 and April 2017 and were geo-
identification of emoji categories, which can be partic-   localized in the United States. We removed hyperlinks
ularly more relevant when the emoji prediction model       from each tweet, and lowercased all textual content in
is less precise.                                           order to reduce noise and sparsity. Since Twitter data
    The remainder of this work is organized in the fol-    includes a large percentage of bot data, we filter noise
lowing way: The next section describe the datasets         as much as possible, removing repeated tweets (or very
used in our experiments. We then present the Deep          similar ones) and selected a maximum of five tweets
Learning Models explored to solve our research prob-       per user. From this dataset, we selected tweets includ-
lem, Finally, we discuss the experiments, results and      ing anyone of the 300 most frequently occuring emojis
then conclude with future research directions.             and at least three tokens (without the emoji), resulting
                                                           in a final dataset composed of 2,862,505 tweets.
2   Datasets
                                                           2.2   SnapCaption
In this study we explore emoji prediction for two dif-
ferent datasets: Twitter and Snapchat captions. We         SnapCaption is an in-house Snapchat internal dataset
select documents (tweets and snaps) that contain a         containing only Snapchat captions. A caption is the
single emoji, and at least three tokens apart from the     textual overlay component of a snap. These cap-
emoji. We restrict to documents containing a single        tions were collected exclusively from snaps submit-
emoji so as to minimize the interference of the pres-      ted to public and crowd-sourced stories (as known
ence of other emojis in the emoji prediction task. We      as Live Stories or Our Stories). Examples of such
also consider only the documents that include the most     public crowd-sourced stories are “New York Story” or
frequent 300 emojis in each dataset. We restrict to the    “Thanksgiving Story”. All captions were posted in
top 300 emojis only due to lack of meaningful number       one year period and do not contain any image or any
of examples beyond that. A subset of the most fre-
quent emojis for each dataset is reported in Table 1.        3 E.g.,             are mapped to one single label
                   Table 1: 60 most frequent emoji for the Twitter (top) and Snap (bottom) datasets.

 235   112   111   63   52   50   45   41   38   36   35   33   33   31   31   31   29   29   28   28   27   27   27   27   26   24   24   23   23   23

 23    22    20    19   19   18   18   18   17   17   16   16   16   16   16   15   15   15   15   15   15   14   14   14   14   13   13   13   13   12


3343 2425 1734 645 617 578 507 433 418 395 391 380 375 364 356 321 315 313 278 273 267 266 234 226 226 225 222 219 212 197

 194   194   191 188 187 186 186 181 174 170 169 167 161 157 154 153 153 150 148 147 147 145 139 136 136 134 134 133 124 123


other associated information. This dataset contains
                                                                               Table 2: Average, standard deviation and Median
30,004,519 captions.
                                                                               length of words and characters of the two datasets.
                                                                                              Words             Chars
2.3    Categories and Clusters of Emojis                                        Dataset Avg. Std. Median Avg. Std. Median
We also consider broader classes of emojis, such as                             Twitter 12.73 4.23  12   91.14 24.67  92
                                                                                 Snap    5.01 2.29   4   25.62 11.75  23
unicode categories and semantic clusters. The unicode
consortium defines a set of 7 categories categories and
74 sub-categories.                                                             3     Models
    The problem with Unicode categories and sub-                               Our main architecture, illustrated in Fig. (2), starts
categories is that they fail to accurately capture se-                         with our character and word embedding modules
mantically related emojis. Emojis like       and     are                       whose outputs are fused by our feature attention unit
both in the sub-category neutral faces even though                             and the word attention unit. Finally the fully con-
they clearly indicate different emotions. Another ex-                          nected layers and the softmax play the role of the final
ample is     and     that are semantically similar, but                        multi-task classifier.
they appear in different categories (“Smiling Faces”
                                                                                  Previous approaches [BBS17, FMS+ 17] have suc-
and “Emotions”) even though they address nearly
                                                                               cessfully learned LSTM models for emoji predic-
identical meanings. To overcome this limitation, we
                                                                               tion tasks. We experimented different plain LSTMs,
propose a second approach to automatically organize
                                                                               stacked LSTMs [FMS+ 17], and different word repre-
emojis by clustering them using pre-trained word em-
                                                                               sentations before solidifying on our final model archi-
beddings similar to emoji2vec [?]. These clusters have
                                                                               tecture Fig. (2). In addition, we explored single task
the advantage of better capturing the semantic infor-
                                                                               models and multi-task models. In the case of the
mation of emojis. For example        and      are in the
                                                                               multi-task models, the entire network is shared and
same cluster. These clusters are an important aspect
                                                                               the specialization only occurs at the final stage to pre-
to consider because they are based on how emojis co-
                                                                               dict specific labels of each task. This specialization is
occur in short text messages from tweets and captions
                                                                               accomplished through specific linear transformations.
of public snaps. We pretrained two different sets of
                                                                               Finally we used a cross entropy loss function for all
skip-gram embeddings [MLS13] for Twitter and Snap.
                                                                               classification tasks. In the case of multitask learning,
The first skip-gram model was trained on a dataset
                                                                               the final loss is the sum of each single loss4 . In the
of about 70 million tweets and the second skip-gram
                                                                               following subsections, we detail each stage of our main
model was trained on about 100 million Snap captions.
                                                                               architecture.
Using the embeddings of the 300 most frequent emo-
jis of each dataset, we created two sets of 30 clusters
using a k-means algorithm. The number of clusters                              3.1       Word Representation
was defined based on qualitative analysis (clusters that
seemed to better organize emojis by semantics). In ad-                         The word embeddings are learned together with the
dition, the number of clusters was selected such that                          updates to the model. For out-of-vocabulary words
each cluster has a similar number of emojis that are                           (OOVWs), we used a fixed representation that is han-
usually displayed on a mobile keyword. As a result, we                         dled as a separate word. In order to train the fixed
would be able to just provide an icon to access directly                       representation for OOVWs, we stochastically replace
each cluster in a similar way as the Figure 1 shows for                        (with p = 0.5) each word that occurs only once in the
the top categories. The resulting clusters will group                          training data. When we use pre-trained word embed-
semantically similar emojis (like in [BKRS16] where                            dings, that are concatenated with the learned vector.
11 cluster are created for 100 emojis), grouping love,
sad faces, hand/gestures, animals, food, drinks, par-                             4 We also experimented weighted sum, with various weights,

ties, Christmas, and so on.                                                    but the best results are obtained with a simple sum of the losses.
                                                           3.4   Feature Attention

                                                           The feature attention module aims to linearly fuse
                                                           multiple input signals instead of simply concatenating
                                                           them. In our architecture, this module learns a uni-
                                                           fied word representation space, i.e. it produces a sin-
                                                           gle vector representation with aggregated knowledge
                                                           among our multiple input word representations, based
                                                           on their weighted importance. We can motivate this
                                                           module from the following observations.
                                                              Prior work, [BBS17] combines both word rep-
                                                           resentation x(w) and character-level representation
                                                           x(c) by simply concatenating the word and charac-
                                                           ter embeddings at each LSTM decoding step ht =
                                                                     (w)   (c)
                                                           LSTM([xt ; xt ]). However, this naive concatena-
                                                           tion results in inaccurate decoding, specifically for un-
                                                           known word token embeddings, e.g., an all-zero vector
                                                             (w)                          (w)
                                                           xt = 0 or a random vector xt =  ∼ U (−σ, +σ), or
                                                           even for out-of-vocabulary words. While this concate-
                                                           nation approach does not cause significant errors for
      Figure 2: Final architecture of our model            well-formatted text, we observe that it induces perfor-
3.2   Char Representation                                  mance degradation for our social media post datasets
                                                           which contain a significant number of slang words, i.e.,
In addition, we use a character based embedding            misspelled or out-of-vocabulary words. As a result,
[LLM+ 15, ?] stacked with a B-LSTM [GS05], produc-         we use a feature attention module, that adaptively
ing a character-based word embedding that focuses on       emphasizes each feature representation in a global
word spelling variants. Indeed, the character-based        manner at each decoding step t. This process pro-
word embedding learned similar representations for         duces a soft-attended context vector xt as an input
words that are orthographically similar, and thus are      token for the next stacked B-LSTM that takes care
expected to handle different alternatives of the same      of the sentences embedding. [RCP16] introduced a
word types that normally occur in social media.            similar approach, where the character embeddings are
                                                           weighted with an attention module. We use the fol-
3.3   Bi-directional LSTMs                                 lowing method:
Our bi-directional LSTM modules, [GS05] named B-                   (w)    (c)                    (w)   (c)        
LSTM in Fig. (2), consists of a forward LSTM that                [at , at ] = σ Wm · [xt ; xt ] + bm
processes an input message from left to right, while the                                        (m)
                                                                          (m)          exp(at )                            (1)
                                                                         αt     =                            ∀m ∈ {w, c}
backward LSTM processes it in the reverse direction.                                   P      (m0 )
                                                                                         exp(at )
As a result, the message representation s is based on                               m0 ∈{w,c}

both the forward and backward LSTM encoding:
                                                                              (w)   (c)
                                                           where αt = [αt ; αt ] ∈ R2 is an attention vector at
            s = max {0, W[hfw ; hbw ] + d}                 each decoding step t, and xt is a final context vector
                                                           at t that maximizes information gain for xt . Note that
where W is a learned parameter matrix, fw is the for-
                                                           this feature attention requires each feature representa-
ward LSTM encoding of the message, bw is the back-                                                  (w)   (c)
ward LSTM encoding of the message, and d is a bias         tion to have the same dimension (e.g. xt , xt ∈ Rp ),
term, and we use a component-wise ReLU as the non-         and that the transformation via Wm essentially en-
linear unit. We use B-LSTM modules for both word           forces each feature representation to be mapped into
and sentence representations, namely Char B-LSTM           the same unified subspace, with the output of the
and Words B-LSTMs in our architecture Fig. (2). Char       transformation encoding weighted discriminative fea-
B-LSTM takes a sequence of characters and outputs          tures for classification of emojis.
a word embedding vector. This output is mixed with
another word representation via our feature attention      3.5   Word Attention
module. Then, the stacked Words LSTMs receive se-
quences of word representations from the attention         Not all the words have the same importance in the
module, and output sentence embedding vectors.             representation of a document. We use the attention
mechanism introduced in [YYD+ 16]:                                                36

         uit = tanh(Ww hit + bw )                                                 34
                exp(u>it uw )                            (2)


                                                                 Top 5 Accuracy
                                           X
         αit = P                ;   di =       αit hit                            32
                         >
                 t exp(uit uw )            t
                                                                                  30
where the final document representation di is a                                                                       lstm
weighted average of the hidden representation hi t of                             28                              lstm + att
the LSTM. The weights αit are learned by the use of a                                                            2 lstm + att
                                                                                  26                         char + 2 lstm + att
Multi-Layer Perceptron (linear transformation W and
biases b) with tanh as non-linear operation, and a soft-                          24                             char + lstm
                                                                                                              char + lstm + att
max to compute the probability of each word.
                                                                                  22
                                                                                       100   500   1,000   1,500   2,000   2,500   3,000
4     Experiments And Results                                                                  Training examples per class
We use two main variations for experiments: Single-
Task Prediction of emojis, unicode categories, and             Figure 3: Acc@top 5 of the same algo. but variable
emoji clusters, and Multi-Task Prediction, where               nr. of training instances per class (from 100 to 3000
we combine the single tasks in one single model. We            examples for each emoji) on SnapCaptions. Test and
also evaluate the impact of our different modules in-          validation set are fixed for each experiment.
cluding the combination of word/char LSTMs and the
                                                               Table 3). There are several reasons that could explain
word attention unit. Finally we investigate the influ-
                                                               this difference in results. One reason is the length of
ence of the number of layers for the LSTMs.
                                                               the text messages, since in Twitter there are on aver-
                                                               age twelve words per message, while on Snap has only
4.1     Single-Task Prediction                                 five (see Table 2). Another reason could be the miss-
We explore three different tasks: (i) the emoji pre-           ing visual context of Snap posts5 , while only a small
diction task proposed by [BBS17], (ii) prediction of           percentage of tweets is complemented with a visual
unicode emoji categories (the emoji in the text be-            content. For this reason, tweets contain typically less
long to the faces, animal, objects) and sub-categories         semantic ambiguity.
(positive faces, animal-mammal), and (iii) prediction              Table 3 highlights the best performing systems on
of automatic clusters that we previously generated us-         the emoji prediction task. For the two datasets state of
ing pre-trained word embeddings.                               the art systems are outperformed by the combination
                                                               of additional components. For example, adding a word
4.1.1    Predicting Emojis                                     attention module improves the baseline of [BBS17].
                                                               Finally, there is an important difference when predict-
Given a set of documents, each document contain-               ing 20 and 300 emojis. We plot on the left of Figure 3
ing only one emoji class, the task consists of pre-            the accuracy of same model architecture (Char + 2
dicting the emoji from the text. For this task, we             LSTM + word attention) on the emoji prediction task
tested the influence of the number of emoji classes            for different numbers of emojis (20 to 300). Best ac-
and the number of examples per class. More pre-                curacy at top 5 (a@5) drops from 20 to 100, and then
cisely, for each experiment, we extract a balanced             remains constant. We observe the same drop using F1
dataset of Nclass emoji classes, and Ndata examples per        (that only considers whether an emoji is predicted as
class, with Nclass = {20, 50, 100, 200, 300 and Ndata          first option), however, having more than 100 classes
= {100, 500, 1000, 1500, 2000, 2500, 3000}. Nclass and         results in improvement. This is probably due the type
Ndata are tested independently: when we vary Nclass ,          of the more rare emoji classes added after the most
we fix Ndata to 3000, and when we vary Ndata we fix            100 frequent ones, that are more specific (like , ,
Nclass to 300. Figure 3 shows our experiments with the         or ) hence easier to predict.
Snapchat dataset. It is clear that using more examples
per class improves our model by around 1% absolute             4.1.2                   Predicting Unicode Emoji Categories
point from 1500 to 3000 examples. For >2000 exam-                                      and Sub-categories
ples the system converges to its optimum.
   From Figure 4, we observe that Twitter data is eas-         We predict Unicode emojis categories and sub-
ier to model than Snap data. In the 300 emoji predic-          categories using the text message that includes only
tion task the best accuracy at top 5 (a@5) on Twitter             5 Snap text messages are captions of images and videos posted

data is 40.05% while on Snap data it is 34.25% (see            by Snapchat users, see Datasets section.
Table 3: Emoji prediction results using multiple number of emojis (20 to 300) and different models. We use the
original implementation of [FMS+ 17], while we implement [BBS17].
                                                        20          50                        100         200         300
                   Dataset         Models           F1     a@5  F1     a@5                 F1    a@5   F1    a@5   F1    a@5
                   Twitter LSTM                    25.32 62.17 19.51 46.69                18.03 40.49 19.44 39.11 19.86 38.44
                           LSTM + Att.             25.51 62.82 19.49 47.21                18.3 40.56 19.61 39.10 19.38 38.41
                           2 LSTM + Att. [FMS+ 17] 24.47 62.38 19.64 47.01                18.36 40.60 19.60 39.09 20.09 38.70
                           Char + LSTM [BBS17]     26.81 63.37 20.21 47.83                18.87 41.41 20.49 40.14 21.27 40.06
                           Char + LSTM + Att.      27.37 64.33 20.91 48.23                18.88 41.8 21.19 40.65 21.59 40.06
                           Char + 2 LSTM + Att.    26.85 64.12 20.36 48.51                18.82 41.92 20.66 40.51 20.53 39.22
                   Snap      LSTM                    25.46 53.51 18.62              43.96 15.06 34.72 16.08 32.96 17.57 32.44
                             LSTM + Att.             25.58 53.67 18.86              44.44 15.40 35.20 16.47 32.95 17.46 32.63
                             2 LSTM + Att. [FMS+ 17] 24.30 53.01 18.64              43.59 15.40 35.03 16.73 33.26 18.07 32.94
                             Char + LSTM [BBS17]     24.84 53.37 19.26              44.75 15.50 35.38 17.39 33.89 18.80 33.98
                             Char + LSTM + Att.      26.01 54.34 19.39              45.1 15.58 35.61 17.44 34.26 18.64 33.86
                             Char + 2 LSTM + Att.    25.72 53.81 18.95              45.05 16.18 36.03 17.51 33.97 18.87 34.25

                  65                                              30              Table 4: Results for single and multi-task predic-
                                         Top 55 acc:
                                         Top    acc: Twitter
                                                      Twitter                     tion of emojis including main unicode categories, sub-
                  60                      Top 55 acc:
                                          Top     acc: Snap
                                                       Snap                       categories, and clusters.
                                         Macro F1: Twitter                         Pred.                         Twitter       Snap
                                                                                   Task              Loss       F1 A@5 F1 A@5
 Top 5 Accuracy


                  55                      Macro F1: Snap          25   Macro F1    Main     Main               46.56 84.70 45.23 87.90
                                                                                   Category Main + Sub         48.34 85.87 45.07 87.79
                  50                                                                        Main + Emoji       48.17 85.52 44.54 87.83
                                                                                            Main + sub + Emoji 48.52 85.90 44.64 95.52
                  45                                                               Sub      Sub                31.62 51.02 32.15 53.81
                                                                  20               Category Sub + Main         31.84 51.86 31.72 53.43
                  40                                                                        Sub + Emoji        32.00 52.23 31.99 53.72
                                                                                            Sub + Main + Emoji 32.24 52.40 31.88 65.19
                  35                                                               Semantic Clusters           34.10 53.56 34.77 53.22
                                                                                   Clusters Clusters + Emoji   35.42 55.90 34.90 53.64
                                                                  15               Emoji    Emoji              21.59 40.06 18.64 33.86
                   20 50       100           200                300
                                                                                            Emoji + Main       21.62 37.80 19.05 34.24
                                 Number of labels                                           Emoji + Sub        21.58 37.91 18.75 34.27
                                                                                            Emoji + Main + Sub 21.44 37.81 18.78 34.05
Figure 4: F1 and Acc. @ top 5 for the model “Char                                           Emoji + Clusters   21.30 37.90 19.05 29.78
+ 2 LSTM + Word Att.” on Twitter and Snap data.
                                                                                  sub-categories of Unicode Standard.
one emoji as we did in the emoji prediction task.
   Table 4 shows the prediction results using macro-                              4.2   Multi-Task Predictions
F1 and a@5 evaluation metrics. In the first two blocks
(main and sub lines), we predict the main category and                            In Table 4 we show the multi-task prediction re-
sub-category respectively. The third block details the                            sults. We considered multiple multi-task combina-
clusters’ evaluation results, and the last block presents                         tions. Learning more than one objective task simul-
the emoji prediction results. In the first line of each                           taneously helps in the main category prediction, as
block are the single-task results and the remaining                               macro F1 improves from 46.56% to 48.52% (4.2% rela-
lines include the ones using a multi-task framework.                              tive improvement) when adding also sub-category and
                                                                                  emoji losses. Sub-categories prediction also improves
                                                                                  when it is learned together with main categories and
                                                                                  emojis.
4.1.3                  Predicting Clusters
                                                                                     On Snap data, category and sub-category predic-
Given a text message containing an emoji e we pre-                                tion tasks do not improve using a multitask approach
dict the cluster that emoji e belongs to. Cluster cre-                            in terms of macro F1, but we obtain a relative im-
ation is described in the dataset section. Cluster re-                            provement of 8.67% and 21.14% using a@5.
sults are reported in Table 4, in the lines correspond-                              The clusters prediction tasks also benefit from
ing to “Semantic Clusters”. The results are better on                             multi-task learning when combined with the emoji pre-
Snap than Twitter for broader classes and our clus-                               diction. However, emoji prediction seems not to im-
ters capture better semantics than the categories and                             prove much in a multi-task setting for Twitter. Emoji
                                             G:    P:    0.26 , 0.15 , 0.13 , 0.10 , 0.04
                               we are having a cat party in room 135 #austinblueroos #bethedifference
                               we are having a cat party in room 135 #austinblueroos #bethedifference

                                                G:     P:    0.75 ,  0.02 , 0.02 ,  0.01 ,  0.01
                                                 Feeling slightly pregnant but it was worth it
                                                 Feeling slightly pregnant but it was worth it

                                                G:    P:   0.97 , 0.01 , 0.002 , 0.001 ,  0.001
                              It’s official , I have found my #unicorn ! Look at this geometric tribal print !
                              It’s official , I have found my #unicorn ! Look at this geometric tribal print !


Figure 5: Word and Feature attention visualization. The first line highlights in blue word attention, while the
second line shows the feature attention. Uncolored words mean almost zero attention over them.

                                                                          have a specific meaning and become difficult to model.
Table 5: Top and bottom 10 emojis with best accuracy
on the Twitter (top) and on Snap (bottom).
                                                                          4.3.2    Feature and Word Attention
  87.5 83.87 83.53 83.33 81.33 80.88 79.86 79.86 79.56 78.95              We previously described the two types of attention ex-
                                                                          plored. The Feature Attention approach gives more
  4.71   4.57   4.43   3.97    1.94   1.94   1.52   1.24   0.63   0.58
                                                                          importance to either the character or word represen-
 89.82 78.27 77.19 76.15 75.72 74.1 73.48 73.03 71.5 71.46
                                                                          tation of a word. The Word Attention approach in-
                                                                          creases the importance of more discriminative words,
  3.46   3.37   3.13   2.92    2.82   2.41   1.86   1.01   0.29    0      for example the word “key” to predict the emoji .
                                                                             Figure 5 visualizes the weights of each of these two
prediction on Snap improves from 33.86% to 34.27%                         attention modules using three example messages. For
or 1.21% relative improvement in terms of a@5 when                        each of them, we list the gold label (“G”) and the pre-
it is learned together with Unicode sub-categories.                       dicted labels (“P”), along with their prediction proba-
                                                                          bility. i.e. the output of the softmax layer. The inter-
4.3      Qualitative Analysis                                             nal weights of the two attention modules are visualized
We analyzed in detail our emoji prediction approach                       using text highlights. Darker color indicates more at-
(char + 2 LSTM + attention) based on the best per-                        tention over word (αit from Formula 2 of each word in
forming system described in the previous section. This                    the message). In second line of each message the red
analysis enumerates the emojis that are easier and                        highlight shows the weights of the feature attention (α
harder to predict. We also include some visualization                     of Formula 1). Bold text formatting indicate the out
examples of where the attention module obtains more                       of vocabulary words.
information. These examples provide us with a better                         Based on the three examples, and some additional
understanding of the importance of the character and                      that we manually evaluated, we verified how these two
word features in our results.                                             attention approaches work. The Word Attention mod-
                                                                          ule (blue highlight) give us insights on the recognition
                                                                          of emojis. In the first example the most important
4.3.1     What emoji is difficult to predict?
                                                                          word is “cat” and the predictions are indeed about
Table 5 shows a list of the top and bottom 10 emo-                        cats, apart from the fifth predicted emoji . This
jis based on the prediction accuracy. We investigated                     emoji is triggered (probably) because of the presence
what emojis are difficult to predict, and we found in-                    of the token “135” as the word attention module also
teresting patterns. As expected, the emojis that are                      focuses on this token. In the second example, the at-
easier to predict describe specific objects without mul-                  tention goes to the word “pregnant”, but in this case
tiple meanings (such as,     and ) or topics (e.g.,                       this word misleads the network that incorrectly pre-
and ). These emojis, as suggested in [BRS16], could                       dicts baby emojis. However, the correct emoji is then
easily be replaced by a word, such as by key), or are                     predicted as fourth option. In the last example, the
used when specific words occur in a text message in-                      network correctly classifies the emoji , based on the
cluding Christmas for       and ). In both datasets,                      hashtag “#unicorn”.
subjective emojis including       and      obtained low-                     Regarding the Feature Attention over the word or
est accuracy values. These subjective emojis describe                     character representation of each token in a message,
emotional information, and they can be interpreted                        we observed that the character representation seems
differently among different users and based on the sur-                   to gain importance on long and less frequent tokens,
rounding context. Hence, these emojis do not seem to                      namely numbers, hashtags, and as expected, out of
vocabulary words (“135” and “#austinblueroos”).                          Modifiers Affect Emoji Semantics in
                                                                         Twitter.    In Proceedings of the 7th
5    Conclusion                                                          Joint Conference on Lexical and Com-
                                                                         putational Semantics (*SEM 2018), New
In this paper, we explored emoji prediction in two so-
                                                                         Orleans, LA, United States, 2018.
cial media platforms, Twitter and Snapchat. We ex-
tended the emoji prediction task to a large number of        [Ben12]     Yoshua Bengio. Deep learning of repre-
emojis and showed that the prediction performance                        sentations for unsupervised and transfer
drastically drops between 50 and 100 emojis, while                       learning. In ICML Workshop, 2012.
the addition of more emojis keeps the accuracy of the
model somehow constant (even if it has to predict more       [BKRS16]    Francesco Barbieri, German Kruszewski,
emojis). We attribute these results to the specificity                   Francesco Ronzano, and Horacio Sag-
of the less-used emojis. We also proposed a novel task                   gion. How Cosmopolitan Are Emojis?
that predicts broader classes of emojis, grouping emo-                   Exploring Emojis Usage and Meaning
jis in automatic clusters or predefined categories, as de-               over Different Languages with Distribu-
fined by the Unicode consortium. These new tasks al-                     tional Semantics. In Proceedings of the
low us to better evaluate the predictions of the model,                  2016 ACM on Multimedia Conference,
since plain emoji prediction may be ambiguous. We                        pages 531–535, Amsterdam, Netherlands,
also carried out an extensive qualitative analysis in or-                October 2016. ACM.
der to understand the importance of the character en-
                                                             [BRS16]     Francesco Barbieri, Francesco Ronzano,
coding of words in noisy social media text, the number
                                                                         and Horacio Saggion. What does this
of training examples, and the difficulties in modeling
                                                                         emoji mean? a vector space skip-gram
specific emojis.
                                                                         model for t.emojis. In LREC, 2016.
    Finally, we proposed a multi-task approach to pre-
dict emojis and emoji group affiliation at the same          [Car]       R. Caruana.    Multitask learning: A
time. We showed that the model obtains significant                       knowledge-based source of i.b.     In
improvements in the Twitter dataset, while more in-                      ICML’93.
vestigation is needed for the Snapchat dataset.
                                                             [Car97]     Rich Caruana. Multitask learning. Mach.
Acknowledgments                                                          Learn., 28(1):41–75, July 1997.

This work was done when Francesco B. interned at             [CMS15]     Spencer Cappallo, Thomas Mensink, and
Snap Inc. Francesco B. acknowledge support also                          Cees GM Snoek. Image2emoji: Zero-shot
from the TUNER project (TIN2015-65308-C5-5-R,                            emoji prediction for visual media. In Pro-
MINECO/FEDER, UE) and the Maria de Maeztu                                ceedings of the 23rd ACM international
Units of Excellence Programme (MDM-2015-0502).                           conference on Multimedia, pages 1311–
                                                                         1314. ACM, 2015.
References                                                   [CSG+ 18]   Spencer Cappallo, Stacey Svetlichnaya,
[BBRS18]     Francesco Barbieri, Miguel Ballesteros,                     Pierre Garrigues, Thomas Mensink, and
             Francesco Ronzano, and Horacio Sag-                         Cees GM Snoek. The new modality:
             gion. Multimodal emoji prediction. In                       Emoji challenges in prediction, antici-
             Proceedings of NAACL: Short Papers,                         pation, and retrieval. arXiv preprint
             New Orleans, US, 2018. Association for                      arXiv:1801.10253, 2018.
             Computational Linguistics.
                                                             [CW08]      R. Collobert and J. Weston. A unified
[BBS17]      Francesco Barbieri, Miguel Ballesteros,                     architecture for natural language process-
             and Horacio Saggion. Are emojis pre-                        ing: Deep neural networks with multitask
             dictable? In Proceedings of the 15th Con-                   learning. In ICML. ACM, 2008.
             ference of the European Chapter of the
             Association for Computational Linguis-          [ERA+ 16]   Ben Eisner, Tim Rocktäschel, Isabelle
             tics: Volume 2, Short Papers, pages 105–                    Augenstein, Matko Bosnjak, and Sebas-
             111, Valencia, Spain, April 2017. Associ-                   tian Riedel. emoji2vec: Learning emoji
             ation for Computational Linguistics.                        representations from their description.
                                                                         In Proceedings of The Fourth Interna-
[BCC18]      Francesco Barbieri and Jose Camacho-                        tional Workshop on Natural Language
             Collados. How Gender and Skin Tone                          Processing for Social Media, pages 48–54,
            Austin, TX, USA, November 2016. Asso-      [WBSD17a] Sanjaya Wijeratne, Lakshika Balasuriya,
            ciation for Computational Linguistics.               Amit Sheth, and Derek Doran. Emojinet:
                                                                 An open service and api for emoji sense
[FMS+ 17]   Bjarke Felbo, Alan Mislove, Anders
                                                                 discovery. International AAAI Confer-
            Søgaard, Iyad Rahwan, and Sune
                                                                 ence on Web and Social Media (ICWSM
            Lehmann. Using millions of emoji oc-
                                                                 2017). Montreal, Canada, 2017.
            currences to learn any-domain represent.
            for detecting sentiment, emotion and       [WBSD17b] Sanjaya Wijeratne, Lakshika Balasuriya,
            sarcasm. In EMNLP, 2017.                             Amit Sheth, and Derek Doran.         A
                                                                 semantics-based measure of emoji simi-
[GS05]      Alex Graves and Juergen Schmidhuber.
                                                                 larity. Web Intelligence, 2017.
            Framewise phoneme classification with
            bidirectional lstm and other neural net-   [YYD+ 16]   Zichao Yang, Diyi Yang, Chris Dyer,
            work architectures. Neural Networks, 18,               Xiaodong He, Alex Smola, and Eduard
            2005.                                                  Hovy. Hierarchical attention networks for
[HGS+ 17]   Tianran Hu, Han Guo, Hao Sun, Thuy-                    document classification. In Proceedings of
            vy Thi Nguyen, and Jiebo Luo. Spice                    the 2016 Conference of the North Ameri-
            up Your Chat: The Intentions and Sen-                  can Chapter of the Association for Com-
            timent Effects of Using Emoji. Proc. of                putational Linguistics: Human Language
            ICWSM 2017, 2017.                                      Technologies, pages 1480–1489, 2016.

[KK17]      Mayu Kimura and Marie Katsurai. Au-
            tomatic construction of an emoji sen-
            timent lexicon. In Proceedings of the
            2017 IEEE/ACM International Confer-
            ence on Advances in Social Networks
            Analysis and Mining 2017, pages 1033–
            1036. ACM, 2017.
[LLM+ 15]   W. Ling, T. Luı́s, L. Marujo, R.F. As-
            tudillo, S. Amir, C. Dyer, A.W. Black,
            and I. Trancoso. Finding function in
            form: Compositional character models
            for open vocabulary word representation.
            EMNLP, 2015.
[MLS13]     Tomas Mikolov, Quoc V Le, and Ilya
            Sutskever. Exploiting similarities among
            languages for machine translation. arXiv
            preprint arXiv:1309.4168, 2013.
[NSSM15]    Petra Kralj Novak, Jasmina Smailović,
            Borut Sluban, and Igor Mozetič.
            Sentiment of emojis.      PloS one,
            10(12):e0144296, 2015.
[RCP16]     Marek Rei, Gamal KO Crichton, and
            Sampo Pyysalo. Attending to characters
            in neural sequence labeling models. In
            Coling, 2016.
[RPG+ 18]   David Rodrigues, Marı́lia Prada, Rui
            Gaspar, Margarida V Garrido, and Di-
            niz Lopes. Lisbon emoji and emoti-
            con database (leed): Norms for emoji
            and emoticons in seven evaluative dimen-
            sions. Behavior research methods, pages
            392–405, 2018.

</pre>