=Paper=
{{Paper
|id=Vol-2263/paper021
|storemode=property
|title=Predicting Emoji Exploiting Multimodal Data: FBK Participation in ITAmoji Task
|pdfUrl=https://ceur-ws.org/Vol-2263/paper021.pdf
|volume=Vol-2263
|authors=Andrei Catalin Coman,Yaroslav Nechaev,Giacomo Zara
|dblpUrl=https://dblp.org/rec/conf/evalita/ComanNZ18
}}
==Predicting Emoji Exploiting Multimodal Data: FBK Participation in ITAmoji Task==
Predicting Emoji Exploiting Multimodal Data:
FBK Participation in ITAmoji Task
Andrei Catalin Coman Yaroslav Nechaev Giacomo Zara
Fondazione Bruno Kessler Fondazione Bruno Kessler Fondazione Bruno Kessler
coman@fbk.eu nechaev@fbk.eu gzara@fbk.eu
Abstract messaging services in our lives, we have been wit-
nessing how common it has become for average
English. In this paper, we present our ap- users to enrich natural language by means of emo-
proach that has won the ITAmoji task of jis. An emoji is essentially a symbol placed di-
the 2018 edition of the EVALITA evalu- rectly into the text, which is meant to convey a
ation campaign1 . ITAmoji is a classifi- simple concept or more specifically, as the name
cation task for predicting the most prob- says, an emotion.
able emoji (a total of 25 classes) to go
The emoji phenomenon has attracted consider-
along with the target tweet written by a
able research interest. In particular, recent works
given person in Italian. We demonstrate
have studied the connection between the natural
that using only textual features is insuf-
language and the emojis used in a specific piece
ficient to achieve reasonable performance
of text. The 2018 edition of EVALITA ITAmoji
levels on this task and propose a system
competition (Ronzano et al., 2018) is a prime ex-
that is able to benefit from the multimodal
ample of such interest. In this competition, par-
information contained in the training set,
ticipants were asked to predict one of the 25 emo-
enabling significant F1 gains and earning
jis to be used in a given Italian tweet based on a
us the first place in the final ranking.
text, the date and the user that has written it. Dif-
Italiano. In questo articolo presentiamo ferently from the similar SemEval (Barbieri et al.,
l’approccio con cui abbiamo vinto la com- 2018) challenge, the addition of the user informa-
petizione ITAmoji dell’edizione 2018 di tion significantly expanded the scope of potential
EVALITA1 . ITAmoji è un task di classi- solutions that could be devised.
ficazione per predire l’emoji più proba- In this paper, we describe our neural network-
bile (tra un totale di 25 classi) che possa based system that exhibited the best performance
essere associato ad un dato tweet scritto among the submitted approaches in this task. Our
in italiano da uno specifico utente. Di- approach is able to successfully exploit user in-
mostriamo che utilizzare esclusivamente formation, such as the prior emoji usage history
dati testuali non è sufficiente per ottenere of a user, in conjunction with the textual features
un ragionevole livello di performance su that are customary for this task. In our experi-
questo task, e proponiamo un sistema in ments, we have found that the usage of just the
grado di beneficiare dalle informazioni textual information from the tweet provides lim-
multimodali contenute nel training set, au- ited results: none of our text-based models were
mentando significativamente lo score F1 able to outperform a simple rule-based baseline
e guadagnando la prima posizione nella based on prior emoji history of a target user. How-
classifica finale. ever, by considering all the modalities of the input
data that were made available to us, we were able
to improve our results significantly. Specifically,
1 Introduction we combine into a single efficient neural network
the typical Bi-LSTM-based recurrent architecture,
Particularly over the last few years, with the in-
that has shown excellent performance previously
creasing presence of social networks and instant
in this task, with the multilayer perceptron applied
1
EVALITA: http://evalita.it/2018 to user-based features.
Figure 1: A diagram of the approach.
2 Description of the System 2.2 User-based features
ITAmoji task is a classification task of predicting Rather than relying solely on a text of the target
one of the 25 emojis to go along with the tweet. tweet, we exploit additional user-based features
The training set provided by the organizers of the to improve performance. The task features many
competition consists of 250 000 Italian tweets, in- variations of the smiling face and three different
cluding for each tweet the text (without the target heart emojis, making it impossible even for a hu-
emoji), the user ID and the timestamp as features. man to determine the most suitable one just based
Participants were explicitly forbidden to expand on a tweet. One of the features we considered
the training set. Figure 1 provides an overview of was the prior emoji distribution for a target author.
our approach. In this section, we provide detailed The hypothesis was that the choice of a particular
descriptions of the methods we employed to solve emoji is driven mainly by the personal user prefer-
the proposed task. ences exemplified by the previous emoji choices.
To this end, we have collected two different
types of emoji history for each user. Firstly, we
2.1 Textual features
use labels in the training set to compute emoji dis-
In order to embed the textual content of the tweet, tributions for each user yielding vectors of size 25.
we have decided to apply a vectorization based on Users from the test set that were not present in
fastText (Bojanowski et al., 2017), a recent ap- the training set were initialized with zeroes. Sec-
proach for learning unsupervised low-dimensional ondly, we have gathered the last 200 tweets for
word representations. fastText sees the words each user using Twitter API5 , and then extracted
as a collection of character n-grams, learning a and counted all emojis that were present in those
representation for each n-gram. fastText fol- tweets. This yielded a sparse vector of size 1284.
lows the famous distributional semantics hypoth- At this step we took extra care to prevent data
esis utilized in other approaches, such as LSA, leaks: if a tweet from the test set ended up in the
word2vec and GloVe. In this work, we exploit the collected 200 tweets, it wasn’t considered in the
Italian embeddings trained on text from Wikipedia user history. The runs that used the former, train-
and Common Crawl2 and made available by the ing set-based approach had a ” tr” suffix in its
fastText authors3 . Such embeddings include name. The ones that used the full user history-
300-dimensional vectors for each of 2M words in based approach had a ” ud” suffix.
the vocabulary. Additionally, we have trained our In addition to prior emoji distribution, we did
own4 embeddings using the corpus of 48M Ital- preliminary experiments with user’s social graph.
ian tweets that were acquired from Twitter Stream- Social graph, which is a graph of connections be-
ing API. This yielded 1.1M 100-dimensional word tween the users, is shown to be an important fea-
vectors. Finally, we have also conducted experi- ture for many tasks on social media, for example,
ments with word vectors suggested by the task or- user profiling. We followed the recently proposed
ganizers (Barbieri et al., 2016). approach (Nechaev et al., 2018a) to acquire 300-
dimensional dense user representations based on a
2
http://commoncrawl.org/ social graph. This feature, however, did not im-
3
https://github.com/facebookresearch/ prove the performance of our approach and was
fastText/blob/master/docs/crawl-vectors.
md excluded.
4
https://doi.org/10.5281/zenodo.
5
1467220 https://developer.twitter.com
Layer Parameter Value dimensionality using the embedding matrix of one
Textual Input seq. length 48 of the approaches listed in Section 2.1. The result-
input 256 ing sequence is padded with zeroes to a constant
output 100 length, in our case 48, and fed into the neural net-
Embedding
trainable true
l2 regularization 10−6
work.
Dropout probability 0.4 2.4 Overall implementation
Bi-LSTM output 512
In order to accommodate both textual and user-
(a) Bi-LSTM model hyperparameters.
based features, we devise a joint architecture that
History Input input 1284 takes both types of features as input and produces
output 256 probability distribution for the target 25 classes.
Dense l2 regularization 10−5 The general logic of our approach is shown in Fig-
activation tanh
ure 1. The core consists of two main components:
output 256
Dense l2 regularization 10−5
activation tanh
• Bi-LSTM. The recurrent unit consumes the
input sequence one vector at a time, modify-
(b) User history model hyperparameters.
ing the hidden state (i.e., memory). After the
Concatenate output 768 whole sequence is consumed in both direc-
output 25 tions, the internal states of the two RNNs are
Dense l2 regularization – concatenated and used as a tweet embedding.
activation softmax
Additionally, we perform l2-regularization of
method Adam
learning rate 0.001 the input embedding matrix and the dropout
Optimizer β1 0.9 to prevent overfitting and fine-tune the perfor-
β2 0.999 mance. Table 1a details the hyperparameters
decay 0.001
we used for a textual part of our approach.
(c) Joint model and optimizer parameters.
• User-based features. The emoji distribution
Table 1: Model hyperparameters. we collected (as described in Section 2.2)
was fed as input to a multilayer perceptron:
2.3 RNN exploiting textual features two densely-connected layers with tanh as
The Recurrent Neural Networks have turned out activation and l2-regularization to prevent
to be a powerful architecture when it comes to an- overfitting. Table 1b showcases the chosen
alyzing and performing prediction on sequential hyperparameters for this component using
data. In particular, over the last few years, differ- the full user history as input.
ent variations of the RNN has shown to be the top The outputs of the two components are then con-
performing approaches for a wide variety of tasks, catenated and a final layer with the sof tmax ac-
including tasks in Natural Language Processing tivation is applied to acquire the probability dis-
(NLP). RNN consumes the input sequence one tribution of the 25 emoji labels. The network is
element at the time, modifying the internal state then optimized jointly with cross entropy as the
along the way to capture relevant information from objective function using Adam optimizer. Table 1c
the sequence. When used for NLP tasks, RNN is includes all relevant hyperparameters we used for
able to consider the entirety of the target sentence, this step.
capturing even the longest dependencies within Since the runs are evaluated based on macro-F1 ,
the text. In our system, we use the bi-directional in order to optimize our approach for this metric,
long short-term memory (Bi-LSTM) variation of we have introduced class weights into the objec-
the RNN. This variation uses two separate RNNs tive function. Each class i is associated with the
to traverse the input sequence in both directions weight equal to:
(hence bi-directional) and employs LSTM cells.
maxi (N ) α
Input text provided by the organizers is split into wi = (1)
Ni
tokens using a modified version of the Keras tok-
enizer (can be found in our repository). Then, the where Ni is the amount of samples in a particular
input tokens are turned into word vectors of fixed class and α = 1.1 is a hyperparameter we tuned
for this task. This way the optimizer is assigning of a particular word embedding approach. In
a greater penalty for mistakes in rare classes, thus particular, provided refers to the ones that
optimising for the target metric. were suggested by organisers, custom-100d
During the training of our approach, we employ indicates our fastText-based embeddings and
an early stopping criteria to halt the training once common-300d refers to the ones available on
the performance on the validation set stops im- fastText website. Table 2 details the perfor-
proving. In order to properly evaluate our system, mances of the mentioned models.
we employ 10-fold cross-validation, additionally Finally, we submitted three of our best mod-
extracting a small validation set from the train- els for the official evaluation. All of the sub-
ing set for that fold to perform the early stopping. mitted runs use the Bi-LSTM approach with
For the final submission we use a simple ensemble our custom-100d word embeddings along with
mechanism, where predictions are acquired inde- some variation of user emoji distribution as de-
pendently from each fold and then averaged out to tailed in Section 2.2. Two of the runs use the en-
produce the final submission. Additionally, one of sembling trick using all available cross-validation
the runs was submitted using predictions from the folds, while the remaining one we submitted
random fold. Runs exploiting the ensemble ap- (” 1f”) uses predictions from just one fold.
proach have the ” 10f” suffix, while runs using
just one fold have the ” 1f” suffix. 4 Results
The code used to preprocess data, train and eval-
uate our approach is available on GitHub6 . Here we report performances of the models bench-
marked both during our local evaluation (Table 2)
3 Evaluation setting and the official results (Table 3). We started
experiments with just the textual models testing
In this section, we provide details on some of the
different architectures and embedding combina-
approaches we have tested during the development
tions. Among those, the Bi-LSTM architecture
of our system, as well as the models we submitted
was a clear choice, providing 1-2% F1 over CNN,
for the official evaluation. In this paper, we report
which led to us abandoning the CNN-based mod-
results for the following models:
els. Among the three word embedding mod-
• MF HISTORY. A rule-based baseline that al- els we evaluated, our custom-100d embed-
ways outputs the most frequent emoji from ding exhibited the best performance on Bi-LSTM,
the user history based on a training set. while common-300d showed the best perfor-
mance using the CNN architecture.
• BASE CNN. A basic Convolutional Neural After we have acquired the user emoji dis-
Network (CNN) taking word embeddings as tributions, we have devised a simple baseline
input without any user-based features. (MF HISTORY), which, to our surprise, outper-
• BASE LSTM. A Bi-LSTM model described formed all the text-based models we’ve tested so
in Section 2.3 used with textual features only. far: 3% F1 improvement compared to the best Bi-
LSTM model. When we introduced the user emoji
• BASE LSTM TR. The complete approach in- histories in our approach, we have gained a signif-
cluding both feature families with emoji dis- icant performance gain: 4% when using the scarce
tribution coming from the training set. training set data and 12% when using the complete
user history of 1284 emojis from recent tweets.
• BASE LSTM UD. The complete approach
During the final days of the competition, we have
with emoji distribution coming from the most
tried to exploit other user-based features to further
recent 200 tweets for each user.
bolster our results, for example, the social graph
For the other models tested during our local eval- of a user. Unfortunately, such experiments did not
uation and complete experimental results, please yield performance gains before the deadline.
refer to our GitHub repository. During the official evaluation, complete user
Additionally, for the BASE LSTM approach we history-based runs exhibited top performance with
report performance variations due to a choice ensembling trick actually decreasing the final F1 .
6
GitHub repository: https://github.com/ As we expected from our experiments, training
Remper/emojinet set-based emoji distribution was much less per-
Approach Embedding Accuracy Precision Recall F1 macro
MF HISTORY – 0.4396 0.4076 0.2774 0.3133
BASE CNN common-300d 0.4351 0.3489 0.2464 0.2673
BASE LSTM common-300d 0.4053 0.3167 0.2534 0.2707
BASE LSTM provided 0.4415 0.3836 0.2408 0.2622
BASE LSTM custom-100d 0.4443 0.3666 0.2586 0.2809
BASE LSTM TR custom-100d 0.4874 0.4343 0.3218 0.3565
BASE LSTM UD custom-100d 0.5498 0.4872 0.4097 0.4397
Table 2: Performance of the approaches as tested locally by us.
Run Accuracy@5 Accuracy@10 Accuracy@15 Accuracy@20 F1 macro
BASE UD 1F 0.8167 0.9214 0.9685 0.9909 0.3653
BASE UD 10F 0.8152 0.9194 0.9681 0.9917 0.3563
BASE TR 10F 0.7453 0.8750 0.9434 0.9800 0.2920
gw2017 p.list 0.6718 0.8148 0.8941 0.9299 0.2329
Table 3: Official evaluation results for our three submitted runs and the runner-up model.
formant but still offered significant improvement Precision Recall F1 Support
over the runner-up team (gw2017 p.list) as 0.7991 0.6490 0.7163 5069
shown in Table 3. Additionally, we detail the per- 0.4765 0.7116 0.5708 4966
formance of our best submission (BASE UD 1F) 0.6402 0.4337 0.5171 279
for each individual emoji in Table 4 and Figure 2. 0.5493 0.4315 0.4834 387
0.4937 0.4453 0.4683 265
5 Discussion and Conclusions 0.7254 0.3229 0.4469 319
0.3576 0.5370 0.4293 2363
Our findings suggest that emojis are currently used 0.4236 0.4089 0.4161 834
mostly based on user preferences: the more prior 0.4090 0.3775 0.3926 506
0.4034 0.3354 0.3663 1282
user history we added, the more significant per-
0.4250 0.3299 0.3715 885
formance boost we have observed. Therefore, the 0.3743 0.3184 0.3441 1338
emojis in a text cannot be considered indepen- 0.3684 0.3239 0.3447 1028
dently from the person that has used them and 0.3854 0.2782 0.3231 266
textual features alone can not yield a sufficiently 0.3844 0.2711 0.3179 546
performant approach for predicting emojis. Addi- 0.3899 0.2648 0.3154 642
tionally, we have shown that the task was sensitive 0.3536 0.2743 0.3089 700
0.3835 0.2566 0.3075 417
to the choice of a particular neural architecture as
0.3525 0.1922 0.2488 541
well as to the choice of the word embeddings used
0.2866 0.2639 0.2748 341
to represent text. 0.2280 0.2922 0.2562 373
An analogous task was proposed to the par- 0.2751 0.2133 0.2403 347
ticipants of the SemEval 2018 competition. The 0.2845 0.1741 0.2160 379
winners of that edition applied an SVM-based ap- 0.3154 0.1822 0.2310 483
proach for the classification (Çöltekin and Rama, 0.2956 0.1824 0.2256 444
2018). Instead, we have opted for a neural
network-based architecture that allowed us greater Table 4: Precision, Recall, F1 of our best submis-
flexibility to experiment with various features sion and the number of samples in test set for each
coming from different modalities: the text of the emoji.
tweet represented using word embeddings and the
sparse user-based history. During our experiments prediction task the effectiveness of different ap-
with the SemEval 2018 task as part of the NL4AI proaches may significantly vary based either on
workshop (Coman et al., 2018), we have found the a language of the tweets or based on a way the
CNN-based architecture to perform better, while dataset was constructed.
here the RNN was a clear winner. Such discrep- In the future, we would like to investigate this
ancy might suggest that even within the emoji topic further by trying to study differences in
Figure 2: Confusion matrix for our best submission normalized by support size: each value in a row is
divided by the row marginal. Diagonal values give recall for each individual class (see Table 4).
emoji usage between languages and communities. Andrei Catalin Coman, Giacomo Zara, Yaroslav
Additionally, we aim to further improve our ap- Nechaev, Gianni Barlacchi, and Alessandro Mos-
chitti. 2018. Exploiting deep neural networks for
proach by identifying more user-based features,
tweet-based emoji prediction. In Proc. of the 2nd
for example, by taking into account the feature Workshop on Natural Language for Artificial Intel-
families suggested by Nechaev et al. (Nechaev et ligence co-located with 17th Int. Conf. of the Italian
al., 2018b). Association for Artificial Intelligence (AI*IA 2018),
Trento, Italy.
References Yaroslav Nechaev, Francesco Corcoglioniti, and Clau-
Francesco Barbieri, German Kruszewski, Francesco dio Giuliano. 2018a. Sociallink: Exploiting graph
Ronzano, and Horacio Saggion. 2016. How cos- embeddings to link dbpedia entities to twitter pro-
mopolitan are emojis?: Exploring emojis usage and files. Progress in AI, 7(4):251–272.
meaning over different languages with distributional
Yaroslav Nechaev, Francesco Corcoglioniti, and Clau-
semantics. In Proceedings of the 2016 ACM on Mul-
dio Giuliano. 2018b. Type prediction combining
timedia Conference, pages 531–535. ACM.
linked open data and social media. In Proc. of the
27th ACM Int. Conf. on Information and Knowl-
Francesco Barbieri, Jose Camacho-Collados, edge Management, CIKM 2018, Torino, Italy, pages
Francesco Ronzano, Luis Espinosa-Anke, Miguel 1033–1042.
Ballesteros, Valerio Basile, Viviana Patti, and
Horacio Saggion. 2018. SemEval-2018 Task Francesco Ronzano, Francesco Barbieri, Endang
2: Multilingual Emoji Prediction. In Proc. of Wahyu Pamungkas, Viviana Patti, and Francesca
the 12th Int. Workshop on Semantic Evaluation Chiusaroli. 2018. Overview of the evalita 2018
(SemEval-2018), New Orleans, LA, United States. italian emoji prediction (itamoji) task. In Tom-
Association for Computational Linguistics. maso Caselli, Nicole Novielli, Viviana Patti, and
Paolo Rosso, editors, Proceedings of the 6th evalua-
Piotr Bojanowski, Edouard Grave, Armand Joulin, and tion campaign of Natural Language Processing and
Tomas Mikolov. 2017. Enriching word vectors with Speech tools for Italian (EVALITA’18), Turin, Italy.
subword information. Transactions of the Associa- CEUR.org.
tion for Computational Linguistics, 5:135–146.
Çagri Çöltekin and Taraka Rama. 2018. Tübingen-
oslo at semeval-2018 task 2: Svms perform bet-
ter than rnns in emoji prediction. In Proc. of
The 12th Int. Workshop on Semantic Evaluation,
SemEval@NAACL-HLT, New Orleans, Louisiana,
pages 34–38.