-

The UNIBA System at the EVALITA 2018 Italian Emoji Prediction Task

Lucia Siciliani

Daniela Girardi

daniela.girardig@uniba.it 0 0 Department of Computer Science, University of Bari Aldo Moro Via , E. Orabona, 4 - 70125 Bari , Italy

English. This paper describes our participation in the ITAmoji task at EVALITA 2018 (Ronzano et al., 2018). Our approach is based on three sets of features, i.e. micro-blog and keyword features, sentiment lexicon features and semantic features. We exploit these features to train and combine several classifiers using different libraries. The results show how the selected features are not appropriate for training a linear classifier to properly address the emoji prediction task.

1 Introduction Nowadays, emojis are widely used to express sentiments and emotions in written communication, which is becoming more and more popular due to the increasing use of social media. In fact, emojis can help the user to express and codify many different messages which can be also easily interpreted by a great audience since they are very intuitive. However, sometimes happens that their meaning is misleading, resulting in the misunderstanding of the entire message. The emoji detection has captured the interest of research since they could be relevant to improve sentiment analysis and user profiling tasks as well as the retrieval of social network material.

In particular, in the context of the International Workshop on Semantic Evaluation (SemEVAL 2018), the Multilingual Emoji Prediction Task (Barbieri et al., 2018) has been proposed for challenging the research community to automatically model the semantics of emojis occurring in English and Spanish Twitter messages. During this challenge, (Barbieri et al., 2017) created a model which outperforms humans in predicting the most probable emoji associated with a given tweet.

Twitter supports more than 1.000 emojis1, belonging to different categories (e.g.: smiley and people, animals, fruits, etc.) and this number seems to grow.

In this paper, we used a set of features which showed promising results in predicting sentiment polarity in tweets (Basile and Novielli, 2014) in order to understand whether they could be used also to predict emoji or not. The paper is organized as follow: Section 2 describes the system and the exploited features, while in Section 3 we report the obtained results using different classifiers and their ensemble. Finally, in Section 4 we discuss our findings and Section 5 reports the conclusions. In this section, we describe the approach used for solving the ITAmoji challenge. This task is structured as a multi-class classification since for each tweet it is possible to assign one of 25 emoji which however are mutually exclusive.

The feature extraction was performed entirely using the language Java. First of all, each tweet was tokenized and stop-words were removed exploiting the “Twitter NLP and Part-of-Speech Tagging” API2 developed by the Carnegie Mellon University. No other NLP steps, like stemming or PoS-tagging where considered since those features were considered not relevant for this particular kind of task.

Then we moved to the extraction of the features from the training data. These features can be categorized into three sets: one addressing the keywords and micro-blog features, the second one exploiting the polarity of each word in a semantic lexicon and the third one using their representation obtained through a distributional semantic model.

A description of the different sets of features will be provided in Section 2.1.

After the features extraction, we obtained a total set of 342 features to be used to train a linear classifier. For classification, we decided to exploit the Weka API3 and use an ensemble of three different classifiers to obtain better predictive results. The three classifiers that have been used are: the L2regularized L2-loss support vector classification, the L2-regularized logistic regression, and the random forest classifier. The first two algorithms are based on the WEKA wrapper class for the Liblinear classifier (Fan et al., 2008) and were trained on the whole set of features, while the random forest was trained only over the keyword and micro-blog features. All the classifiers were combined using the soft-voting technique, which averages the sum of the output of each classifier over their overall number.

In the light of the results of the task given by the organizers, we conducted an in-depth analysis of our solution and discovered that due to a problem in the Liblinear WEKA wrapper, not all the classifiers returned a set of probability scores for multiclass classification thus compromising the results of all the ensemble. Therefore, even if out of the time scope of this challenge, we decided to try to use the scikit-learn (Buitinck et al., 2013) to build our classifiers and evaluate the impact of the selected features.

All the results will be summarized and discussed in Section 3 and Section 4.

2http://www.cs.cmu.edu/ ark/TweetNLP/ 39http://www.cs.waikato.ac.nz/ml/weka/ 2.1

Features As in the previous work of (Basile and Novielli, 2014) , we defined three groups of features based on (i) keyword and micro-blogging characteristics, (ii) a sentiment lexicon and (iii) a Distributional Semantic Model (DSM). Keyword based features exploit tokens occurring in the tweets, considering only unigrams. During the tokenization phase user mentions, URLs and hash-tags are replaced with three meta-tokens: “USER”, “URL”, and “TAG”, in order to count them and include their number as features. Other features connected to the micro-blogging environment are: the presence of exclamation and interrogative marks, adversative, disjunctive, conclusive, and explicative words, the use of uppercase and informal expressions of laughter, such as ”ah ah”. The list of micro-blogging features is reported in 1.

The second block of features consists of sentiment lexicon features. As Italian lexicon database, we used MultiWordNet (Pianta et al., 2002) , where at each lemma is assigned a positive, negative and neutral score. In particular, we include features based on the prior polarity of words in the tweets. To deal with mixed polarity cases we defined two sentiment variation features so as to capture the simultaneous expression of positive and negative sentiment. We decided to include features related to the polarity of the tweets since emoji could be intuitively categorized into positive and negative and are usually used to enforce the sentiment expressed. The list of sentiment lexicon features is reported in 2. The last group of features is the semantic one, which exploits a Distributional Semantic Model. We used the vector embeddings for each word and the superposition operator (Smolensky, 1990) to compute an overall vector representation of the tweet. Analogously, we first computed a prototype vector for each polarity class (positive, negative, subjectivity and objectivity) as the sum of all the vector representations of each tweet to a certain class. Finally, we computed the element-wise minimum and maximum of the vectors representation of each word in the tweet and then the resulting vectors were then concatenated and used as features. This approach has been proved to work well and easy to compute for small texts like tweets and other micro-blog posts (De Boom et al., 2016) . The list of sentiment lexicon features is reported in 3.

Microblog Description tag total occurences of hashtags url total occurrences of URLs user total occurrences of user mentions neg count total occurrences of ”non” word pt exclamation total occurrences of exclamation marks interrogative total occurrences of interrogative marks adversative total occurrences of adversative words disgiuntive total occurrences of disjunctive words conclusive total occurrences of conclusive words esplicative total occurrences of esplicative words uppercase ch number of upper case characters repeat ch number of consecutive repetitions of a character in a word ahah repetition total occurrences of ”ahah” laughter expression The goal of the ITAmoji challenge is to evaluate the capability of each system to predict the right emoji associated with a tweet, regardless of its position in the text.

Organizers selected a subset of 25 emojis and provided 250,000 tweets for training, each tweet contains only one emoji which is extracted from text and given as a target feature. The training set is very unbalanced since three emojis (i.e.: read heart , face with tears of joy , and smiling face with heart eyes ) represent almost 50% of the whole dataset.

For the evaluation instead, the organizers created a test set made up of 25,000 tweets, keeping unchanged the ratio of the different classes over the whole set. The prediction for each tweet is composed by the list of all the 25 emojis ordered by their probability to be associated to the tweet: in this way, it is possible to evaluate the systems according to their accuracy up to a certain position in the rank. Nevertheless only the first emoji one was mandatory for the submission.

Systems were ranked according to the macro F-Measure but also other metrics have been calculated, i.e. the micro F-measure, the weighted F-measure, the coverage error and the accuracy (measured @5, @10, @15 and @20). The final results for the challenge are reported in table 5. We can see how while there is quite a difference between the results obtained for the macro-F1 score, the same does not happen with the micro F1 score.

The same happens with the outcomes of the accuracy where, setting aside two runs, all the other obtain a result which is included between 0,5 and 0,8. In other words, even if the macro-F1 measure appears to be the most discriminating factor among all the runs, such a result is based on the presence of some classes which appear over a numerous amount of instances and this causes the classifiers to overfit over them.

Table 6 summarizes the results obtained using both WEKA (the one which was submitted, highlighted in italic) and scikit-learn. We used the scikit-learn library to perform a classification using the logistic regression and then adding, using a soft voting technique, a Naive-Bayes classifier and a Random Forest (rows 4 and 5 respectively).

From these results we can see how, independently from the used classifier, the final results in terms of the metrics used for the evaluation over the test dataset stay quite similar among them. Specifically, these results depends on the fact that our system predicts only two label as first which are ”red heart” and ”face whit tears”, resulting unable to classify correctly the other classes, as is shown in table 4. This outcome is then probably due to the set of features that we used, which does not manage to appropriately model the data in this task, even if it proved to be successful in another sentiment analysis context (Basile and Novielli, 2014) . In the last column of table 6, we reported the average macro-F1 obtained performing 5-fold cross validation. The value for the first evaluation has not been calculated since the fault in the library described in section 2. The overall results of the challenge show how this task is non-trivial and difficult to solve with high precision and the reason behind this is intrinsic to the task itself. First of all, there are several emojis which often differ only slightly from each other, furthermore, this meaning is deeply dependent on the single user and from the context. In fact, a single emoji (like ) could be used to convey both joy and fun or, on the contrary, it could also be used ironically with a negative meaning. To this extent, an interesting update for the task could be to leave the text of the tweet as it is so that the position could be also exploited to detect irony and other variations.

From the analysis of the overall results of the task emerged that there is a large gap between the macro-F1 scores which is not reflected by the micro-F1. For this particular task, where both training and testing dataset are heavily unbalanced, we think that the micro-F1 score is more suited to capture the performance of the submitted systems since it takes into account the support of each class.

There is a result which is particularly interesting that is, the value for the 5-fold using only the logistic regression as a classifier which is particularly high (0,358) and is opposing to the final score. This aspect surely needs further investigathe similarity between ~t and the negative prototype vector p~s the similarity between ~t and the positive prototype vector p~s the similarity between ~t and the subjective prototype vector p~s the similarity between ~t and the objective prototype vector p~s the element-wise minimum of the vectors representations of each word in the tweet the element-wise maximum of the vectors representations of each word in the tweet In this paper, we presented our contribution to the ITAmoji task of the EVALITA 2018 campaign.

We tried to model the data by extracting features based on the keywords and micro-blogging characteristics, using a sentiment lexicon and finally using word embeddings. Apart from the characteristics of the different libraries available for machine learning purposes, the results show how, independently from the classifier, those features do not adapt to this problem. As future work, this analysis could also be extended with an ablation which would allow understanding if there are noisy features. label beaming face with smiling eyes blue heart face blowing a kiss face savoring food face screaming in fear face with tears of joy flexed biceps grinning face grinning face with sweat kiss mark loudly crying face red heart rolling on the floor laughing rose smiling face with heart eyes smiling face with smiling eyes smiling face with sunglasses sparkles sun thinking face thumbs up top arrow two hearts winking face winking face with tongue avg / total 0,000 0,500 0,500 0,000 0,000 0,313 0,000 0,000 0,000 0,000 0,000 0,259 0,000 0,125 0,135 0,167 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,164 support 1028 4966 506 834 387 444 417 885 379 279 373 5069 546 265 2363 1282 700 266 319 541 642 347 341 1338 483 25000 precision recall 0,000 0,002 0,002 0,000 0,000 0,448 0,000 0,000 0,000 0,000 0,000 0,909 0,000 0,004 0,004 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,274 0,000 0,004 0,005 0,000 0,000 0,369 0,000 0,000 0,000 0,000 0,000 0,403 0,000 0,007 0,008 0,002 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,156

Francesco

Barbieri , Miguel Ballesteros, and

Horacio

Saggion . 2017 . Are emojis predictable? arXiv preprint arXiv: 1702 . 07285 .

Francesco

Barbieri , Jose Camacho-Collados, Francesco Ronzano, Luis Espinosa Anke, Miguel Ballesteros, Valerio Basile, Viviana Patti, and

Horacio

Saggion . 2018 . Semeval 2018 task 2: Multilingual emoji prediction . In Proceedings of The 12th International Workshop on Semantic Evaluation , pages 24 - 33 .

Pierpaolo

Basile and

Nicole

Novielli . 2014 . Uniba at evalita 2014-sentipolc task: Predicting tweet sentiment polarity combining micro-blogging, lexicon and semantic features . Proceedings of EVALITA , pages 58 - 63 .

Lars

Buitinck , Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae,

Peter

Prettenhofer , Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake

VanderPlas

, Arnaud Joly, Brian Holt, and Gae¨l Varoquaux. 2013 . API design for machine learning software: experiences from the scikit-learn project . In ECML PKDD Workshop: Languages for Data Mining and Machine Learning , pages 108 - 122 .

Cedric De Boom , Steven Van Canneyt, Thomas Demeester , and Bart Dhoedt . 2016 . Representation learning for very short texts using weighted word embedding aggregation . Pattern Recognition Letters , 80 : 150 - 156 .

Rong-En

Fan

, Kai-Wei

Chang

, Cho-Jui

Hsieh

, XiangRui Wang, and Chih-Jen Lin . 2008 . Liblinear: A library for large linear classification . Journal of machine learning research , 9 (Aug): 1871 - 1874 .

Emanuele

Pianta , Luisa Bentivogli, and

Christian

Girardi . 2002 . Multiwordnet: developing an aligned multilingual database . 1st gwc. India , January.

Francesco

Ronzano , Francesco Barbieri, Endang Wahyu Pamungkas, Viviana Patti, and

Francesca

Chiusaroli . 2018 . Overview of the EVALITA 2018 Italian Emoji Prediction (ITAMoji) Task . In Proceedings of Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2018 ), Turin, Italy. CEUR.org.

Paul

Smolensky . 1990 . Tensor product variable binding and the representation of symbolic structures in connectionist systems . Artificial intelligence , 46 ( 1- 2 ): 159 - 216 .