UNIBA - Integrating distributional semantics features in a supervised approach for detecting irony in Italian tweets Pierpaolo Basile and Giovanni Semeraro Department of Computer Science University of Bari Aldo Moro Via, E. Orabona, 4 - 70125 Bari (Italy) {pierpaolo.basile,giovanni.semeraro}@uniba.it Abstract that the word “badante” (caregiver) is used in an unconventional context, since “caregiver” usually English. This paper describes the UNIBA does not co-occur with words “Premier” or “Mario team participation in the IronITA 2018 Monti”. task at EVALITA 2018. We propose a su- pervised approach based on LIBLINEAR Following this idea in our work we introduce a that relies on keyword, polarity, micro- feature able to detect words used out of their usual blogging features and representation of context. Moreover, we integrate further features tweets in a distributional semantic model. based on keywords, bigrams, trigrams, polarity Our system ranked 3rd and 4th in the irony and micro-blogging features as reported in (Basile detection subtask. We participated only in and Novielli, 2014). Our idea is supported by best the constraint run exploiting the training systems participating in the Semeval-2018 task 3 - data provided by the task organizers. Irony detection in English tweets (Van Hee et al., 2018), where the best systems not based on deep Italiano. Questo articolo descrive learning exploit features based on polarity contrast la partecipazione del team UNIBA al information and context incongruity. task IronITA 2018 organizzato durante We evaluate our approach in the context of the EVALITA 2018. Nell’articolo proponi- IronITA task at EVALITA 2018 (Cignarella et al., amo un approccio supervisionato basato 2018). The goal of the task is to predict irony in su LIBLINEAR che sfrutta le parole chi- Italian tweets. The task is organized in two sub- ave, la polarità, attributi tipici dei micro- tasks: 1) irony detection and 2) different types of blog e la rappresentazione dei tweet in uno irony. In the second sub-task participates must spazio semantico distribuzionale. Il nos- identify if irony belongs to sarcasm or not. In this tro sistema si è classificato terzo e quarto paper, we propose an approach which is able to de- nel sotto task di identificazione dell’ironia. tect the presence of irony without taking into ac- Abbiamo partecipato solamente nel con- count different types of irony. We evaluate the ap- straing run utilizzando i dati di training proach in a constrained setting using only the data forniti dagli organizzatori del task. provided by task organizers. The only external re- sources exploited in our approach are a polarity 1 Introduction lexicon and a collection of about 40M tweets ran- domly extracted from TWITA(Basile and Nissim, The irony is defined as “the use of words that 2013) (a collection of about 800M Italian tweets). say the opposite of what you really mean, often The paper is structured as follows: Section 2 as a joke and with a tone of voice that shows describes our system, while evaluation and results this”1 . This suggests us that when we are ana- are reported in Section 3. Final remarks are pro- lyzing written text for detecting irony, we should vided in Section 4. focus our attention on those words that are used in an unconventional context. For example, given the 2 System Description tweet: “S&P ha declassato Mario Monti da Pre- mier a Badante #declassaggi”2 , we can observe Our approach adopts a supervised classifier based 1 on LIBLINEAR (Fan et al., 2008), in particular we Oxford Learner Dictionary 2 In English: “S&P has downgraded Mario Monti from use the L2-regularized L2-loss linear SVM. Each Premier to Caregiver” tweet is represented using several sets of features: keyword-based : keyword-based features exploit many switches from POS to NEG, or vice tokens occurring in the tweets. Unigrams, bi- versa, occur in the tweet. grams and trigrams are considered. During distributional semantics features : we compute the tokenization we replace the user mentions two kinds of distributional semantics fea- and URLs with two metatokens: “ USER ”, tures: “ URL ”; 1. given a set of unlabelled downloaded microblogging : microblogging features take into tweets, we build a geometric space in account some attributes of the tweets that which each word is represented as a are peculiar in the context of microblog- mathematical point. The similarity be- ging. We exploit the following features: the tween words is computed as their close- presence of emoticons, item character repe- ness in the space. To represent a tweet titions3 , informal expressions of laughters4 in the geometric space, we adopt the su- and the presence of exclamation and interrog- perposition operator (Smolensky, 1990), ative marks. All microblobbing features are that is the vector sum of all the vectors binary. of words occurring in the tweet. We use → − the tweet vector t as a semantic feature polarity : this block contains features extracted in training our classifiers; from the SentiWordNet (Esuli and Sebastiani, 2. we extract three features that taking into 2006) lexicon. We translate SentiWordNet in account the usage of words in an uncon- Italian through MultiWordNet (Pianta et al., ventional context. In particular, for each 2002). It is important to underline that Senti- word wi we compute a score aci that WordNet is a synset-based lexicon while our measures how the word is out of its con- Italian translation is a word based lexicon. In ventional context. Finally, we compute order to automatically derive our Italian sen- three features: the average, the maxi- timent lexicon from SentiWordNet, we per- mum and the minimum of all the aci form three steps. First, we translate the synset scores. More details about the computa- offset in SentiWordNet from version 3.0 to tion of the aci score are reported in Sub- 1.65 using automatically generated mapping section 2.1. file. Then, we transfer the prior polarity of SentiWordNet to the Italian lemmata. Fi- 2.1 Distributional Semantics Features nally, we expand the lexicon using Morph- The distributional semantics model is built on a it! (Zanchetta and Baroni, 2005), a lexicon collection of tweets. We randomly extract 40M of inflected forms with their lemma and mor- tweets from TWITA and build a semantic space phological features. We extend the polarity based on the Random Indexing (RI) (Sahlgren, scores of each lemma to its inflected forms. 2005) technique using a context windows equals Details about the creation of the sentiment to 2. Moreover, we consider only words occurring lexicon are reported in (Basile and Novielli, more than ten times6 . The context window is dy- 2014). The obtained Italian translation of namic and it does not take into account words that SentiWordNet is used to compute three fea- are not in the vocabulary. Our vocabulary contains tures based on prior polarity of words in the 105,543 terms. tweets: 1) the maximum positive polarity; The mathematical insight behind the RI is the 2) the maximum negative polarity; 3) polar- projection of a high-dimensional space on a lower ity variation: for each token occurring in the dimensional one using a random matrix; this kind tweet a tag is assigned, according to the high- of projection does not compromise distance met- est polarity score of the token in the Italian rics (Dasgupta and Gupta, 1999). lexicon. Tag values are in the set {OBJ, POS Formally, given a n × m matrix A and an m × , NEG}. The sentiment variation counts how k matrix R, which contains random vectors, we 3 define a new n × k matrix B as: These features usually plays the same role of intensifiers in informal writing contexts. 4 An,m · Rm,k = B n,k k << m (1) i.e., sequences of “ah”. 5 6 Since MultiWordNet is based on WordNet 1.6. We call this set of words: the vocabulary. The new matrix B has the property to preserve (−1, 1, 0, −2, 0, −1, 0, 1, 0, 1). This operation is the distance between points, that is if the distance repeated for all the sentences in the corpus and for between two any points in A is d; then the distance all the words in V . In this example, we used very dr between the corresponding points in B will sat- small vectors, but in a real scenario, the vector di- isfy the property that dr ≈ c × d. A proof of that mension ranges from hundreds to thousands of di- is reported in the Johnson-Lindenstrauss lemma mensions. In particular, in our experiment we use (Dasgupta and Gupta, 1999). a vector dimension equals to 200 with 10 no-zero Specifically, RI creates the WordSpace in two elements. steps: In order to compute the aci score for a word wi in a tweet, we build a context vector cwi as the sum 1. A context vector is assigned to each word. of random vectors assigned to words that co-occur This vector is sparse, high-dimensional and with wi in the tweet. Then we compare the cosine ternary, which means that its elements can similarity between cwi and the semantic vector svi take values in {-1, 0, 1}. A context vec- assigned to wi . The idea is to measure how the tor contains a small number of randomly dis- semantic vector is dissimilar to the context vector. tributed non-zero elements, and the structure If the word wi has never appeared in the context of this vector follows the hypothesis behind under analysis, its semantic vector does not con- the concept of Random Projection; tain the random vectors of the words in the con- 2. Context vectors are accumulated by analyz- text, this results in low cosine similarity. Finally, ing co-occurring words. In particular, the se- the divergence from the context is computed as mantic vector for any word is computed as 1 − cosSim(cwi , svi ). the sum of the context vectors for words that co-occur with the analyzed word. 3 Evaluation Formally, given a corpus C of n documents, and We perform the evaluation using the data provided a vocabulary V of m words extracted from C, we by the task organizers. The number of tweets in perform two steps: 1) assign a context vector ci to the training set is 3,977, while the testing set con- each word in V ; 2) compute a semantic vector svi sists of 872 tweets. The only parameter to set in for each word wi as the sum of all context vectors LIBLINEAR is C (the cost), after a 5-fold cross assigned to words co-occurring with wi . The con- validation on training we set C=1. text is the set of m words that precede and follow We submit two runs: UNIBA1 includes the se- wi . mantic vector representing the tweet as a feature, For example, considering the following tweet: while UNIBA2 does not include this vector. Nev- “siete il buono della scuola fatelo capire”. In the ertheless, features about the divergence are in- first step we assign a random vector to each term cluded in both the runs. as follows: Official results are reported in Table 1. Our runs rank third and fourth in the final rank. Our team is classified as second since the first two runs in csiete = (−1, 0, 0, −1, 0, 0, 0, 0, 0, 0) the rank belong to the team1. We can notice that cbuono = (0, 0, 0, −1, 0, 0, 0, 1, 0, 0) runs are very close in the rank. The last run is cscuola = (0, 0, 0, 0, −1, 0, 0, 0, 1, 0) ranked below the baseline random, while any sys- tem is ranked below the baseline baseline-mfc that cf atelo = (0, 1, 0, 0, 0, −1, 0, 0, 0, 0) assigns the most frequent class (non-ironic). ccapire = (−1, 0, 0, 0, 0, 0, 0, 0, 0, 1) Results show that our system is not able to improve performance exploiting the distributional representation of tweets, since the two runs report In the second step, we build a semantic vec- the same average F1-score. We performed further tor for each term by accumulating random vec- experiments in order to understand the contribu- tors of its co-occurring words. For example fix- tion of each feature. Some relevant outcomes are ing m = 2, the semantic vector for the word reported in Table 2, in particular: scuola is the sum of the random vectors si- ete, buono, fatelo, capire. Summing these vec- • keyword-based features are able to achieve tors, the semantic vector for scuola results in the best performance, in particular bigrams team precision recall F1-score precision recall F1-score average (non- (non- (non- (ironic) (ironic) (ironic) F1-score ironic) ironic) ironic) team1 0.785 0.643 0.707 0.696 0.823 0.754 0.731 team1 0.751 0.643 0.693 0.687 0.786 0.733 0.713 UNIBA1 0.748 0.638 0.689 0.683 0.784 0.730 0.710 UNIBA2 0.748 0.638 0.689 0.683 0.784 0.730 0.710 team3 0.700 0.716 0.708 0.708 0.692 0.700 0.704 team6 0.600 0.714 0.652 0.645 0.522 0.577 0.614 random 0.506 0.501 0.503 0.503 0.508 0.506 0.505 team7 0.505 0.892 0.645 0.525 0.120 0.195 0.420 baseline-mfc 0.501 1.000 0.668 0.000 0.000 0.000 0.334 Table 1: Task results. run note no-iro-F iro-F avg-F run1 all 0.6888 0.7301 0.7095 run2 no DSM 0.6888 0.7301 0.7095 1 keyword 0.6738 0.6969 0.6853 2 keyword, bigrams 0.6916 0.7219 0.7067 3 keyword, bigrams, trigrams 0.6992 0.7343 0.7168 4 keyword, bigrams, trigrams, blog 0.7000 0.7337 0.7168 5 keyword, bigrams, trigrams, polarity 0.6906 0.7329 0.7117 6 keyword, bigrams, trigrams, context 0.6937 0.7325 0.7131 7 only DSM 0.6166 0.6830 0.6406 8 only context 0.4993 0.5587 0.5290 Table 2: Task results obtained combining different types of features. and trigrams contribute to improve the per- different kernels for distributional and keyword- formance (run 1 and 2); based features. • DSM features introduce some kind of noise 4 Conclusions when are combined with other features, in We propose a supervised system for detecting fact run 4, 5 and 6 achieve good performance irony in Italian tweets. The proposed system ex- without DSM; ploits different kinds of features: keyword-based, • DSM alone without any other kind of features microblogging features, polarity, distributional se- is able to achieve remarkable results, it is im- mantics features and a score that measure how a portant to notice that in this run only the tweet word is used in an unconventional context. The vector is used as a feature; word divergence from its conventional context is computed exploiting the distributional semantics • blog, polarity, and context features are not model build by the Random Indexing. able to give a contribution to the overall sys- Results prove that our system is able to achieve tem performance, however we can observe good performance and rank third in the official that using only context features (only three ranking. However, a deep study on different com- features for each tweet) we are able to over- binations of features shows that keyword-based come both the baselines. features alone are able to achieve the best result, while distributional features introduce noise dur- Analyzing results we can conclude that a more ing the training. This outcome suggests the need effective way to combine distributional with no- for a different strategy for combining distribu- distributional features is needed. We plan to in- tional a no-distributional features. vestigate as a future work the combination of two References Valerio Basile and Malvina Nissim. 2013. Sentiment analysis on italian tweets. In Proc. of WASSA 2013, pages 100–107. Pierpaolo Basile and Nicole Novielli. 2014. Uniba at evalita 2014-sentipolc task: Predicting tweet sen- timent polarity combining micro-blogging, lexicon and semantic features. In Proc. of EVALITA 2014, pages 58–63, Pisa, Italy. Alessandra Teresa Cignarella, Simona Frenda, Vale- rio Basile, Cristina Bosco, Viviana Patti, and Paolo Rosso. 2018. Overview of the evalita 2018 task on irony detection in italian tweets (ironita). In Tom- maso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso, editors, Proceedings of the 6th evalua- tion campaign of Natural Language Processing and Speech tools for Italian (EVALITA’18), Turin, Italy. CEUR.org. Sanjoy Dasgupta and Anupam Gupta. 1999. An ele- mentary proof of the Johnson-Lindenstrauss lemma. Technical report, Technical Report TR-99-006, In- ternational Computer Science Institute, Berkeley, California, USA. Andrea Esuli and Fabrizio Sebastiani. 2006. Senti- wordnet: A publicly available lexical resource for opinion mining. In Proc. of LREC, pages 417–422. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang- Rui Wang, and Chih-Jen Lin. 2008. Liblinear: A library for large linear classification. Journal of ma- chine learning research, 9(Aug):1871–1874. Emanuele Pianta, Luisa Bentivogli, and Christian Gi- rardi. 2002. Multiwordnet: developing an aligned multilingual database. In Proc. 1st Intl Conf. on Global WordNet, pages 293–302. Magnus Sahlgren. 2005. An Introduction to Random Indexing. In Methods and Applications of Semantic Indexing Workshop at the 7th International Confer- ence on Terminology and Knowledge Engineering, TKE, volume 5. Paul Smolensky. 1990. Tensor product variable bind- ing and the representation of symbolic structures in connectionist systems. Artificial Intelligence, 46(1- 2):159–216, November. Cynthia Van Hee, Els Lefever, and Véronique Hoste. 2018. Semeval-2018 task 3: Irony detection in en- glish tweets. In Proceedings of The 12th Interna- tional Workshop on Semantic Evaluation, pages 39– 50. Eros Zanchetta and Marco Baroni. 2005. Morph-it!: a free corpus-based morphological resource for the italian language. Proc. of the Corpus Linguistics Conf. 2005.