Lacam&Int@UNIBA at the EVALITA 2016-SENTIPOLC Task Vito Vincenzo, Covella, Stefano Ferilli, Berardina De Carolis Domenico Redavid Department of Computer Science Department of Computer Science University of Bari “Aldo Moro”, Italy University of Bari “Aldo Moro”, Italy covinc93@gmail.com, berardi- stefano.ferilli@uniba.it, do- na.decarolis@uniba.it menico.redavid@uniba.it Abstract per cercare di renderlo più adatto allo scopo della challenge. English. This paper describes our first experience of participation at the 1 Introduction EVALITA challenge. We participated only to the SENTIPOLC Sentiment Po- We tested two systems to analyze the Sentiment larity subtask and, to this purpose we Polarity for Italian. They were designed and cre- tested two systems, both developed for a ated to be generic Text Categorization (TC) sys- generic Text Categorization task, in the tems without any specific feature and resource to context of the sentiment analysis: Senti- support Sentiment Analysis. We used them in mentWS and SentiPy. Both were devel- various domains (movie reviews, opinion about oped according to the same pipeline, but public administration services, mood detection, using different feature sets and classifica- Facebook posts, polarity expressed in the linguis- tion algorithms. The first system does not tic content of speech interaction, etc.). use any resource specifically developed Both systems were applied to the EVALITA for the sentiment analysis task. The se- 2016 SENTIPOLC Sentiment Polarity detection cond one, which had a slightly better per- subtask (Barbieri et al., 2016) in order to under- formance in the polarity detection sub- stand whether, notwithstanding their “general- task, was enriched with an emoticon clas- purpose” and context-independent setting, they sifier in order to fit better the purpose of were flexible enough to reach a good accuracy. If the challenge. so, this would mean that the Sentiment Analysis task could be approached without creating spe- Italiano. Questo articolo descrive la no- cial resources for this purpose, which is known stra prima esperienza di partecipazione to be a costly and critical activity, or that, if ad EVALITA. Il nostro team ha parteci- available, these resources may improve their per- pato solo al subtask inerente il ricono- formance. scimento della Sentiment Polarity, In We present here only the results of the con- questo contesot abbiamo testato due si- strained runs in which only the provided training stemi sviluppati genericamente per la data were used to develop the systems. Text Categorization applicandoli a que- The first system was entirely developed by the sto specifico task: SentimentWS e Sen- LACAM research group (all the classes used in tiPy. Entrambi i sistemi usano la stessa the pipeline). After studying the effect of differ- pipeline ma con set di feature e algoritmi ent combinations of features and algorithms on di classificazione differenti. Il primo si- the automatic learning of sentiment polarity clas- stema non usa alcuna risorsa specifiche sifiers in Italian based on the EVALITA SENTI- per la sentment analysis, mentre il se- POLC 2014 dataset, we applied the best one to condo, che si è classifcato meglio, pur the training set of EVALITA 2016 in order to mantendendo la sua genericità nella participate to the challenge. classificazione del testo, è stato arricchi- The second system was developed using the to con un classificatore per le emoticon scikit-learn (Pedregosa et al., 2011) and NLTK libraries (Bird et al., 2009) for building the pipe- line and in order to optimize the performance on - abbreviations, acronyms, and colloquial the provided training set, classifications algo- expressions, especially those that are often rithms and feature sets, different from those used found in informal texts such as blog posts on in SentimentWS, were tested. the Internet and SMS’; Even if initially they have been conceived as a - n-grams (groups of n consecutive terms) generic TC system, with the aim of tuning it for whose frequency of occurrence in the corpus the SENTIPOLC task, they considered also the is above a pre-defined threshold, that some- emoticons present in the tweets. In the first sys- times may be particularly meaningful; tem this was made by including them in the set - PoS tags, that are intuitively discriminant of features, while in the second one emoticons for subjectivity; were handled by building a classifier whose re- - expressive punctuation (dots, exclama- sult was considered to influence the sentiment tion and question marks), that may be indica- polarity detection. The results obtained by the tive of subjectivity and emotional involve- two systems are comparable even if the second ment. one shows a better overall accuracy and ranked In order to test the system in the context of higher than the first one in the challenge. Sentiment Analysis we added emoticons in the set of features to be considered, due to their di- 2 Systems Description rect and explicit relationship to emotions and moods. 2.1 SentimentWS As regards NLP pre-processing, we used the In a previous work (Ferilli et al., 2015) we de- TreeTagger (Schmid, 1994) for PoS-tagging and veloped a system for Sentiment Analy- the Snowball suite (Porter, 2001) for stemming. sis/Opinion Italian. It was called SentimentWS, All the selected features are collectively repre- since it has been initially developed to run as a sented in a single vector space based on the real- web-service in the context of opinion coming valued weighting scheme of Term Frequency - from web-based applications. SentimentWS casts Inverse Document Frequency (TF-IDF) (Robert- the Sentiment Classification problem as a TC son, 2004). To have values into [0, 1] we use task, where the categories represent the polarity. cosine normalization. To be general and context-independent, it relies To reduce the dimensionality of the vector on supervised Machine Learning approaches. To space, Document Frequency (i.e., removing learn a classifier, one must first choose what fea- terms that do not pass a predefined frequency tures to consider to describe the documents, and threshold) was used as a good tradeoff between what is the learning method to be exploited. An simplicity and effectiveness. To build the classi- analysis of the state-of-the-art suggested that no fication model we focused on two complemen- single approach can be considered as the abso- tary approaches that have been proved effective lute winner, and that different approaches, based in the literature: a similarity-based one (Rocchio) on different perspectives, may reach interesting and a probabilistic one (Naive Bayes). Senti- results on different features. As regards the fea- mentWS combines the above approaches in a tures, for the sake of flexibility, it allows to se- committee, where each classifier (i = 1,2) plays lect different combinations of features to be used the role of a different domain expert that assigns for learning the predictive models. As regards the a score sik to category ck for each document to approaches, our proposal is to select a set of ap- be classified. The final prediction is obtained as proaches that are sufficiently complementary to class c = arg maxk Sk, considering a function Sk mutually provide strengths and support weak- = f (s1k; s2k ). There is a wide range of options nesses. for function f (Tulyakov et al., 2008). In our case As regards the internal representation of text, we use a weighted sum, which requires that the most NLP approaches and applications focus on values returned by the single approaches are the lexical/grammatical level as a good tradeoff comparable, i.e. they refer to the same scale. In for expressiveness and complexity, effectiveness fact, while the Naive Bayes approach returns and efficiency. Accordingly, we have decided to probability values, Rocchio's classifier returns take into account the following kinds of de- similarity values, both in [0; 1]. scriptors: 2.2 SentiPy - single normalized words (ignoring dates, numbers and the like), that we believe convey SentiPy has been developed using the scikit-learn most of informational content in the text; and NLTK libraries for building the pipeline and, in order to optimize the performance on the pro- Tokenization Scikit-Learn de- vided training set, classifications algorithms and fault tokenizer feature sets, different from those used in Senti- mentWS, were tested. It uses a committee of two Maximum document 0.5 classifiers, one for the text component of the frequency CountVec- message and the other for the emoticons. For the torizer parameter first classifier we use a very simple set of fea- tures, any string made at least of two chars, and Maximum number of unlimited linear SVC as classification algorithm. Even if this might seem too simple, we made terms for the vocabu- some experiments in which we tested other con- lary figurations of features taking advantage of i) lemmatization, ii) lemmatization followed by n-grams Unigrams and POS-tagging, iii) stemming, iv) stemming fol- lowed by POS-tagging. All of them were tested bigrams with and without removing italian's stopwords (taken from Term weights tf-idf nltk.corpus.stopwords.words(“italian”)). We tested also other classification algorithms Vector’s normaliza- l2 (Passive Aggressive Classifier, SGDClassifier, tion Multinomial Naive Bayes), but their performance was less accurate than the one of linear SVC, that fit_intercept classifier False we selected. parameter Before fitting the classifier text preprocessing was performed according to the following steps: dual classifier para- True - Twitter's “mentions” (identified by the meter character '@' followed by the username) and http links are removed; Number of iterations 1000 - retweets special characters (“RT” and over training data “rt”) are removed; - hashtags are “purged”, removing the Class balancing automatic character '#' followed by the string, which is then left unmodified; Table 1: SentiPy - positive vs all best configura- - non-BMP utf8 characters (characters tion based on Sentipolc 2014. outside the Basic Multilingual Plane), usually used to encode special emoticons and emojis Tokenization Scikit-Learn de- used in tweets, are handled by replacing them fault tokenizer with their hexadecimal encoding; this is done to avoid errors while reading the files. Maximum document 0.5 After doing the aforementioned experiments using the training and testing sets provided by frequency CountVec- sentipolc2014, which was also used to fine-tune torizer parameter the parameters used by the LinearSVC algo- rithm, we compared the most successful ap- Maximum number of unlimited proaches: tokenization done using terms for the vocabu- nltk.tokenize.TweetTokenizer followed by lary stemming and feature extraction simply done by using the default tokenizer provided by scikit (it n-grams Unigrams and tokenizes the string by extracting words of at bigrams least 2 letters). The best configurations are those shown in Term weights tf-idf Table 1 and Table 2. Vector’s normalization l2 fit_intercept classifier True As far as the emoticons and emojis are parameter concerned, in this system we decided to exclude them from the features set, solution adopted in dual classifier parame- True SentimentWS, and train a classifier according to ter the valence with whom the tweet was labeled. This approach may be useful to detect irony or Number of iterations 1000 for recognizing valence in particular domains in which emoticons are used with a different mean- over training data ing. Emoticons and emojis were retrieved using automatic a dictionary of strings and some regular expres- Class balancing sions. The emoticons and emojis retrieved are Table 2: SentiPy - negative vs all best configura- replaced with identifiers, removing all other terms not related to the emoticons, thus obtaining tion based on Sentipolc 2014. a matrix emoticons-classes. The underlying clas- sifier takes this matrix as input and creates the These two configurations, which had the same model that will be used in the classification fine-tuned LinearSVC's parameters, were com- phase. The algorithm used in the experiments is pared observing the evaluation data obtained the Multinomial Naive Bayes. testing them on sentipolc2016 training set, taking The committee of classifiers was built using advantage of a standard common technique: 10- the VotingClassifier class, which is provided by fold cross validation, whose results are shown in the Scikit-Learn framework. The chosen voting Table 3. technique is the so called “hard voting”: it is The obtained results were comparable there- based on the majority voting rule; in case of a tie, fore we selected the configuration shown in the the classifier will select the class based on the first two rows of Table 3 combined with the ascending sort order (classifier 1 → class 2; clas- emoticon classifier since it was not presented in sifier 2 → class 1; class 1 will be the selected the SentimentWS system. class). F1-score macro 3 Experiments Accura- Configuration averaging cy Both systems were tested on other domains be- fore applying them to the SENTIPOLC subtask. VotingClassifier In the results tables, for each class (positive 0,70388 0,77383 and negative) 0 represents the value “False” used default tokenization – in the dataset annotations for the specific tweet positive vs all and class, 1 represents “True”, following the task guidelines of Sentipolc 2016. Thus the cell iden- VotingClassifier tified by the row positive and the column prec.0 shows the precision related to the tweets with default tokenization – 0,70162 0,70648 positive polarity annotations set to False. The negative vs all meaning of the other cells can be obtained analo- gously. VotingClassifier 3.1 SentimentWS Results stemming – SentimentWS was tested initially on a dataset of 0,70654 0,75424 2000 reviews in Italian language, concerning 558 positive vs all movies, taken from http://filmup.leonardo.it/. In this case, classification performance was evalu- VotingClassifier ated on 17 different feature settings using a 5- stemming – fold cross-validation procedure. Equal weight 0,6952 0,70351 was assigned to all classifiers in the Senti- negative vs all mentWS committee. Overall accuracy reported in (Ferilli et al., 2015) was always above 81%. Table 3: 10-fold on Sentipolc 2016 training set. When Rocchio outperformed Naive Bayes, accu- racy of the committee was greater than that of the components; in the other cases, correspond- ing to settings that used n-grams, Naive Bayes - minimum number of occurrences for a alone was the winner. term to be considered: 3 Before tackling the EVALITA 2016 SENTI- - POS-tags used: NOUN-WH-CLI-ADV- POLC task, in order to tune the system on a NEG-CON-CHE-DET-NPR-PRE-ART- (hopefully) similar environment, we tested our INTADJ-VER-PRO-AUX system on the EVALITA 2014 dataset and de- - n-grams: unigrams termined in this way the combination of features With the configuration described above, the that had a better accuracy on this dataset. system SentimentWS was able to classify the We tested the system using a subset of ~900 whole test set of Sentipolc 2014 (1935 tweet) tweet (taken from the dataset provided in Senti- obtaining a combined F-score of 0.6285. polc 2014), in order to find the best configuration The previously mentioned best configuration of parameters, which resulted to be the following was also used in one of the two runs sent for one: Sentipolc 2016, obtaining a combined F-score of - term normalization: lemmatization; 0.6037, as shown in Table 4. class prec. 0 rec. 0 F-sc. 0 prec. 1 rec. 1 F-sc. 1 F-sc positive 0.8642 0.7646 0.8113 0.2841 0.4375 0.3445 0.5779 negative 0.7087 0.7455 0.7266 0.5567 0.5104 0.5325 0.6296 Table 4: SentimentWS - Sentipolc 2016 test set - Combined F-score = 0.6037. class prec. 0 rec. 0 F-sc. 0 prec. 1 rec. 1 F-sc. 1 F-sc positive 0.8792 0.7992 0.8373 0.3406 0.4858 0.4005 0.6189 negative 0.7001 0.8577 0.7709 0.6450 0.4130 0.5036 0.6372 Table 5: SentiPy@Sentipolc2016 results (LinearSVC fine-tuned + EmojiCustomClassifier) - Combined F-score = 0.6281. 3.2 SentiPy Results 4 Conclusions With the configuration discussed above SentiPy combined F-score was 0.6281 as shown in Table Looking at the results of the Sentiment Polarity 5. detection subtask we were surprised of the over- We made other experiments on the Sentipolc all performance of the systems presented in this 2016 test set after the deadline of EVALITA. paper since they were simply Text Categoriza- Their results, even if unofficial, show significant tion systems. The only integrations to the origi- improvements, since we managed to get 0.6403 nal systems, in order to tune their performance as a combined F-score. We got it by making spe- on the sentiment polarity detection task, con- cific changes in the positive vs all classifier: we cerned emoticons. In SentimentWS these were used lemmatization (without stopwords remov- included in the feature set and SentiPy was en- al), unigrams (no other n-grams allowed) and the riched with a classifier created for handling parameter fit_intercept of the LinearSVC algo- emoticons. rithm was set to True. The other parameters re- Besides the experiments that were executed on mained unchanged. No changes have been made the SENTIPOLC dataset, we tested both systems to the classifier negative vs all. on a dataset of Facebook posts in Italian collect- ed and annotated by a group of researchers in our laboratories. This experiment was important in order to understand whether their performance was comparable to the one obtained in the Steven Bird, Ewan Klein, and Edward Loper (2009). SENTIPOLC challenge. Results in these cases Natural Language Processing with Python. were encouraging since both systems had a com- O’Reilly Media Inc. bined F-score higher than 0.8. Stefano Ferilli , Berardina De Carolis, Floriana We are currently working at the improvement Esposito , Domenico Redavid. 2015. Sentiment of the performance of the system by tuning it on analysis as a text categorization task: A study on the Sentiment Analysis context. To this aim we feature and algorithm selection for Italian lan- are developing a specific module to handle nega- guage. In Proceeding of IEEE International Con- tion in Italian and, in our future work we plan to ference on Data Science and Advanced Analytics (DSAA). integrate the two systems by creating one com- mittee including all the classifiers, moreover we H. Schmid. 1994. Probabilistic part-of-speech tagging plan to include an approach based on a combina- using decision trees. In Proceedings of Interna- tion of probabilistic and lexicon (De Carolis et tional Conference on New Methods in Language al., 2015). Processing, pp. 44–49. M. F. Porter, 2001. Snowball: A language for stem- References ming algorithms,” [Online]. http://snowball.tartarus.org/texts/introduction.html Barbieri, Francesco and Basile, Valerio and Croce, Danilo and Nissim, Malvina and Novielli, Nicole Stephen Robertson. 2004. Understanding inverse doc- and Patti, Viviana 2016. Overview of the ument frequency: On theoretical arguments for EVALITA 2016 SENTiment POLarity Classifica- IDF. Journal of documentation, 60(5), 503-520.. tion Task. In Proceedings of Third Italian Confer- S. Tulyakov, S. Jaeger, V. Govindaraju, and D. Do- ence on Computational Linguistics (CLiC-it 2016) ermann. 2008. Review of classifier combination & Fifth Evaluation Campaign of Natural Lan- methods,” ser. Studies in Computational Intelli- guage Processing and Speech Tools for Italian. Fi- gence (SCI). Springer. vol. 90, pp. 361–386. nal Workshop (EVALITA 2016). Associazione Ita- liana di Linguistica Computazionale (AILC). B. De Carolis, D. Redavid, A. Bruno. 2015. A Senti- ment Polarity Analyser based on a Lexical- Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, Probabilistic Approach . In Proceedings of V., Thirion, B., Grisel, O., et al. (2011). Scikit- IT@LIA2015 1st AI*IA Workshop on Intelligent learn: machine learning in python. Journal of Ma- Techniques At LIbraries and Archives co-located chine Learning Research, 12(Oct), 2825-2830. with XIV Conference of the Italian Association for Artificial Intelligence (AI*IA 2015).