-

1613-0073

Classi er Ensembles That Push the State-of-the-Art in Sentiment Analysis of Spanish Tweets

Jhon Adrian Ceron-Guzman Santiago de Cali

Valle del Cauca

Colombia jadrian.ceron@gmail.com

59 64

This paper describes the JACERONG system proposed to participate in TASS-2017 Task 1. For such a benchmark evaluation, two ensemble methods widely utilized because of their proved ability to increase prediction accuracy were implemented, namely: averaging and stacking. First of all, (relatively) highly correct classi ers utilize supervised learning algorithms to predict a class label or probability estimates. Then, predictions from these classi ers are optimally combined in order to obtain a better nal prediction. Finally, how to choose which classi ers constitute an ensemble was also an important issue addressed in this work. Experimental results show that the proposed system is top-ranked on the test set of the InterTASS corpus, according to the accuracy metric. Together with this, results indicate that the predictive performance on the whole test set of the General Corpus of TASS outperforms the best result achieved in the four-label evaluation of the previous edition of TASS, in terms of the Macro-F1 metric.

Nowadays, `tweeting' has become an activity par excellence to say what one thinks or feels.

Thus, the large amount of user-generated content on Twitter, in the form of short texts limited to 140 characters that are known as tweets, has led to develop new methods to explore the human subjectivity at large scale.

Sentiment analysis, as one of these methods is known, has been widely utilized to gauge public opinion regarding important issues of people's everyday life, the society, and the word in general, e.g. a political election (Ceron-Guzman and Leon-Guzman, 2016b) ; it also bene ts from the exponential growth of the computational capacity to process such a large volume of information.

TASS is a workshop aimed at fostering research on sentiment analysis of Spanish tweets, which provides a benchmark evalu

eld (Mart nez-Camara et al., 2017) . One of the proposed tasks is to determine the opinion orientation expressed at tweet level. Task 1 consists in assigning one of four labels (P, NEU, N, NONE) to a given tweet. Here, P, N, and NEU, stand for positive, negative, and neutral, respectively; NONE, instead, means no sentiment.

This paper describes the JACERONG system proposed to participate in TASS-2017 Task 1. For this sixth edition, classi er ensembles based on stacking were developed, in addition to the ones based on averaging, with several improvements, that were presented in the previous edition (Ceron-Guzman, 2016) .

Regarding the ensembles, they are constituted by (relatively) highly correct classi ers that utilize Logistic Regression and Support Vector Machines as the supervised learning algorithms to predict a class label or probability estimates. Then, predictions from these classi ers are optimally combined in order to obtain a better nal prediction. Finally, how to choose which classi ers constitute an ensemble was also an important issue addressed in this work.

The remainder of this paper is organized as follows. Section 2 explains the system architecture. Next, the submitted runs and the obtained results are discussed in Section 3.

Lastly, Section 4 concludes the paper. 2

The System Architecture

The system architecture can be viewed as a pipeline consisting of several pre-processing modules, a vectorizer that transforms a text into a feature vector, machine learning classi ers, and an ensemble combiner that takes level-one predictions and then optimally combines them to obtain a better nal prediction. Figure 1 illustrates the system architecture. In addition to this, code of the system is publicly available to enable reproducibility.1 2.1

Pre-processing 2.1.1 Text Normalizer

This is a rule-based normalizer as listed below:

Removing URLs and emails.

HTML entities are mapped to their textual representation (e.g., \<" ! \<"). Speci c Twitter terms such as mentions (@user) and hashtags (#topic) are replaced by placeholders.

Unknown characters are mapped to their closest ASCII variant, using the Python Unidecode module for the mapping. Consecutive repetitions of a same character are reduced to one occurrence. Emoticons are recognized and then classi ed into positive and negative, according to the sentiment they convey (e.g., \:)" ! \EMO POS", \:(" ! \EMO NEG").

Uni cation of punctuation marks (Vilares, Alonso, and Gomez-Rodr guez, 2015) .

2.1.2 Spell Checker

An open-source spell checker for Spanish texts is used to normalize non-standard word forms, i.e. out-of-vocabulary (OOV) words, to their standard lexical form (CeronGuzman and Leon-Guzman, 2016a) .2 Normalesp suggests normalization candidates that are identical or similar to the graphemes or phonemes that make an OOV word, and 1https://github.com/jacerong/TASS-2017 2https://github.com/jacerong/normalesp using contextual information, it selects the best normalization candidate.

2.1.3 Negation Detector

Inspired by the approach proposed by Pang et al. (Pang, Lee, and Vaithyanathan, 2002) , a negated context is de ned as a segment of the text that starts with a negation word and ends with a punctuation mark (i.e., \!", \,", \:", \?", \.", \;"), but only the rst n [0; 3] or all tokens labeled with any or a speci c POS tag (i.e., verb, adjective, adverb, and common noun) are a ected by adding it the \ NEG" su x; note that when n = 0, no token is a ected. The negation detector uses FreeLing (Padro and Stanilovsky, 2012) to tokenize the text and assign Part-of-Speech (POS) tags to the resulting tokens. 2.2

Feature Extraction

Once the text has been normalized as described above, it is transformed into a feature vector that feeds a rst-level classi er. The feature vector is formed by concatenating basic and n-gram features.

2.2.1 Basic Features

Some of the following features are computed before the text normalization is performed.

The number of words completely in uppercase.

The number of words with more than two consecutive repetitions of a same character.

The number of consecutive repetitions of exclamation marks, question marks, and both punctuation marks (e.g., \!!", \??", \?!"), and whether the text ends with an exclamation or question mark.

The number of occurrences of each class of emoticons, i.e. positive and negative, and whether the last token of the text is an emoticon.

The number of positive and negative words, relative to the ElhPolar lexicon (Saralegi and Vicente, 2013) , the AFINN lexicon (Nielsen, 2011) , the iSOL lexicon (Molina-Gonzalez et al., 2013) , the EmoLex Lexicon (Mohammad and Turney, 2013) , the StrengthLex lexicon (Perez-Rosas, Banea, and Mihalcea, 2012) , or a union of two, three, four, or all lexicons. In a negated context, the polarity of a word is inverted, i.e. positive words become negative words, and vice versa. Additionally, a third feature labels the tweet with the class whose number of polarity words in the text is the highest.

The number of negated contexts.

The number of occurrences of each Partof-Speech tag.

2.2.2 N-gram Features

The xed-length set of basic features is always extracted from a text. However, a text varies from another in terms of length, number of tokens, and vocabulary. For that reason, a process that transforms textual data into numerical feature vectors of xed length is required. This process, known as vectorization, is performed by applying the TfIdf weighting scheme (Manning, Raghavan, and Schutze, 2008) . Thus, each document (i.e., a tweet text) is represented as a vector d = ft1; : : : ; tng RV , where V is the size of the vocabulary that was built by considering word n-grams with n [1; 4], or character ngrams with n [2; 5] in the collection (i.e., the training set). The vector is, hence, formed by word n-grams, character n-grams, or a concatenation of word and character n-grams. 2.3

Machine Learning Classi er

At this stage, a machine learning classi er, or rst-level classi er, receives the feature vector and predicts a class label or probability estimates, i.e. the probability of the tweet to be of a certain class. Whichever the prediction be, it is denominated level-one prediction. Logistic Regression and Support Vector Machines (SVM) with `linear' kernel are the algorithms utilized to develop a supervised learning classi cation approach; Scikit-learn (Pedregosa et al., 2011) is the machine learning library used. 2.4

Ensemble Combiner

Two ensemble methods were implemented to take level-one predictions and then optimally combine them in order to obtain a better nal prediction, namely: averaging and stacking (Li et al., 2014). The former chooses the class with the highest unweighted average probability from probability estimates predicted by rst-level classi ers. In spite of its simplicity, it has proved to be a competitive method that allows to achieve top results (Ceron-Guzman, 2016) . Regarding the latter, it stacks class labels predicted by rst-level classi ers and then provides them as input to a second-level classi er to generate an ensemble prediction, i.e. the nal prediction. SVM with `radius basis function' kernel is the algorithm utilized to generate nal predictions. 3

Experiments

Firstly, the training data were used to t 8,774 rst-level classi ers (4,387 of which were learned from the training set of the InterTASS corpus, while to learn the remaining ones the training set of the General Corpus of TASS was used) via 5-fold cross validation in order to nd the best parameter settings, namely: scope of the negated context, polarity lexicon, order of word and character n-grams, and other parameters related to the vectorizer (e.g., frequency thresholds). Secondly, these classi ers were ranked according to their predictive performance on cross validation, i.e. the (out-of-fold) prediction accuracy obtained by averaging among the k iterations; out-of-fold predictions in the k-th iteration are the predictions obtained by applying a rst-level classi er, which was trained on k 1 folds, to the remaining one subset. Thus, the best 100 rst-level classi ers for each training set were ltered. Thirdly, how to choose which rst-level classi ers constitute an ensemble was an important issue tackled in this work. Empirical ndings indicate that the less-correlated combination of classi ers achieves top results (Ceron-Guzman, 2016) . Finally, second-level classi ers were trained using out-of-fold predictions on cross validation. Regarding this matter, only ensembles based on stacking were trained via 5-fold cross validation.

In order to evaluate the predictive performance of the system, the test set of the InterTASS corpus and the two test sets of the General Corpus of TASS (the whole set and the strati ed sample of 1,000 tweets) were used. Speci cally, given a tweet from any of the test sets, its polarity should be predicted; the polarity, or class label, can be P, N, NEU, or NONE. Macro-F1 and Accuracy are the ofcial metrics used to rank the participating systems. Regarding the provided corpora, and the way these are split into training and test sets, the reader is referred to (Mart nezCamara et al., 2017) where they are thoroughly described. legacy-run-3: it is the same run submitted to TASS-2016 Task 1 that achieved the best results (CeronGuzman, 2016), namely: the lesscorrelated combination of 25 rst-level classi ers learned from the training set of the General Corpus of TASS, which constitute an ensemble based on averaging.

In the same way, the runs submitted to evaluate the system on the test set of the InterTASS corpus are described below: InterTASS-run-1: the less-correlated combination of 3 rst-level classi ers learned from the training set of the InterTASS corpus, which constitute an ensemble based on averaging.

InterTASS-run-2: the less-correlated combination of 19 rst-level classi ers learned from the training set of the InterTASS corpus, which constitute an ensemble based on stacking.

InterTASS-run-3: the less-correlated combination of 14 rst-level classi ers learned from the training set of the General Corpus of TASS, which constitute an ensemble based on averaging.

In summary, it is worth to state that ensembles based on averaging are signi cantly better than the ones based on stacking. And this signi cance does not only correspond to the slightly better results achieved by the former, but also to their ability to increase prediction accuracy given their simplicity and their computational e ciency. Thus, the proposed system outperforms all the participating systems in predictive performance on the test set of the InterTASS corpus, in terms of the accuracy metric; likewise, the predictive performance on the whole test set of the General Corpus of TASS turns out to be slightly better than the best result achieved in the four-label evaluation of the previous edition (Garc a-Cumbreras et al., 2016) , in terms of the Macro-F1 metric. Additionally, the obtained results of the third run submitted to evaluate the system on the InterTASS corpus (\InterTASS-run-3") should be highlighted, taking into account that the domain from which the rst-level classi ers that constitute the ensemble were learned di ers from the one of evaluation; speci cally, these results are above-average (0.5642 in terms of the accuracy metric, taking only the best result from each participating system).

As a nal point, class imbalance is a major problem that has not been properly tackled yet. Speci cally, the overall performance of the system was signi cantly a ected by the low discriminative power for the NEU class, on both the test set of the InterTASS Corpus and the two test sets of the General Corpus of TASS. With this in mind, future research e orts should be focused on dealing with the low representativeness of the NEU class.

Conclusion

This paper has described the JACERONG system proposed to participate in TASS-2017 Task 1. For such a benchmark evaluation, two ensemble methods were implemented, namely: averaging and stacking. Findings indicate that ensembles based on averaging are signi cantly better than the ones based on stacking. This signi cance corresponds to the former's ability to increase prediction accuracy given their simplicity and their computational e ciency, in addition to the slightly better results achieved by them. Moreover, ndings show that the less-correlated combination of classi ers achieves top results. The experimental evaluation on the test set of the InterTASS corpus showed that the proposed system is top-ranked. Together with this, results indicated that the predictive performance on the whole test set of the General Corpus of TASS outperforms the best result achieved in the four-label evaluation of the previous edition of TASS. Proceedings of TASS 2016: Workshop on Sentiment Analysis at SEPLN co-located with the 32nd SEPLN Conference (SEPLN 2016), pages 13{21.

Ceron-Guzman , J. A.

2016 . Jacerong at TASS 2016: An ensemble classi er for sentiment analysis of Spanish tweets at global level . In Proceedings of TASS 2016 : Workshop on Sentiment Analysis at SEPLN co-located with the 32nd SEPLN Conference (SEPLN 2016 ), pages 35 { 39 .

Ceron-Guzman , J. A. and

Leon-Guzman . 2016a . Lexical normalization of Spanish tweets . In Proceedings of the 25th International Conference Companion on World Wide Web, WWW'16 Companion , pages 605 { 610 .

Ceron-Guzman , J. A. and

Leon-Guzman . 2016b . A sentiment analysis system of Spanish tweets and its application in Colombia 2014 presidential election . In Proceedings of the 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud) , Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) (BDCloud-SocialComSustainCom) , pages 250 { 257 .

Garc a-Cumbreras, M. A. ,

Villena-Roman , E.

Mart nez-

Camara , M. C.

D az-

Galiano , M. T.

Mart n-Valdivia, and L. A.

UrenaLopez . 2016 . Overview of TASS 2016 . In Li, Y. ,

Gao ,

Li , and

Fan . 2014 . Ensemble learning . In Data Classi cation: Algorithms and Applications . Chapman & Hall/CRC.

Manning , C. D. ,

Raghavan , and H. Schutze. 2008 . Scoring, term weighting and the vector space model . In An Introduction to Information Retrieval . Cambridge University Press.

Mart nez-Camara, E. , M. C.

D az-

Galiano , M. A.

Garc a-Cumbreras, M.

Garc aVega, and

Villena-Roman . 2017 . Overview of TASS 2017 . In Proceedings of TASS 2017: Workshop on Semantic Analysis at SEPLN (TASS 2017 ), volume 1896 of CEUR Workshop Proceedings , Murcia, Spain, September. CEUR-WS.

Mohammad , S. M. and

P. D.

Turney . 2013 . Crowdsourcing a word-emotion association lexicon . Computational Intelligence , 29 ( 3 ): 436 { 465 .

Molina-Gonzalez , M. D. , E. Mart nezCamara, M. T. Mart n-Valdivia, and

J. M.

Perea-Ortega . 2013 . Semantic orientation for polarity classi cation in Spanish reviews . Expert Systems with Applications , 40 ( 18 ): 7250 { 7257 .

Nielsen , F. A.

2011 . A new anew: evaluation of a word list for sentiment analysis in microblogs . In Proceedings of the ESWC2011 Workshop on `Making Sense of Microposts': Big things come in small packages , pages 93 { 98 .

Padro , L. and

Stanilovsky . 2012 . Freeling 3.0: Towards wider multilinguality . In Proceedings of the Language Resources and Evaluation Conference (LREC 2012 ), Istanbul, Turkey, May. ELRA.

Pang , B. ,

Lee , and

Vaithyanathan . 2002 . Thumbs up?: Sentiment classi - cation using machine learning techniques . In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10 , EMNLP ' 02 , pages 79 { 86 . Association for Computational Linguistics .

Pedregosa , F. ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot , and

Duchesnay . 2011 . Scikit-learn: Machine learning in Python . Journal of Machine Learning Research , 12 : 2825 { 2830 .

Perez-Rosas , V., C.

Banea , and R.

Mihalcea . 2012 . Learning sentiment lexicons in Spanish . In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC2012) , pages 3077 { 3081 .

Saralegi , X. and

I. S.

Vicente . 2013 . Elhuyar at TASS 2013 . In Proceedings of the Sentiment Analysis Workshop at SEPLN (TASS2013) , pages 143 { 150 .

Vilares , D. ,

M. A.

Alonso , and C. GomezRodr guez . 2015 . On the usefulness of lexical and syntactic processing in polarity classi cation of twitter messages . Journal of the Association for Information Science and Technology , 66 ( 9 ):1799{ 1816 .