-

1613-0073

JACERONG at TASS 2016: An Ensemble Classi er for Sentiment Analysis of Spanish Tweets at Global Level

Jhon Adrian Ceron-Guzman Santiago de Cali

Valle del Cauca

Colombia jadrian.ceron@gmail.com

2016

35 39

This paper describes an ensemble-based approach developed to participate in TASS-2016 Task 1 on sentiment analysis of Spanish tweets at global level. Ensembles are built on the combination of systems with the lowest absolute correlation with each other. The systems are able to deal with non-standard lexical forms in tweets, in order to improve the quality of natural language analysis. To support the polarity classi cation, the approach uses basic features that have proved their discriminative power, as well as word and character n-gram features. Then, outputs from Logistic Regression classi ers, which may be either class labels or probabilities for each class, are used to build ensembles. Experimental results show that the less-correlated combination of 25 systems, which chooses the class with the highest unweighted average probability, is the setting that best suits to the task, achieving an overall accuracy of 62.0% in the six-labels evaluation, and of 70.5% in the fourlabels evaluation.

What people say on social media about issues of their everyday life, the society, and the world in general, has turned into a rich source of information to understand social behavior. Twitter content, in particular, has caught the attention of researchers who have investigated its potential for conducting studies on the human subjectivity at large scale, which was not feasible using traditional methods. Around election time, sentiment analysis of political tweets has been widely used to capture trends in public opinion regarding important issues such as voting intention (Gayo-Avello, 2013) . However, analyzing this content also presents several challenges, including the development of text analysis approaches based on Natural Language Processing techniques, which properly adapt to the informal genre and the free writing style of Twitter (Han and Baldwin, 2011; Ceron-Guzman and Leon-Guzman, 2016) .

TASS is a workshop aimed at fostering research on sentiment analysis of Spanish Twitter data, which provides a benchmark evaluation to compare the latest advances in the eld (Garc a-Cumbreras et al., 2016) . One of the proposed tasks is to determine the opinion orientation expressed in tweets at global level. Task 1 consists on assigning one of six labels (P+, P, NEU, N, N+, NONE) to a tweet in the six-labels evaluation; or one of four labels (P, NEU, N, NONE) in the four-labels evaluation. Here, P, N, and NEU, stand for positive, negative, and neutral, respectively; NONE, instead, means no sentiment. The \+" symbol is used as intensi er.

This paper presents an ensemble-based approach to polarity classi cation of Spanish tweets, developed to participate in Task 1 proposed by the organizing committee of the TASS workshop. The ensemble members are (relatively) highly correct classi ers with the lowest absolute correlation with each other. The output from each classi er, which may be either a class label or probabilities for each class, is used to assign the polarity to a tweet based on a majority rule or on the highest unweighted average probability. Moreover, classi ers are adapted to deal with non-standard lexical forms in tweets, in order to improve the quality of natural language analysis.

The remainder of this paper is organized as follows. Section 2 describes the common architecture of the ensemble members (i.e., classi ers). Next, the submitted experiments, as well as the obtained results, are discussed in Section 3. Finally, Section 4 concludes the paper. 2

The System Architecture

The tweet text is passed through the pipeline of each system in order to assign it a class label or a probability to be of a certain class. The pipeline, which goes from text preprocessing to machine learning classi cation, is described below. Note that the system term is preferred over the classi er term, because a machine learning classi er receives a feature vector and produces a class label or probabilities for each class; instead, the system term enables to conceive the whole process, from preprocessing to machine learning classi cation. 2.1

Preprocessing

The process of text cleaning and normalization is performed in two phases: basic preprocessing and advanced preprocessing.

2.1.1 Basic Preprocessing

The following simple rules are implemented as regular expressions:

Removing URLs and emails.

HTML entities are mapped to textual representations (e.g., \<" ! \<"). Speci c Twitter terms such as mentions (@user) and hashtags (#topic) are replaced by placeholders.

Unknown characters are mapped to their closest ASCII variant, using the Python Unidecode module for the mapping. Consecutive repetitions of a same character are reduced to one occurrence. Emoticons are recognized and then classi ed into positive and negative, according to the sentiment they convey (e.g., \:)" ! \EMO POS", \:(" ! \EMO NEG").

Uni cation of punctuation marks (Vilares, Alonso, and Gomez-Rodr guez, 2014) .

2.1.2 Advanced Preprocessing

Once the set of simple rules has been applied, the tweet text is tokenized and morphologically analyzed by FreeLing (Padro and Stanilovsky, 2012) . In this way, for each resulting token, its lemma and Part-of-Speech (POS) tag are assigned. Taking these data as input, the following advanced preprocessing is applied:

Lexical normalization. Each token is passed through a set of basic modules of FreeLing (e.g., dictionary lookup, sufxes check, detection of numbers and dates, and named entity recognition) for identifying standard word forms and other valid constructions. If a token is not recognized by any of the modules, it is marked as out-of-vocabulary (OOV) word. Then, a confusion set is formed by normalization candidates which are identical or similar to the graphemes or phonemes that make the OOV word. These candidates are elements of the union of a dictionary of Spanish standard word forms and a gazetteer of proper nouns. The best normalization candidate for the OOV word is which best ts a statistical language model. The language model was estimated from the Spanish Wikipedia corpus. Lastly, the selected candidate is capitalized according to the capitalization rules of the Spanish language. Extensive research on lexical normalization of Spanish tweets can be read in (CeronGuzman and Leon-Guzman, 2016) .

Negation handling. Inspired by the

approach proposed by Pang et al. (Pang, Lee, and Vaithyanathan, 2002) , this research de ned a negated context as a segment of the tweet that starts with a (Spanish) negation word and ends with a punctuation mark (i.e., \!", \,", \:", \?", \.", \;"), but only the rst n [0; 3] or all tokens labeled with any or a speci c POS tag (i.e., verb, adjective, adverb, and common noun) are a ected by adding it the \ NEG" su x. Note that when n = 0, no token is a ected. 2.2

Feature Extraction

In this stage, the normalized tweet text is transformed into a feature vector that feeds the machine learning classi er. The features are grouped into basic features and n-gram features.

2.2.1 Basic Features

Some of these features are computed before the process of text cleaning and normalization is performed.

The number of words completely in uppercase.

The number of words with more than two consecutive repetitions of a same character.

The number of consecutive repetitions of exclamation marks, question marks, and both punctuation marks (e.g., \!!", \??", \?!") and whether the text ends with an exclamation or question mark.

The number of occurrences of each class of emoticons (i.e., positive and negative) and whether the last token of the tweet is an emoticon.

The number of positive and negative words, relative to the ElhPolar lexicon (Saralegi and Vicente, 2013) , the AFINN lexicon (Nielsen, 2011) , or an union of both lexicons. In a negated context, the label of a polarity word is inverted (i.e., positive words become negative words, and vice versa). Additionally, a third feature labels the tweet with the class whose number of polarity words in the text is the highest.

The number of negated contexts.

The number of occurrences of each Partof-Speech tag.

2.2.2 N-gram Features

The xed-length set of basic features is always extracted from tweets. However, the tweet text varies from another in terms of length, number of tokens, and vocabulary used. For that reason, a process that transforms textual data into numerical feature vectors of xed length is required. This process, known as vectorization, is performed by applying the tf-idf weighting scheme (Manning, Raghavan, and Schutze, 2008) . Thus, each document (i.e., a tweet text) is represented as a vector d = ft1; : : : ; tng RV , where V is the size of the vocabulary that was built by considering word n-grams with n [1; 4], or character n-grams with n [3; 5] in the collection (i.e., the training set). The vector is, hence, formed by word n-grams, character n-grams, or a concatenation of word and character n-grams.

2.3 Machine Learning Classi cation

At the last stage, the sentiment analysis system classi es a given tweet as either P+, P, NEU, N, N+, or NONE, or assigns probabilities for each class. After receiving as input the feature vector, a L2-regularized Logistic Regression classi er assigns a class label to the tweet or a probability to be of a certain class. The classi er was trained on the training set, using the Scikit-learn (Pedregosa et al., 2011) implementation of the Logistic Regression algorithm. 3

Experiments

1,720 di erent sentiment analysis systems were trained on the training set via 5-fold cross validation, in order to nd the best parameter settings, namely: negation handling, Experiment Accuracy run-1 run-2 run-3 run-1 run-2 run-3

Class

P NEU N NONE polarity lexicon, order of word and character n-grams, and others parameters related to the vectorization process (e.g., lowercasing, frequency thresholds, etc.). The systems were sorted by their mean cross-validation score, and thus the top 50 ranked were ltered to build the ensemble. The training set is a collection of 7,219 tweets, each of which is tagged with one of six labels (i.e., P+, P, NEU, N, N+, and NONE). Note that the systems were trained for the six-labels evaluation, and therefore the P+ and P labels were merged into P, as well as the N+ and N labels were merged into N, to produce an output in accordance with the four-labels evaluation. Further description of the provided corpus, as well as of the training and test sets, can be read in (Garc a-Cumbreras et al., 2016) .

Next, the top 50 systems assigned a class label to each tweet in a collection of 1,000, which was drawn from the untagged test set with a similar class distribution to the training set. In this stage, the objective was to nd the systems with the lowest absolute correlation with each other; therefore, the performance was not evaluated. Then, the less-correlated combinations of 5, 10, and 25 systems, were used to build the ensembles, whose outputs correspond to the submitted experiments. These experiments are described below: run-1: the less-correlated combination of 5 systems, which chooses the class label that represents the majority in the predictions made by the ensemble members. run-2: the less-correlated combination of 10 systems, which chooses the class with the highest unweighted average probability. run-3: the less-correlated combination of 25 systems, which chooses the class with the highest unweighted average probability.

Tables 1 and 2 show the performance evaluation on the test set (i.e., a collection of 60,798 tweets) for six and four labels, respectively. Accuracy has been de ned as the o cial metric for ranking the systems. In summary, the main gain occurs among the \run1" and \run-2" experiments, with an increment of 0.5% in accuracy in the six-labels evaluation, and of 0.2% in the four-labels evaluation; instead, a negligible gain occurs among the \run-2" and\ run-3" experiments, taking additionally into account the computational cost of running the latter.

As a nal point, Table 3 shows how the overall performance is a ected by the low discriminative power of the ensembles (in this case, the one that correspond to \run-3") for the NEU class. With this in mind, it is proposed as future work to deal with the low representativeness of the NEU class in the training data (i.e., 9.28% of tweets), in order to properly characterize this kind of tweets. 4

Conclusion

This paper has described an ensemble-based approach for sentiment analysis of Spanish Twitter data at global level, developed in order to participate in Task 1 proposed by the organization of TASS workshop. Three ensembles were built on the combination of sentiment analysis systems with the lowest absolute correlation with each other. The systems were adapted to the informal genre and the free writing style that characterize Twitter, in order to improve the quality of natural language analysis. In this way, the predicted class label for a particular tweet was based on a majority rule or on the highest average probability. Experimental results showed that the less-correlated combination of 25 systems, which chose the class with the highest unweighted average probability, was the setting that best suited to the task. However, there is a great room for improvement in the learning of a proper characterization of neutral tweets.

Ceron-Guzman , J. A. and

Leon-Guzman . 2016 . Lexical normalization of Spanish tweets . In Proceedings of the 25th International Conference Companion on World Wide Web, WWW'16 Companion , pages 605 { 610 . International World Wide Web Conferences Steering Committee.

Garc a-Cumbreras, M. A. ,

Villena-Roman , E.

Mart nez-

Camara , M. C.

D az-

Galiano , M. T.

Mart n-Valdivia, and L. A.

UrenaLopez . 2016 . Overview of tass 2016 . In Proceedings of TASS 2016 : Workshop on Sentiment Analysis at SEPLN co-located with the 32nd SEPLN Conference (SEPLN 2016 ), Salamanca, Spain, September.

Gayo-Avello , D.

2013 . A meta-analysis of state-of-the-art electoral prediction from Twitter data . Soc. Sci. Comput . Rev., 31 ( 6 ): 649 { 679 .

Han , B . and

Baldwin . 2011 . Lexical normalisation of short text messages: Makn sens a #Twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 , HLT' 11 , pages 368 { 378 , Stroudsburg , PA, USA. Association for Computational Linguistics.

Manning , C. D. ,

Raghavan , and H. Schutze. 2008 . Scoring, term weighting and the vector space model . In An Introduction to Information Retrieval . Cambridge University Press, New York, NY, USA.

Nielsen , F. A.

2011 . A new anew: evaluation of a word list for sentiment analysis in microblogs . In Proceedings of the ESWC2011 Workshop on `Making Sense of Microposts': Big things come in small packages , pages 93 { 98 .

Padro , L. and

Stanilovsky . 2012 . Freeling 3.0: Towards wider multilinguality . In Proceedings of the Language Resources and Evaluation Conference (LREC 2012 ), Istanbul, Turkey, May. ELRA.

Pang , B. ,

Lee , and

Vaithyanathan . 2002 . Thumbs up?: Sentiment classi - cation using machine learning techniques . In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10 , EMNLP ' 02 , pages 79 { 86 . Association for Computational Linguistics .

Pedregosa , F. ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot , and

Duchesnay . 2011 . Scikit-learn: Machine learning in Python . Journal of Machine Learning Research , 12 : 2825 { 2830 .

Saralegi , X. and

I. S.

Vicente . 2013 . Elhuyar at tass 2013 . In Proceedings of the Sentiment Analysis Workshop at SEPLN (TASS2013) , September .

Vilares , D. ,

M. A.

Alonso , and C. GomezRodr guez . 2014 . On the usefulness of lexical and syntactic processing in polarity classi cation of twitter messages . Journal of the Association for Information Science and Technology.