=Paper=
{{Paper
|id=Vol-1386/ensemble_classifier
|storemode=property
|title=Ensemble classifier for Twitter sentiment analysis
|pdfUrl=https://ceur-ws.org/Vol-1386/ensemble_classifier.pdf
|volume=Vol-1386
|dblpUrl=https://dblp.org/rec/conf/nldb/Martinez-Camara15a
}}
==Ensemble classifier for Twitter sentiment analysis==
Ensemble classifier for Twitter Sentiment Analysis Eugenio Martı́nez-Cámara1, Yoan Gutiérrez-Vázquez2, Javi Fernández2 , Arturo Montejo-Ráez1 , and Rafael Muñoz-Guillena2 ⋆ 1 Computer Science Department University of Jaén Campus Las Lagunillas, 23071, Jaén, Spain 2 University of Alicante Carretera San Vicente del Raspeig, 03690, Alicante, Spain emcamara@ujaen.es,ygutierrez@dlsi.ua.es,javifm@ua.es, amontejo@ujaen.es,rafael@dlsi.ua.es Abstract. In this paper, we present a combination of different types of sentiment analysis approaches in order to improve the individual perfor- mance of them. These ones consist of (I) ranking algorithms for scoring sentiment features as bi-grams and skip-grams extracted from annotated corpora; (II) a polarity classifier based on a deep learning algorithm; and (III) a semi-supervised system founded on the combination of sentiment resources. By means assembling of the outputs of these approaches, we made a new evaluation in order to reach a complementation among them. The evaluations were based on the General Corpus of the TASS compe- tition. The good results reached encourage us to continue studying the application of ensemble methods to resolve sentiment analysis problems. Keywords: Sentiment Analysis, Deep Learning, Ensemble methods 1 Introduction Textual information has become one of the most important sources of data to extract useful and heterogeneous knowledge. Texts can provide factual informa- tion, such as descriptions or even instructions, and opinion-based information, which would include reviews, emotions, or feelings. Subjective information can be expressed through different textual genres, such as blogs, forums, and reviews, but also through social networks and microblogs. Social networks like Twitter, Facebook, etc. have gained much popularity last years. These sites involve a large amount of subjective information, due to millions of users share opinions on different aspects of their everyday life. Extracting this subjective informa- tion has a great value for both general and expert users. For example, users can ⋆ This work has been partially supported by a grant from the Fondo Europeo de Desarrollo Regional (FEDER), ATTOS project (TIN2012-38536-C03-0) from the Spanish Government. 2 E. Martı́nez, Y. Gutiérrez, J. Fernández, A. Montejo, R. Muñoz find opinions about a product they are interested in, and companies and public figures can monitor their on-line reputation. Sentiment Analysis (SA) can deal with this task; however, it is difficult to exploit it accordingly, mainly because of the short length of the tweets, the informality, and the lack of context. SA systems must be adapted to this face the challenges of this new textual genre. In this paper, we present an ensemble classifier which makes use of different types of SA approaches in order to improve the performance of the base classi- fiers on the context of polarity classification of tweets written in Spanish. The base systems are polarity classifiers that have participated in the Task 1 of the TASS competition3 [20]. The task is focused on the development of polarity clas- sifiers at tweet level, which must identify six levels of polarity intensity: strong positive (P+), positive (P), neutral (NEU), negative (N), strong negative (N+) and no sentiment (NONE). The paper is structured as follows. Next section provides related works where the main insights of each approach are exposed. The base systems of the final en- semble classifier are described in Section 3. Subsequently, in Section 4 is exposed in detail the ensemble classifier that we are exposing. Finally, the conclusions and future work are presented in Section 5. 2 Related Work In order to build sentiment resources, several studies have been conducted. One of the first is the relevant work performed by Hu and Liu [8] using lexicon expansion techniques by adding synonymy and antonymy relations provided by WordNet [11]. A similar approach has been used for building WordNet-Affect [19] which expands six basic categories of emotion; thus, increasing the lexicon paths in WordNet. Another well presented lexicon can be found in [14], where 2,496 words in Spanish are annotated into two different lexicons: Full Strength Lexicon and Medium Strength Lexicon. Nowadays, many sentiment and opinion messages are provided by Social Medias. In it, new expression manners characterise the communication streaming across the Social Medias. That reason is very important to us, because it allows us to retrieve available information in these medias to build new types of sentiment resources. Deep neural networks have already been used to construct polarity classifiers. The work [16] proposes a sophisticated approach where they concatenate word- level vectors with character-level vectors. These vectors feed a second convolu- tional network for obtaining the final polarity. In case of the approach presented in Section 3.2, linear averaging of word-level vectors is performed, reducing the amount of computation and simplifying the process. Another approach to com- pute the final vector for a sentence or text is to consider the parse tree and 3 http://www.daedalus.es/TASS2013/ Ensemble classifier for Twitter Sentiment Analysis 3 calculate the vector of a node according to the vectors of the child nodes. This approach has been successfully applied in [18], although it requires syntactic information to be available in order to train the system, so it may not be a preferred option with short texts like tweets are. Besides, it is more complex to export the system to languages other than English. 3 Polarity classifiers Our proposal is based on the combination of three different polarity classifiers, so firstly we are going to describe the base systems of the final combined polarity classifier. 3.1 Approach I: Ranking Algorithm and Skip-grams The first approach consists of a modified version of the ranking algorithm RA- SR [7] using bi-grams, combined with a skip-gram scorer [4]. Both approaches share the same strategy: – From a corpora of polarity-annotated sentences, a sentiment lexicon is cre- ated. This lexicon assigns a different score for each term and polarity. A term can consist on a single word or several context-related words, by implement- ing n-gram4 and skip-gram5 strategies. – A machine learning model is generated using the corpora and the senti- ment resource created. Each text in the corpora is employed as a training instance, considering the polarities as the training categories. The features are obtained by combining the scores of the terms in each text, given by the sentiment lexicon. In both cases the algorithm used is Support Vector Machines, due to its good performance in text categorisation tasks [17]. The differences between these approaches reside basically in the term gen- eration, the term scoring, and the features employed for the machine learning modeling. The subsequent lines explain these differences. Ranking Algorithm RA-SR In this approach we use a method named RA-SR (using Ranking Algorithms to build Sentiment Resources) [7], which produces sentiment inventories based on senti-semantic evidence, obtained after exploring text with annotated sentiment polarity information. A wider description can be found in [7]. To generate the sentiment lexicon, each sentence in the corpora is tokenised and lemmatised into words and word bi-grams. These terms are used as nodes of a RS-SR contextual graph, where the links between two terms represent the 4 An n-gram is a sequence of n consecutive words found in text. 5 A skip-gram is a generalisation of n-grams where words might be skipped and do not need to be consecutive. 4 E. Martı́nez, Y. Gutiérrez, J. Fernández, A. Montejo, R. Muñoz appearance of both terms in the same text. Finally, each term is assigned a value of positivity, negativity and objectivity obtained by applying the PageRank algorithm over lexical graphs that represent each polarity respectively. This method generates several features for a single text, such as the number of positive terms, the addition of the positive scores of the terms in the text, or the number of positive emoticons (and the same for the negative and objective polarities). Skip-gram Scorer In this approach, terms are not only uni-grams and bi- grams, but also skip-grams. Skip-grams are a technique largely used in the field of speech processing, whereby n-grams are formed but in addition to allowing adjacent sequences of words, it also allows tokens to be skipped [6]. It should be noted that in this approach the terms are not lemmatised. To create the sentiment lexicon, the scores for each skip-gram and polarity depend on their occurrences in the corpora. The score is calculated taking into account the number of skipped terms on each text, the number of occurrences, and the proportion of occurrences in a specific polarity. In this method a feature is created for each polarity. The value of each polarity feature for a text will be calculated by adding all the scores for that polarity of the skip-grams it contains. For example, the feature called positive for a specific text is calculated by adding all the positive scores of the skip-grams in that text. A deep explanation of this approach can be found in [5]. 3.2 Approach II: Word2Vec This approach projects every tweet text to a space of fixed dimensionality where every word is semantically modeled. The text is represented as the centroid of all words in this semantic space. For representing each word as a vector we applied the deep learning algorithm known as Word2Vec. Once the tweet is represented as a vector, traditional supervised learning is applied. Word2Vec is an implementation of the representation of words architecture by means of vectors in the continuous space, based on Milokov’s method of bags of words or n-grams [10]. This method has been applied to manage the semantic of words in problems such as analogy at word level or word clustering. The main idea of the method is depicted as follows: every word is associated with an n-dimensional space whose weights are calculated from a neural network structure by using a recurrent algorithm. There are two possible topologies of the neural network based on a model called Skip-gram Model. This models com- putes the weights of a hidden layer using the target term w(t) as input and the surrounding terms as expected output. Another model is the Continuous Bag- of-Words Model (CBOW). In this case, the prediction of a term w(t) is based on a window of two up to five terms around the term t. Ensemble classifier for Twitter Sentiment Analysis 5 It is possible to represent the semantics of each word by using these topolo- gies, if we have a high enough volume of data. We used deeplearning4j software6 in order to calculate the Word2Vec model. In order to obtain suitable vectors for each word, it is required to generate a model from a high volume of text. Thus, we downloaded XML versions of Spanish Wikipedia. Then, we extracted text included into each XML document. In this way we obtained about 2.2 GB of Spanish texts. Therefore, this classifier [13] requires two learning steps: firstly, we generate the word vector model using Wikipedia, by means of the Word2Vec unsupervised algorithm. Thus, tweets can be vectorised in this new vector space model. The vectoriser module computes the centroid of all the vectors of the words in the tweet. The second step is the supervised learning phase. This approach is supported by processing messages following the next steps. First, stop words are removed. The resulting texts are passed to the vectori- sation module that generates the vectors of words using the Word2Vec model of Wikipedia in the corresponding language. Supervised learning is performed by considering only provided training samples. Word2Vec is parameterised as the following table shows, obtaining as result models with a vocabulary size of 404,916 terms since Wikipedia was used as basic resource. Parameter Values Window size 5 terms Network model CBOW model Number of dimensions 200 Hierarchical clustering enabled Min. occurrences for every word 10 times 3.3 Approach III: Combination of linguistic resources The third approach, which we attempt to combine, is based on the joint use of several sentiment lexical resources and the exploitation of the lexical infor- mation of the jargon of Twitter, such as emoticons, exclamation marks and onomatopoeia of laugh. The sentiment resources used for the development of the system are: – SentiWordNet: It is a knowledge-base founded on the structure of WordNet [11], which links to each synset of WordNet three values that correspond to the likelihood to be positive, negative or neutral. A wider description of SentiWordNet can be read in [3]. – Q-WordNet: It is also based on WordNet. The authors of Q-WordNet [1] consider the polarity of a word as a quality of the different senses of a word. For this reason, they associate to each sense of WordNet a label of polarity. They use two different labels of polarity, Positive and Negative. Q-WordNet is only composed by 15,510 synsets. – iSOL: List of opinion bearing words, which is composed by 2,509 positive words and 5,626 negative words. The evaluation of the list in [12] shows the quality of the list. 6 http://deeplearning4j.org/word2vec.html 6 E. Martı́nez, Y. Gutiérrez, J. Fernández, A. Montejo, R. Muñoz The system is developed as a pipeline of processing modules that begins with a tokenisation module and finishes with the classification of the polarity of the tweet. With the aim of clarifying the function of each module, we are going to explain each of them. – Tokenisation. We developed a tokeniser for tweets written in Spanish based on the tokeniser of Christopher Potts7 . – Normalisation. The users of Twitter usually employ abbreviations due to the length limitation of Twitter. The subsequent disambiguation process needs that the input text is well-formed, thus the abbreviations were expanded with the aim of facilitating the work of the disambiguation module. Misspellings are also common in tweets, so because the same reason, a spelling checker was used with the intention of correcting the misspellings of the tweets. The spelling checker is based on the GNU Aspell spelling checker. – Disambiguation. We used the graph-based disambiguation algorithm UKB [2]. – Polarity classification. The polarity classification system attempts to rightly combine the sentiment information of the three linguistic resources, the emoticons, the exclamation marks and the onomatopoeia of laugh. The sys- tem is based on the assignation of sentiment scores to each token of the tweet, and finally adding up all the sentiment scores to reach the final polarity la- bel. Firstly, a sentiment lexicon of emoticons was compiled, and they were labelled taking into account four levels of sentiment. Each level of sentiment was assigned a sentiment score. Secondly, the onomatopoeia of laugh is usu- ally a signal of a positive sentiment, so the system links to each laugh token a sentiment score of 0.75. The rest of the tokens that are neither emoticons nor onomatopoeias of laugh can be sentiment tokens, so they are sought in iSOL and their corresponding synsets are searched in SentiWordNet and Q- WordNet. iSOL and Q-WordNet return a unique polarity score that can be 1 (positive) or -1 (negative), but SentiWordNet returns two continuous values of polarity, one that corresponds to the likelihood to be positive, and the other one the likelihood to be negative. The polarity score of SentiWordNet consists on calculating the difference between the positive and the negative score. The positive, negative and the neutral scores returned by each sen- timent resource are added up, so each token has three polarity scores, one positive, one negative and one neutral. If the word is accompanied by an exclamation mark, then its polarity scores are augmented in 0.1 points. The final polarity score of the token is the greater of the positive and negative values. The final polarity score (pt ) is reached by adding up the polarity score of each token, and the sentiment label returned by the classifier is determined by the Equation 1. 7 http://sentiment.christopherpotts.net/code-data/happyfuntokenizing.py Ensemble classifier for Twitter Sentiment Analysis 7 4 Ensemble method and Evaluation The main idea of ensemble methodology is to combine a set of classifiers in order to obtain more accurate estimations than can be achieved by using a single classifier [15]. Broadly speaking, the ensemble methodology attempts to learn from the errors of the base classifiers with the aim of achieving a more accurate final classifier. P+ if pt > 0.6 if 0.12 < pt ≤ 0.6 P EU N if − 0.209 ≤ pt ≤ −0.05 polarity = and 0.02 ≤ pt ≤ 0.12 (1) N if − 0.209 < pt ≥ −0.45 N+ if pt ≤ −0.45 if − 0.05 < pt < 0.02 N ON E In order to assess the effectiveness of the ensemble classifier, we performed a series of experiments over the provided datasets. The measures used are precision (Pr) and recall (R). We do not use accuracy because it is not a good measure for text categorisation when using an imbalanced corpus [21]. Instead, we also use the F-score (F1) because it represents a balance between precision and recall. Due to the fact that the classifier has to identify six classes, we used the Macro- averages measures of Precision, Recall and F1. The results reached in TASS workshop by the three polarity classification systems described previously are shown in Table 1. We have to say that in the edition of 2014, the organisation of the workshop admitted that they calculated wrongly the recall, because they considered SA as a Information Retrieval task, which is completely wrong, but these are the official results, which we take as reference. System Precision Recall F1 System I 61.60% 61.60% 61.60% System II 51.35% 51.35% 51.35% System III 31.40% 31.40% 31.40% Table 1. Results reached by base classifiers in TASS workshop The three previous described systems (see Sections 3.1, 3.2 and 3.3) can be considered as base classifiers of an ensemble classifier. The results (see Table 1) and the description of the system show that the three classifiers are very different, so it is a sign that a first approach to study the level of rapport among the classifiers can be the combination of the outputs of them. A straightforward methodology to combine the outputs of several classifiers is the voting scheme, and more specifically the one that we are going to describe, which is the plurality rule voting scheme. The plurality vote is called in a wide sense “the majority 8 E. Martı́nez, Y. Gutiérrez, J. Fernández, A. Montejo, R. Muñoz vote”, and it is the most often used rule from the majority vote group. According to Kuncheva [9] the majority rule could be represented mathematically as follows: L L X c X di,k = max di,j (2) j=1 i=1 i=1 where it is assumed that the label outputs of the classifiers are given as c- dimensional binary vectors [di,1 , ..., di,c ] ∈ 0, 1, i = 1, ..., L where di,j = 1 if the base classifiers Di labels the document with label wj and 0 otherwise. Regarding the equation 2, in our case, the set D is composed by the three base classifiers described previously (see Section 3.1, Section 3.2 and Section 3.3), and the set w is composed by six classes (P+, P, NEU, N, N+, NONE). In a voting system if the number of classes or labels is greater or equal to the number of classifiers, as in our case, it is possible that the result of the voting is a tie. There is not a consensus in the literature to resolve ties in voting systems, so a rule has to be defined to solve them. Our voting system has three base classifiers and has to identify 6 classes, so it is very likely that a tie can be produced during the classification process. Thus, two strategies to break the ties have been defined: 1. NEU Default: In this case, we consider that an instance with dissimilar clas- sification results must have a neutral semantic orientation. 2. NONE Default: In this case, we think that the cause of the disagreement among the base classifiers is that the instance has not polarity, so the system returns as label NONE. Before showing the results reached by the voting system, we would like to remark that the performance of an ensemble method depends on the nature of the base classifiers. If the base classifiers perform well and their classification results are not homogeneous, then the resultant ensemble method has a high likelihood of improving the performance of the base classifiers. However, if there are differences between the performance of the base classifiers, as it occurs in our case, then it is highly likelihood that the performance of the best classifier is not be improved. Table 2 shows the results reached by our two voting systems. Voting system Macro-P. Macro-R. Macro-F1 NEU Default 64.09% 61.91% 62.98% NONE Default 58.49% 65.50% 61.79% Table 2. Results reached by the voting systems Regarding the results shown in Table 2 the most adequate strategy to resolve the ties of our voting system is to label as neutral those instances that provoke disagreement among the base classifiers. The reason of this behaviour has a sim- ple explication, in the set of possible classes (P +, P, N EU, N, N +, N ON E) there are more sentiment classes (P +, P, N EU, N, N +) than no-sentiment classes (N ON E), Ensemble classifier for Twitter Sentiment Analysis 9 so it is more likely that a disagreement could be triggered because the semantic orientation of the input text is not well-defined or it has a low intensity of po- larity, thus these types of instances are good candidates to be labeled as NEU. The former assertion is corroborated by the performance of the system in each class (see Table 3). NEU Default NONE Default P+ P NEU N N+ NONE P+ P NEU N N+ NONE Precision 81.06% 84.78% 19.40% 69.97% 53.64% 75,68% 81.06% 90.14% 2.00% 69.97% 53.64% 54.11% Recall 58.43% 90.01% 70.15% 67.07% 42.22% 45.54% 58.43% 93.67% 58.83% 67.07% 40.22% 74.76% F1 67.91% 87.32% 30.39% 68.49% 45.97% 56.86% 67.91% 91.87% 3.87% 68.49% 45.97% 62.78% Table 3. Results reached per each class by the voting systems Are the base classifiers improved by the ensemble method proposed herein? To answer this question Table 1 and Table 2 should be compared. The two voting systems (NEU Default and NONE Default) improve the base classifiers, but the improvement is higher when the NEU class is considered as the default. Our base classifiers are very different among them, for instance, System I reached the first position in the assessment of TASS, while System III the 42nd position of 47 systems. Therefore, it is very relevant that the combination of three very dissimilar systems improve all the base classifiers. Previously, we said that if there is a big difference in the performance of the base classifiers it is highly likely that the performance of the ensemble classifier will not be improved. However, the results shown in Table 2 indicate that, the combination improves the results of the best of the base classifier, which is System I. 5 Conclusion and further works The results show that the combination of several polarity classifiers allows the im- provement of the base classifiers. This results encourage us to continue studying the most adequate way to combine the classification power of different method- ologies. Our future work will be focused on the analysis of the resolution of ties in the voting system, because, when a tie is caused by the output N EU, P, P +, the system return NEU as class. We have to follow analysing what could be the best way to combine the classifiers. References 1. Agerri, R., Garca-Serrano, A.: Q-wordnet: Extracting polarity from wordnet senses. In: Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA), Valletta, Malta (may 2010) 2. Agirre, E., Soroa, A.: Personalizing pagerank for word sense disambiguation. Pro- ceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics pp. 33–41 (2009), cited By (since 1996)55 10 E. Martı́nez, Y. Gutiérrez, J. Fernández, A. Montejo, R. Muñoz 3. Esuli, A., Sebastiani, F.: Sentiwordnet: A publicly available lexical resource for opinion mining. In: Proceedings of LREC. vol. 6, pp. 417–422 (2006) 4. Fernández, J., Gómez, J.M., Martı́nez-Barco, P.: A supervised approach for senti- ment analysis using skipgrams (2014) 5. Fernández, J., Gutiérrez, Y., Gómez, J.M., Martınez-Barco, P., Montoyo, A., Munoz, R.: Sentiment analysis of spanish tweets using a ranking algorithm and skipgrams. XXIX Congreso de la Sociedad Espanola de Procesamiento de Lenguaje Natural (SEPLN 2013) pp. 133–142 (2013) 6. Guthrie, D., Allison, B., Liu, W., Guthrie, L., Wilks, Y.: A closer look at skip- gram modelling. In: Proceedings of the 5th international Conference on Language Resources and Evaluation (LREC-2006). pp. 1–4 (2006) 7. Gutiérrez, Y., González, A., Orquı́n, A.F., Montoyo, A., Muñoz, R.: RA-SR: Using a ranking algorithm to automatically building resources for subjectivity analysis over annotated corpora. WASSA 2013 p. 94 (2013) 8. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 168–177. ACM (2004) 9. Kuncheva, L.L.: Combining Pattern Classifiers: Methods and Algorithms. Wiley, 2 edn. (2014) 10. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre- sentations in vector space. CoRR abs/1301.3781 (2013) 11. Miller, G.A.: Wordnet: An on-line lexical database. International Journal of Lex- icography 3(4), 235–312 (1990), http://ijl.oxfordjournals.org/content/3/4. toc 12. Molina-González, M.D., Martı́nez-Cámara, E., Martı́n-Valdivia, M.T., Perea- Ortega, J.M.: Semantic orientation for polarity classification in spanish reviews. Expert Systems with Applications 40(18), 7250 – 7257 (2013), http://www. sciencedirect.com/science/article/pii/S0957417413004752 13. Montejo-Ráez, A., Garcı́a-Cumbreras, M.A., Dı́az-Galiano, M.C.: Participacin de sinai word2vec en tass 2014. In: Proc. of the TASS workshop at SEPLN 2014 (2014) 14. Pérez-Rosas, V., Banea, C., Mihalcea, R.: Learning Sentiment Lexicons in Spanish. In: LREC. pp. 3077–3081 (2012) 15. Rokach, L.: Ensemble methods for classifiers. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 957–980. Springer US (2005), http://dx.doi.org/10.1007/0-387-25465-X_45 16. dos Santos, C.N., Gatti, M.: Deep convolutional neural networks for sentiment analysis of short texts. In: Proceedings of the 25th International Conference on Computational Linguistics (COLING), Dublin, Ireland (2014) 17. Sebastiani, F.: Machine learning in automated text categorization. ACM comput- ing surveys (CSUR) 34(1), 1–47 (2002) 18. Socher, R., Bauer, J., Manning, C.D., Ng, A.Y.: Parsing with compositional vector grammars. In: In Proceedings of the ACL conference. Citeseer (2013) 19. Strapparava, C., Valitutti, A.: WordNet Affect: an Affective Extension of WordNet. In: LREC. vol. 4, pp. 1083–1086 (2004) 20. Villena Román, J., Lana Serrano, S., Martı́nez Cámara, E., González Cristóbal, J.C.: TASS-Workshop on sentiment analysis at sepln (2013) 21. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceed- ings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. pp. 42–49. ACM (1999)