=Paper=
{{Paper
|id=Vol-1749/paper_035
|storemode=property
|title=On the performance of B4MSA on SENTIPOLC'16
|pdfUrl=https://ceur-ws.org/Vol-1749/paper_035.pdf
|volume=Vol-1749
|authors=Daniela Moctezuma,Eric S. Tellez,Mario Graff,Sabino Miranda–Jiménez
|dblpUrl=https://dblp.org/rec/conf/clic-it/MoctezumaTGM16
}}
==On the performance of B4MSA on SENTIPOLC'16==
On the performance of B4MSA on SENTIPOLC’16 Daniela Moctezuma Eric S. Tellez CONACyT-CentroGEO Mario Graff Circuito Tecnopolo Norte No. 117, Sabino Miranda-Jiménez Col. Tecnopolo Pocitos II, C.P. 20313, Ags, México CONACyT-INFOTEC dmoctezuma@centrogeo.edu.mx Circuito Tecnopolo Sur No 112, Fracc. Tecnopolo Pocitos II, Ags, 20313, México. eric.tellez@infotec.mx mario.graff@infotec.mx sabino.miranda@infotec.mx Abstract tenere diverse rappresentazioni del testo. In questo caso essa applicata alla lin- This document describes the participation gua italiana. I risultati qui presentati sono of the INGEOTEC team in SENTIPOLC due: le metriche della competizione uf- 2016 contest. In this participation two ficiale ed altre misure note della perfor- approaches are presented, B4MSA and mance, come macro F1 e micro F1. B4MSA + EvoDAG, tested in Task 1: Sub- jectivity classification and Task 2: Polarity classification. In case of polarity classifi- 1 Introduction cation, one constrained and unconstrained runs were conducted. In subjectivity clas- Nowadays, the sentiment analysis task has become sification only a constrained run was done. a problem of interest for governments, companies, In our methodology we explored a set of and institutions due to the possibility of sensing techniques as lemmatization, stemming, massively the mood of the people using social entity removal, character-based q-grams, networks in order to take advantage in decision- word-based n-grams, among others, to making process. This new way to know what are prepare different text representations, in people thinking about something imposes chal- this case, applied to the Italian language. lenges to the natural language processing and ma- The results show the official competition chine learning areas, the first of all, is that peo- measures and other well-known perfor- ple using social networks are kindly ignoring for- mance measures such as macro and micro mal writing. For example, a typical Twitter user F1 scores. do not follow formal writing rules and introduces new lexical variations indiscriminately, the use of Italiano. Questo documento descrive emoticons and the mix of languages is also the la partecipazione del team INGEOTEC common lingo. These characteristics produce high alla competizione SENTIPOLC 2016. In dimensional representations, where the curse of questo contributo sono presentati due ap- dimension makes hard to learn from examples. procci, B4MSA e B4MSA + EvoDAG, ap- There exists a number of strategies to cope with plicati al Task 1: Subjectivity classifica- the sentiment analysis on Twitter messages, some tion e Task 2: Polarity classification. Nel of them are based on the fact that the core problem caso della classificazione della polarit, is fixed: we are looking for evidence of some sen- sono stati sottomessi un run constrained timent in the text. Under this scheme a number of ed un run unconstrained. Per la clas- dictionaries have been described by psychologists, sificazione della soggettivita, stato sot- other resources like SentiWordNet have been cre- tomesso solo un run constrained. La nos- ated adapting well known linguistic resources and tra metodologia esplora un insieme di tec- machine learning. There is a lot of work around niche come lemmatizzazione, stemming, this approach; however, all these knowledge is lan- rimozione di entit, q-grammi di caratteri, guage dependent and must exists a deep under- n-grammi di parole, ed altri, al fine di ot- standing of the language being analyzed. Our ap- proach is mostly independent of this kind of ex- At a glance, our goal is to find the best perform- ternal resources while focus on tackling the mis- ing normalization and tokenization pipelines. We spellings and other common errors in the text. state the modeling as a combinatorial optimiza- In this manuscript we detail our approach to tion problem; then, given a performance measure, sentiment analysis from a language agnostic per- we try to find the best performing configuration spective, e.g., no one in our team knows Italian among a large parameter space. language. We neither use external knowledge nor The list of transformations and tokenizers are specialized parsers. Our aim is to create a solid listed below. All the text transformations consid- baseline from a multilingual perspective, that can ered are either simple to implement, or there is be used as a real baseline for challenges like SEN- an open-source library (e.g. (Bird et al., 2009; TIPOLC’16 and as a basic initial approximation Řehůřek and Sojka, 2010)) that implement it. for sentiment analysis systems. 2.2 Set of Features The rest of the paper is organized in the follow- ing sections. Section 2 describes our approach. In order to find the best performing configura- Section 3 describes our experimental results, and tion, we used two sort of features that we consider finally Section 4 concludes. them as parameters: cross-language and language- dependent features. 2 Our participation Cross-language Features could be applied in most similar languages and similar surface fea- This participation is based on two approaches. tures. Removing or keeping punctuation (ques- First, B4MSA method, a simple approach which tion marks, periods, etc.) and diacritics from the starts by applying text-transformations to the original source; applying or not applying the pro- tweets, then transformed tweets are represented in cesses of case sensitivity (text into lowercase) and a vector space model, and finally, a Support Vector symbol reduction (repeated symbols into one oc- Machine (with linear kernel) is used as the classi- currence of the symbol). Word-based n-grams (n- fier. Second, B4MSA + EvoDAG, a combination words) Feature are word sequences of words ac- of this simple approach with a Genetic program- cording to the window size defined. To compute ming scheme. the N-words, the text is tokenized and combined 2.1 Text modeling with B4MSA the tokens. For example, 1-words (unigrams) are each word alone, and its 2-words (bigrams) set are B4MSA is a system for multilingual polarity clas- the sequences of two words, and so on (Juraf- sification that can serve as a baseline as well as a sky and Martin, 2009). Character-based q-grams framework to build sophisticated sentiment analy- (q-grams) are sequences of characters. For exam- sis systems due to its simplicity. The source code ple, 1-grams are the symbols alone, 3-grams are of B4MSA can be downloaded freely1 . sequences of three symbols, generally, given text We used our previous work, B4MSA, to tackle of size m characters, we obtain a set with at most the SENTIPOLC challenge. Our approach learns m − q + 1 elements (Navarro and Raffinot, 2002). based on training examples, avoiding any digested Finally, Emoticon (emo) feature consists in keep- knowledge as dictionaries or ontologies. This ing, removing, or grouping the emotions that ap- scheme allows us to address the problem without pear in the text; popular emoticons were hand clas- caring about the particular language being tackled. sified (positive, negative or neutral), included text The dataset is converted to a vector space using emoticons and the set of unicode emoticons (Uni- a standard procedure: the text is normalized, to- code, 2016). kenized and weighted. The weighting process is fixed to be performed by TFIDF (Baeza-Yates and Language Dependent Features. We considered Ribeiro-Neto, 2011). After that process, a linear three language dependent features: stopwords, SVM (Support Vector Machines) is trained using stemming, and negation. These processes are 10-fold cross-validation (Burges, 1998). At the applied or not applied to the text. Stopwords end, this classifier is applied to the test set to ob- and stemming processes use data and the Snow- tain the final prediction. ball Stemmer for Italian, respectively, from NLTK Python package (Bird et al., 2009). Negation fea- 1 https://github.com/INGEOTEC/b4msa ture markers could change the polarity of the mes- sage. We used a set of language dependent rules more than 10, 000, 000 tweets. From these tweets, for common negation structures to attached the we kept only those that were consistent with the negation clue to the nearest word, similar to the emoticon’s polarity used, e.g., the tweet only con- approach used in (Sidorov et al., 2013). tains consistently emoticons with positive polarity. Then, the polarity of the whole tweet was set to the 2.3 Model Selection polarity of the emoticons, and we only used pos- The model selection, sometimes called hyper- itive and negative polarities. Furthermore, we de- parameter optimization, is the key of our ap- cided to balance the set, and then we remove a lot proach. The default search space of B4MSA con- of positive tweets. At the end, this external dataset tains more than 331 thousand configurations when contains 4, 550, 000 tweets, half of them are posi- limited to multilingual and language independent tive and the another half are negative. parameters; while the search space reaches close Once this external dataset was created, we de- to 4 million configurations when we add our three cided to split it in batches of 50, 000 tweets half language-dependent parameters. Depending on of them positive and the other half negative. This the size of the training set, each configuration decision was taken in order to optimize the time needs several minutes on a commodity server to needed to train a SVM and also around this num- be evaluated; thus, an exhaustive exploration of ber the Macro F1 metric is closed to its maximum the parameter space can be quite expensive that value. That is, this number of tweets gives a good makes the approach useless. trade-off between time needed and classifier per- To reduce the selection time, we perform a formance. In total there are 91 batches. stochastic search with two algorithms, random search and hill climbing. Firstly, we apply ran- For each batch, we train a SVM at the end of dom search (Bergstra and Bengio, 2012) that con- this process we have 91 predictions (it is use the sists on randomly sampling the parameter space decision function). Besides these 91 predictions, it and select the best configuration among the sam- is also predicted (using as well the decision func- ple. The second algorithm consists on a hill climb- tion) each tweet with B4MSA. That is, at the end ing (Burke et al., 2005; Battiti et al., 2008) im- of this process we have 94 values for each tweet. plemented with memory to avoid testing a config- That is, we have a matrix with 7, 410 rows and uration twice. The main idea behind hill climb- 94 columns for the training set and of 3, 000 rows ing is to take a pivoting configuration (in our and 94 columns for the test set. Moreover, for ma- case we start using the best one found by random trix of the training set, we also know the class for search), explore the configuration’s neighborhood, each row. It is important to note that all the val- and greedily moving to the best neighbor. The pro- ues of these matrix are predicted, for example, in cess is repeated until no improvement is possible. B4MSA case, we used a 10-fold cross-validation The configuration neighborhood is defined as the in the training set in order to have predicted values. set of configurations such that these differ in just one parameter’s value. Clearly, at this point, the problem is how to Finally, the performance of the final configura- make a final prediction; however, we had built tion is obtained applying the above procedure and a classification problem using the decision func- cross-validation over the training data. tions and the classes provided by the competition. Thus, it is straight forward to tackle this classifica- 2.4 B4MSA + EvoDAG tion problem using EvoDAG (Evolving Directed In the polarity task besides submitting B4MSA Acyclic Graph)2 (Graff et al., 2017) which is a which is a constrained approach, we decided to Genetic Programming classifier that uses seman- generate an unconstrained submission by perform- tic crossover operators based on orthogonal pro- ing the following approach. The idea is to pro- jections in the phenotype space. In a nutshell, vide an additional dataset that it is automatically EvoDAG was used to ensemble the outputs of the label with positive and negative polarity using the 91 SVM trained with the dataset automatically la- Distant Supervision approach (Snow et al., 2005; beled and B4MSA’s decision functions. Morgan et al., 2004). We start collecting tweets (using Twitter 2 stream) written in Italian. In total, we collect https://github.com/mgraffg/EvoDAG 3 Results and Discussion Finally, Table 3 presents the measures em- ployed by our internal measurement, that is Macro This Section presents the results of the IN- F1 and Micro F1 (for more details see (Sebastiani, GEOTEC team. In this participation we did two 2002)). These values are from polarity uncon- runs, a constrained and an unconstrained run with strained run (B4MSA + EvoDAG), polarity con- B4MSA system, and only a constrained run with strained run (B4MSA), subjectivity constrained B4MSA + EvoDAG. The constrained run was con- run (B4MSA) and irony classification (B4MSA). ducted only with the dataset provided by SEN- We do not participate in irony classification task TIPOLC’16 competition. For more technical de- but we want to show the obtained result from our tails from the database and the competition in gen- B4MSA approach on this task. eral see (Barbieri et al., 2016). The unconstrained run was developed with an 4 Conclusions additional dataset of 4, 550, 000 of tweets labeled In this work we describe the INGEOTEC team with Distant Supervision approach. The Distant participation in SENTIPOLC’16 contest. Two ap- Supervision is an extension of the paradigm used proaches were used, first, B4MSA method which in (Snow et al., 2005) and nearest to the use of combine several text transformations to the tweets. weakly labeled data in (Morgan et al., 2004). In Secondly, B4MSA + EvoDAG, which combine the this case, we consider the emoticons as key for B4MSA method with a genetic programming ap- automatic labeling. Hence, a tweet with a high proach. In subjectivity classification task, the ob- level of positive emoticons is labeled as positive tained results place us in seventh of a total of 21 class and a tweet with a clear presence of negative places. In polarity classification task, our results emoticons is labeled as negative class. This give place us 18 and 19 places of a total of 26. Since us a bigger amount of samples for the dataset for our approach is simple and easy to implement, we training. take these results important considering that we do For the constrained run we participate in two not use affective lexicons or another complex lin- task: subjectivity and polarity classification. In guistic resource. Moreover, our B4MSA approach the unconstrained run we only participate in polar- was tested internally in irony classification task ity classification task. Table 1 shows the results of with a result of 0.4687 of macro f1, and 0.8825 subjectivity classification Task (B4MSA method), of micro f1. here, Prec0 is the P recision0 value, Rec0 is the Recall0 value, FSc0 is F − Score0 value and Prec1 , Rec1 and FSc1 the same for F − Score1 References values and FScavg is the average value from all Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. F-Scores. The explanation of evaluation measures 2011. Modern Information Retrieval. Addison- can be seen in (Barbieri et al., 2016). Wesley, 2nd edition. Table 2, shows the results on the polarity clas- Francesco Barbieri, Valerio Basile, Danilo Croce, sification task. In this task our B4MSA method Malvina Nissim, Nicole Novielli, and Viviana Patti. achieves an average F-Score of 0.6054 and our 2016. Overview of the EVALITA 2016 SENTi- combination of B4MSA + EvoDAG reaches an ment POLarity Classification Task. In Pierpaolo Basile, Anna Corazza, Franco Cutugno, Simonetta 0.6075 of average F-Score. These results place us Montemagni, Malvina Nissim, Viviana Patti, Gio- on position 18 (unconstrained run) and 19 (con- vanni Semeraro, and Rachele Sprugnoli, editors, strained run) of a total of 26 entries. Proceedings of Third Italian Conference on Compu- It is important to mention that the difference be- tational Linguistics (CLiC-it 2016) & Fifth Evalua- tion Campaign of Natural Language Processing and tween our two approaches is very small; however, Speech Tools for Italian. Final Workshop (EVALITA B4MSA + EvoDAG is computationally more ex- 2016). Associazione Italiana di Linguistica Com- pensive, so we expected to have a considerable putazionale (AILC). improvement in performance. It is evident that Roberto Battiti, Mauro Brunato, and Franco Mascia. these results should be investigated further, and, 2008. Reactive search and intelligent optimization, our first impression are that our Distant supervi- volume 45. Springer Science & Business Media. sion approach should be finely tune, that is, it is James Bergstra and Yoshua Bengio. 2012. Random needed to verify the polarity of the emoticons and search for hyper-parameter optimization. Journal of the complexity of the tweets. Machine Learning Research, 13(Feb):281–305. Prec0 Rec0 FSc0 Prec1 Rec1 FSc1 FScavg 0.56 0.80 0.66 0.86 0.67 0.75 0.70 Table 1: Results on Subjectivity Classification FScorepos FScoreneg Combined FScore Constrained run (B4MSA) 0.6414 0.5694 0.6054 Unconstrained run (B4MSA + EvoDAG) 0.5944 0.6205 0.6075 Table 2: Results on Polarity Classification Run Macro F1 Micro F1 Radim Řehůřek and Petr Sojka. 2010. Software Polarity Unconstrained 0.5078 0.5395 Framework for Topic Modelling with Large Cor- Polarity Constrained 0.5075 0.5760 pora. In Proceedings of the LREC 2010 Workshop Subjectivity Constrained 0.7137 0.721 on New Challenges for NLP Frameworks, pages 45– Irony Constrained 0.4687 0.8825 50, Valletta, Malta, May. ELRA. http://is. muni.cz/publication/884893/en. Table 3: Micro F1 and Macro F1 results from our approaches Fabrizio Sebastiani. 2002. Machine learning in au- tomated text categorization. ACM Comput. Surv., 34(1):1–47, March. Steven Bird, Ewan Klein, and Edward Loper. Grigori Sidorov, Sabino Miranda-Jiménez, Francisco 2009. Natural Language Processing with Python. Viveros-Jiménez, Alexander Gelbukh, Noé Castro- O’Reilly Media. Sánchez, Francisco Velásquez, Ismael Dı́az-Rangel, Sergio Suárez-Guerra, Alejandro Treviño, and Juan Christopher J.C. Burges. 1998. A tutorial on support Gordon. 2013. Empirical study of machine learn- vector machines for pattern recognition. Data Min- ing based approach for opinion mining in tweets. In ing and Knowledge Discovery, 2(2):121–167. Proceedings of the 11th Mexican International Con- ference on Advances in Artificial Intelligence - Vol- Edmund K Burke, Graham Kendall, et al. 2005. ume Part I, MICAI’12, pages 1–14, Berlin, Heidel- Search methodologies. Springer. berg. Springer-Verlag. Mario Graff, Eric S. Tellez, Hugo Jair Escalante, and Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2005. Sabino Miranda-Jimnez. 2017. Semantic Genetic Learning syntactic patterns for automatic hypernym Programming for Sentiment Analysis. In Oliver discovery. In L. K. Saul, Y. Weiss, and L. Bottou, Schtze, Leonardo Trujillo, Pierrick Legrand, and editors, Advances in Neural Information Processing Yazmin Maldonado, editors, NEO 2015, number Systems 17, pages 1297–1304. MIT Press. 663 in Studies in Computational Intelligence, pages Unicode. 2016. Unicode emoji chart. 43–65. Springer International Publishing. DOI: http://unicode.org/emoji/charts/ 10.1007/978-3-319-44003-3 2. full-emoji-list.html. Accessed 20-May- 2016. Daniel Jurafsky and James H. Martin. 2009. Speech and Language Processing (2Nd Edition). Prentice- Hall, Inc., Upper Saddle River, NJ, USA. Alexander A. Morgan, Lynette Hirschman, Marc Colosimo, Alexander S. Yeh, and Jeff B. Colombe. 2004. Gene name identification and normaliza- tion using a model organism database. Journal of Biomedical Informatics, 37(6):396 – 410. Named Entity Recognition in Biomedicine. G. Navarro and M. Raffinot. 2002. Flexible Pattern Matching in Strings – Practical on-line search al- gorithms for texts and biological sequences. Cam- bridge University Press. ISBN 0-521-81307-7. 280 pages.