Opinion Analysis Applied to Politics: A case study based on Twitter Gilberto Nunes Denivaldo Lopes and Zair Abdelouahab Federal Institute of Education, Federal University of Maranhão - UFMA / Science and Technology São Luı́s, Maranhão, Brazil of Piauı́ - IFPI / denivaldo.lopes@ufma.br, Picos, Piauı́, Brazil zair@dee.ufma.br gilberto.nunes@ifpi.edu.br Abstract 1 Introduction Prasetyo and Hauff (2015), Jungherr (2013) and Lampos (2012) propose approaches to determine Nowadays, social networks such as Face- the voter intention polls based on information re- book and Twitter are openly available covered from Twitter. for everyone around the world over the In this paper, we propose another approach Internet. These websites provide some based on opinion analysis applied to politics in functionality without costs, such as: cre- order to colect information from Twitter and de- ation/edition of communities and social termine the public opinion about the current im- networks; it provides support to a large va- peachment process in Brazil that is submitted the riety of multimedia contents (e.g. audio elected president in October 2014. and video) and support to interactive com- During the process of impeachment, as well as munications (e.g. chats and post). Twit- the electoral process, opinion surveys are applied ter’s users post comments about a range such as presented by Rothschild1 . He says that of subjects, such as, products, famous per- this opinion survey is generally based on data ob- sons and politics. The dissemination of tained from printed forms filled by the population. the information in these social networks Our approach is based on opinion analysis to ana- should be considered due to their global lyze messages obtained from Twitter to determine coverage. An important functionality of the Brazilian population’s opinion about the im- Twitter is the support to georeferenced peachment of the Brazil’s president. According to posts making the localization of posts pos- Currie (1998), impeachment is considered a pro- sible. In this paper, we propose an ap- cess that can result in the removal of a person from proach to make the Sentiment Analysis or public office after this person has violated the Con- Opinion Mining. Our approach is based on stitution of her country. Mining Web and of Opinion, Geographic In this paper, we show an approach based on Information System (GIS) and Machine knowledge discovery of textual sources, data from Learning in order to recover relevant in- social networks (e.g. Twitter), Mining Web and of formation from tweets. The information Opinion, Geographic Information System and Ma- recovered follows our approach is essen- chine Learning. Applying our proposed approach, tial to provide support to the verification opinion trends about impeachment can be identi- of population trends, e.g. in politics do- fied in the Brazilian population. main. We propose a prototype that makes This paper is presented as follows. Section 3 the analysis of population trends, in spe- presents some fundamental concepts to this re- cial, Brazil’s politic context and the im- search work. Section 4 presents our approach for peachment process in course. performing opinion analysis from data obtained in Twitter, with Web Mining support, about impeach- 1 Keywords — Web Mining; Opinion Mining; Forecasting Elections: Voter Intentions versus Expectations — Brookings Institution - Link for ebook: Machine Learning; Opinion Analysis; Twitter; http://www.brookings.edu/research/papers/2012/11/01- Geographic Information System. voter-expectations-wolfers. 35 ment process development in Brazil. Section 5 (2011). Second author (Zhang, 2011), Web Min- presents some results about the impeachment pro- ing aims to find useful knowledge from Web and cess development in Brazil. Section 6 shows a on the basis data mining, text mining, and multi- case study according to the impeachment process. media to combine the traditional data mining tech- Section 7 presents some conclusions and future di- niques with Web. This mining type can be subdi- rections. vided in: Web Content Mining, Web Usage Min- ing, Web Structure Mining. Mining Web Content 2 Related Works refers to the extraction Web page content, the text Most the related works use sentiment analysis contained on those pages is a good example con- and opinion mining for evaluate voting inten- tents to be extracted. Web Usage Mining is the au- tions, taking into consideration only the content tomated recognition user utilization patterns based the posts. For instance, we use for illustration on the Web site. Web Structure Mining is based purpose the following approaches: Prasetyo and on interconnection between data or information in Hauff (Dwi Prasetyo and Hauff, 2015), Jungherr documents or sites by Web. Figure 1 illustrates the (Jungherr, 2013) and Lampos (Lampos, 2012), as subdivision. will be described below. Prasetyo and Hauff (Dwi Prasetyo and Hauff, 2015), propose the use Twitter-based election forecasting, sentiment analysis and machine learn- ing techniques to determine voting intention. For Indonesia’s presidential elections 2014. Jungherr (Jungherr, 2013), shows a work us- Figure 1: Basic Taxonomy for Web Mining. Font ing four metrics to determine voting intention, the image: (Zhang, 2011). likewise: the total number hashtags mentioning a given political party; the dynamics between men- tions positive or negative a given political party; 3.2 Opinion Mining the total number hashtags mentioning one the can- didates; and the total number users who used hash- Opinion mining can be defined as a computational tags mentioning a given party or candidate. For technique that takes care opinion in textual sources Germany presidential elections 2009. (Pang and Lee, 2008). It aims to extract informa- Lampos (Lampos, 2012), shows a study tech- tion based on sentiment analysis (e.g. positive, niques and patterns for extracting positive or neg- negative and neutral) expressed by one or more ative sentiment from tweets, which build on each writers and their texts (Pang and Lee, 2008). Opin- other, through a supervised approach for turning ion mining has a process that analyzes a large vol- sentiment into voting intention percentages. For ume textual documents that contains a range sub- United Kingdom presidential elections 2010. jects, such as, entertainment, politics, education Differently from approaches mentioned above, and marketing. The social networks like Twitter our work uses georeferenced data, addition to the have supported their users to express and share textual content. Thus we can easily perform a spa- opinions and points view. Thus, social networks tial analysis, as shown the proposed case study can be seen as a large documents volume in tex- (vide section 6). In the next section, we described tual source and digital format. the technological used in our case study. 3.3 Geographic Information System (GIS) 3 Overview According to Nuhcan (2014), Geographic Infor- In this section, we present the subjects Web Min- mation System (GIS) can be understood as a com- ing, Opinion Mining, Geographic Information putational information system like any other, but System (GIS), Machine Learning and Twitter. the differential is the database that stores geo- referenced data, i.e. the database includes lati- 3.1 Web Mining tude/longitude information linked to the data. Ini- Web Mining is a process extracting data or infor- tially, GIS applications were restricted to desktop mation from web sources, as described by Zhang computers, but nowadays they are present the Web 36 (Servers to maps) and the Smartphones (map ap- 4 Proposed Approach for Opinion plications). Mining Applied to Politics 3.4 Machine Learning Our proposed approach for opinion mining ap- Machine Learning is a subarea of Artificial In- plied to politics is based on Knowledge Discov- telligence where the focus is to develop compu- ery in Databases (KDD) (Fayyad et al., 1996). To tational methods order to provide intelligent be- reach the proposed objectives this article, it was havior to computers (Arel et al., 2010). Examples developed an approach that consists of five steps. of Machine Learning are Support Vector Machine This approach has been implemented by a Soft- (SVM) (1998), Random Forest (2001) and Naive ware Prototype3 that assists its execution. The Bayes (2006). prototype composed to two modules, one for re- covery (Works interconnected to Search API) and 3.5 Twitter another for the analysis (Works interconnected to API WEKA) of data. In the first stage, occurs the Twitter is a social networking service that enables acquisition of data (tweets). In the second stage, the users to send and receive messages denomi- preprocessing data, to remove noisy structures. In nated tweets that have 140 character of maximum the third stage, the feature extraction of the data size for each post. Twitter has a large number using TF-IDF (Robertson, 2004). In the fourth of content such as profiles, general information, step, we have the Text Mining, by means of the ap- tweets, emotions, hastags and other (Tiara et al., plied of the algorithms to Machine Learning pre- 2015). This social network provides basically two sented previously. The fifth and last step, contem- API2 to support the recovery of data: Search API plates the evaluation the results obtained through and Streaming API. In our approach, we apply the the analysis of the Confusion Matrix (Sokolova Twitter API in order to recover the tweets and the and Lapalme, 2009). Finally, we found the pos- georeferenced location where they were posted. itive and negative opinions. Figura 2 introduces 3.6 Metrics Evaluation the proposed approach which is based on KDD (Fayyad et al., 1996). Once the Twitter data have been collected and pro- cessed, it needs a mechanism to determine the va- It is worth mentioning the importance of using lidity of the classification applied (Sokolova and the WEKA4 tool and its API during execution of Lapalme, 2009). Table 1 introduces the confusion the steps present in this approach, with the excep- matrix that is used to assist the calculation of the tion of the data stage acquisition. evaluation. 4.1 Acquisition of datas The data (tweets) were recovered using the Search Table 1: Confusion Matrix. Predict Twitter API. During the recovery process of tweets positive negative it is necessary Web Mining, specifically the Web TP FN Content Mining, in which recovered the texts con- negative positive True False tained in the posts by users the Twitter. A total of Positive Negative 1,218 georeferenced tweets were collected, based FP TN on posts related to impeachment of the President Dilma, during the March of 2016 which contained Real False True Positive Negative the following hashtags: Font of data (Sokolova and La- #FicaDilma, #SouMaisDilma, palme, 2009). #NaoVaiTerGolpe, #FicaPT, #FicaLula, #NaoAoGolpe, The metrics used this article are derived from #ForaDilma, #ForaPT, #ForaPTralhas, the Confusion Matrix, which are: Accuracy, Sen- #ForaLula, #ForaDilmaLulaPT, sitivity or Recall, Specificity, F1-Score and Preci- 3 Software Prototype - It is the result of applying a soft- sion (Sokolova and Lapalme, 2009). ware process, as defined (Sommerville, 2006). 4 Machine Learning Group at the University of 2 Documentation Twitter Developers - Link for docu- Waikato - Version 3.7.12 and documentation, link: mentation: https://dev.twitter.com/overview/documentation. http://www.cs.waikato.ac.nz/ml/weka/documentation.html. 37 Figure 2: Proposed Approach for Opinion Mining (Based on KDD (Fayyad et al., 1996)). #DilmaNao, #VaiTerImpeachment, ematical definition of the TF-IDF (Robertson, #NaiVaiTerGolpeVaiTerImpeachment. 2004) model. The Text Mining has a model the representation using often as feature set, known The georeferencing of tweets corresponds to the as “bag-of-words model”, with the help of the 26 Brazilian state capitals and the Brazilian Fed- WEKA4 tool and using your StringToWordVector eral District. This step it is performed by recovery method one created the model used this paper. In module the prototype made during the search. this model, documents are represented as a word 4.2 Preprocessing vector. Thus, all documents are represented as a giant document/term matrix. In this paper, TF/IDF Before feature extraction of the tweets, it is im- (Robertson, 2004) was used as the cell value to portant to remove the unwanted structures, such dampen the importance of those terms if it appears like: hyperlinks irrelevant words, special charac- in many documents. This step it is performed by ters, and other references. After removing those, the analysis module assisted by the prototype. it is necessary the stemming and normalization applied on tweets. It is important to emphasize 4.4 Text Mining that the preprocessing occurs in copies of tweets collected (corpus (Khairnar and Kinikar, 2013)). Once generated the numeric matrix values, these Such device, it seeks to maintain the original values are used as inputs to the classification al- tweets intact for avoid any inconsistencies. As the gorithms presented previously. These algorithms previous step, this step it is also performed by the are seeking patterns of data interpretable within recovery module contained in the prototype. the matrix of values for determinate the classes of the tweets in positive or negative for Dilma’s im- 4.3 Feature Extraction peachment. This step it is also performed by the After the preprocessing stage, tweets were sub- analysis module. mitted to the feature extraction process, through 4.5 Evaluation the TF-IDF (Robertson, 2004) method. Once you have applied the method of TF-IDF (Robertson, Lastly, we have the evaluation of the classifica- 2004), the tweets are represented by the matrix of tion of data the confusion matrix and its metrics. numeric values (bag-of-words model) as the math- Providing the obtaining of information, which will 38 provide the acquisition of knowledge at the end of • 20% of the samples for training and 80% the process of KDD (Fayyad et al., 1996). of test samples. 5 Results The algorithm that showed the best model was used in the case study this paper. The results The Results Section of this research is divided for the proposed scenario and the best designs for into three subsections. Subsection 5.1 is respon- each algorithm can be viewed in subsection (5.3) sible for describing the database that contains the next. samples used for training and testing. Subsection 5.2 includes the training models and test. Subsec- 5.3 Training and Cross-Validation results tion 5.3 shows the results for the classification of tweets. Table 2: Results obtained with the application of metrics for each of the proportions using the classification algorithms. 5.1 Data Base Training/ Metrics Evaluation Algorithms This research, the database has 500 positive sam- Test ples and 500 negative of tweets to posts related to AC SE ES the impeachment of the president Dilma. Totaling 80%-20% 96.7% 98.1% 95.7% 1,000 samples in the database. Is worth emphasiz- 60%-40% 96.9% 97.1% 96.7% ing that the samples were divided only into pos- SVM1(Linear Kernel) 40%-60% 96.5% 98.0% 95.5% itives and negatives, because the neutral samples 20%-80% 95.5% 97.4% 94.2% have no representativity, as seen during the ex- periments. Samples were collected an automatic 80%-20% 95.1% 94.5% 95.5% 60%-40% 94.5% 93.0% 95.6% manner by Search API, but the labeling process Random Forest2 40%-60% 94.9% 91.6% 97.6% was performed manually. During manual labeling 20%-80% 95.8% 96.4% 97.7% it was aimed the selection of samples which had good representativity for the classification process, 80%-20% 77.5% 76.2% 78.4% that is the most variable possible. Recalling that 60%-40% 84.8% 84.4% 85.1% Naive Bayes3 40%-60% 85.0% 83.8% 85.8% the tweets used this subsection are different from 20%-80% 89.3% 94.0% 86.7% those used in subsection regarding the Case Study. These are geo-referenced to the capital and federal Subtitle: AC: Accuracy; SE: Sensibility; ES: Specificity. district that make up Brazil and a period of posts different from the month of March 2016. Thus, 1 Parameters WEKA - type kernel: default values; 3 linear; SVM type: C-SVC (classifica- Parameters WEKA - All we seek to avoid potential problems in the tweets tion); gama: 0.5 and other paratemtros paratemtros with default values. with default values; classification. 2 Parameters WEKA - Number of trees: 10; and other paratemtros with 5.2 Training and Test Models The generation of training models and test took According to Table 5.3, can be checked that the place with the help of the WEKA4 tool. Through greatest amount of accuracy was found for the pro- this, we used the implementations of algorithms portion of 60% - 40%, using the SVM, with a hit (SVM, Naive Bayes and Random Forest) clas- rate 96.9%. While the lowest value was recorded sification, necessary for the creation of models. by the accuracy Naive Bayes with a hit rate of Scenarios were generated, respecting the training 89.3% for the proportion of 20% - 80%. models and test as: According to the analysis results for Sensitivity in Table 5.3, we can conclude that the SVM has • 80% of the samples for training and 20% the highest rate in relation to the number of true of test samples; positive feedback. With a Sensitivity rate of 98.1 % for the proportion of 80% - 20%. • 60% of the samples for training and 40% Analyzing the data in Table 5.3 concerning of test samples; Specificity, one can infer that the Random Forest presents the best result for true negative reviews, • 40% of the samples for training and 60% with a Specificity rate of 97.7% for the proportion of test samples; of 80% - 20%. 39 Table 3: Results obtained with the application of metrics for each of the proportions using the classification algorithms for Cross-validation. Analyzing Figure 3, one can see that in the Mid- Metrics Evaluation west, Southeast and South map there most records Algorithms Quantities of folds PR RE F1 in favor of impeachment. Assuming the map of SVM1(Linear Kernel) 10 98.5% 97.8% 98.4% the Northeast region, little more of most records are of opposed to impeachment. The map of the Random Forest2 10 98.2% 97.5% 98.1% northern region is the only one of the five regions presenting the same results for the reviews. Naive Bayes3 10 76.5% 82.9% 76.4% It is important to note that other research related to the impeachment process have already been car- Subtitle: PR: Precision; RE: Recall; F1: F1-Score. ried out since 2015 in Brazil, when the first evi- dences to the process. One of those researches are 1 Parameters WEKA - type kernel: linear; ues; SVM type: C-SVC (classification); gama: 3 Parameters WEKA - All paratemtros very similar to the one presented in the research in this study, being presented in the Veja8 magazine. 0.5 and other paratemtros with default val- with default values. ues; 2 Parameters WEKA - Number of trees: 10; and other paratemtros with default val- In it the magazine exposes results of a research on social networks by the company Torabit9 , in which 49.3 % of posts on social networks are fa- According to the analysis results for Precision, vorable to impeachment and only 31.7 % contrary. Recall and F1-Score in Table 5.3, we can conclude Considering the results of the report and the pro- that the SVM has the highest rate for the metrics posed approach, it can be seen that the present used in cross-validation with 10 folds. With a Pre- work presents valid trends in relation to the im- cision rate of 98.5 %, Recall rate of 97.8% and peachment process. It is remarkable that the pro- F1-Score rate of 98.4%. posed work informs trends by region, which does not happen with the work done by Torabit9 . 6 Case study Seeking to standardize the presentation of data In this case study were analyzed a total of 1,218 in the map plotted by the proposed approach (see tweets georeferenced, highlighting that the tweets Figure 3) we used a graphic seeking to make it not georeferenced were discarded. These posts are understandable, as shown in Figure 4. Analyz- referring to the period of March 2016, linked to ing Figure 4, it can be seen, in simplified way, the process of impeachment the president of the the percentages by region for each of the opinions, country. This period was selected based on two whether favorable or contrary to the impeachment. large manifestation schedules for the month. The It can be said that the proposed paper presents first manifestation favorable5 to impeachment, oc- information by regions, which can be proven curred on day 13 and the second contrary6 on day through traditional research survey. It happens be- 31. cause these studies uses past data, while the pro- Figure 3 presents the results to tweets collected posed work can use past or current data. Monthly and analyzed in the form of map for the regions of data was used in the study of proposed case in Brazil, using the approach proposed. Reminding March 2016. This collection and analysis of daily that for plotting of the map used the GeoServer data can identify possible trends and allows target- (Web Map), as shown in (Huang and Xu, 2011) ing of strategic actions in general. These actions and based on shapefiles7 to the five regions of carried out by favorable movements or contrary to Brazil. These occurrences are posts containing impeachment. hashtags cited previously in subsection 4.1. It is important to note that the case study could 5 be carried out in relation to other periods for the Check the location and time of the demonstra- tions of March 13 — Congress in focus - Link for Twitter posts. In this new study you can be dis- news: http://congressoemfoco.uol.com.br/noticias/confira-o- pensed the phases by training and testing, since the horario-e-o-local-das-manifestacoes-de-13-de-marco/. 6 8 Manifestations against the coup are sched- 49% of mentions on social networks are uled for this Thursday (31/03) - Link for news: pro-impeachment, study shows — Radar http://www.pragmatismopolitico.com.br/2016/03/manifestac Online — VEJA.com - Link for news: oes-contra-o-golpe-estao-agendadas-para-esta-quinta-feira- http://veja.abril.com.br/blog/radar-on-line/sem-categoria/49- 3103.html. das-mencoes-em-redes-sociais-sao-pro-impeachment- 7 mostra-estudo/. Shapefiles - It is a well-known format 9 for storing geospatial resources in files, site: Page Home - Torabit - Link for site: http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf http://www.torabit.com.br/. 40 Figure 3: Approach to implementing proposed policy opinion analysis: map with trends impeachment for regions of Brazil. the data of social networks, such as the Twitter (available through its API), can be used for public opinion research purposes that go beyond a simple mechanism for broadcast content. Remember that these networks provide a range of opportunities to detect where and when a topic of interest is being discussed. Monitoring on a particular topic and lo- cation, allows researchers to compare it with other collected data using different means. As it was shown in Case Study proposed. Possibly, the results can be improved through Figure 4: Graphic trends for the impeachment for the use of other methods for feature extraction or regions of Brazil, based on the proposed approach. combination of these, such as: Latent Semantic In- dexing Principal Component Analysis and others. models were obtained in the previous study and These improvements can come with implementa- the same could be reused for other periods. With tions of these methods in future work. the application of this new study results should verify possible trends for the process of impeach- ment for the selected period. References I. Arel, D.C. Rose, and T.P. Karnowski. 2010. Deep 7 Conclusion machine learning - a new frontier in artificial intel- ligence research [research frontier]. Computational It is concluded the proposed approach achieved Intelligence Magazine, IEEE, 5(4):13–18, Nov. the goal of providing a solution based on opinion Leo Breiman. 2001. Random forests. Mach. Learn., mining to identify policy trends according to pub- 45(1):5–32, October. lic opinion. The result obtained with the proposed work to collect and process data from the Twitter David P. Currie. 1998. The first impeachment: The constitution’s framers and the case of senator is valid and resembles with the other work. william blount. American Journal of Legal History, Probably, the results of this study conclude that 42(4):427–429. 41 Nugroho Dwi Prasetyo and Claudia Hauff. 2015. Ian Sommerville. 2006. Software Engineering: (Up- Twitter-based election prediction in the developing date) (8th Edition) (International Computer Sci- world. In Proceedings of the 26th ACM Conference ence). Addison-Wesley Longman Publishing Co., on Hypertext & Social Media, HT ’15, pages Inc., Boston, MA, USA. 149–158, New York, NY, USA. ACM. Tiara, M.K. Sabariah, and V. Effendy. 2015. Sentiment Usama Fayyad, Gregory Piatetsky-shapiro, and analysis on twitter using the combination of lexicon- Padhraic Smyth. 1996. From data mining to based and support vector machine for assessing the knowledge discovery in databases. AI Magazine, performance of a television program. pages 386– 17:37–54. 390, May. Eibe Frank and Remco R. Bouckaert. 2006. Naive H. Zhang. 2011. The research of web mining in e- bayes for text classification with unbalanced classes. commerce. In Management and Service Science In Proceedings of the 10th European Conference on (MASS), 2011 International Conference on, pages Principle and Practice of Knowledge Discovery in 1–4, Aug. Databases, PKDD’06, pages 503–510, Berlin, Hei- delberg. Springer-Verlag. Z. Huang and Z. Xu. 2011. A method of using geoserver to publish economy geographical infor- mation. In Control, Automation and Systems Engi- neering (CASE), 2011 International Conference on, pages 1–4, July. Thorsten Joachims. 1998. Text categorization with suport vector machines: Learning with many rele- vant features. In Proceedings of the 10th European Conference on Machine Learning, ECML ’98, pages 137–142, London, UK. Springer-Verlag. Andreas Jungherr. 2013. Tweets and votes, a spe- cial relationship: The 2009 federal election in ger- many. In Proceedings of the 2Nd Workshop on Pol- itics, Elections and Data, PLEAD ’13, pages 5–14, New York, NY, USA. ACM. Jayashri Khairnar and Mayura Kinikar. 2013. Machine learning algorithms for opinion mining and senti- ment classification. International Journal of Scien- tific and Research Publications, 3(6):1 – 6. Vasileios Lampos. 2012. On voting intentions infer- ence from Twitter content: a case study on UK 2010 General Election. arXiv preprint arXiv:1204.0423. Mahmut Onur Karslıoğlu Nuhcan Akçit, Emrah Tomur. 2014. Geographical information systems partici- pating into the pervasive computing. In GEOPro- cessing 2014, The Sixth International Conference on Advanced Geographic Information Systems, Appli- cations, and Services, pages 129–137. ThinkMind, March. Bo Pang and Lillian Lee. 2008. Opinion mining and sentiment analysis. Found. Trends Inf. Retr., 2(1- 2):1–135, January. Stephen Robertson. 2004. Understanding inverse doc- ument frequency: On theoretical arguments for idf. Journal of Documentation, 60(5):503–520, July. Marina Sokolova and Guy Lapalme. 2009. A system- atic analysis of performance measures for classifica- tion tasks. Information Processing & Management, 45(4):427 – 437. 42