=Paper=
{{Paper
|id=Vol-1178/CLEF2012wn-RepLab-BalahurEt2012
|storemode=property
|title=Detecting Entity-Related Events and Sentiments from Tweets Using Multilingual Resources
|pdfUrl=https://ceur-ws.org/Vol-1178/CLEF2012wn-RepLab-BalahurEt2012.pdf
|volume=Vol-1178
}}
==Detecting Entity-Related Events and Sentiments from Tweets Using Multilingual Resources==
Detecting Entity-Related Events and Sentiments from Tweets Using Multilingual Resources Alexandra Balahur and Hristo Tanev alexandra.balahur@jrc.ec.europa.eu hristo.tanev@ext.jrc.ec.europa.eu European Commission Joint Research Centre IPSC, GlobeSec, OPTIMA Via E. Fermi 2749, Ispra, Italy Abstract. This article presents the details of the participation of the OPTAH team to the CLEF 2012 RepLab profiling (polarity classification) and monitoring tasks. Specifically, we present the manner in which the OPAL system has been modified to deal with opinions in tweets and how the use of rules involving the language use in social-media can help to achieve good results as far as polarity classification is concerned, even in a language for which we have just a small polarity lexicon. Additionally, we show how we can employ the values computed for sentiment intensity (especially the negative ones) to classify the importance of event-related clusters of tweets. Our methods, although quite simple, obtained promising results in the RepLab evaluations. 1 Introduction In the new Social Web era, the influence the user-generated contents have on all spheres of life has reached an unprecedented level. People’s comments on news, events and their personal opinions on persons and companies worldwide have made the Internet a rich source of information, highly relevant for the people or companies in question and their stakeholders. Online Reputation Management deals with the issue of detecting and em- ploying “positive” and “negative” clues expressed in online contents on such people and companies, in an automatic manner. As stated by Balahur [4], this task is highly complex, as it deals with important issues in opinion mining, sentiment analysis, bias detection, Named Entity discrimination, online trust and reputation management, topic modeling, good versus bad news classification and other aspects which, in themselves are not trivial in Natural Language Processing. This article presents the details of the participation of the OPTAH team to the CLEF 2012 RepLab profiling (polarity classi- fication) and monitoring tasks. The main objectives of our experiments were: 1. For the polarity task: – test if the methods we have developed for sentiment analysis for other text types can be adapted to the case of tweets (short texts) and what changes are required to that aim; – to test and evaluate, in comparison, a semi-supervised versus an unsupervised method for sentiment analysis in this type of texts; and 2 Authors Suppressed Due to Excessive Length – to measure the impact of resources that are typical of social media - e.g. collec- tions of smileys, colloquial expressions, slang, repetitions of punctuation signs, etc. and the use of an algorithm to normalize words can help to more accurately detect opinions in tweets. 2. For the monitoring task: – to test how well a clustering method that has initially been employed in the case of news can be customized to deal with news reported in tweets; and – to test in how far we can employ the intensity of the sentiment detected in the tweets within clusters to sort them depending on their priority. In the first case, although the adaptation to the Twitter domain was not very extensive in terms of sentiment-bearing words, our results showed that the use of rules taking into account the typical phenomena in the short informal texts can achieve good results, our two submissions ranking 8th and 9th overall and achieving the 5th rank in the case of Spanish, in terms of F-score for polarity and the second rank in terms of polarity accuracy for Spanish. In the second task, we could see that the negativity expressed in the comments were important to the priority of the clusters that contained those comments. Nevertheless, additional methods to score the “negativity” of news have to be employed, as well as the added use of “good” versus “bad” news terms, which were disregarded by the OPAL system. 2 Profiling Task - Polarity Classification For the polarity classification task, we employed to approaches. The first one was semi- supervised, using a variant of the OPAL system [5], whose extension is presented in the following section. The second one was unsupervised, using only lexicons of words that relate to polarity, as well as a set of rules for modifiers and negation. The two approaches are described in the following subsections. In order to prepare the tweets for analysis, the texts were tokenized and subsequently the tokens were preprocessed as follows: 1. Word normalization. The words in the tweets were compared against the words in the Roget’s Thesaurus. Subsequently, words that were not found in the dictionary were processed, eliminating repeated letters until they matched a word in the dic- tionary. The words were also matched against the affect lexicons we employed in our method, which were The General Inquirer [8] list of sentiment words, the Lin- guistic Inquiry and Word Count - LIWC - [9] resource and MicroWNOp [7] as well as the dictionary obtained by Steinberger et al. [6] for Spanish. This is important, as for the second method, which is based on the polarity and intensity values of concepts, the value of the word that is “stressed” by writing it with repeated letters receives an increment in polarity (i.e. for positive words, 1 is added to the total polarity value and for negative words, 1 is subtracted from the total polarity value). 2. Emoticon replacement. We employed an emoticon dictionary and replaced the emoti- cons found in the tweets with the word they signify (e.g. “:)” is replaced with “happy”). Title Suppressed Due to Excessive Length 3 3. Repeated punctuation sign normalization. In the tweets, we reduced multiple punc- tuation signs to only one and, for the second approach, added or subtracted 1 from the total polarity value. 2.1 OPAL - a System for Opinion Detection from Text This run was submitted with the acronym OPTAH 1. In order to determine the polarity of the sentences, we passed each sentence through an opinion mining system employing SVM machine learning over the NTCIR 8 MOAT corpus - for English and the Spanish translation, obtained by Balahur and Turchi [11] -, the MPQA corpus for English, EmotiBlog [10] for English and Spanish and the tweets given in the training set by the organizers of the RepLab 2012 competition. As opposed to the system employed in the NTCIR MOAT 8 task [5], we only used tokenization and did not perform any parsing, as tweets are many of the times not fully-formed sentences. Each of the positive, negative, negation and modifier (intensifier, diminisher) words found in this corpora were matched against the General Inquirer, Opinion Finder, MicroWordNet and LIWC resources anreplaced by the “POSITIVE”, “NEGATIVE”, “NEGATOR”, “INTENSIFIER” and “DIMINISHER” labels. Subsequently, we repre- sented the sentences in the training set as a vector containing the presence (1) or absence (0) of all the unigrams and bigrams in the corpora used for training. With the vectors thus obtained, we employed the Support Vector Machines implementation in Weka (the SMO version) and created a learning model. The tweets in the test set were represented as vectors whose features corresponded to the presence or absence of the unigrams and bigrams in the training sets. 2.2 Opinion Detection from Text Using Opinion Lexica and Rules This second run was submitted with the acronym OPTAH 2. In this second approach, we employed a simpler method to compute the polarity and intensity scores. Each of the sentiment lexicons employed were mapped to 4 values of polarity - high positive (with a value of 4) , high negative (-4), positive (1), negative (-1). Additionally, we added slang words for both languages (e.g. “LOL” with a value of 4, “joder” with a value of -4). Additionally, we employed a set of rules, to take into consideration negation, modifiers, repeated punctuation signs and emoticons, as follows: – Negation treatment. When a negation was found, the polarity of the subsequent sentiment-bearing words found in the tweet was inverted. We excluded the known cases of “false negations”, such as “not only”, “no solamente”. – Modifier treatment. When an intensifier was found, the polarity of the follow sentiment- bearing word in the tweets was multiplied with 1.5. In the case of diminishers, the polarity of the sentiment bearing word that followd it, the value of its polarity was multiplied with 0.5. – Emoticon treatment. When an emoticon is found, the score it is given is of the word that it represents (e.g. “:(” has the value -1, of “sad”). 4 Authors Suppressed Due to Excessive Length – Repeated letters treatment. When a word has repeated letters and it is found in the polarity lexicon, its polarity value is multiplied by 1.5. – Repeated punctuation signs. In the case of exclamation signs, the value of the entire sentence preceding it is multiplied by 1.5. In the case of fullstops, the value of the preceding sentence is multiplied by 0.5. 2.3 Results and Discussion For the two runs we submitted, we obtained the following results, in terms of polarity accuracy, R polarity, S polarity and F-score of R and S polarity, respectively: OPTAH 1 (0.3644, 0.3256, 0.3102, 0.3048), OPTAH 2 (0.3705, 0.4048, 0.2689, 0.3042), scoring 8th and 9th of 34 runs in terms of F(R,S). Per language, for English, the results, in the same order, were: OPTAH 1 (0.3207, 0.3050, 0.2920, 0.2810) and OPTAH 2(0.3293, 0.4061, 0.2523,0,2922). For Spanish, the results were: OPTAH 1 (0.4430, 0.3041, 0.2901, 0.2837) and OP- TAH 2 (0.4435, 0.3695, 0.2567, 0.2844). We can see that for the case of English, using more resources deteriorated the performance and the use of the semi-supervised method of learning actually produced worse results than the use of a simple, lexicon and rule- based system. In case of Spanish, our systems ranked among the first three in terms of accuracy and F-measure, showing that a smaller, but more precise lexicon (containing also slang), combined with a set of rules that capture the manner in which expressions of sentiment are stressed upon in Social Media, can better help to classify tweets. 3 Monitoring Task We participated in the RepLab monitoring task. In this task we used multilingual lists of keywords, extracted my Europe Media Monitor [1], which were used as features for clustering. Then, we used the second system for sentiment detection described above (OPTAH 2) to define the priority of the clusters. Our assumption was that clusters, which convey negative opinions should be considered more relevant for reputation man- agement, since they may report about major issues, related to the mentioned organiza- tion. Our algorithm has two stages of processing: clustering and priority definition. Now we will explain in more details each of these steps. Clustering is performed in three steps: First, for each tweet, we build a vector from the Europe Media Monitor keywords which appear in this tweet; doing this we ig- nore very frequent keywords. Each dimension of our vector corresponds to one word which appears in the tweet. The values of the vector components are defined, using log-likelihood ratio, considering probability of appearance of the word in a large news corpus of 100’000’000 words. The fact, that we used a news corpus and not one de- rived from tweets influences the accuracy of our approach; however, we did not have Twitter-specific keywords. Then, we count which of the Spanish or English keywords are more represented in the tweet and consider the tweet as English or Spanish, according to the language, from Title Suppressed Due to Excessive Length 5 which the majority of the keywords come. We did not consider tweets with less than 3 keywords, since these were most probably not informative. Finally, we cluster the tweet’s vectors, using agglomerative clustering with a thresh- old, previously optimized on the training set of clusters. Our criteria for optimization was that the average reliability and sensitivity of the clustering for the training set enti- ties are balanced. In our experiments we used the CluTo clustering tool [2]. We defined the priority of the clusters by using sentiment detection. We assumed that negative tweets convey information about issues and problems, related to the or- ganization of interest and its products or services. Negative opinions are important for reputation management, since negative perception of an organization can be exploited against it by its competitors. Also, by analysing the negative opinions, the organization could find its weak points as seen by people and improve its image. For example, one of the negative tweets about Blackberry was: So my blackberry broke again this morning, and is not working again Similar opinions should be important for Blackberry, since they show problems with these products. Or let’s consider the following example: Bank of America bugs the shit out of me in general. 20 years! I think I am a financial masochist Again, here we have a negative opinion about Bank of America and its services. Tweets, similar to this one should be considered important for Bank of America, since they show potential weaknesses in their services. Another example for the same organization: Bank of America refusing to do business with certain companies... WOW for a bank that nearly went bankrupt and closing branches all over This tweet directly states a negative fact about Bank of America, at the same time it shows clearly negative attitude towards the bank. In order to detect negative tweets, we run our multilingual system OPTAH 2 and we detected the clusters, which contain negative tweets. These clusters were considered important and their priority was set to alert level , while the clusters not containing negative tweets we considered unimportant and their priority level was set to average. One of the weaknesses of our clustering approach was that we used clustering based on purely lexical features. We could have considered for example, synonyms and sim- ilar words. Also, using dimentionality reduction, clustering can be done on a reduced feature space [3]. This could potentially result in a better clustering. Another possibility was to calculate a table of distributional similarity between the frequent keywords. In this way, we could overcome the restrictions of lexical similarity. Another possibility to improve the results lays in improving calculation of cluster priority. Currently, we used only sentiment detection. One could use also the size of the cluster, it’s lexical content, also the fact that some tweets are retweeted or replied-to, how many Web links are provided in the tweets, etc., in order to calculate better the priority. This can be formulated as a supervised machine learning task, where certain tweets or clusters are marked manually with their level of priority and the features are the previously mentioned characteristics. We submitted one run for the monitoring task. It was ranked in the middle of the ranked list of runs, with the following scores: 0.7 for R CLUSTERING (BCubed pre- cision), 0.34 for S CLUSTERING (BCubed recall), 0.38 for F(R,S) CLUSTERING, 6 Authors Suppressed Due to Excessive Length 0.19 for R PRIORITY, 0.16 for S PRIORITY, 0.16 F PRIORITY, 0.37 R, 0.19 S and an overall F(R, S) of 0.22. Considering the simplicity of our approach, we consider the re- sults satisfactory, still possible to improve. One of the main problems was that negative sentiment alone was not enough to detect important tweet clusters. 4 Conclusions and Future Work From our experiments with the training data and from the results obtained in the com- petition, we could see that indeed reputation management is a difficult task. The main challenge is related to the language used in social media, the shortness of texts, the assumed knowledge on the context (i.e. people use hashtags to refer to specific events, which are presented in traditional media) and the difficulty of assessing “good” and “bad” news from the perspective of different domains. As future work, we plan to use the EMM categories for events as additional clues to the positivity and negativity of events and develop a method to detect topic-specific types of events, which we can then classify in terms of positive or negative impact on the entity in dependence to the opin- ion expressed in social media. Further on, we will extend on the method developed by (Tanev et al., 2012) to link tweets to news and thus be able to explore a higher quantity of text for the reputation management task. References 1. Steinberger, R., Pouliquen, B., Van der Goot, E. 2009. An Introduction to the Europe Media Monitor Family of Applications. In: Fredric Gey, Noriko Kando, Jussi Karlgren (eds.): Infor- mation Access in a Multilingual World - Proceedings of the SIGIR 2009 Workshop (SIGIR- CLIR’2009), pp. 1-8. Boston, USA. 23 July 2009. 2. Karypis, G. CLUTO. Available online: http://glaros.dtc.umn.edu/gkhome/views/cluto/. 3. Song, W. and Park, S.C. 2007. A novel document clustering model based on latent semantic analysis. In Proceedings of the Third International Conference on Semantics, Knowledge and Grid, pages 539 -542. 4. Balahur, A.. 2012. The Challenge of Processing Opinions in Online Contents in the Social Web Era Workshop Language Engineering for Online Reputation Management at LREC 2012. 5. A. Balahur, E. Boldrini, A. Montoyo, P. Martinez-Barco. 2010. The OpAL System at NTCIR 8 MOAT. Proceedings of NTCIR 8, 2010. 6. Steinberger, J., Lenkova, P., Ebrahim, M., Ehrman, M., Hurriyetoglu, A., Kabadjov, M., Stein- berger, R., Tanev, H., Zavarella, V., Vazquez, S.. 2011. Creating Sentiment Dictionaries via Triangulation. Proceedings of the 2nd Workshop on Computational Approaches to Subjectiv- ity and Sentiment Analysis (WASSA 2011). 7. Cerini, S., Compagnoni, V., Demontis, A., Formentelli, M., Gandini, G.. 2007. Language resources and linguistic theory: Typology, second language acquisition, English linguistics, chapter Micro-WNOp: A gold standard for the evaluation of automatically compiled lexical resources for opinion mining. Franco Angeli Editore, Milano, IT. 8. Stone, P. J., Dunphy, D. C., Smith, M. S., and Ogilvie, D. M.. 1966. The General Inquirer: A Computer Approach to Content Analysis. MIT Press. 9. Pennebaker, J.W., Francis, M.E., Booth, R.J.. 2001. Linguistic Inquiry and Word Count: LIWC2001. Mahwah, NJ: Erlbaum Publishers. Title Suppressed Due to Excessive Length 7 10. Boldrini, E., Balahur, A., Martinez-Barco, P., Montoyo, A.. 2010. EmotiBlog: a finer-grained and more precise learning of subjectivity expression models. Proceedings of the 4th Linguistic Annotation Workshop (LAW IV), 2010. 11. Balahur, A., Turchi, M.. 2012. Multilingual Sentiment Analysis Using Machine Translation?. Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2012).