Dictionary-Based Sentiment Analysis applied to specific domain using a Web Mining approach Laura Cruz José Ochoa Universidad Nacional de San Agustı́n, Perú Universidad Católica San Pablo, Perú lcruzq@unsa.edu.pe jeochoa@ucsp.edu.pe Mathieu Roche Pascal Poncelet TETIS LIRMM, Cnrs Cirad, Cnrs Université Montpellier, France AgroParisTech, Irstea, France pascal.poncelet@lirmm.fr mathieu.roche@cirad.fr Abstract express opinions about some topics can be spe- cific and highly correlated to a particular domain In recent years, the Web and social media (Duthil et al., 2011). Likewise, while we may are growing exponentially. We are pro- find that The chair is black, such an adjective vided with documents which have opin- would be unusual in a movies domain. To tackle ions expressed about several topics. This these issues both machine learning and dictionary- constitute a rich source for Natural Lan- based approaches have been proposed in the lit- guage Processing tasks, in particular, Sen- erature. A machine learning method that applies timent Analysis. In this work, we aim at text-categorization techniques has been proposed constructing a sentiment dictionary based by (Pang and Lee, 2004). In such method, graphs, on words obtained from web pages re- minimum cut formulation, context and domain lated to a specific domain. To do so, we have been considered to extract subjective portions correlate candidate opinion words, seed of documents. words and domain using AcroDefM I3 On the other hand, dictionary based approaches and TrueSkill methods. This dictionary- are unsupervised in nature. In general, these meth- based approach is compared to the Sen- ods assume that positive (negative) adjectives ap- tiWordNet lexical resource. Experimental pear more frequently near a positive (negative) results show suitability of our approach for seed word (Harb et al., 2008). An unsupervised multiple domains and infrequent opinion learning algorithm for classifying reviews (thumbs words. up or thumbs down) has been adopted by (Turney, 2002; Wang and Araki, 2007). A review classifica- 1 Introduction tion is given by the average semantic orientation of In recent years, the Web and social media their phrases which contain either adjectives or ad- are growing exponentially, this constitute a rich verbs. A phrase semantic orientation is computed source for Sentiment Analysis tasks. Companies using the mutual information between the given are increasingly using the content in these media phrase and the word excellent minus the mutual in- to make better decisions (Marrese-Taylor et al., formation between the given phrase and the word 2013). Social networking sites are being used for poor. Therefore, a phrase has a positive seman- expressing thoughts and opinions about products tic orientation when it has good associations and a by users (Amine et al., 2014). In this context, negative semantic orientation when it has bad as- Sentiment Analysis involves the process of iden- sociations, as shown by equation 1. tifying the polarity of opinionated texts. These SO(phrase) = opinionated texts are highly unstructured in nature and thus involves the application of Natural Lan- hits(phrase NEAR excellent) · hits(poor) log guage Processing techniques (Varghese and Jayas- hits(phrase NEAR poor) · hits(excellent) ree, 2013). As a rule, documents have opinion- (1) ated texts about several topics. Words used to In this work, words used to express opinions 80 are learned. To do so, positive and negative Domain, Web seed words (e.g. good, excellent, bad) are used Seed Word Corpus Adquisition Pages to extract adjectives near seed words. To cor- relate candidate words, seed words and domain, Pre-processing AcroDefM I3 and TrueSkill methods are pro- Text POS-Tag posed. Experimental results show suitability of Word Extraction Window Size(N) our proposal. Several domains (e.g movies, agri- Nouns, cultural) were used to compare our approach to Adjectives Score SentiWordNet. Word Selection MI3 TrueSkill The paper is organized as follows. The Method- SentiWordNet ology is presented in Section 2. Experimental setup is described in Section 3. In Section 4, we Dictionary+ Dictionary- present and discuss the obtained results. Conclud- ing remarks are presented in Section 5. Figure 1: Lexicons are inferred from Web pages 2 Methodology correlated to seed words and domains, via extrac- tion and process of candidate words. The proposed process is depicted in Figure 1. The steps are summarized in the following steps: 1. A corpora for a specific domain, contain- These examples show that a given word, for in- ing positive and negative opinions is acquired stance scientific, can be highly correlated to a par- from the Web. ticular domain (Harb et al., 2008). The first exam- 2. Each document is pre-processed to get text, ple is considered a neutral opinion. Conversely, remove HTML tags and scripts. the second example is considered a positive opin- ion. The third example is also a positive opinion 3. Opinion adjectives and nouns are extracted because of the word good. Thus, some words are using POS-Tagging and the Window Size al- useful to learn opinion words related to a given gorithm. domain. We can define a seed word, such as good, 4. The correlation score of a given word with that can help us to find others opinion words. a seed word and domain is computed us- Lexicons are built using selected words from ing AcroDefM I3 and TrueSkill. lexicons web page corpus. Web pages are retrieved using are inferred based on these correlation scores Bing search engine. Queries used to retrieve this that identify semantic orientation for each ex- web pages combine seed words and domain key- tracted word. High correlation score words words. We have positive and negative seed words, are selected. P = {good, nice, excellent, positive, fortunate, cor- rect, superior} , Q = {bad, nasty, poor, negative, We perform experiments over two domains: Agri- unfortunate, wrong, inferior}, respectively. cultural domain (opinions extracted from Twitter) and a Movie domain1 (data set introduced in (Pang A positive (negative) seed word ensure a pos- et al., 2002)). Further details are given in the next itive (negative) web page about a query domain, sections. due to all opposite seed words are excluded from that query. For example, the following query can 2.1 Corpus Acquisition be used for retrieving positive pages: query+ = Some words can express neutral, positive or nega- +opinion + review + gmo + good bad nasty tive opinion in specific domain such as: poor negative unfortunate wrong inferior Neutral ! I attend scientific conferences. Thus, we have positive and negative web pages denoted by corpus+ , corpus respectively. Each Positive ! The list shows the scientific discoveries. corpus is related to a seed word and a given do- Positive ! He made a good scientific discovery. main. In the next section we will extract words 1 http://www.cs.cornell.edu/People/pabo/movie-review- near seed words for each web page corpus using data/ POS-Tagging and the Window Size algorithm. 81 2.2 Word extraction as gmo, can be used to express a domain opin- Opinion words near a seed word can have the same ion. Hence, we need to measure the correlation polarity (Roche and Prince, 2007; Harb et al., of a given extracted word with domain and seed 2008). The same approach has been used to ex- word to build a lexicon. In order to get candi- tract candidate opinion words. To identify opinion date opinion words we propose to use the statis- words (nouns and adjectives) in web page corpus, tical measure AcroDefM I3 (equation 2) (Roche TreeTagger2 has been used. Previously, HTML and Prince, 2007). Moreover, we also propose a tags, scripts, blank spaces and stop words3 were novel probabilistic measure based on the TrueSkill removed from web pages. In order to get near Algorithm (Herbrich et al., 2007) (Algorithm 3). words for each seed word a Window Size algo- The AcroDefM I3 measure takes each word ex- rithm has been used (Algorithm 1). The Window tracted using the Window Size algorithm and com- Size Algorithm looks for opinion words in both putes the following equation 2, which is based on left and right sides of a seed word given a K dis- web mining. tance. This distance is the number of left (right) The total web page results, based on queries opinion words of a seed word given a web page that combine candidate words, seed words and corpus. This process is shown in Algorithm 1. domain keywords, are used in the AcroDefM I3 measure to get the correlation score for each ex- Algorithm 1 The Window Size Algorithm tracted word. Require: seed words, corpus, K AcroDefM I3 = Ensure: opinion words 0 1 1: words TreeTagger to each corpus (nb(sw word AND domain)+ 2: words filter adjectives and nouns B nb(word sw AND domain))3 C log B @ nb(sw AND domain) C (2) A 3: for index= 0 until total of words do 4: if words{index} in seed word then · nb(word AND domain) 5: for k = 1 until K do where sw is a seed word, nb(x) function is the 6: left word words[index - k] number of total result pages, x is the query used 7: right word words[index + k] to retrieve pages in the search engine, and word 8: opinion words is the word extracted using the Window Size al- left word and right word gorithm. This process is detailed in Algorithm 2. In Figure 2 adjectives (JJ) and nouns (NNS) are retrieved using TreeTagger. The good word is a Algorithm 2 Word selection algorithm using positive seed word and its nearest adjective is safe AcroDefM I3 given k = 1 distance. Likewise, scientific and Require: corpus, seed words = P, keywords of studies words are retrieved with distance k = 2. domain In addition, safe is a positive opinion word candi- Ensure: correlation score values for each word date because it occurred near a positive seed word 1: for each corpus do (good). In this sense, we can have a set of opin- 2: words+ = window size(corpus+ , P ) ion words (positive and negative), that can be can- 3: for word in words+ do didates to include into the resulting lexicon. To 4: given each seed word and keywords of get the correlation score of each extracted word domain compute correlation score: given a seed word, two measures are employed: 5: score max(AcroDefM I3 ) AcroDefM I3 and TrueSkill which are described in the next section. Unlike AcroDefM I3 , in the TrueSkill approach words are extracted using the Window Size algo- 2.3 Word Selection rithm and the measure function is applied. Fur- As seen in our previous example (Figure 2), the thermore, words are extracted for each positive scientific word was retrieved using window size (negative) page against k random negative (pos- distance = 2. However, specific words, such itive) pages and then their score words are com- 2 http://www.cis.uni-muenchen.de/ schmid/tools/TreeTagger/ puted. Thus, TrueSkill configures a match be- 3 http://www.ranks.nl/stopwords tween positive pages words against negative pages 82 Scientific studies have frequently found that GMO’s are safe to eat and even good. JJ NNS VHP RB VVN IN NNS VBP JJ TO VV CC RB JJ window size = 1 window size = 2 Figure 2: Window size sample for good seed word. words. The process is detailed in Figure 3, where Algorithm 3 Word selection algorithm using TrueSkill Require: corpus, seed words(P, Q) 2 2 2 2 N (s1,1 , µ1,1 ; 1,1 ) N (s1,2 , µ1,2 ; 1,2N ) (s2,1 , µ2,1 ; 2,1 ) N (s2,2 , µ2,2 ; 2,2 ) Ensure: correlation score values for each word 1: k = 10 number of match for each corpus. s1,1 s1,2 s2,1 s2,2 2: for each corpus do 2 2 2 2 N (p1,1 ; s1,1 ; ) N (p1,2 ; s1,2 ; ) N (p2,1 ; s2,1 ; ) N (p2,2 ; s2,2 ; ) 3: words+ = window size(corpus+ , P ) p1,1 p1,2 p2,1 p2,2 4: for k random corpus do I(t1 = p1,1 + p1,2 ) I(t2 = p2,1 + p2,2 ) 5: words = window size(corpus , Q) t1 t2 6: given each word compute correlation I(d1 = t1 t2 ) 1 score: d1 7: score 2 T rueSkill(words+ , words , t = [1, 2]) I(d1 > ") Team Words Si S i+1 Figure 3: TrueSkill Model, learning score for each word+ bioengineered 22, 738 22, 809 word selected given the positive and negative cor- word economic 0, 001 0, 022 pus. S = {s1,1 , s1,2 , , s1,n } and S = {s2,1 , s2,2 , , s2,n }, Where: S i denotes current correlation score s are learning values for each word in positive and for each word, and S i+1 , the updated value negative web page respectively. p is the learning after matching pages (positive against negative performance for each word, t is the sum of total page), bioengineered is a word near excellent, performance for each word in corpus. a seed word 2 P , and economic is near wrong, As T rueSkill learns s according its match out- seed word 2 Q when the Window Size algorithm come, we set a high punctuation for corpus+ , and has distance k = 1. Thus, when the same less punctuation for corpus . Therefore, we have corpus+ has a match with other corpus : d = t1 t2 . Due to difference (d) is important, we set t1 = 1 to a positive corpus and t2 = 2 to a neg- corpus = Various studies · · · poor agricul- ative corpus, where 1 denotes first. This process is tural income · · · . detailed in Algorithm 3. The following example shows how TrueSkill measures two collected web pages: Team Words Si S i+1 word+ bioengineered 22, 738 28, 023 corpus+ = By the way a New York Times ··· word agricultural 0, 108 4, 764 excellent job · · · bioengineered food · · · . corpus = Roundup Ready cotton · · · wrong It is worth noting that agricultural becomes a solution · · · at any economic advantage. more negative word than economic because its value decreases more after the match using the same positive word: bioengineered. On one hand, 83 if a word is often found in a corpus its value Seed Word Domain tends to decrease. On the other hand, if it is in Agricultural Movie a corpus+ its value will increase. If the word is superior 42 10 found in both corpus it tends to be constant. In the good 406 178 next section, experiments results are showed. positive 54 17 fortunate 23 4 3 Experiments excellent 47 20 In order to validate our approach experiments over correct 24 7 two data sets were conducted. The polarity of each nice 40 23 opinion from domains (Agricultural tweets and poor 58 14 Movie reviews) is predicted using the inferred lex- negative 65 25 icons, AcroDefM I3 and TrueSkill measures. Pre- wrong 64 43 cision, recall and f-score were measured in order bad 98 39 to compare to the SentiWordNet approach. Data unfortunate 22 27 sets used are described in the next section. nasty 23 15 inferior 23 11 3.1 Datasets The domains keywords used in queries were: Table 1: Seed words(SW) frequency for Agricul- Agriculture domain = {gmo, agricultural biotech- tural Domain nology, biotechnology for agriculture}, and Movie domain = {cinema, film, movie}. In order to test the agricultural domain, tweets using these 3.3 Window size keywords were collected and manually classified. Using web pages number k = 20, a high number There were 50 positive and 61 negative tweets. of low frequency adjectives are retrieved as shown The Movie domain 4 is based on (Pang and Lee, in Figure 5a. To get a word near a seed word with 2004). The number of positive and negative is re- window size= 1, the maximum distance allowed spectively 1000 and 1000. is 10 words per window size. A simple classification procedure was used. In order to do so, the number of positive and negative 3.4 Measure function (AcroDefM I3 , words in each tweet or review is computed using TrueSkill) the inferred lexicons. If the difference is greater Figures 4, 5 show words scores obtained using than zero then it is classified as positive, otherwise the measures proposed. It can be observed that is negative. The following kind of lexicons were words better discriminate than frequencies of Win- used to sentiment classification: dow Size Algorithm as shown in Figure 5a. Ta- • M I3: seed words + W S with AcroDefM I3 . ble 2, Table 3 show the top 5 words of inferred lexicons. • T S: seed words + W S with T rueSkill. 3.5 SentiWordNet • SW N : SentiWordNet. SentiWordNet5 is a lexical resource for opinion where W S denotes words extracted with window mining. It assigns to each synset of WordNet three size. Finally, the number of web pages retrieved sentiment scores, positive, negative and neutral. during the corpus acquisition for each seed word We compute differences between positive and neg- was k = 20. ative scores. If the result is greater than zero then In the next, we show word distributions for each the polarity of the word is positive, otherwise neg- type of lexicon. ative. SentiWordNet assigns a different score for 3.2 Seed words each word according its context. As context is not considered, higher positive and negative word Table 1 shows the number of occurrences for each scores are obtained. Finally, SentiWordNet com- seed word in web pages. prises 21479 adjectives and 117798 nouns. 4 http://www.cs.cornell.edu/People/pabo/movie-review- 5 data/ http://sentiwordnet.isti.cnr.it/ 84 (a) Word frequency using Window (b) MI3 (c) TS Size (WS) Figure 4: Adjective words for Agricultural domain (a) Word frequency using Window (b) MI3 (c) TS Size (WS) Figure 5: Adjective words for Movie domain Adjective Words Noun Words WS MI3 TS WS MI3 TS Positive Positive dark cheap qualified flavor luck note daily fat inconclusive fit night commitment active coconut ideal movie morning judgment favorite false fresh opportunity source continent full probiotic active job vodka jihad Negative Negative stunning rural devastating farmer regulation farmer german chemical irreversible debate bread regulation hungry standard sick cost guy group wealthy brutish general intensity gmos problem medical hungry chemical gmos soil tomato Table 2: Top 5 adjectives for Agricultural domain Table 3: Top 5 nouns for Agricultural domain 3.6 Classification In order to classify opinions the inferred lexicons are used. We have positive and negative lexi- cons (dictionary) for each data sets (Agricultural, top 10 new words ordered by their correlation Movie), as shown in Table 7. In the Agricul- score value. In order to validate the algorithms we tural domain 32 new words have been learned that calculate recall, precision and f-score. Figures 7, do not appear in SentiWordNet. Likewise, in the 6 show the recall, precision and fscore using each Movie domain 20 new words that do not appear in word type(noun, adjectives), and the results using SentiWordNet have been learned. Table 6 shows MI3, SentiWordNet and TrueSkill. 85 Figure 6: Tweet classification, left with adjectives, right with nouns. Figure 7: Classification using Movie Reviews, left with adjectives, right with nouns. Adjective Words Noun Words WS MI3 TS WS MI3 TS Positive Positive comfy big late info place info expensive real clear wife day place late natured common people food staff french sound french service feel credo infectious easy commercial party luck city Negative Negative makeshift video emotional blood thing blood video pretty russian rate word character lost english cartoonish character person idea fast acting treacly interest blood progression attentive full dull time video activity Table 4: Top 5 adjectives for Movie domain Table 5: Top 5 nouns for Movie domain 4 Discussion of the results Precision and F-Score) than SentiWordNet and When the inferred lexicon for the Movie domain AcroDefM I3 for positive reviews using adjec- is considered, TrueSkill performs better (Recall, tives and nouns. When negative reviews are con- 86 Domain by using the Window Size Algorithm, it is possi- Agricultural Movie ble to obtain new adjectives entries in both agricul- chocolaty configurable tural and movie domains when compared to Sen- glyphosate updated tiWordNet. phosphonic readymade carfentrazone nature Acknowledgments sporogene directorial This work has been supported and financed by kalu spendidly FONDECYT. protato cartoonish adeed mic phthalates showreel References genotoxicity coverup Abdelmalek Amine, Reda Mohamed Hamou, and Michel Simonet. 2014. Detecting opinions in Table 6: Top 10 words of inferred lexicons using tweets. volume abs/1402.5123. AcroDefM I3 and TrueSkill methods, which are Benjamin Duthil, François Trousset, Mathieu Roche, not in SentiW ordN et Gérard Dray, Michel Plantié, Jacky Montmain, and Pascal Poncelet, 2011. Towards an Automatic Char- acterization of Criteria, pages 457–465. Springer Word Positive Negative Berlin Heidelberg, Berlin, Heidelberg. Agricultural Ali Harb, Michel Plantie, Gerard Dray, Mathieu Roche, Adjective 200 119 Francois Trousset, and Pascal Poncelet. 2008. Noun 314 189 Web opinion mining: How to extract opinions from Movie blogs? In Proceedings of the 5th International Con- Adjective 153 141 ference on Soft Computing As Transdisciplinary Sci- ence and Technology, CSTST 08, pages 211–217, Noun 171 183 New York, NY, USA. ACM. Table 7: Total of inferred lexicon words by do- Ralf Herbrich, Tom Minka, and Thore Graepel. 2007. Trueskill(tm): A bayesian skill rating system. In main. Advances in Neural Information Processing Systems 20, pages 569–576. MIT Press, January. sidered TrueSkill performs better using nouns than Edison Marrese-Taylor, Juan D. Velsquez, Felipe Bravo-Marquez, and Yutaka Matsuo. 2013. Iden- adjectives. tifying customer preferences about tourism prod- On the other hand, in the Agricultural domain, ucts using an aspect-based opinion mining approach. SentiWordNet performs better than AcroDefM I3 Procedia Computer Science, 22(0):182 – 191. 17th and TrueSkill. This is due to the agricultural do- International Conference in Knowledge Based and Intelligent Information and Engineering Systems - main was collected from Twitter. Tweets are short {KES2013}. texts that usually have more seed words and com- mon words as shown in Table 1. The agricultural Bo Pang and Lillian Lee. 2004. A sentimental educa- tion: Sentiment analysis using subjectivity summa- domain has frequent seed words. rization based on minimum cuts. In Proceedings of the ACL. 5 Conclusion Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Most of the dictionary-based algorithms for sen- 2002. Thumbs up?: Sentiment classification using machine learning techniques. In Proceedings of the timent analysis consider word frequency in doc- ACL-02 Conference on Empirical Methods in Natu- uments. However, this research has shown that ral Language Processing - Volume 10, EMNLP ’02, collected corpus words with low frequencies can pages 79–86, Stroudsburg, PA, USA. Association be useful to set polarities. Thus, We propose a for Computational Linguistics. dictionary-based algorithm for sentiment analysis Mathieu Roche and Violaine Prince, 2007. Model- that uses AcroDefM I3 and TrueSkill methods so ing and Using Context: 6th International and Inter- as to compute correlation word scores that allow disciplinary Conference, CONTEXT 2007, Roskilde, Denmark, August 20-24, 2007. Proceedings, chapter us to differentiate between positive and negative AcroDef: A Quality Measure for Discriminating Ex- polarities. This is particularly useful for low fre- pansions of Ambiguous Acronyms, pages 411–424. quency words obtained from corpus. In addition, Springer Berlin Heidelberg, Berlin, Heidelberg. 87 Peter D. Turney. 2002. Thumbs up or thumbs down?: Semantic orientation applied to unsupervised classi- fication of reviews. In Proceedings of the 40th An- nual Meeting on Association for Computational Lin- guistics, ACL ’02, pages 417–424, Stroudsburg, PA, USA. Association for Computational Linguistics. R. Varghese and M. Jayasree. 2013. Aspect based sen- timent analysis using support vector machine clas- sifier. In Advances in Computing, Communications and Informatics (ICACCI), 2013 International Con- ference on, pages 1581–1586, Aug. Guangwei Wang and Kenji Araki. 2007. Modifying so-pmi for japanese weblog opinion mining by using a balancing factor and detecting neutral expressions. In Human Language Technologies 2007: The Con- ference of the North American Chapter of the As- sociation for Computational Linguistics; Compan- ion Volume, Short Papers, NAACL-Short ’07, pages 189–192, Stroudsburg, PA, USA. Association for Computational Linguistics. 88