-

FBM-Yahoo! at RepLab 2012

Jose M. Chenlo

josemanuel.gonzalez@usc.es

Jordi Atserias

Carlos Rodriguez

carlos.rodriguezg@barcelonamedia.org

Roi Blanco

roi@yahoo-inc.com

This paper describes FBM-Yahoo!'s participation in the proling task of RepLab 2012, which aims at determining whether a given tweet is related to a speci c company and, in if this being the case, whether it contains a positive or negative statement related to the company's reputation or not. We addressed both problems (ambiguity and polarity reputation) using Support Vector Machines (SVM) classi ers and lexicon-based techniques, building automatically company pro les and bootstrapping background data. Concretely, for the ambiguity task we employed a linear SVM classi er with a token-based representation of relevant and irrelevant information extracted from the tweets and Freebase resources. With respect to polarity classi cation, we combined SVM lexicon-based approaches with bootstrapping in order to determine the nal polarity label of a tweet.

RepLab [ 1 ] addresses the problem of Reputation analysis, i.e. mining and understanding opinions about companies and individuals, a harder and still not well understood problem. FBM-yahoo! participates in the RepLab Pro ling task [ 1 ] where systems are asked to annotate two kinds of information on tweets: 2.1

Ambiguity task

Company Representation Twitter messages are short (up to 140 characters), hence, measures that account for the textual overlap between tweets and company names are in general not enough to classify a given tweet as relevant or irrelevant [ 2 ], mostly due to data sparsity and lack of context [ 3 ]. In order to alleviate this problem, we turned into using the Freebase4 graph and Wikipedia5 as reliable sources of information for building expanded term-based representations of the di erent companies.

From the Freebase/Wikipedia pages of the companies we extracted automatically two sets of entities, namely related concepts and non-related concepts : { Related Concepts (RC): represents the set of entities that are connected with the company in Freebase through the incoming (outgoing) links connected to the company's Freebase page. For example, in the case of Apple Inc., the related concepts set includes iPhoto, ichat, ibook, iTunes Store. { Non-Related Concepts (NRC): represents the set of common entities with which the current company could cause spurious term matches. This set is comprised of all Freebase entities with a name similar to that of the company's. This set is built automatically by querying Freebase with the query that identi es the company in the training data. From this set we remove the target company (if it was found), and all the entities that are already included in RC, and all entities that shared at least one non-common category with the target company. As an example of this process, in the case of Apple Inc. some of the non-related entities selected were \big apple" or \pine apple".

Then for each entity obtained following the previous method we have crawled its Wikipedia page6 and then we have used Lucene7 software to compute the following lists of keywords for each set of entities (RC, NRC): { entity names : Name of the entities related (non-related) to the company. { named entities in text : All named entities extracted by the Stanford Named

Entity Recognizer [ 4 ]. { ngrams: Unigrams and bigrams (applying stemming and removing stopwords).

A weight w is associated to all of the obtained keywords (list of entities, named entities in text, ngrams). In the case of the entities, the weight is always 1. For named entities in text and ngrams, the weight is the ratio of documents that contain the concrete keyword.

These lists of keywords represent our pro le for a given company as a bag of words model. We note that tweets could be written in English and Spanish and accordingly we have computed two di erent pro les for each company: one with the English version of Wikipedia an the other one with the Spanish version. 4 http://www.freebase.com 5 http://www.wikipedia.org 6 In the data-set tweets are written in either English or Spanish. For this reason we have downloaded and stored both versions when possible. 7 http://lucene.apache.org 2.2

Training Process In recent years, Machine Learning techniques have been deeply applied over Twitter data with relative success in many classi cation problems [ 5 ] [ 6,7 ]. Concretely, the best system in WePS-3 Evaluation Campaign [ 8 ], where the main task consisted in identifying if a tweet that contains a company name is related or not to the company, employed a linear SVM classi er. Following this approach, we have trained a SVM linear classi er using the LibLinear package [ 9 ]. Table 1 lists the features that are being used to represent the data, which are broken down into matches from terms in the tweet in the company's prole (pro le), features related to the company name in the tweet (company) and company-independent (tweet-only) features.

Scope

description

Note that the last six features compare a given tweet with the pro le computed for the company. The rst six features are tweet-dependent, and they only need the text of the tweet and the query that represents the company. Using this representation we were able to learn a classi er over the trial set (six companies) that can be directly applied to the test data.

Polarity for Reputation task

The following sections explain three di erent approaches (lexicon-based and distant supervision using hashtags and lexicons) we explored in order to determine whether a tweet has positive or negative implications for the company's reputation. 3.1

Lexicon-Based Approaches The most straightforward approaches employ an ensemble of several lexicons created with di erent methodologies in order to broaden coverage, especially across domains since some sentiment cues are used di erently depending on the subject being commented.

In order to aggregate the lexicon scores into a nal polarity measure, several formulas can be used, for instance: polScore(t; lan; qt) =

X polLex(t; li; qt)

(1) li2lan where t is a tweet, lan is the language of the tweet, qt is a query, li is one of the lexicons associated to lan and polLex(t; li) is a matching function between the lexicon li and the tweet t. We have developed two di erent matching functions, polLexraw and polLexsmooth. polLexraw is a simple aggregation measure that takes into account just the matchings between tweets and lexicons to compute the nal polarity: polLexraw(t; l; qt) = X tfwl;t priorP ol(wl) (2)

wl2l where t represents a simple tweet, l is one of the lexicons associated to the language of the tweet, wl is an opinionated word from lexicon l, tfwl;t is the frequency of wl in tweet t and priorP ol(wl) is the polarity score of word wl in lexicon l. 8

On the other hand, polLexSmooth is an aggregation measure that takes into account the matchings between tweets and lexicons and the distance of these matchings to the company name to smooth the score of polarity of each word: 1 X

1 jqtj qi2qt wl2l\t dwl;qi polLexsmooth(t; l; qt) = priorP ol(wl) (3) where dwl;qt is the distance of the tweet term wl to query term qi.

Finally, we decide the nal classi cation of each tweet using the following simple thresholding:

8 positive if polScore(t; l; qt) > 0 pol(t) = < neutral if polScore(t; l; qt) = 0 : negative if polScore(t; l; qt) < 0 (4) 8 This score could be positive or negative depending on the orientation of wl.

Note that it is possible to compute two di erent values for polScore(t; l; qt) by applying either Equation 2 or Equation 3 to the formula in Equation 1. Full details about which methods have been used in the runs submitted can be found in Section 4. 3.2

Distant Supervision Traditional opinion mining methods proposed in the literature are often based on machine learning techniques, using as primary features a vocabulary of unigrams and bigrams collected from training data [ 10 ].

Following this approach and we have used a linear SVM to classify tweets as positive, neutral or negative. Table 2 lists the features employed to represent the data, which are broken down into tweet-based features, part of speech-based features and lexicon-based features.

Scope Description Voc. vocabulary features: Unigrams and bigrams from training examples.

Size of tweet.

Number of links.

Tweet INfutmhebetwreoefthcaosuhltdagbse. spam (a single word appears more than three times).

Number of exclamations and interrogations.

Number of uppercase letters.

Number of lengthening phenomena.

Number of verbs.

POS NNuummbbeerr ooff apdrojepcetrivneasm.es.

Number of pronouns.

Number of positive emoticons.

The lexicon-based approaches previously described do not require training and can be directly applied over test data. However, the proposed data representation requires some amount of training data to compute the vocabulary features for each tweet which was not available at training time. Moreover, due to the fact that the companies in the test set belong to di erent domains (e.g. banks vs technology), the terms (and even their senses) used for express opinions may change from one company to another.

For that reason, we learnt di erent a model for each company in which we automatically generated a set of labelled examples from their background model. Other recent work on this area has focused on distantly supervised methods which learn the polarity classi ers from data with noisy labels such as emoticons and hashtags [ 6 ] [ 11 ].

Distant Supervision using Hashtags Similarly to [ 11 ], for each polarity class (i.e. positive, negative and neutral) we have performed the following process to automatically generate positive, neutral and negative labelled examples: 1. Selecting all hashtags that were used in more than 5 tweets in the background model of the company. 2. Removing the noisy content (spam, repeated tweets, retweets, etc.) for each hashtag. 3. Using the equation 1 in conjunction with equation 2 as matching function to select the top 5 positive/negative/neutral hashtags, according to the ratio of tweets of each hashtag that were classi ed as positive/negative/neutral. 4. Selecting the top 20 tweets of each polarity from top hashtags.

This bootstrapping process enables to obtain up to 100 positive, negative and neutral labelled examples (i.e. up to 300 examples in total) to train di erent classi ers.

Once we have generated our labelled examples, we have trained a positive classi er (positive examples against negative plus neutral examples), and a negative classi er (negative examples against positive plus neutral examples) for each company in the test set. We have also trained the best thresholds that separated the positive and the negative examples for each classi er. Finally, we combined the two classi ers and the thresholds learned to decide if a given tweet had to be tagged as positive, neutral or negative.

Learning the Best Threshold In the previously described approach, we selected the class decision threshold for a classi er using data which could potentially contain noisy labels and consequently could harm the performance of our system. To alleviate this problem, we randomly assessed 50 examples from the background data of each company and we selected the positive/negative thresholds for each classi er according to the the class distribution found in the data. Full details about which runs submitted were built with this kind of training can be found in Section 4. 3.4

Distant Supervision using lexicons This distant supervision method is similar to the one explained in Section 3.3, with the di erence that it makes use of the polarity lexicons instead of the tweet hashtags.

The following process is undertaken for each polarity class (i.e. positive, negative and neutral), in order to automatically generate positive, neutral and negative labelled examples for each company: 1. Select as positive examples tweets that only have positive matches sorted by the number of matches in the lexicon. 2. Select as neutral examples tweets that no matches ordered by the tweet length. 3. Select as negative examples tweets that only have negative matches sorted by the number of matches in the lexicon.

Similarly to the distant supervision method using hashtags doing this bootstrapping process we select up to 100 positive, negative and neutral labelled examples (i.e. up to 300 examples in total) in order to train di erent classiers for each company. These examples are selected in order of their number of matches.

The nal classi er is built using the thresholded ensemble described in Section 3.3. 4

Submitted Runs

FBM-Yahoo! participated in the pro ling task of RepLab 2012 competition with 5 di erent runs9. The particular details on how the FBM-Yahoo! 5 runs runs were made can be found in Table 3. All runs use the method explained in section 2.1 to classify a tweet as relevant or irrelevant, but they di er on the polarity method used to compute the nal label of a tweet (i.e. positive,negative or neutral ).

Regarding the polarity lexicon based method described in section 3.1 we employed a total of six di erent polarity lexicons for English (including OpinionFinder[ 13 ], AFINN [ 14 ], Qwordnet[ 15 ], dictionaries from the Linguistic Inquiry and Word Count (LIWC) text analysis system [ 16 ]10 and ve polarity lexicons for Spanish. Following [ 17 ] we also combine these lexicons with a lexicon based on emoticons.

Since the resources available for Spanish are scarce, we translated some of the resources available for English, for instance, some baseline lexicons like the one used by OpinionFinder (the MPQA Subjectivity Lexicon), or AFINN [ 14 ]. In order to resolve ambiguities in this bilingual dictionary and to adapt it to micro-blogging usage, we selected the translation alternative that occurred most frequently on an alternative large (100,000) Spanish Twitter corpus (di erent from the one provided by RepLab).

As an additional approach we used author-assessed datasets to create polar lexicons from customer reviews, in this case, from 100,000 good vs. bad comments sent to Hotels.com and other such sites, like movie comments from volunteer reviewers and professionals. A Naive Bayes classi er was trained, from which a list of class-discriminative unigrams and bigram was extracted. Only adjectives and adverbs from those list were ltered to create a data-driven polar lexicon, similar to the method of Banea and Mihalcea [ 18 ] that employs an automatically translated corpus. Finally, starting from a small, manually crafted dictionary, we expanded its polar entries via WordNet synsets. 9 Another run (UNED 5) was submitted in collaboration with UNED which combines all FBM-Yahoo! and UNED runs. The details on the combination are described at see section 3 of [ 12 ] 10 Mapping positive and negative sentiments to numeric polarities, expanding the lexicon to possible morphological variants.

Polarity for reputation task. According to the o cial measures (R and S), the runs that take into account just the overlapping between tweets and lexicons (i.e. BMedia2 and BMedia3 ) performed the best for polarity classi cation. Nonetheles, bootstrapping approaches were very competitive in terms of accuracy. In fact, the performance they achieved is very close to that of the lexicon-based approaches, and therefore the rst conclusion we can extract from this evaluation is that distant supervised approaches take a limited advantage of training data in this benchmark. This could be due to the fact that lexicons contribute for most of the model signal and might make di cult to learn anything from other sources of features. Moreover, the noise introduced by misclassi cation data in the training process could harm the performance of the learning process more than improve it.

Pro ling task. In this task, all methods behave similarly in terms of performance, being BMedia4 the best run. This method combines the hashtag bootstrapping approach with the selection of a threshold for each classi er learnt from hand-classi ed tweets from background models. It is worth to remark that we have selected the best threshold for a classi er using data which contains noisy labels and consequently could harm the overall performance of the system. In order to overcome this problem, we set a di erent threshold for each classi er using background data. Results indicate that setting this threshold alleviates the score noise coming from lexicon bootstrapped examples.

Finally, as future work, we would like to explore how sentiment in Twitter streams are a ected by real-world events, which a ect severely Twitter topic trends. For example, if a football team loses a match, probably the next day the overall opinion about this team will be to negative. We would also like to study how to detect the polarity changes across the time and how to adapt our classi cation models to this new scenarios. More concretely, we would like to apply propensity scoring techniques [ 19,20 ] to deal with the fact that training instances are governed by a distribution that di ers greatly from the test distribution.

Acknowledgements

This work is partially funded by the Holopedia Project (TIN2010-21128-C02-02), Ministerio de Ciencia e Innovacion.

1. Amigo , E. , Corujo , A. , Gonzalo , J. , Meij , E. , Rijke , M. d .: Overview of replab 2012: Evaluating online reputation management systems . In: CLEF 2012 Labs and Workshop Notebook Papers. ( 2012 )

Surender

Reddy Yerva , Zoltan Miklos, K.A. : It was easy, when apples and blackberries were only fruits . In: CLEF (Notebook Papers) . ( 2010 )

3. Blanco , R. , Zaragoza , H.: Finding support sentences for entities . In Crestani, F., Marchand-Maillet , S. , Chen , H.H. , Efthimiadis , E.N. , Savoy , J., eds.: SIGIR, ACM ( 2010 ) 339 { 346

4. Toutanova , K. , Klein , D. , Manning , C.D. , Singer , Y. : Feature-rich part-of-speech tagging with a cyclic dependency network . In: In Proceedings of HLT-NAACL 2003 . ( 2003 ) 252 { 259

5. Bermingham , A. , Smeaton , A.F. : Classifying sentiment in microblogs: is brevity an advantage? In: CIKM . ( 2010 ) 1833 { 1836

6. Go , A. , Bhayani , R. , Huang , L. : Twitter sentiment classi cation using distant supervision . Processing ( 2009 ) 1{ 6

7. Agarwal , A. , Xie , B. , Vovsha , I. , Rambow , O. , Passonneau , R.: Sentiment analysis of twitter data . In: Proceedings of the Workshop on Language in Social Media (LSM 2011 ), Portland, Oregon, Association for Computational Linguistics ( June 2011 ) 30 { 38

8. Artiles , J. , Borthwick , A. , Gonzalo , J. , Sekine , S. , Amigo , E.: Weps-3 evaluation campaign: Overview of the web people search clustering and attribute extraction tasks . In: CLEF (Notebook Papers/LABs/Workshops). ( 2010 )

9. Fan , R.E. , Chang , K.W. , Wang , C.J.H.X.R. , Lin., C.J.: LIBLINEAR: A library for large linear classi cation . In: Journal of Machine Learning Research 9 ( 2008 ). ( 2008 ) 1871 { 1874

10. Pang , B. , Lee , L. , Vaithyanathan , S. : Thumbs up? sentiment classi cation using machine learning techniques . In: Proceedings of EMNLP . ( 2002 ) 79 { 86

11. Kouloumpis , E. , Wilson, T. , Moore , J. : Twitter sentiment analysis: The good the bad and the omg ! In: ICWSM. ( 2011 )

12. Jorge Carrillo de Albornoz , I.C.y.E.A. : Using an emotion-based model and sentiment analysis techniques to classify polarity for reputation . In: CLEF 2012 Labs and Workshop Notebook Papers. ( 2012 )

13. Wilson, T. , Wiebe , J. , Ho mann, P.: Recognizing contextual polarity in phraselevel sentiment analysis . In: HLT/EMNLP . ( 2005 )

14. Nielsen , F.A.: A new ANEW: Evaluation of a word list for sentiment analysis in microblogs . CoRR ( 2011 )

15. Agerri , R. , Garc a-Serrano, A.: Q-wordnet: Extracting polarity from wordnet senses . In Chair), N.C.C. , Choukri , K. , Maegaard , B. , Mariani , J. , Odijk , J. , Piperidis , S. , Rosner , M. , Tapias , D., eds. : Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10) , Valletta, Malta, European Language Resources Association (ELRA) (may 2010 )

16. Pennebaker , J. , Francis , M.E. , Booth , R.J.: Linguistic inquiry and word count: Liwc 2001 . Mahway: Lawrence Erlbaum Associates ( 2001 )

17. Kun-Lin

Liu

, Wu-Jun Li , M.G. : Emoticon smoothed language models for twitter sentiment analysis . In: Proceedings of the Twenty-Sixth AAAI Conference on Arti cial Intelligence (AAAI) . ( 2012 )

18. Mihalcea , R. , Banea , C. : Learning multilingual subjective language via crosslingual projections . In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics . ( 2007 )

19. Bickel , S. , Bruckner, M. , Sche er, T.: Discriminative Learning Under Covariate Shift . Journal of Machine Learning Research 10 ( September 2009 ) 2137 { 2155

20. Agarwal , D. , Li , L. , Smola , A.J. : Linear-time estimators for propensity scores . Journal of Machine Learning Research - Proceedings Track 15 ( 2011 ) 93 { 100