-

1613-0073

Evaluating a Sentiment Analysis Approach from a Business Point of View

Javi Fernandez

Yoan Gutierrez

David Tomas

Jose M. Gomez

Patricio Mart nez-Barco

patriciog@dlsi.ua.es 0 0 Department of Software and Computing Systems, University of Alicante

93 98

In this paper, we describe our contribution for the Task 1: Sentiment Analysis at global level of the TASS 2015 competition. This work presents our approach and the results obtained, focusing the evaluation and the discussion in the context of business enterprises.

In recent years, with the explosion of Web 2.0, textual information has become one of the most important sources of knowledge to extract useful data from. Texts can provide factual information, but also opinionbased information, such as reviews, emotions, and feelings. Blogs, forums and social networks, as well as second screen scenarios, o er a place for people to share information in real time. Second screen refers to the use of devices (commonly mobile devices) to provide interactive features on streaming content (such as television programs) provided within a software application or realtime video on social networking applications. These facts have motivated recent researches in the identi cation and extraction of opin

We would like to express our gratitude for the nancial support given by the Department of Software and Computer Systems at the University of Alicante, the Spanish Ministry of Economy and Competitivity (Spanish Government) by the project grants ATTOS (TIN2012-38536-C03-03) and LEGOLANG (TIN2012-31224), the European Commission by the project grant SAM (FP7-611312), and the University of Alicante by the project \Explotacion y tratamiento de la informacion disponible en Internet para la anotacion y generacion de textos adaptados al usuario" (GRE13-15) ions and sentiments in user comments (UC), providing invaluable information, especially for companies willing to understand customers' perceptions about their products or services in order to take appropriate business decisions. In addition, users can nd opinions about a product they are interested in, and companies and personalities can monitor their online reputation.

However, processing this kind of information brings di erent technological challenges. The large amount of available data, its unstructured nature, and the need to avoid the loss of relevant information, makes almost impossible its manual processing. Nevertheless, Natural Language Processing (NLP) technologies can help in analysing these large amounts of UC automatically. Nowadays, Sentiment Analysis (SA) as part of an NLP task has become a popular discipline due to its wide-relatedness to social media behaviour studies. SA is commonly used to analyse the comments that people post on social networks. Also, it allows to identify the preferences and criteria of users about situations, events, products, brands, etc.

In this work we apply SA to the social context, speci cally to address the Task 1:

Sentiment Analysis at global level as part of

Publicado en http://ceur-ws.org/Vol-1397/. CEUR-WS.org es una publicación en serie con ISSN reconocido TASS1 2015 challenge. This task consists on determining the global polarity of each message over provided test sets of general purpose. A detailed description about the workshop and the mentioned task can be found in (Villena-Roman et al., 2015) . The context of the workshop is also part of second screen phenomenon, in which users generate feedbacks of their experiences by posting them in social media. Our approach goes on that direction being part of the SAM 2 (Socialising Around Media) platform, where \[...] users are interacting with media: from passive and one-way to proactive and interactive. Users now comment on or recommend a TV programme and search for related information with both friends and the wider social community."

In this paper we present our SA system. This approach builds its own sentiment resource based on annotated samples, and based on the information collected it generates a machine learning classi er to deal with the SA challenges. The paper is structured as follows: The next section provides related works where main insights of each approach are exposed. The classi cation system is described in Section 3. Subsequently, Section 4 exposes in detail the evaluation, not just focusing on the guidelines of the TASS competition, but also on those aspects of interest for companies. Finally, the conclusions and future work are presented in Section 5. 2

Related Work

Di erent techniques have been used for both product reviews and social content analysis to obtain lexicons of subjective words with their associated polarity. We can start mentioning the strategy de ned by Hu y Liu (2004) which starts with a set of seed adjectives (\good" and \bad") and reinforces the semantic knowledge by applying and expanding the lexicon with synonymy and antonymy relations provided by WordNet 3 (Miller, 1993). As a result, an opinion lexicon composed by a list of positive and negative opinion words for English (around 6; 800 words) was obtained. A similar approach has been used for building WordNet-A ect (Strapparava y Valitutti, 2004) in which six basic categories of emotions (joy, sadness, 1www.daedalus.es/TASS2015 2www.socialisingaroundmedia.com 3wordnet.princeton.edu fear, surprise, anger and disgust ) were expanded using WordNet. Other widely used resource in SA is SentiWordNet (Esuli y Sebastiani, 2006). It was built using a set of seed words which polarity was previously known, and expanded using similarities between glosses. The main assumption behind this approach was that \terms with similar glosses in WordNet tend to have similar polarity". The main problem of using these kinds of resources is that they do not consider the context in which the words appear. Some methods tried to overcome this issue building sentiment lexicons using the local context of words.

Balahur y Montoyo (2008b) built a recommender system which computed the polarity of new words using \polarity anchors" (words whose polarity is known beforehand) and Normalised Google Distance scores. The authors used as training examples opinion words extracted from \pros and cons reviews" from the same domain, using the clue that opinion words appearing in the \pros" section are positive and those appearing in the \cons" section are negative. Research carried out by these authors employed the lexical resource Emotion Triggers (Balahur y Montoyo, 2008a). Another interesting work presented by (Popescu y Etzioni, 2007) extracts the polarity from local context to compute word polarity. To this extent, it uses a weighting function of the words around the context to be classi ed.

In our approach, the context of the words is kept using skipgrams. Skipgrams are a technique whereby n-grams are formed, but in addition to allowing adjacent sequences of words, some tokens can be \skipped". The next section describes our approach in detail. 3

Methodology

Our approach is based on the one described in (Fernandez et al., 2013). In this approach, the knowledge is extracted from a training dataset, where each document/sentence/tweet is labelled with respect to their overall polarity. A sentiment lexicon is created using the words, word n-grams and word skipgrams (Guthrie et al., 2006) extracted from the dataset (Section 3.1). In this lexicon, terms are statistically scored according to their appearance within each polarity (Section 3.2). Finally, a machine learning model is generated using the mentioned sentiment resource (Section 3.3). In the following sections this process is explained in detail. 3.1

Term Extraction

Each text in the dataset is processed by removing accents and converting it to lower case. Then, each text is tokenised into words, Twitter mentions (starting with @) and Twitter hashtags (starting with #). We also include combinations of punctuation symbols as terms, in order to discover some polarityspeci c emoticons.

To improve the recall of our system, we perform a basic normalisation of the words extracted by removing all character repetitions. In addition, we use the stems of the words extracted, using the Snowball4 stemmer implementation.

Afterwards, we obtain all the possible word skipgrams from those terms by making combinations of adjacent terms and skipping some of them. Speci cally, we extract kskip-n-grams, where the maximum number of terms in the skipgram is de ned by the variable n and the maximum number of terms skipped is determined by the variable k. Note that words and word n-grams are subsets of the skipgrams extracted. Figure 1 shows an example of this process.

We must clarify the di erence between two concepts: skipgram and skipgram occurrence. For example, the sentences \I hit the tennis ball" and \I hit the ball" contain the skipgram \hit the ball", but there are two occurrences of that skipgram: the rst one in the rst example with 1 skipped term, and the second one in the second example with no skipped terms. In other words, we will consider a skipgram as a group of terms that appear near of each other in the same order, allowing some other terms between them, and a skipgram occurrence as the actual appearance of that skipgram in a text. 3.2

Term Scoring

In this step, we calculate a global score for each skipgram. This score using the formula in Equation 1, where T represents the set of texts in the dataset, t is a text from the dataset T , os;t represents an occurrence of skipgram s in text t, and k is a function that returns the number of skipped terms of the input skipgram occurrence.

Graciaaaas por tu apoyo @usuario!! :))) #

Tokenisation

Graciaaaas, por, tu, apoyo, @usuario, !!, :)))

Normalisation

gracias, por, tu, apoyo, @usuario, !, :) #

Stemming

graci, por, tu, apoy, @usuario, !, :) #

Skipgrams (2-skip-2-grams)

graci por, graci tu, graci apoy, por tu, por apoy, por @usuario, tu apoy, tu @usuario, tu !, apoy @usuario, apoy !, apoy :), @usuario !, @usuario :), ! :)

We also calculate a polarity score for each skipgram and polarity. It is similar to the previous score, but it only takes into account the texts with a speci c polarity. The formula is presented in Equation 2, very similar to Equation 1, but where p represents a speci c polarity, and Tp is the set of texts in the training corpus annotated with polarity p. score(s; p) =

X X t2Tp os;t2t k(os;t) + 1 1 (2)

At the end of this process we have a list of skipgrams with a global score and a polarity score, that forms our sentiment resource. 3.3

Learning

Once we have created our statistical sentiment resource, we generate a machine learning model. We consider each polarity as a category and each text as a training instance to build our model. For each text, we will de ne one feature per polarity. For example, if we are categorising into positive, negative or neutral (3 categories), there will be 3 features for each document, called positive, negative, and neutral respectively.

The values for these features will be calculated using the sentiment resource, combining the previously calculated scores of all the value(p; t) = 1 X ( os;t2t k(os;t) + 1 score(s; p) + 1 score(s; p) score(s; p) score(s) ) (3) skipgram occurrences in the text, to nally have one value for each feature. The formula used can be seen in Equation 3, where p represents a speci c polarity, t is a text from the dataset, os;t represents an occurrence of skipgram s in text t, and k is a function that returns the number of skipped terms of the input skipgram occurrence. This formula gives more importance to occurrences with a low number of skipped terms, with a high number occurrences in the dataset in general, and with a high number of occurrences within a speci c polarity.

Finally, a model will be generated using the features speci ed and their values obtained as explained above. The machine learning algorithm selected is Support Vector Machines (SVM), due to its good performance in text categorisation tasks (Sebastiani, 2002) and previous works (Fernandez et al., 2013). 4

Evaluation

ditional experiments using di erent category con gurations. These are the con gurations chosen:

Default. In this con guration, we used the categories speci ed in the workshop: NONE, NEU, P+, P, N+ and N.

Subjectivity. In this con guration, we used only two categories: SUBJECTIVE and OBJECTIVE. The SUBJECTIVE includes the texts that express opinions (positive, neutral and negative), and the OBJECTIVE category represents no opinionated texts. The goal of this con guration is to discover users' messages that involve opinions.

Polarity. In this experiment, we used only two categories: POSITIVE and NEGATIVE, independently of their intensity. The rest of the texts were discarded. By using this kind of categorisation it is possible to simplify an analysis report into only two main points of view. Polarity+Neutral. In these experiments, only the opinionated categories were used: POSITIVE, NEUTRAL and NEGATIVE. In this case, the NEUTRAL category includes both not opinionated texts and neutral text. Business companies in some cases need to consider neutral feedbacks, since the neutral mentions can also be considered as positive for their reputation.

For the experiments, we also employed additional datasets, so we can extrapolate our conclusions to other domains. Their distribution can be seen in Table 2. These are the datasets chosen:

TASS-Train and TASS-Test. These are the o cial train and test dataset of the TASS 2015 Workshop respectively. Sanders. This is the Sanders Dataset 5. It consists of hand-classi ed tweets labelled as positive, negative or neutral. 5www.sananalytics.com/lab/twitter-sentiment MR-P. This is the well-known Movie Reviews Polarity Dataset 2.0 6 (Pang y Lee, 2004) . It contains reviews of movies labelled with respect to their overall sentiment polarity (positive and negative).

MR-PS. The Movie Reviews Sentence Polarity Dataset 1.0 (Pang y Lee, 2005).

It has sentences from movie reviews labelled with respect their polarity (positive and negative).

MR-SS. The Movie Reviews Subjectiv

ity Dataset 1.0 (Pang y Lee, 2004) . It has sentences from movie reviews labelled with respect to their subjectivity status (subjective or objective).

These experiments were performed combining the datasets and the con gurations, using 10-fold cross validation, as these corpora do not have a default division into train and test datasets. Note that not all the datasets can be used in all con gurations. For example, the Sanders dataset can be used to evaluate Polarity and Polarity+Neutral, but not with Subjectivity, as texts are not explicitly divided into not opinionated (NONE) and neutral (NEU). Table 3 shows the results obtained.

First of all, it should be noted that our model does not use information out of the training dataset. Thus, it will work very well with datasets in a speci c domain and similar topics. However, in small and heterogeneous datasets the results will be lower. We consider MR-SS, MR-P and MR-PS as homogeneous datasets (only within the movies domain) and TASS-Train, TASS-Test and Sanders as heterogeneous datasets.

As we can see in Table 3, the best results were obtained in subjectivity detection in closed domains (MR-SS), with a F-score of 0:92. In open domains the results are noticeably worse. In our opinion, the results obtained are good enough for business, as studies like Wilson et al. (2005) report a 0:82 of human agreement when working with the

Polarity+Neutral con guration.

In addition, when evaluating subjectivity the results are signi cantly better when the corpus is in closed domains (movies in this case), and worse in open domains. However, polarity evaluation does not seem to be 6www.cs.cornell.edu/people/pabo/movie-reviewdata as domain dependent as subjectivity evaluation. Results evaluating polarity are very similar independently of the type of dataset employed. 5

Conclusions

In this paper, we presented our contribution for the Task 1 (Sentiment Analysis at global level) of the TASS 2015 competition. The approach presented is a hybrid approach, which builds its own sentiment resource based on annotated samples, and generates a machine learning model based on the information collected.

Di erent category con gurations and different data sets were evaluated to assess the performance of our approach considering business enterprises interests regarding the analysis of user feedbacks. The results obtained are promising and encourage us to continue with our research line.

As future work we plan to train our system with di erent datasets, in terms of size and domain, and combine our sentiment lexicon with existing ones (such as SentiWordNet or WordNet A ect ) to improve the recall of our approach.

Bibliograf a

Balahur, Alexandra y Andres Montoyo. 2008a. Applying a culture dependent emotion triggers database for text valence and emotion classi cation. Procesamiento del lenguaje natural, 40:107{114.

Balahur, Alexandra y Andres Montoyo. 2008b. Building a Recommender System using Community Level Social Filtering.

En NLPCS, paginas 32{41.

Esuli, Andrea y Fabrizio Sebastiani. 2006.

Sentiwordnet: A publicly available lexical resource for opinion mining. En Proceedings of LREC, volumen 6, paginas 417{ 422.

Fernandez, Javi, Yoan Gutierrez, Jose M.

Gomez, Patricio Mart nez-Barco, Andres Montoyo, y Rafael Mun~oz. 2013. Sentiment Analysis of Spanish Tweets Using a Ranking Algorithm and Skipgrams. En

XXIX Congreso de la Sociedad Espan~ola

de Procesamiento de Lenguaje Natural (SEPLN 2013), paginas 133{142.

Guthrie, David, Ben Allison, Wei Liu, Louise Guthrie, y Yorick Wilks. 2006. A closer Dataset

TASS-Train TASS-Test Sanders MR-P MR-PS MR-SS

NONE 1,483 21,416 2,223 5,331

Subjectivity Polarity+Neutral Polarity

look at skip-gram modelling. En Proceedings of the LREC-2006, paginas 1{4. Hu, Minqing y Bing Liu. 2004. Mining and summarizing customer reviews. En

Proceedings of the 10th ACM SIGKDD,

paginas 168{177. ACM.

Miller, George A. 1993. Five papers on

WordNet. Technical Report CLS-Rep-43, Cognitive Science Laboratory, Princeton University.

Pang, Bo y Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. En Proceedings of the 42nd

Annual Meeting on Association for Computational Linguistics, pagina 271.

Pang, Bo y Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. En Proceedings of the 43rd Annual

Meeting on Association for Computational

Linguistics, paginas 115{124.

Popescu, Ana-Maria y Orena Etzioni. 2007.

Extracting product features and opinions from reviews. En Natural language processing and text mining. Springer, paginas 9{28.

Sebastiani , Fabrizio. 2002 . Machine learning in automated text categorization . ACM computing surveys (CSUR) , 34 ( 1 ):1{ 47 .

Strapparava , Carlo y Alessandro Valitutti. 2004 . WordNet A ect: an A ective Extension of WordNet . En LREC, volumen 4, paginas 1083 { 1086 .

Villena-Roman , Julio, Janine

Garc a-Morera, Miguel A. Garc a-Cumbreras, Eugenio Mart nez- Camara , M. Teresa Mart nValdivia, y L. Alfonso Uren~a- Lopez . 2015 . Overview of TASS 2015 .

Wilson , T. , P. Ho mann, S. Somasundaran,

Kessler ,

Wiebe ,

Choi ,

Cardie , E. Rilo , y

Patwardhan . 2005 . OpinionFinder: A system for subjectivity analysis . En Proceedings of HLT/EMNLP on Interactive Demonstrations.