-

Opinion polarity detection in Twitter data combining sequence mining and topic modeling

Asma Ouertatani

Ghada Gasmi

Chiraz Latiri

1 0 LIPAH, ENSI, University of Manouba , Tunis , Tunisia 1 LIPAH, FST, University of Tunis El Manar , Tunis , Tunisia 2 LISI, INSAT, University of Carthage , Tunis , Tunisia

We propose a pipeline process to analyze opinion about festivals and cultural events by automatically detecting polarity in Twitter data. Previous studies have focused in the polarity classi cation of individual tweets. However, to understand the polarity of opinion on a domain, it is important to nd themes or topics that occur in the corpus. The rst phase is to nd the optimal number of topics and to identify the major topics via the latent Dirichlet analysis (LDA) topic model. The second stage is to detect polarity in tweets using the sequence mining approach mainly founded on sequences extracted from tweets using a LCM-seq algorithm [9]. The results showed that the polarity detection accuracy of the sequence mining was 84.78%, indicating that the proposed method was valid in most cases.

topic modeling LDA opinion analysis sequence mining

With the advent of web 2.0 and social network service evolution, users generated a massive amount of information stored in unstructured online reviews that can not simply be used for further processing by computers. Various researchers have conducted analyses focusing on the exchange of opinions that occurs on social network platforms.

Twitter is an online social network where users post and interact with messages, "tweets", restricted to 140 characters.

However, discovering sentiments and opinions through manual analysis of a large volume of textual data is extremely di cult. For that reason, speci c preprocessing methods and algorithms are needed in order to mine useful patterns. Hence, in recent years, there have been much interests in the natural language processing community to develop novel text mining techniques with the capability of accurately extracting users' opinions from large volumes of information like Twitter data.

Among various opinion mining tasks, one of them is polarity analysis, i.e. whether the semantic orientation of a text is positive or negative, which focuses on classifying the polarity of individual texts (e.g., web reviews or tweets) by selecting important features through methods such as n-grams [10, 11], word subsequence [12], information gain [ 6 ], and recursive feature elimination [ 1 ]. When applying machine learning to opinion classi cation, most existing approaches rely on supervised learning models trained from labeled corpora where each document has been labeled as positive or negative prior to training. A tweet is then classi ed via algorithms, such as the nave nave, maximum entropy [11], or support vector machine (SVM) algorithms. However, sentiment classi cation models trained on one domain might not work at all when moving to another domain. Furthermore, in a more ne-grained opinion classi cation problem (e.g nding users0 opinions for a particular lm festival), topic detection and opinion classi cation are often performed in a two-stage pipeline process, by rst detecting a topic and later assigning a polarity label to that particular topic.

We propose a pipeline process to analyze opinion about festivals and cultural events by automatically detecting polarity in Twitter data. Previous studies have focused on the polarity classi cation of individual tweets.

However, to understand the polarity of opinion on a domain, it is important to nd themes or topics that occur in the corpus. Our goal here is to nd the optimal number of topics and to identify the major topics via the latent Dirichlet analysis (LDA) topic model. The second stage detects polarity in tweets using the sequence mining approach mainly founded on sequences extracted from tweets using a LCM-seq algorithm.

The remainder of this paper is organized as follows. Section 2 details the proposed method, which includes a data-preprocessing step; Section 3 presents the analysis results; and Section 4 presents the conclusion of this study and discusses directions for future research. 2 2.1

Proposed method Preprocessing

The MC2@CLEF2017 lab has released a collection of 70 000 000 microblogs over 18 months dealing with cultural events [ 7 ]. Microblogs are in all languages. We used just 5 000 000 tweets from the collection.

Simple and intiutive techniques in the preprocessing phase were evoked as removal links, twitter identi ers, pontuations and stop words.

Clearly cannot be performed without knowing the underlying language detection. Therefore, modern text processing tools heavily rely on highly e ective algorithms for language. We employed the Cavnar and Trenkle [ 5 ] approach to text categorization based on character n-gram frequencies that have been particularly successful.

We used the implementation in the R extension package textcat aims at both exibility and convenience. After the preprocessing phrase we chosed the rst 320000 english tweets to be our dataset. Figure 1 presents a words cloud from our dataset. The word cloud principle is based on a text analysis method that allows us to highlight the most frequently used keywords ( like : music, Film..) Topic modeling is a type of statistical model in natural language processing that aims to nd topics in a corpus, group topics together by looking for similarity and co-occurence, and categorize documents in the corpus based on the topic probabilities assigned.

We are speci cally using a statistical method called the latent Dirichlet allocation (LDA). Latent Dirichlet Allocation (LDA) is one of the most popular topic models [ 3 ]. In the context of LDA, a topic is composed of terms with creation probabilities. For each term position in a document, LDA identi es a topic, and the topic is composed of the terms included in the topic, measured probabilistically. Given a set of documents, LDA provides an algorithm that learns the topics and the terms associated with each topic. LDA requires one input parameter: the number of topics to extract.

And now the question then arises as: What is the best way to determine k (number of topics) in topic modeling?

Optimal number of topics for LDA model :

Before going right into generating the topic model and analysing the output, we need to decide on the number of topics that the model should use. We used 3 metrics to estimate the best tting number of topics: { Method based on the harmonic mean :

This method has rst been applied by Gri ths and Steyvers [8]. We calculated the harmonic mean of a the values sets of p(wjz; k). The model that we will retain by varying k will be the one which will have the highest value. z : Per word topic assignment. w : word.

k : number of topic. { Density-based method [ 4 ]

The principle is to calculate the similarity (or distance) between all pairs of themes for di erent models obtained by varying the number of themes.

Themes are more independent if the similarity between themes is small. { Method based on the Kullback-Leibler divergence (KL) [ 2 ]

The measure of divergence is a measure of how the topic1 distribution for document m and the word distribution for topic1 diverges from a second topics expected probability distribution.

The optimal k is the one with the lowest divergence. The three methods required to train multiple LDA models to select one with the best performance. So, the best way is to calculate all metrics at once, the gure 2 represents the Results calculated on the whole dataset: The three methods agree that somewhere between 75 and 100 topics is optimal for this dataset. To nd the best value of the number of topics hyperparameter k we used the perplexity measure for the applicability of a topic model to new data and the 5 folds cross validation over the range of k [75..100]. Perplexity is a measure of how well a probability model predicts a sample. We opted to t a model with 85 topics. In the gure 3 the plot of the results: Terms are assigned to a topic with probabilities, so every term in the corpus is given a probability per topic. However, we can use the top terms to get a sense for what each topic covers. Figure 4 shows the topics names. For the second stage of our approach we used the lms topic. Before starting the phase of analysis of polarity one must go through the stage of the analysis of subjectivity to remove the objective tweets of our collection. To do this, we used the subjectivity lexicon 1, and N-gram as features and the

1 http://mpqa.cs.pitt.edu/lexicons/subj-lexicon

nave bayes as classi er.

For the polarity detection, we used lexique Wordnet sentiment, T f idf and the algorithm LCM-seq 2 to extract all frequent item sequences. to use it as features. Lcm-seq : is an e cient algorithm for enumerating frequent sequence patterns from a sequential database. In addition to its high speed, LCM-seq can be applied in a variety of ways, as it can assign a positive or negative weight to each sequence and only extract frequent sequence patterns that appear in a speci ed window width [9].

For a vocabulary V , the set of nite sequences on V is expressed by V . A sequence pattern is an arbitrary sequence s = a1::::an V , and P = V expresses the set of all sequence patterns on V . The sequence database on V is the sequence set S = s1; :::; sm. We denote the the size of S by jSj. For sequence pattern p 2 P , a sequence database including p is called an occurrence of p. The denotation of p, denoted by (p) is the set of the occurrences of p. j (p)j is called the frequency of p, and denoted byF req. For given constant 2 N , called a minimum support, sequence pattern p is frequent if F req(p) . In our approach, we used a value min sup equal to 100. 3 3.1

Results Experimental validation

For the phase of the subjectivity analysis we used as a training corpus introduced in Pang/Lee ACL 2004 3 we used the Subjectivity lexicon and N-gram as features.

For the polarity detection we used the sentiment140 data as a training data 4, and we used the frequent item sequences as features for nave bayes classi er. 3.2

Evaluation protocol

As evaluation meteric we used the classi er Accuracy.

The accuracy can be de ned as the percentage of correctly classi ed instances : Accuracy =

(T P + T N ) (T P + T N + F P + F N ) (1) Where TP, FN, FP and TN represent the number of true positives, false negatives, false positives and true negatives, respectively.

The following table illustrates the results for the nave bayes classi er :

2 http://research.nii.ac.jp/uno/code/LCM-seq.html 3 http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html 4 http://help.sentiment140.com/for-students]

The polarity detection aims to automatically classify the customer opinion and provide comprehensive understanding of customer feedback from raw data on the Web. In all of the social network platforms, Twitter has been one of the most popular sources for marketing information research and sentiment classi cation. The work described in this paper is a step towards e cient classi cation of tweets using the topic modelling. Dublin, Ireland, 11/09/2017-14/09/2017, volume 10456 of Lecture Notes in Computer Science, http://www.springerlink.com, 2017. Springer. 8. T. L. Gri ths and M. Steyvers. Finding scienti c topics. Proceedings of the

National academy of Sciences, 101(suppl 1):5228{5235, 2004. 9. T. Nakahara, T. Uno, and K. Yada. Extracting promising sequential patterns from r d data using the lcm sequence. In Knowledge-Based and Intelligent Information and Engineering Systems. Springer, 2010. 10. A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysis and opinion mining. In LREc, volume 10, 2010. 11. B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?: sentiment classi cation using machine learning techniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, pages 79{86. Association for Computational Linguistics, 2002. 12. R. Xia, C. Zong, and S. Li. Ensemble of feature sets and classi cation algorithms for sentiment classi cation. Information Sciences, 181(6):1138{1152, 2011.

Abbasi ,

Chen ,

Thoms , and

Fu . A ect analysis of web forums and blogs using correlation ensembles . IEEE Transactions on Knowledge and Data Engineering , 20 ( 9 ): 1168 { 1180 , 2008 .

Arun ,

Suresh ,

Veni Madhavan , and

M. Narasimha

Murthy . On nding the natural number of topics with latent dirichlet allocation: Some observations . Advances in Knowledge Discovery and Data Mining , pages 391 { 402 , 2010 .

3. D. M. Blei , A. Y.

Ng , and M. I.

Jordan . Latent dirichlet allocation . Journal of machine Learning research , 3 (Jan): 993 { 1022 , 2003 .

Cao ,

Xia ,

Li ,

Zhang , and

Tang . A density-based method for adaptive lda model selection . Neurocomputing , 72 ( 7 ): 1775 { 1781 , 2009 .

W. B.

Cavnar ,

J. M.

Trenkle , et al. N-gram-based text categorization . Ann Arbor MI, 48113 ( 2 ): 161 { 175 , 1994 .

6. T .-Y. Chu, J.

Lu , S.

Beaupre , Y.

Zhang , J.-R.

Pouliot , S.

Wakim , J.

Zhou , M.

Leclerc , Z.

Li , J.

Ding , et al. Bulk heterojunction solar cells using thieno [3, 4-c] pyrrole-4, 6-dione and dithieno [3, 2-b: 2, 3-d] silole copolymer with a power conversion e ciency of 7.3% . Journal of the American Chemical Society , 133 ( 12 ): 4250 { 4253 , 2011 .

Ermakova ,

Goeuriot ,

Mothe ,

Mulhem ,

J.-Y.

Nie , and E. Sanjuan. CLEF 2017 Microblog Cultural Contextualization Lab Overview (regular paper) . In Experimental IR Meets Multilinguality , Multimodality, and Interaction, CLEF,