-

Opinion Analysis of Bi-Lingual Event Data from Social Networks

Iqra Javed

iqra217@gmail.com 0

Hammad Afzal

hammad.afzal@mcs.edu.pk 0 0 Department of Computer Software Engineering, National University of Sciences and Technology , Islamabad , Pakistan

Social networks have recently emerged as the fastest and very effective medium to express news updates, trends and expression of personal views. There have been several studies to perform detailed sentiment analysis on such data in most of the developed languages. However, Urdu lacked any such study despite being spoken by around 30 Million people around the globe and used in regions with fastest growth of broadband users. This research has been carried out as a first step in this direction, where a language resource comprising the sentiment strengths of Roman Urdu words has been proposed along with its utility by under taking a case study of spatial analysis of bi-lingual (Urdu and English) tweets in the context of a national event, i.e. genral elections 2013. The results are encouraging, showing the effective utility of the bi-lingual sentiment strength database.

Keywords Sentiment Analysis Twitter Data Language Resources

For last few years, there has been an emerging trend by public to consider the social networks for news updates, upcoming trends, community updates and expression of personal reviews on various events. These events range from smaller ones, interesting only to some particular region or community such as local seminars or concerts to the larger ones that can be of interest to entire country (epidemics, weather or political events). The popularity of social networks among public to share their opinion has led to its use as an opinion reviewing and result predicting tool for events that are related to public having common issues and problems. There have been several case studies that consider geographilcal and temporal analysis of such events [ 2-10 ]

Twitter1 is considered as one of the most popular micro-blogging social networking website with more than 554 million active users till 20132. Twitter user’s posts, known as “tweets”, are generally used as information broadcasting tool for local events and they can be used to mine their pre and post effects. In addition, they can also be used for opinion analysis from a specific region within specific time bounds.

1 https://twitter.com/ 2 http://www.statisticbrain.com/twitter-statistics/

This research presents an approach on analysis of bi-lingual tweets, describing the public’s opinions about a national event. We have particularly focused on a case study of Pakistan’s general elections 2013. Pakistan has been considered as one of the fastest growing countries in terms of IT users and broadband usage. Youth being the major portion of population3, such frameworks can be very effectively utilized for trend prediction. Although English is commonly used in higher education, public in general is not much well versed in English; however they are not restricted by this limitation and tend to express their opinions in Urdu using English script (termed as Roman Urdu hereafter in this paper). We have performed spatial and temporal analysis, covering five major cities in Pakistan (having populations around 50 Million each) and over the period of 5 months. The results obtained by our analyis mostly confirm with the results of elections (announced in March, 2013) and the observations made by other survey organizations (using the means other than social network data). 2

Background

Manually prepared lexicons and machine learning techniques have been mostly used in sentiment analysis to analyze mood, emotion classification and opinion extraction within a text provided tweets. In [ 2 ] proposed technique is based on classification of tweets on their content basis and groups them as hot topics according to the frequent population of tweets on relative topics and geo-location information associated with tweet text. However, due to semantic fluctuations, the proposed classification technique does not work particularly good enough as tweets can use multiple words to refer to the same event.

Ishikawa, Arakawa, Tagashira, Fukuda discusses a system that detects hot topic in a local area in a specified time period and a classification method is proposed that reduces variation of posted words related to the same topic in tweets. The hot topics can be predictable (matches, elections, festivals) and non-predictable (natural disasters) events. Such event analysis is helpful in making any business strategy, disease information social relationships [ 3 ].

Wong and Chang conducted quantitative and qualitative analysis on informative and affective tweets based on word frequencies and word co-occurrence [ 5 ]. They used event related context specific vocabulary to train their classifier. Open source resources have also been utilized for lexicon building and sentiment classification but the classifier gave poor performance on untrained domains [ 7 ]. Polarity classification was performed in [ 8 ] using lexicon-based approach where manual annotation was performed. They ruled out those tweets that contained both positive and negative emotions. Lexicon based approach is applied in Sentistrength [10] for sentiment analysis of text. But these lexicons provide limited support and needs manual marked lexicon. Further no support available for roman-Urdu and political text analysis. 3 http://southasiainvestor.blogspot.com/2011/10/pakistan-ranks-among-fastest-growing.html

Methodology

The aim of the proposed research is to provide a framework to analyse the bilingual data from twitter using spatial and temporal bounds. Pakistan’s general Election 2013 is taken as case study. Retrieved text from twitter comprises of tweets written in two languages, English and Roman-Urdu. The sentiment analysis is performed on this bi-lingual text using existing (customized) and newly created lexicons on sentiments data. The steps performed in our approach are illustrated in Fig 1 and elaborated below.

Our approach starts with collection of tweets dataset. Twitter search API is used for tweets retrieval based on keywords. Tweets related to four main political parties Pakistan Tehreek-e-Insaaf (PTI), Pakistan Muslim League Nawaz PML(N), Pakistan Peoples Party (PPP) andMutahidda Quomi Movement (MQM ) from five major cities of Pakistan (Islamabad, Lahore, Karachi, Peshawar and Quetta) considering the radius of 20 miles of the city are collected. Collection of dataset is performed on weekly basis while the time span for dataset collection is from Dec 2012 till polling day (11th March, 2013). 3.2

Classification of Tweets

Two iterations of classification are performed over dataset retrieved from twitter. These classifications are carried out on keyword basis. First iteration discriminates between the tweets belonging to political/non political contents. This step was reqiured as most of the spammers, particularly belong to real estate businesses, exploited the popularity of the keywords related to political parties. Some keywords that were used to identify noisy (non political tweets) are summarized in Table 1.

Second iteration of classification was performed to discriminate between English and Roman-Urdu. This was also performed based on presence of keywords from a set of commonly used English words as presented in Table 2. S.No Party Pti

City

Peshawar Mqm Pml Pti Islamabad

Karachi Lahore

Language Text

Roman peshawar: jamaat-e-islami aur pti ke dermian khyber Urdu pakhtunkhwa mey seat adjustment per ittefaak na husaka.

Roman karachi: mqm nay aam intikhabat main mulk bhar say party Urdu ticket kay liye darkhastain talab kar lein dr. farooq sattar.b.n Roman :lahore: \nsabiq governor state bank dr. ishrat hussain ko Urdu nigran wazir e azam banai janne ka imkaan zarai.\n#ppp

#pmln #pti English :#pti & #ji flirting in rawalpindi :d >>>>

http:\/\/t.co\/0rqippguod

Table 3.Sample of Tweets Collected and Saved in Database. 3.3

Creation of Bi-Lingual Sentiment Repository

In order to perform text analysis of bi-lingual tweets, we need to develop a database that is capable of providing sentiment strength to words used within bi-lingual tweets messages. For English language, SentiStrength’4 is used for extracting the English lexica’s sentiment strength. The original SentiStrength contains 2546 English words along with their sentiment score ranging from -4 to +4. However, there has not been any such attempt for Urdu (Roman Urdu) language. For this purpose, we created our own lexicon that provides the sentiment strength score to Roman Urdu words similar to the structure of SentiStrength. Two resources, SentiStrenght and English to Roman-Urdu dictionary5 are utilized in order to create a unified sentiment strength database. English words from SentiStrength have been searched for their RomanUrdu translations. English words with their Roman-Urdu translations are combined with SentiStrength to create Bi-Lingual Sentiment Repository (BLSR) as shown in Table 4.

Bi-Lingual Sentiment Repository (BLSR) thus created provides the sentiment strength of 1673 English as well as 3900 Roman-Urdu words. Sentiment strength ranges from -4 to -1 indicating negative strength (-4 as most negative and -1 as least negative) and 1 to 4 indicate positive strength(1 as least positive and 4 as most positive) where 0 represent no sentiment strength and behaves as neutral.

Tweets belonging to each political party are tokenized. After tokenization, each token is assigned strength from SentiStrength and BLSR. The strength of every single tweet is then computed as follows:

Sentiment-Tweet (ST) = ∗ + ∗ + ∗ +……. ∗ (1)

4 http://sentistrength.wlv.ac.uk/ 5 http://www.scribd.com/doc/14203656/English-to-Urdu-and-Roman-Urdu-Dictionary

Where, F1, F2… Fn are the frequencies of the tokens appearing in a tweet, S1, S2 … Sn are the sentiment strength of the corresponding token, n is the number of tokens in a given tweet.

Using the database, the strength of each political party can then be computed as:

Sentiment-Party (SP) = = (2) Where, STpi is the strength of a tweet belonging to a particular party p. m is the number of tweets belonging to party p. 3.5

Handling the Missing Tokens in BLSR

There are a lot of important terms that could not be found in BLSR because of typographical errors, transliteration errors as well as individual based short written English and Roman-Urdu words. To handle such typographical errors in Roman-Urdu tokens, a number of algorithms (Bigram-Based Cosine Similarity, Dice Coefficient and Jaccard Similarity) are applied for string approximation. We found that bigramCosine similarity outperformed other metrics.

To increase the recall of English words, WordNet is utilized to obtain synonyms for English tokens that did not exist in SentiStrength. Class sentiment strength is assigned to relevant tokens on the basis of synonyms. 4

Results and Discussion

The dataset contains 91,804 tweet messages collected for four political parties in five major cities along with noisy data (non-political) of 21,821 tweets. The detailed statistics regarding the number of tweets collected from various cities and about different parties is presented in Table 5.

Index

City In language classification 62797 tweets were classified as English and 7186 as Roman-Urdu tweet messages as depicted in Table 6.

We have proposed a method for sentiment analysis of bi-lingual, English and roman-Urdu data from social networks, particularly focusing on twitter data. We considered case study of general elections in Pakistan 2013. Tweets are collected related to major political parties of Pakistan considering four major cities. A bi-lingual lexicon is constructed that is capable of providing sentiment strength for English as well as roman-Urdu words used in tweets. In order to increase the coverage of this bilingual lexicon, WordNet is used to improve the performance of English tweets. Similarly, for Roman Urdu tweets, a bigram based consine similarity is used to reduce number of typographical errors as well as performing string approximation to increase the coverage. Using these resources, we have addressed the dominance of political parties in Pakistan before elections 2013. The difference in the results of English and Urdu Tweets shows the two separate clusters of population and their political affiliations. Furthermore, the inbalance between number of English and Urdu Tweets is because of simple classification method to detect language that has resulted in many Roman Urdu tweets marked as English. This could be improved by incorporating complex methodologies. Furthermore, the size of lexicon can be improved by using lexical and contextual similarity based techniques [11] to collect similar terms from a corpus (in this case, WWW can be used). The constructed bi-lingual lexicon is not domain specific and therefore, can be used for any other domain as well. 10. Thelwall, M., Buckley, K., Paltoglou, G. Cai, D., & Kappas, A. (2010).Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology, 61(12), 2544–2558. 11. Hammad Afzal, Robert Stevens, Goran Nenadic: “Towards Semantic Annotation of Bioinformatics Services: Building a Controlled Vocabulary”, Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008): pp. 5-12

B. J.

Jensen ,

Zhang ,

Sobel , and

Chowdury , “ Twitter power: Tweets as electronic word of mouth , ”Journal of the American Society for Information Science and Technology , vol. 60 , no. 11 , pp. 2169 - 2188 , 2009 .

2. Chung-Hong

Lee

, Hsin-Chang , Tzan-Feng Chien and Wei-Shiang Wen Yang, “A Novel Approach for Event Detection by Mining Spatio-temporal Information on Microblogs,” in International Conference on Advances in Social Networks Analysis and Mining , 2011 .

Shota

Ishikawa , Yutaka Arakawa, Shigeaki Tagashira, Akira Fukuda “ Hot Topic Detection in Local Areas Using Twitter and Wikipedia,” in ARCS Workshops (ARCS ), 28 - 29 Feb. 2012 .

Alexander

Pak and Patrick Paroubek, “ Twitter for Sentiment Analysis: When Language Resources Are Not Available ,” 22nd International Workshop on Database and Expert Systems Applications , 2011 .

5. Yi

, Jackson Wong , Yimeng Deng , Klarissa Chang, “An Exploration of Social Media in Public Opinion Convergence: Elaboration Likelihood and Semantic Networks on Political Events,” Ninth IEEE International Conference on Dependable, Autonomic and Secure Computing , 2011 .

Asli

Celikyilmaz , Dilek Hakkani-Tur, Junlan Feng, “ Probabilistic Model-Based Sentiment Analysis of Twitter Messages,” Spoken Language Technology Workshop (SLT) , 12 - 15 Dec. 2010 :pp. 79 - 84 .

Vinh

Ngoc Khuc , Chaitanya Shivade, Rajiv Ramnath, Jay Ramanathan, “ Towards Building Large-Scale Distributed Systems for Twitter Sentiment Analysis , ” SAC'12, Riva del Garda, Italy, March 25 -29, 2012 ,

Georgios

Paltoglou and Mike Thelwall, “Twitter, MySpace , Digg: Unsupervised Sentiment Analysis in Social Media,” ACM Transactions on Intelligent Systems and Technology , Vol. 3 , No. 4, Article

, Publication

date

: September 2012 .

Akshaya

Iyengar ,

Tim

Finin and Anupam Joshi, “ Content-based prediction of temporal boundaries for events in Twitter ,” IEEE International Conference on Privacy, Security, Risk, Trust, and IEEE International Conference on Social Computing, 2011 .