<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Opinion Analysis of Bi-Lingual Event Data from Social Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Iqra Javed</string-name>
          <email>iqra217@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hammad Afzal</string-name>
          <email>hammad.afzal@mcs.edu.pk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Software Engineering, National University of Sciences and Technology</institution>
          ,
          <addr-line>Islamabad</addr-line>
          ,
          <country country="PK">Pakistan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Social networks have recently emerged as the fastest and very effective medium to express news updates, trends and expression of personal views. There have been several studies to perform detailed sentiment analysis on such data in most of the developed languages. However, Urdu lacked any such study despite being spoken by around 30 Million people around the globe and used in regions with fastest growth of broadband users. This research has been carried out as a first step in this direction, where a language resource comprising the sentiment strengths of Roman Urdu words has been proposed along with its utility by under taking a case study of spatial analysis of bi-lingual (Urdu and English) tweets in the context of a national event, i.e. genral elections 2013. The results are encouraging, showing the effective utility of the bi-lingual sentiment strength database.</p>
      </abstract>
      <kwd-group>
        <kwd>Keywords</kwd>
        <kwd>Sentiment Analysis</kwd>
        <kwd>Twitter Data</kwd>
        <kwd>Language Resources</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        For last few years, there has been an emerging trend by public to consider the social
networks for news updates, upcoming trends, community updates and expression of
personal reviews on various events. These events range from smaller ones, interesting
only to some particular region or community such as local seminars or concerts to the
larger ones that can be of interest to entire country (epidemics, weather or political
events). The popularity of social networks among public to share their opinion has led
to its use as an opinion reviewing and result predicting tool for events that are related
to public having common issues and problems. There have been several case studies
that consider geographilcal and temporal analysis of such events [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">2-10</xref>
        ]
      </p>
      <p>Twitter1 is considered as one of the most popular micro-blogging social
networking website with more than 554 million active users till 20132. Twitter user’s posts,
known as “tweets”, are generally used as information broadcasting tool for local
events and they can be used to mine their pre and post effects. In addition, they can
also be used for opinion analysis from a specific region within specific time bounds.</p>
    </sec>
    <sec id="sec-2">
      <title>1 https://twitter.com/ 2 http://www.statisticbrain.com/twitter-statistics/</title>
      <p>This research presents an approach on analysis of bi-lingual tweets, describing the
public’s opinions about a national event. We have particularly focused on a case study
of Pakistan’s general elections 2013. Pakistan has been considered as one of the
fastest growing countries in terms of IT users and broadband usage. Youth being the
major portion of population3, such frameworks can be very effectively utilized for trend
prediction. Although English is commonly used in higher education, public in general
is not much well versed in English; however they are not restricted by this limitation
and tend to express their opinions in Urdu using English script (termed as Roman
Urdu hereafter in this paper). We have performed spatial and temporal analysis,
covering five major cities in Pakistan (having populations around 50 Million each) and
over the period of 5 months. The results obtained by our analyis mostly confirm with
the results of elections (announced in March, 2013) and the observations made by
other survey organizations (using the means other than social network data).
2</p>
      <sec id="sec-2-1">
        <title>Background</title>
        <p>
          Manually prepared lexicons and machine learning techniques have been mostly
used in sentiment analysis to analyze mood, emotion classification and opinion
extraction within a text provided tweets. In [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] proposed technique is based on
classification of tweets on their content basis and groups them as hot topics according to the
frequent population of tweets on relative topics and geo-location information
associated with tweet text. However, due to semantic fluctuations, the proposed
classification technique does not work particularly good enough as tweets can use multiple
words to refer to the same event.
        </p>
        <p>
          Ishikawa, Arakawa, Tagashira, Fukuda discusses a system that detects hot topic in
a local area in a specified time period and a classification method is proposed that
reduces variation of posted words related to the same topic in tweets. The hot topics
can be predictable (matches, elections, festivals) and non-predictable (natural
disasters) events. Such event analysis is helpful in making any business strategy, disease
information social relationships [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>
          Wong and Chang conducted quantitative and qualitative analysis on informative
and affective tweets based on word frequencies and word co-occurrence [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. They
used event related context specific vocabulary to train their classifier. Open source
resources have also been utilized for lexicon building and sentiment classification but
the classifier gave poor performance on untrained domains [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Polarity classification
was performed in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] using lexicon-based approach where manual annotation was
performed. They ruled out those tweets that contained both positive and negative
emotions. Lexicon based approach is applied in Sentistrength [10] for sentiment
analysis of text. But these lexicons provide limited support and needs manual marked
lexicon. Further no support available for roman-Urdu and political text analysis.
3 http://southasiainvestor.blogspot.com/2011/10/pakistan-ranks-among-fastest-growing.html
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Methodology</title>
        <p>The aim of the proposed research is to provide a framework to analyse the
bilingual data from twitter using spatial and temporal bounds. Pakistan’s general
Election 2013 is taken as case study. Retrieved text from twitter comprises of tweets
written in two languages, English and Roman-Urdu. The sentiment analysis is performed
on this bi-lingual text using existing (customized) and newly created lexicons on
sentiments data. The steps performed in our approach are illustrated in Fig 1 and
elaborated below.</p>
        <p>Our approach starts with collection of tweets dataset. Twitter search API is used
for tweets retrieval based on keywords. Tweets related to four main political parties
Pakistan Tehreek-e-Insaaf (PTI), Pakistan Muslim League Nawaz PML(N), Pakistan
Peoples Party (PPP) andMutahidda Quomi Movement (MQM ) from five major cities
of Pakistan (Islamabad, Lahore, Karachi, Peshawar and Quetta) considering the radius
of 20 miles of the city are collected. Collection of dataset is performed on weekly
basis while the time span for dataset collection is from Dec 2012 till polling day (11th
March, 2013).
3.2</p>
        <sec id="sec-2-2-1">
          <title>Classification of Tweets</title>
          <p>Two iterations of classification are performed over dataset retrieved from twitter.
These classifications are carried out on keyword basis. First iteration discriminates
between the tweets belonging to political/non political contents. This step was
reqiured as most of the spammers, particularly belong to real estate businesses, exploited
the popularity of the keywords related to political parties. Some keywords that were
used to identify noisy (non political tweets) are summarized in Table 1.</p>
          <p>Second iteration of classification was performed to discriminate between English
and Roman-Urdu. This was also performed based on presence of keywords from a set
of commonly used English words as presented in Table 2.
S.No Party
  Pti</p>
          <p>City</p>
          <p>Peshawar
Mqm
Pml
Pti Islamabad</p>
          <p>Karachi
Lahore</p>
          <p>Language Text</p>
          <p>Roman peshawar: jamaat-e-islami aur pti ke dermian khyber
Urdu pakhtunkhwa mey seat adjustment per ittefaak na husaka.</p>
          <p>Roman karachi: mqm nay aam intikhabat main mulk bhar say party
Urdu ticket kay liye darkhastain talab kar lein dr. farooq sattar.b.n
Roman :lahore: \nsabiq governor state bank dr. ishrat hussain ko
Urdu nigran wazir e azam banai janne ka imkaan zarai.\n#ppp</p>
          <p>#pmln #pti
English :#pti &amp; #ji flirting in rawalpindi :d &gt;&gt;&gt;&gt;</p>
          <p>http:\/\/t.co\/0rqippguod</p>
          <p>Table 3.Sample of Tweets Collected and Saved in Database.
3.3</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Creation of Bi-Lingual Sentiment Repository</title>
          <p>In order to perform text analysis of bi-lingual tweets, we need to develop a
database that is capable of providing sentiment strength to words used within bi-lingual
tweets messages. For English language, SentiStrength’4 is used for extracting the
English lexica’s sentiment strength. The original SentiStrength contains 2546 English
words along with their sentiment score ranging from -4 to +4. However, there has not
been any such attempt for Urdu (Roman Urdu) language. For this purpose, we created
our own lexicon that provides the sentiment strength score to Roman Urdu words
similar to the structure of SentiStrength. Two resources, SentiStrenght and English to
Roman-Urdu dictionary5 are utilized in order to create a unified sentiment strength
database. English words from SentiStrength have been searched for their
RomanUrdu translations. English words with their Roman-Urdu translations are combined
with SentiStrength to create Bi-Lingual Sentiment Repository (BLSR) as shown in
Table 4.</p>
          <p>Bi-Lingual Sentiment Repository (BLSR) thus created provides the sentiment
strength of 1673 English as well as 3900 Roman-Urdu words. Sentiment strength
ranges from -4 to -1 indicating negative strength (-4 as most negative and -1 as least
negative) and 1 to 4 indicate positive strength(1 as least positive and 4 as most
positive) where 0 represent no sentiment strength and behaves as neutral.</p>
          <p>Tweets belonging to each political party are tokenized. After tokenization, each
token is assigned strength from SentiStrength and BLSR. The strength of every single
tweet is then computed as follows:</p>
          <p>Sentiment-Tweet (ST) =   ∗  +  ∗  +  ∗  +…….      ∗                              (1)</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4 http://sentistrength.wlv.ac.uk/ 5 http://www.scribd.com/doc/14203656/English-to-Urdu-and-Roman-Urdu-Dictionary</title>
      <p>Where,
F1, F2… Fn are the frequencies of the tokens appearing in a tweet,
S1, S2 … Sn are the sentiment strength of the corresponding token,
n is the number of tokens in a given tweet.</p>
      <p>Using the database, the strength of each political party can then be computed as:</p>
      <p>Sentiment-Party (SP) =       =                                                                                  (2)
Where,
STpi is the strength of a tweet belonging to a particular party p.
m is the number of tweets belonging to party p.
3.5</p>
      <sec id="sec-3-1">
        <title>Handling the Missing Tokens in BLSR</title>
        <p>There are a lot of important terms that could not be found in BLSR because of
typographical errors, transliteration errors as well as individual based short written
English and Roman-Urdu words. To handle such typographical errors in Roman-Urdu
tokens, a number of algorithms (Bigram-Based Cosine Similarity, Dice Coefficient
and Jaccard Similarity) are applied for string approximation. We found that
bigramCosine similarity outperformed other metrics.</p>
        <p>To increase the recall of English words, WordNet is utilized to obtain synonyms
for English tokens that did not exist in SentiStrength. Class sentiment strength is
assigned to relevant tokens on the basis of synonyms.
4</p>
        <sec id="sec-3-1-1">
          <title>Results and Discussion</title>
          <p>The dataset contains 91,804 tweet messages collected for four political parties in
five major cities along with noisy data (non-political) of 21,821 tweets. The detailed
statistics regarding the number of tweets collected from various cities and about
different parties is presented in Table 5.</p>
          <p>Index</p>
          <p>City
In language classification 62797 tweets were classified as English and 7186 as
Roman-Urdu tweet messages as depicted in Table 6.</p>
          <p>We have proposed a method for sentiment analysis of bi-lingual, English and
roman-Urdu data from social networks, particularly focusing on twitter data. We
considered case study of general elections in Pakistan 2013. Tweets are collected related
to major political parties of Pakistan considering four major cities. A bi-lingual
lexicon is constructed that is capable of providing sentiment strength for English as well
as roman-Urdu words used in tweets. In order to increase the coverage of this
bilingual lexicon, WordNet is used to improve the performance of English tweets.
Similarly, for Roman Urdu tweets, a bigram based consine similarity is used to reduce
number of typographical errors as well as performing string approximation to increase
the coverage. Using these resources, we have addressed the dominance of political
parties in Pakistan before elections 2013. The difference in the results of English and
Urdu Tweets shows the two separate clusters of population and their political
affiliations. Furthermore, the inbalance between number of English and Urdu Tweets is
because of simple classification method to detect language that has resulted in many
Roman Urdu tweets marked as English. This could be improved by incorporating
complex methodologies. Furthermore, the size of lexicon can be improved by using
lexical and contextual similarity based techniques [11] to collect similar terms from a
corpus (in this case, WWW can be used). The constructed bi-lingual lexicon is not
domain specific and therefore, can be used for any other domain as well.
10. Thelwall, M., Buckley, K., Paltoglou, G. Cai, D., &amp; Kappas, A. (2010).Sentiment strength
detection in short informal text. Journal of the American Society for Information Science
and Technology, 61(12), 2544–2558.
11. Hammad Afzal, Robert Stevens, Goran Nenadic: “Towards Semantic Annotation of
Bioinformatics Services: Building a Controlled Vocabulary”, Proceedings of the Third
International Symposium on Semantic Mining in Biomedicine (SMBM 2008): pp. 5-12</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Jensen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sobel</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdury</surname>
          </string-name>
          , “
          <article-title>Twitter power: Tweets as electronic word of mouth</article-title>
          ,
          <source>”Journal of the American Society for Information Science and Technology</source>
          , vol.
          <volume>60</volume>
          , no.
          <issue>11</issue>
          , pp.
          <fpage>2169</fpage>
          -
          <lpage>2188</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chung-Hong</surname>
            <given-names>Lee</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsin-Chang</surname>
          </string-name>
          ,
          <article-title>Tzan-Feng Chien and Wei-Shiang Wen Yang, “A Novel Approach for Event Detection by Mining Spatio-temporal Information on Microblogs,”</article-title>
          <source>in International Conference on Advances in Social Networks Analysis and Mining</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Shota</given-names>
            <surname>Ishikawa</surname>
          </string-name>
          , Yutaka Arakawa, Shigeaki Tagashira, Akira Fukuda “
          <article-title>Hot Topic Detection in Local Areas Using Twitter and Wikipedia,” in ARCS Workshops (ARCS</article-title>
          ),
          <fpage>28</fpage>
          -
          <lpage>29</lpage>
          Feb.
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Pak</surname>
          </string-name>
          and Patrick Paroubek, “
          <article-title>Twitter for Sentiment Analysis: When Language Resources Are Not Available</article-title>
          ,” 22nd
          <source>International Workshop on Database and Expert Systems Applications</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Yi</surname>
            <given-names>Wu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jackson Wong</surname>
            ,
            <given-names>Yimeng</given-names>
          </string-name>
          <string-name>
            <surname>Deng</surname>
          </string-name>
          ,
          <source>Klarissa Chang, “An Exploration of Social Media in Public Opinion Convergence: Elaboration Likelihood and Semantic Networks on Political Events,” Ninth IEEE International Conference on Dependable, Autonomic and Secure Computing</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Asli</given-names>
            <surname>Celikyilmaz</surname>
          </string-name>
          , Dilek Hakkani-Tur, Junlan Feng, “
          <article-title>Probabilistic Model-Based Sentiment Analysis of Twitter Messages,”</article-title>
          <source>Spoken Language Technology Workshop (SLT)</source>
          ,
          <fpage>12</fpage>
          -
          <lpage>15</lpage>
          Dec.
          <year>2010</year>
          :pp.
          <fpage>79</fpage>
          -
          <lpage>84</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Vinh</given-names>
            <surname>Ngoc</surname>
          </string-name>
          <string-name>
            <surname>Khuc</surname>
          </string-name>
          , Chaitanya Shivade, Rajiv Ramnath, Jay Ramanathan, “
          <article-title>Towards Building Large-Scale Distributed Systems for Twitter Sentiment Analysis</article-title>
          ,
          <source>” SAC'12, Riva del Garda, Italy, March</source>
          <volume>25</volume>
          -29,
          <year>2012</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Georgios</given-names>
            <surname>Paltoglou</surname>
          </string-name>
          and Mike Thelwall, “Twitter,
          <string-name>
            <surname>MySpace</surname>
          </string-name>
          ,
          <source>Digg: Unsupervised Sentiment Analysis in Social Media,” ACM Transactions on Intelligent Systems and Technology</source>
          , Vol.
          <volume>3</volume>
          , No. 4,
          <string-name>
            <surname>Article</surname>
            <given-names>66</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Publication</surname>
            <given-names>date</given-names>
          </string-name>
          :
          <year>September 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Akshaya</given-names>
            <surname>Iyengar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Tim</given-names>
            <surname>Finin</surname>
          </string-name>
          and Anupam Joshi, “
          <article-title>Content-based prediction of temporal boundaries for events in Twitter</article-title>
          ,” IEEE International Conference on Privacy, Security, Risk, Trust, and IEEE International Conference on Social Computing,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>