TEA: Episode Analytics on Short Messages

                            Prapula G                                    Soujanya Lanka                   Kamalakar Karlapalem
               Center for Data Engineering                        Center for Data Engineering             Center for Data Engineering
                     IIIT Hyderabad                                     IIIT Hyderabad                          IIIT Hyderabad
                 Andhra Pradesh, India                              Andhra Pradesh, India                   Andhra Pradesh, India
             prapula.g@research.iiit.ac.in                            soujanya@iiit.ac.in                      kamal@iiit.ac.in


 ABSTRACT                                                                              events are usually related to nouns like persons, movies and
 Twitter is a widely used micro-blogging service, which in re-                         objects in real world; these nouns are referred to as entities.
 cent times, has become a reliable source of happening news                            Each entity will have a series (one or more) of events which
 around the world [11]. Breaking news are covered in twitter;                          are significant in its lifetime. People tweet about events
 the magnitude and volumes of tweets reflecting on the na-                             that are of importance to them[16][13]. People seek lat-
 ture and intensity of the news. During events, many tweets                            est up-to-date information by searching through tweets live
 are posted either expressing sentiments about the event or                            stream. So, an event or a search phrase obtains a high fre-
 just about the occurrence of the event. Events related to                             quency of tweets, mostly due to its significance (like a trend-
 an entity that have attracted a large number of tweets can                            ing topic). Hence, the overall social interest received for an
 be considered significant in the entity’s twitter lifetime. En-                       event related to an entity is reflected by the number of tweets
 tity could represent a person, movie, community, electronic                           that mention the event. This streaming information about
 gadgets, software products and like wise. In this work, we                            various events should be identified, analyzed and visualized
 attempt to automatically detect significant events related to                         in order to make them suitable for humans to understand
 an entity. An episode, is an event of importance; identified                          and interpret the causes and the consequences. Such a vi-
 by processing the volumes of tweets/posts in a short time.                            sual representation is also useful in displaying search results.
    The key features implemented in Tweet Episode Analytics                            AspecTiles[10] address the problem of search result diversifi-
 (TEA) system are: (i) detecting episodes among the stream-                            cation. In our work, given an entity we address the event di-
 ing tweets related to a given entity over a period of time                            versification related to an entity. For instance, if a search on
 (from the entity’s birth i.e., mention in the tweet world till                        ‘Roger Federer’ is performed during the Wimbledon season,
 date), (ii) providing visual analytics (like sentiment scoring                        there could be various events related to Federer that would
 and frequency of tweets over time) of each episode through                            have been tweeted on different days of the season. Iden-
 graphical interpretation.                                                             tifying significant events and displaying sets of tweets (by
                                                                                       grouping tweets related to a particular event) with graphs
                                                                                       gives user a chance to glance through events and explore in
 Categories and Subject Descriptors                                                    detail on an event he/she is interested in.
 H.4 [Web IR and Social Media Search]: Social Network                                     With large number of twitter users getting interested in
 Analysis(Micro-Blogging Analysis)                                                     a particular event leads to a deluge of tweets and also the
                                                                                       queries on those tweets. Mining significant events will be
 General Terms                                                                         useful in summarizing the deluge of tweets. Hence, an anal-
                                                                                       ysis system is needed, that (i) identifies important events
 Entity, Trend, Events, Sentiment, Analysis, Detection
                                                                                       related to an entity, (ii) analyzes the temporal sentiment
                                                                                       patterns of tweets during the period of increased interest
 Keywords                                                                              and provides visuals depicting the same. A large scale pro-
 Tweets, Episode, Text Analytics                                                       cessing is done to accomplish all of this and the results of
                                                                                       each of the above is presented in Section 5.
 1. INTRODUCTION                                                                          The importance of an event can be computed by the fre-
                                                                                       quency of tweets and re-tweet counts related to the event as
   Tweets are a source of valuable information that have the
                                                                                       done in [14]. A popular entity (like a movie star, movie, mu-
 potential of providing an overview of how the world is think-
                                                                                       sician and the likes) receives some amount of attention on a
 ing about various events/persons over a period of time. The
                                                                                       regular basis in twitter. The amount of attention received
                                                                                       need not to be constant over a daily basis. The attention
                                                                                       received (i.e., the number of tweets talking about the entity)
                                                                                       varies over a period of time due to various events related to
 Permission
 Copyright to c make
                  2014 digital
                        held by or author(s)/owner(s);
                                    hard copies of all or copying
                                                             part of this work for
                                                                       permitted       the entity. When there is a spike in the attention received,
 personal
 only for or  classroom
           private       use is granted
                    and academic           without fee provided that copies are
                                      purposes.                                        the event associated could be a significant one.
 not made orasdistributed
 Published                for #Microposts2014
                 part of the   profit or commercial    advantageproceedings,
                                                     Workshop       and that copies       For instance, let us consider ‘Lady Gaga’ as an entity.
 available online as CEUR Vol-1141 (http://ceur-ws.org/Vol-1141)
 bear this notice and the full citation on the first page. To copy otherwise, to
 republish, to post on April
                       servers7th,
                                or to2014,
                                      redistribute
                                                                                       There could be many tweets that mention Lady Gaga as part
 #Microposts2014,                          Seoul, to  lists, requires prior specific
                                                   Korea.                              of routine events like ‘@user432 Listening to Lady Gaga’,
 permission and/or a fee.
 Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.


· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
 ‘just read article on Lady Gaga’, ‘Lady Gaga in Japan’ and        generic.
 ‘Lady Gaga’s Born this way - releasing in 2012’. Among               In [9], Gruhl et al studied the propogation of informa-
 these, significant events for Lady Gaga could be ‘Born this       tion in environments like personal publishing using a large
 way’ album’s release and her ‘tour to Japan’. A significant       collection of web logs. They have characterized the top-
 event due to increased volumes of tweets related to an entity     ics into long running “chatter” topics consisting of recursive
 is considered as an episode.                                      “spikes” topics. According to their theory, if there are spikes
    The sentiments expressed by twitter users about episodes       recursively for a topic over a long period of time, it may
 change over time. For example, there could be a very posi-        be of interest. Topics are detected and then classified if its
 tive anticipation for a particular movie about to be released,    chatter or spike and studied the propagation. Our work con-
 but it might not have been well received (paving way for          centrates on detecting events related to an entity based on
 negative sentiments expressed post-release). Analyzing and        a similar notion that spikes are the places where significant
 visualizing the accumulated sentiments about episodes over        events have occurred in an entity’s life time.
 time could be useful for market research analysis of an entity
 (movies, electronic gadgets, albums etc).
    In this paper, we introduce the concept of an episode for      3.    OVERVIEW OF TEA
 a time-line of an entity and develop a tweet episode analyt-        In this section, we introduce the concept of an episode. We
 ics system (referred to as TEA) which when given a phrase         also present the architecture of “Tweet Episode Analytics”
 of words that represent an entity as input can: (a) identify      system as a part of this section.
 episodes, (b) analyze episodes, life-spans, (c) display the cu-
 mulative sentiments expressed over a period of time.              3.1    What is an Episode?
    In section 2, we present related work. In section 3, an           Episode can be defined as a significant event in the time
 Overview of TEA is presented which is followed by Tweet           line of an entity (individual person, community, group etc)
 Episode Analytics (Section 4). Section 5 presents Results of      that has occurred due to a sudden increase of tweet volumes
 TEA with Section 6 presenting some conclusions.                   of the entity from its regular volumes.
                                                                      Among all the events that an object/entity is involved in,
 2.   RELATED WORK                                                 the events that received more attention in a particular period
    There has been a considerable amount of work done on ex-       of time, are referred as episodes. All episodes are events
 tracting trending topics from twitter. The idea of an Episode     but not all events can be episodes. Episodes are significant
 that has been proposed in this paper is different from the        events with respect to an entity, but events are more general
 past studies on trending topics. There has been a study on        not specifically related to entities. Episodes are always for
 how and why the topics become trending in one of the pa-          an entity. TEA algorithm identifies prominent episodes of
 pers [6]. As a part of their study, [6] have tried to explain     an entity that has occurred over its time line, considering
 the growth of trending topics. They have concluded that           an entity has a long lifespan. An episode is different from
 most topics do not trend for long on Twitter. This conclu-        the traditional concept of “a trending topic” [12] or “topics
 sion from their study strengthens our idea of Episodes which      extracted from topic clustering” [8]. An entity is said to have
 we have defined as a significant event that may occur in the      an episode if there is a sudden spike in an activity and that is
 time line of an entity and the event will be significant only     captured as an event in the time line of the entity because of
 for a short period of time.                                       which there is a huge activity related to the entity. For each
    In [7], Becker et al identified real-world events and their    such event, there is evidence like an article or information
 associated twitter messages that are published. Online clus-      that shows the true importance of the event. If no such
 tering and filtering framework is used to address this event      article or information exists, then it may not be an episode.
 identification problem.We have introduced the concept of             Similar to ‘Lady Gaga’ example mentioned in Section 1,
 an episode and have presented an algorithm to identify an         we noticed a similar episode being detected in our tweet
 episode by considering accumulated significance of the tweets.    data set related to ‘Justin’(entity). A phrase formed by
    In [14], Nichols et al extracted sporting events and sum-      ‘Justin’ and ‘Boyfriend’ put together is an episode whereas
 marized the tweets in that events. They are confined to           ‘Justin’ is not. After the release of Justin Bieber’s new song
 tweets related to sports and concentrated more on summa-          ‘Boyfriend’, there was a sudden outburst of tweets about
 rizing than extracting events. Our frame work and algorithm       this song. Even though the number of tweets about ‘Justin’
 work for a search query (to represent the entity) and detect      are large implying that it is a trending topic, it is not an
 possible episodes in its life time.                               episode because the reason for more social activity about
    In [15], Sakaki et al believe that when a real event like      ‘Justin’ is not due to a single significant event.
 natural disasters that influence people from either one re-
 gion or some parts of the world occur, the twitter users (so-     3.2    System Architecture
 cial sensors) will tweet about the event immediately. This           The whole tweet episode analytics system can be divided
 paper aims to recognise events at real time whereas we de-        into different modules. Tweet collection and tweet process-
 tect episodes that have already occurred and have lots of         ing are offline modules (module in which processing is done
 importance in the entity’s life time. Our paper presents          beforehand) where as, episode detection, sentiment analyz-
 historical coverage of an entity as a sequence of episodes.       ing are online modules (module in which the processing
 Moreover, this paper targets events like social events (e.g.,     starts after receiving the query as input to the system).
 large parties), sports events, accidents and political cam-       The flowchart of system architecture to “Detect Episodes
 paigns and natural events like storms, heavy rainfall, tor-       of an entity from Twitter data using Episode Detection Al-
 nadoes, typhoons which influence people’s daily life whereas      gorithm” is given in Figure 1. Below is a brief explanation
 our work is not specific to any event of an entity and is more    for each of the modules.


                                                                                                                              12
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
                                                                    rithm as explained in Section 4.3. This module generates
                                                                    charts/graphs which shows how the sentiment of the entity
                                                                    has been changing over the period of its twitter lifetime.
                                                                       We have given “Federer” query for our system along with
                                                                    the output of Tweet Collection and Tweet Processing offline
                                                                    modules and the flow is as below: (i) we retrieved episodes
                                                                    mentioned in Table 1 using Episode detection module, (ii)
                                                                    from episodes - we merged episodes and got bubble chart,
                                                                    (iii) we extracted sentiment scores and the trending graphs
                                                                    using sentiment analysis module.


 Figure 1: Flow Chart to detect episodes from
 Tweets using Episode Detection Algorithm


  3.2.1   Tweet Collection Module
    Tweet Collection module collects tweets using Twitter Stream-
 ing API. A sample of public tweets are extracted from twit-
 ter.com every 2 minutes. We have been collecting tweets
 since March 2012 and until December 2012. Around 140             Figure 2: Episodes strength chart of entity “Federer”
 Million public tweets were collected from Twitter. Tweets        (see Equation 5)
 were collected on an hourly basis; tweets for each hour are
 stored in a separate file.
                                                                    4.    TWEET EPISODE ANALYTICS
  3.2.2   Tweet Processing Module
                                                                       In this section, we present our algorithm to detect episodes
   Tweet processing includes removing non-english tweets            from the tweet data. After the episode detection algorithm
 and tweets with incomplete details. These processed tweets         is executed on the data set, we use the information obtained
 are stored by indexing them using Lucene [1]. The details          from the algorithm to detect all the episodes of a particu-
 about a tweet that are being stored in the Lucene index are        lar entity. We also present sentiment analysis method that
 tweet id, text, retweet count of that particular tweet and its     we have used in our system. In the post processing phase,
 creation time. In addition to this, the id, name, location, url,   we present sentiment, trend and temporal analytics of each
 description, followers count, creation time of the account of      episode.
 the user who has tweeted the tweet are also stored for each
 tweet.                                                             4.1    Episode Detection Algorithm
                                                                       Given an entity/query as an input, Episode Detection Al-
  3.2.3   Episode Detection Module                                  gorithm gives episodes for an entity over a given time period.
   A query(entity) is given as input to this module along           The algorithm will detect the episodes that have occurred in
 with the processed Lucene Index from the above module.             the entity’s twitter lifetime. The time of birth for an entity
 Episode detection module will extract all the tweets that          in our twitter data set is the time stamp of the first occur-
 are related to the given query and then all the episodes that      ring tweet that mentions it. Lifetime of an entity would be
 have occurred over the life time of the entity are detected by     the first time stamp to till date. For this, all the tweets
 applying Episode Detection Algorithm on the related tweets.        related to a given query are extracted from the Lucene in-
                                                                    dex and are processed by cleaning the text. The proper
  3.2.4   Sentiment Analysing Module                                nouns that have occurred in these tweets are determined
   Sentiment Analysis is a method of analyzing/finding the          using Stanford POS tagger[3] along with their frequency of
 opinion/sentiment that is expressed in a piece of text, a          occurrence in the tweets. Frequent bi-gram nouns are also
 tweet in our context. In this module, a very basic senti-          extracted and then using the episode detection algorithm,
 ment scoring algorithm is applied on the tweets which are          all the episodes that have occurred over the lifetime of the
 related to the given entity to get their sentiment score. This     entity are detected.
 algorithm could be replaced with any other sentiment scor-            The following are the conditions to be satisfied to say that
 ing algorithm; for this paper, we used a basic scoring algo-       an episode has occurred on a short duration of time:


                                                                                                                              13
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
                                           Table 1: Episodes detected of ‘Federer’

  Rank    Episode            Date/Duration Maximum Frequent Tweet [[Fre-             Frequency   *Related Web URL 1
                                           quent Nouns]]                             [[Tweet
                                                                                     Spike]]
    3     Entering into      07/06/12 to     RT @Wimbledon: Federer will get         3464        http://www.bbc.co.uk/sport/0/
          Wimbledon          07/07/12        a crack at his 7th #Wimbledon ti-       [[22094]]   tennis/18740443
          ’12 finals                         tle beating Djokovic 6-3 3-6 6-4 6-3
                                             to reach Sunday’s final. http://t.c
                                             ... [[Wimbledon, Federer, crack,
                                             Djokovic, title, Sunday]]
    1     Winning            07/08/12 to     RT @AndrewBloch: In 2003 a              6230        http://www.atpworldtour.com/
          Wimbledon          07/10/12        man predicted Federer would win 7       [[48919]]   News/Tennis/2012/07/27/Wimbledon-
          ’12 title                          Wimbledon titles. He died in 2009                   Sunday2-Final-Report.aspx
                                             and left the bet to charity. Today
                                             Oxfam ... [[Federer, Wimbledon, ti-
                                             tle, man, Murray, today, bet, char-
                                             ity]]
    5     Blog on Mur-       07/21/12        RT @CrowdedSounds:           Fan of     1636        http://t.co/eOeQjSbu
          ray and Fed-                       both     Federer    and      Murray?    [[7693]]
          erer in Finals                     http://t.co/eOeQjSbu           [[Fan,
                                             Federer, Murray]]
    2     About       Fed-   08/03/12 to     RT @Persie Official: Federer is the     3360        –
          erer               08/05/12        boss [[Federer, gold, Andy, Murray,     [[39646]]
                                             Wimbledon, mens, singles]]
    4     Federer’s          08/08/12        RT @ATPWorldTour:              Roger    2180        http://www.tennisnow.com/News/
          Birthday                           #Federer turns 31 today! Retweet        [[10832]]   Happy-Birthday-Mr–Federer.aspx
                                             to wish him a happy birthday!
                                             #atp #tennis [[Federer, Roger,
                                             Birthday, Today, retweet]]
    6     Winning            08/19/12        RT @ATPWorldTour: #Federer              722         http://www.espn.co.uk/tennis/sport/
          Cincy Tennis                       beats @DjokerNole 60 76(7) to           [[6022]]    story/165924.html
          title                              win fifth @CincyTennis crown, ties
                                             @RafaelNadala’s record 21 Masters
                                             1000 titles ... [[Roger, Federer,
                                             Cincinnati, Masters, title, congrats,
                                             today, Djokovic]]


   1) The total number of tweets that are related to the        day on which the spikeExtent is maximum is the spikeDay.
 event considering retweet count should be greater than min-       3) The tweets on spikeDay are processed and then all the
 NumTweets (parameter).                                         nouns in those tweets are extracted along with their occur-
                                                                rence frequency in the tweets. If the maximum frequent
                   TE >= minN umT weets                     (1) nouns which are most frequent after the query words corre-
                                                                sponds to a single or at most two topics then the event is an
 where TE is the total number of tweets that are related to     Episode.
 event E.                                                          The difference between the number of tweets on a partic-
   2) For each day, spike extent (spikeExtent) is calculated.   ular day and the number of tweets of the previous day is
 Let the day be represented by d and D is the number of         calculated for each day and the days are sorted in decreas-
 days in the lifetime of given entity. The number of tweets     ing order based on this difference that is computed. The
 related to the event E on a day d are NumTweets(d,E)           days which also satisfy the above conditions are considered
                                                                as spikeDays.
 spikeExtent(d, E) = N umT weets(d, E)−N umT weets(d−1, E)
                                                                   The following additional information is extracted for each
                                                            (2)
                                                                episode:
           d=D
           max(spikeExtent(d, E)) >= spikeLimit             (3)    1) Let FreqN , FreqrtN are arrays of nouns which are stored
           d=0                                                  in decreasing order of their frequency from the tweets with-
 whereas                                                        out and with retweet count correspondingly on the spikeDay.
                                                                First 20 elements of FreqN and FreqrtN are extracted.
                spikeLimit = TE /spikeF actor               (4)    2) Let FreqB , FreqrtB are arrays of bigram nouns which
 spikeFactor ( 0 < spikeFactor <= TE ) is set manually. The     are  stored in the decreasing order of their frequency from
 maximum spikeExtent of all days should be greater than the
 spikeLimit threshold. The number of days the spikeExtent is    1
                                                                  Note: * - not generated from our algorithm, but provided
 greater than the spikeLimit is also counted as spikeFreq. The  by us as a verification of the episode detected.


                                                                                                                          14
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
 the tweets without and with retweet count correspondingly         Figure 4.(a): Sentiment Trends of ‘Federer’ and Fig-
 on the spikeDay. First 50 elements of FreqB and FreqrtB           ure 4.(b): Thresholds measures chart
 are extracted. Similarly, let us say FreqPosB , FreqNegB and
 FreqNeuB are arrays with bigrams which are extracted from
 tweets with positive, negative and neutral sentiments on the
 spikeDay correspondingly. First 50 elements from each of
 FreqPosB , FreqNegB , FreqNeuB are also extracted.
   3) Let Tmax is the tweet which has maximum retweet
 count on the spikeDay and Tnoun is array of nouns present
 in Tmax. Tmax is extracted and Tnoun is determined from
 Tmax. In addition to the above, the difference between max-
 imum retweet count and minimum retweet count of the tweet
 on the spikeDay (MaxMindiff ) is also extracted.

 Figures 3.(a), 3.(b): Sentiment Trends of ‘Federer’               cumulative polarity of adjectives. It is explained below in
                                                                   brief.

                                                                   4.3    Sentiment Analysis
                                                                      Given a piece of text, sentiment analysis algorithm will
                                                                   give the sentiment score of the text. The text is split by
                                                                   sentence and then all the words like stop words and others
                                                                   that has no sentiment or opinion in it are removed. The
                                                                   list of stop words used is taken from the Stanford stop word
                                                                   list[4] Sentiment lexicon has a list of words with their polar-
                                                                   ity score. It is taken from MPQA Subjectivity Lexicon[2].
                                                                   The polarity score of the remaining words from the sentence
    From the tweets, all the above information is extracted        which are present in the sentiment lexicon are added, which
 and then top k (can be set manually) of the nouns, bigrams        adds upto polarity score of a sentence. The polarity scores of
 and the maximum frequent tweet, nouns in that tweet are           all the sentences in the text are added to get the sentiment
 all presented in the results as episodes.                         score of the total text. The sentiment score can be either
                                                                   positive, zero or negative, depending upon whether the text
 4.2    Episode Analytics on Tweets                                has positive opinion, neutral opinion or negative opinion.
    As a part of episode analytics for twitter, the sentiment
 trend and cumulative trend of tweets with retweet count           5.    RESULTS AND EVALUATION
 are also presented as charts. Number of tweets with dif-             In this section, we evaluate the proposed episode detection
 ferent polarities in each 100 tweets are also shown. For all      method by analysing the episodes strength for some famous
 the episodes their strength is calculated and presented in a      personalities(entities). We have considered the twitter data
 chart. A chart with all the episodes of entity is generated       from March 2012 to December 2012 for our experiments, so
 and presented.                                                    the episodes detected will fall into this timeline.
    For an entity that has been given as input, until a maxi-         We have experimented with some queries like “Federer”,
 mum of 10 episodes are detected based on the threshold and        “Serena Williams”, “Lumia 920”. We will be analysing the
 the number of tweets related to the entity. The episodes are      results on the entity query “Federer” in this section. Our
 ranked based on their strengths. The strength of an episode       Episode Detection algorithm has found 6 episodes related
 is calculated as the ratio of the number of tweets that are       to “Federer” over the period of consideration(March ’12 to
 tweeted about it and the time period over which the episode       December ’12) and they are presented in Table 1 in sorted
 has occurred. The strength is the average number of tweets        order of time.
 that are tweeted per day in the duration of the episode. The         Each Episode in the table has the following fields: Rank
 formula of the strength is given below:                           of the episode, episode description, date/duration of the
                                                                   episode, Maximum Frequent Tweet during the episode and
                                n
                                X                                  Frequent Nouns, Frequency of the maximum frequent tweet
                       SE = (         Ni )/n                (5)    and tweet spike, finally the web URL which shows details of
                                i=1
                                                                   the episode on the internet.
 where SE is the Strength of an Episode (E ) and Ni is the            The rank of the episode is decided based on the strength
 number of tweets on ith day where as n is the number of           of the episode that is being calculated. Episode descrip-
 days the episode has occurred.                                    tion is the description in short for the episode that is de-
   The episodes are further sorted based on the time of their      tected. Date/duration of an episode is the period in which
 occurrence and all the episodes are presented from the start      the episode has occurred. Maximum Frequent Tweet is the
 to the end of the lifetime of the entity. For us, the start and   tweet which have occurred maximum number of times in
 end times are the start and end points of the tweet collection.   the episode time period and Frequent Nouns are the nouns
                                                                   that are related to the episode which are sorted based on
   Apart from the episode detection, the trends or patterns        their frequency of occurrence. Frequency is the number of
 in the number of tweets and their sentiments are visualized.      times the tweet has occurred where as tweet spike is the to-
 Basic polarity scoring algorithm is implemented by using          tal number of tweets that are tweeted in the duration of the


                                                                                                                             15
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
 episode. For evaluating the episode that is detected, we have     with sentiment or all in total (yellow line) until that day
 searched on the internet and then included the web URL of         from the start day with retweet count. We can see there is a
 the page which shows the details of an episode and so prov-       sudden spike in the number of tweets at several places. Fig-
 ing the occurrence of that corresponding episode. Observe         ure 4.(a) shows the number of positive (green line), negative
 that the dates of the articles in the web URLs are same as        (red line) and neutral (blue line) tweets with sentiment that
 the dates of occurrence of its corresponding episode. Each of     are present in every 100 tweets.
 the episode detected related to “Federer” is analysed further        The episodes of “Narendra Modi” were also detected. “Naren-
 based on their date of occurrence below:                          dra Modi” is an Indian Politician, Chief Minister of the state
    1) The first episode has occurred on 6th and 7th of July       Gujarat in India. Table 2 shows episodes detected for the
 2012 when Federer won the semi finals against Djokovic and        entity “Narendra Modi” with 6 episodes presented based on
 entered into Wimbledon ’12 Finals just before the day of the      their occurrence date.
 finals. The rank of this episode is 3 and the maximum fre-           A brief analyis of the episodes detected is done below
 quent tweet has tweeted 3464 times. The frequent nouns are        based on their date of occurrence: 1) The rank of the first
 wimbledon, federer, crack, Djokovic, title, sunday. The web       episode is 1 and it occurred on 03/17/12. The episode is
 URL shows that Federer has entered into finals by winning         Modi on cover page of Time Magazine. 2) This episode oc-
 over Djokovic dated 6th of July 2012.                             curred on 07/24/12 about Modi going to Japan. The rank
    2) The second episode is after Federer winning the             of the episode is 3. 3) This episode is Modi wishing everyone
 Wimbledon ’12 Finals over Murray. This episode is ranked          on Janmastami. The rank of this episode is 4 and occurred
 number 1 and has occurred between 8th and 10th July 2012.         on 08/10/12. 4) The episode with rank 6 has occurred on
 Maximum frequent tweet has been tweeted 6230 times. Fed-          Modi’s Birthday on 09/17/12. 5) The episode occurred after
 erer, wimbledon, title, man, murray, today are frequent nouns.    Modi completed 4000 days as Gujarat’s CM and the rank
 The web page talks about Federer winning Wimbledon for            of the episode is 5. It has occurred on 09/18/12. 6) Mes-
 the 7th time.                                                     sage from Modi is the next episode whose rank is 2. It has
    3) The third episode is the blog that is written about         occurred on 10/13/12.
 the final match between Federer and Murray and how people            As a part of TEA system evaluation, we have calculated
 want both to win the match. This episode has occurred             precision, recall and F-measure of our TEA approach. For
 on 21st July 2012, 9days after the blog has been posted.          an entity, the detected episodes are classified manually to be
 Frequent nouns are fan, federer, murray. This might be            either valid or invalid episodes. An episode is valid if it is
 because this is not an event, but the opinion of a person         a significant event that has occurred in the lifespan of that
 written in the form of a blog and so it took time to tweak.       particular entity. The ratio of number of episodes that are
 It is number 5 episode and the tweet itself has the URL to        valid to the total number of episodes detected will be the
 the blog.                                                         precision of our TEA algorithm for that particular entity.
    4) Robin Van Persie tweets about Federer. Many peo-            The precision of TEA system is calculated by taking the
 ple have retweeted it as they share the same opinion and so       average precision of all the entities.
 this has become an episode. The rank is 2 and this tweet has         The recall of TEA system for a particular entity is the
 retweeted 3360 times. Federer, gold, Andy, Murray, wimble-        ratio of number of valid episodes to the actual number of
 don, mens, singles are frequent nouns.                            episodes that have occurred over that entity’s lifespan in
    5) Federer’s 31st Birthday is the fifth episode that           twitter. The recall of our TEA algorithm is the average
 has occurred on his birthday 8th August 2012. It is rank 4        recall of all the entities. However, it is difficult to determine
 and 2180 people has tweeted the same birthday wishes tweet        how many episodes have actually occurred for an entity over
 to “Federer”. Frequent nouns are Roger, Federer, birthday,        its twitter lifespan. So, for each entity we have manually
 today.                                                            searched over the internet (mostly their Wikipedia pages)
    6) The last episode is about Federer winning the Cincy         and listed down the significant events that have occurred
 Tennis Crown on 19th August 2012. Frequent nouns are              over a period from March 2012 to December 2012.
 Roger, Federer, Cinnicati, masters, title, congrats, today.          Table 3 shows the precision and recall for each entity that
 The episode is ranked 6 and the url shows details about the       is given as input to the TEA system. The overall precision
 episode.                                                          of the system that is calculated over these 11 entities is 0.864
    All these episodes are sorted and their strengths are cal-     whereas the overall recall of the system is 0.503.
 culated and then the episodes strength of the entity is gen-         F1-score (F-measure) is a measure of a test’s accuracy.
 erated. The chart in figure 2 shows the strength of detected      The F1-score can be interpreted as a weighted average of
 episodes of “Federer” with Time on X-axis and Number of           the precision and recall and it’s formula is given by:
 days an episode has occurred on Y-axis. The radius of the
                                                                     F1-score = 2 * (Precision * Recall)/(Precision + recall)
 bubble is taken as the strength of an episode. The strength
                                                                                                                                 (6)
 is divided by 50000 to mark it as radius just to scale the
                                                                   Table 4 shows the F1-score (f-measure) that are computed
 value to fit into the chart.
                                                                   using precision and recall values from table 3 for each of the
    Figures 3.(a), 3.(b) and 4.(a) shows the sentiment trends
                                                                   entities that are considered. The overall F1 score of TEA
 of tweets related to “Federer” over the time line. The senti-
                                                                   system is 0.62.
 ment trends charts are generated using Zingchart javascript
                                                                      The precision, recall and f-measure values that are pre-
 library[5](free branded version). Figure 3.(a) shows the num-
                                                                   sented for different entities are calculated by setting different
 ber of tweets that are tweeted positive (green line), negative
                                                                   thresholds (spikeFactor ) for different entities. These valida-
 (red line) or neutral (blue line) with sentiment on each day.
                                                                   tion measure values change based on the threshold value
 Figure 3.(b) shows the number of tweets that are tweeted
                                                                   that is set. For entity ‘Narendra Modi’, we have presented
 positive (green line), negative (red line), neutral (blue line)
                                                                   values of validation measures for different thresholds. Fig-


                                                                                                                               16
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
                                    Table 2: Episodes of ‘Narendra Modi’ over its lifespan

               Rank   Episode       Date/      Maximum Frequent Tweet [[Fre-         Frequency *Related Web URL 2
                                    Dura-      quent Nouns]] [[Polarity Score]]
                                    tion
                 1    Modi    on    03/17/12   RT @vijsimha: Here’s news more        314      http://timesofindia.indiatimes.com/
                      cover page               interesting than #Budget2012.                  india/Narendra-Modi-
                      of   Times               Time magazine puts Narendra                    on-Time-magazine-
                      Magazine                 Modi on cover as the man who                   cover/articleshow/12296366.cms
                                               could change Indi ...       [[news,
                                               time, magazine, narendra, modi,
                                               cover, man]] [[ 1 ]]
                 3    Modi going    07/24/12   RT @sardesairajdeep: Appreci-         280      http://articles.economictimes.indiati
                      to Japan                 ate Narendra Modi for going to                 mes.com/2012-07-
                                               Japan and standing by Haryana                  23/news/32804624 1 maruti-suzuki-
                                               govt.     Nation above politics.               s-manesar-manesar-plant-maruti-s-
                                               (there you go folk ... [[narendra,             manesar
                                               modi, japan, standing, haryana,
                                               govt]] [[ 1 ]]
                 4    Modi          08/10/12   RT @TOIBlogs: Janmashtami             86       http://t.co/foHZ8Qwb
                      wishes on                the protector of cows, Lord
                      Janmash-                 Krishna’s birthday :        Naren-
                      tami                     dra Modi http://t.co/foHZ8Qwb
                                               [[protector, cow, lord, krishna,
                                               birthday, narendra]] [[ 1 ]]
                 6    Modi’s        09/17/12   RT @Ohfakenews:          Narendra     27       http://en.wikipedia.org/wiki/Narendra
                      Birthday                 Modi turns 62 today. You may                    Modi
                                               remember him from his biggest
                                               hit: Naroda Patiya riots. #Hap-
                                               pyBdayNamo #NaMo [[naren-
                                               dra, modi, today, hit, #happyb-
                                               daynamo, #namo]] [[0]]
                 5    4000 days     09/18/12   RT @sardesairajdeep: Narendra         93       http://samvada.org/2012/news/4000-
                      as      Gu-              Modi completes 4000 days as Gu-                days-as-cm-narendra-modi-takes-
                      jarat’s                  jarat chief minister today. Quite              gujarat-as-model-state-of-india-in-
                      CM                       an achievement Shouldn’t that                  development/
                                               be trending? [[narendra, modi,
                                               days, gujarat, chief, minister, to-
                                               day]] [[ 0 ]]
                 2    Message       10/13/12   RT @Swamy39: Narendra Modi:           437      -
                      from Modi                UK has melted. US is not far
                                               behind. The hidden message is
                                               that if we are strong then they
                                               will come looking ... [[narendra,
                                               modi, message]] [[ 1 ]]


                                               Table 3: Precision and Recall of Entities

                      Entity (query)             Precision     Recall     Entity (query)            Precision     Recall
                      Narendra Modi                 0.9        0.333      Federer                       1         0.588
                      Barack Obama                  0.9        0.642      Britney Spears               0.8          0.4
                      Sachin                         1           0.5      Serena Williams               1          0.83
                      Adele                         0.5          0.5      Andy Murray                  0.7        0.571
                      Life of Pi                    0.9         0.33      Lumia 920                     1          0.33
                      Taylor Swift                  0.8          0.5

                                                 Table 4: F-measure values of Entities

                                  Entity (query)             F1 score     Entity (query)            F1 score
                                  Narendra Modi               0.486       Federer                    0.740
                                  Barack Obama                0.749       Britney Spears             0.533
                                  Sachin                      0.667       Serena Williams            0.907
                                  Adele                        0.5        Andy Murray                0.629
                                  Life of Pi                  0.486       Lumia 920                  0.499
                                  Taylor Swift                0.615


 ure 4.(b) shows how precision, recall and f-measure values                 to precision, maroon line corresponds to recall and green
 change with spikeFactor (threshold). The plot shows pre-                   line to F-measure. The precision started low, increased to a
 cision, recall and f-measure values on Y-axis for different                maximum value and then decreased with increase in spike-
 thresholds on X-axis. The blue line in the plot corresponds                Factor. Whereas, the recall started even low and increased
 2
     Note: * - not generated from our algorithm, but provided               by us as a verification of the episode detected.


                                                                                                                                      17
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
                           Table 5: Validation measures for different thresholds of ‘Narendra Modi’

                                    spikeFactor   Number of      Precision   Recall   F1 score
                                    (Thresh-      Episodes
                                    old)          Detected
                                    10            3                0.67       0.08      0.15
                                    20            4                0.75       0.17      0.27
                                    30            6                0.83       0.25      0.38
                                    40            9                0.88       0.33      0.48
                                    50            9                 0.9       0.33      0.49
                                    100           20               0.84       0.58      0.69
                                    150           23               0.82       0.58      0.68
                                    200           30               0.76       0.67      0.71
                                    250           39               0.73       0.67       0.7
                                    500           61               0.63       0.67      0.65
                                    1000          85               0.56       0.75      0.64
                                    1500          105              0.55       0.75      0.63
                                    2000          120              0.51       0.75      0.61


 with spikeFactor until it reached a maximum value and then           [5] ZingChart Javascript Charting Library. http://www.
 it became constant from there. F-measure followed a similar              zingchart.com.
 pattern as that of precision curve. Table 5 shows the preci-
 sion, recall and f-measure values for different spikeFactor.         [6] S. Asur and B. A. Huberman. Trends in social media:
    The top 6 episodes that are detected for entity ‘Naren-               Persistence and decay. AAAI, 2011.
 dra Modi’ when threshold (spikeFactor ) is set to be 50 are          [7] H. Becker and M. Naaman. Beyond trending topics:
 presented in Table 2 and validation measures for different               Real-world event identification on twitter. AAAI, 2011.
 thresholds for ‘Narendra Modi’ are presented in Table 5.
                                                                      [8] M. S. Bernstein and B. S. Eddi. Interactive topic-based
 6. CONCLUSIONS                                                           browsing of social status streams. UIST, 2010.
    Our intention to infer significant knowledge/insight from         [9] D. Gruhl and R. Guha. Information diffusion through
 huge number of tweets raises problems. The key issue is to               blogspace. WWW, 2004.
 comprehend what a set of tweets convey about an entity.
 Our approach has been to consider lifetime of an entity and         [10] M. Iwata and T. Saka. Aspectiles: Tile-based visual-
 determine what all events can occur in it. From the events               ization of diversified web search results. SIGIR, 2012.
 one can get episodes that convey larger description of the          [11] H. Kwak and C. Lee. What is twitter, a social network
 set of tweets are conveying, and then episodes strength of               or a news media? WWW, 2010.
 an entity are shown. We built a system for taking any en-
 tity as a keyword and process relevant tweets to detect the         [12] M. Mathioudakis and N. Koudas. Twittermonitor:
 episodes. Our results validate our approach by providing                 Trend detection over the twitter stream. SIGMOD,
 episodes that provide the essence of information that can be             2010.
 gleaned from tweets. In particular, we are able to convey
 sentiments about tweets and phrases that describe tweets            [13] M. R. Morris and S. Counts. Tweeting is believing? un-
 over different periods of time. Therefore, our system can                derstanding microblog credibility perceptions. CSCW,
 be used to determine short term understanding from tweets                2012.
 about a given entity and use it to promote or rectify certain
                                                                     [14] J. Nichols and J. Mahmud. Summarizing sporting
 actions. For example, sell more mobile phones at discount
                                                                          events using twitter. ACM IUI, 2012.
 or quickly send out a patch for a malfunctioning applet. As
 part of future work we will continue to improve core algo-          [15] T. Sakaki and M. Okazaki. Earthquake shakes twit-
 rithms applied in this paper, and delve into what can be                 ter users: real-time event detection by social sensors.
 learned from detected episodes.                                          WWW, 2010.
                                                                     [16] Teevan and Ramage. Twittersearch: A comparison of
 References                                                               microblog search and web search. WSDM, 2011.
  [1] Apache Lucene. https://lucene.apache.org/.

  [2] MPQA Subjectivity Lexicon. http://mpqa.cs.pitt.edu/.

  [3] Stanford Part-Of-Speech Tagger. http://nlp.stanford.
      edu/software/tagger.shtml.

  [4] Stanford Stop-Word List. http://www.wordsift.com/
      wordlists.


                                                                                                                             18
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014