<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Microposts</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>TEA: Episode Analytics on Short Messages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Prapula G</string-name>
          <email>prapula.g@research.iiit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Soujanya Lanka</string-name>
          <email>soujanya@iiit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kamalakar Karlapalem</string-name>
          <email>kamal@iiit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Data Engineering</institution>
          ,
          <addr-line>IIIT Hyderabad, Andhra Pradesh</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <volume>4</volume>
      <fpage>11</fpage>
      <lpage>18</lpage>
      <abstract>
        <p>Twitter is a widely used micro-blogging service, which in recent times, has become a reliable source of happening news around the world [11]. Breaking news are covered in twitter; the magnitude and volumes of tweets reflecting on the nature and intensity of the news. During events, many tweets are posted either expressing sentiments about the event or just about the occurrence of the event. Events related to an entity that have attracted a large number of tweets can be considered significant in the entity's twitter lifetime. Entity could represent a person, movie, community, electronic gadgets, software products and like wise. In this work, we attempt to automatically detect significant events related to an entity. An episode, is an event of importance; identified by processing the volumes of tweets/posts in a short time. The key features implemented in Tweet Episode Analytics (TEA) system are: (i) detecting episodes among the streaming tweets related to a given entity over a period of time (from the entity's birth i.e., mention in the tweet world till date), (ii) providing visual analytics (like sentiment scoring and frequency of tweets over time) of each episode through graphical interpretation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
      <p>H.4 [Web IR and Social Media Search]: Social Network
Analysis(Micro-Blogging Analysis)</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>Tweets are a source of valuable information that have the
potential of providing an overview of how the world is
thinking about various events/persons over a period of time. The
PCeorpmyirsisgihont tco m20ak1e4 dhieglidtalbyorahuatrhdocro(sp)i/eoswonfearl(ls)o;r cpoaprtyionfgthpiserwmoitrktefdor
poenrlsyonfoarl oprricvlaatsesraonodmaucsaediesmgircanptuedrpwositehso.ut fee provided that copies are
nPoutbmliashdeedoradsipstarirbtuotefdthfoer#prMofiticorrocpoomstms2e0rc1i4al Wadovraknsthaogpe apnrdoctheeadticnogpsi,es
baveaariltahbislenotice anadstCheEfUulRlcVitoatli-o1n14o1n (thhet tfirpst:/p/acgeeu.rT-owsco.oprygo/tVhoelr-w1i1se4,1)to
online
republish, to post on servers or to redistribute to lists, requires prior specific
p#eMrmicisrsoiponosatnsd2/0o1r4a, fAeep.ril 7th, 2014, Seoul, Korea.</p>
      <p>
        Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
events are usually related to nouns like persons, movies and
objects in real world; these nouns are referred to as entities.
Each entity will have a series (one or more) of events which
are significant in its lifetime. People tweet about events
that are of importance to them[16][
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. People seek
latest up-to-date information by searching through tweets live
stream. So, an event or a search phrase obtains a high
frequency of tweets, mostly due to its significance (like a
trending topic). Hence, the overall social interest received for an
event related to an entity is reflected by the number of tweets
that mention the event. This streaming information about
various events should be identified, analyzed and visualized
in order to make them suitable for humans to understand
and interpret the causes and the consequences. Such a
visual representation is also useful in displaying search results.
AspecTiles[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] address the problem of search result
diversification. In our work, given an entity we address the event
diversification related to an entity. For instance, if a search on
‘Roger Federer’ is performed during the Wimbledon season,
there could be various events related to Federer that would
have been tweeted on different days of the season.
Identifying significant events and displaying sets of tweets (by
grouping tweets related to a particular event) with graphs
gives user a chance to glance through events and explore in
detail on an event he/she is interested in.
      </p>
      <p>With large number of twitter users getting interested in
a particular event leads to a deluge of tweets and also the
queries on those tweets. Mining significant events will be
useful in summarizing the deluge of tweets. Hence, an
analysis system is needed, that (i) identifies important events
related to an entity, (ii) analyzes the temporal sentiment
patterns of tweets during the period of increased interest
and provides visuals depicting the same. A large scale
processing is done to accomplish all of this and the results of
each of the above is presented in Section 5.</p>
      <p>
        The importance of an event can be computed by the
frequency of tweets and re-tweet counts related to the event as
done in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. A popular entity (like a movie star, movie,
musician and the likes) receives some amount of attention on a
regular basis in twitter. The amount of attention received
need not to be constant over a daily basis. The attention
received (i.e., the number of tweets talking about the entity)
varies over a period of time due to various events related to
the entity. When there is a spike in the attention received,
the event associated could be a significant one.
      </p>
      <p>For instance, let us consider ‘Lady Gaga’ as an entity.
There could be many tweets that mention Lady Gaga as part
of routine events like ‘@user432 Listening to Lady Gaga’,
‘just read article on Lady Gaga’, ‘Lady Gaga in Japan’ and
‘Lady Gaga’s Born this way - releasing in 2012’. Among
these, significant events for Lady Gaga could be ‘Born this
way’ album’s release and her ‘tour to Japan’. A significant
event due to increased volumes of tweets related to an entity
is considered as an episode.</p>
      <p>The sentiments expressed by twitter users about episodes
change over time. For example, there could be a very
positive anticipation for a particular movie about to be released,
but it might not have been well received (paving way for
negative sentiments expressed post-release). Analyzing and
visualizing the accumulated sentiments about episodes over
time could be useful for market research analysis of an entity
(movies, electronic gadgets, albums etc).</p>
      <p>In this paper, we introduce the concept of an episode for
a time-line of an entity and develop a tweet episode
analytics system (referred to as TEA) which when given a phrase
of words that represent an entity as input can: (a) identify
episodes, (b) analyze episodes, life-spans, (c) display the
cumulative sentiments expressed over a period of time.</p>
      <p>In section 2, we present related work. In section 3, an
Overview of TEA is presented which is followed by Tweet
Episode Analytics (Section 4). Section 5 presents Results of
TEA with Section 6 presenting some conclusions.
2.</p>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORK</title>
      <p>
        There has been a considerable amount of work done on
extracting trending topics from twitter. The idea of an Episode
that has been proposed in this paper is different from the
past studies on trending topics. There has been a study on
how and why the topics become trending in one of the
papers [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. As a part of their study, [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] have tried to explain
the growth of trending topics. They have concluded that
most topics do not trend for long on Twitter. This
conclusion from their study strengthens our idea of Episodes which
we have defined as a significant event that may occur in the
time line of an entity and the event will be significant only
for a short period of time.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Becker et al identified real-world events and their
associated twitter messages that are published. Online
clustering and filtering framework is used to address this event
identification problem.We have introduced the concept of
an episode and have presented an algorithm to identify an
episode by considering accumulated significance of the tweets.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], Nichols et al extracted sporting events and
summarized the tweets in that events. They are confined to
tweets related to sports and concentrated more on
summarizing than extracting events. Our frame work and algorithm
work for a search query (to represent the entity) and detect
possible episodes in its life time.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], Sakaki et al believe that when a real event like
natural disasters that influence people from either one
region or some parts of the world occur, the twitter users
(social sensors) will tweet about the event immediately. This
paper aims to recognise events at real time whereas we
detect episodes that have already occurred and have lots of
importance in the entity’s life time. Our paper presents
historical coverage of an entity as a sequence of episodes.
Moreover, this paper targets events like social events (e.g.,
large parties), sports events, accidents and political
campaigns and natural events like storms, heavy rainfall,
tornadoes, typhoons which influence people’s daily life whereas
our work is not specific to any event of an entity and is more
generic.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Gruhl et al studied the propogation of
information in environments like personal publishing using a large
collection of web logs. They have characterized the
topics into long running “chatter” topics consisting of recursive
“spikes” topics. According to their theory, if there are spikes
recursively for a topic over a long period of time, it may
be of interest. Topics are detected and then classified if its
chatter or spike and studied the propagation. Our work
concentrates on detecting events related to an entity based on
a similar notion that spikes are the places where significant
events have occurred in an entity’s life time.
3.
      </p>
    </sec>
    <sec id="sec-4">
      <title>OVERVIEW OF TEA</title>
      <p>In this section, we introduce the concept of an episode. We
also present the architecture of “Tweet Episode Analytics”
system as a part of this section.
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>What is an Episode?</title>
      <p>Episode can be defined as a significant event in the time
line of an entity (individual person, community, group etc)
that has occurred due to a sudden increase of tweet volumes
of the entity from its regular volumes.</p>
      <p>
        Among all the events that an object/entity is involved in,
the events that received more attention in a particular period
of time, are referred as episodes. All episodes are events
but not all events can be episodes. Episodes are significant
events with respect to an entity, but events are more general
not specifically related to entities. Episodes are always for
an entity. TEA algorithm identifies prominent episodes of
an entity that has occurred over its time line, considering
an entity has a long lifespan. An episode is different from
the traditional concept of “a trending topic” [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] or “topics
extracted from topic clustering” [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. An entity is said to have
an episode if there is a sudden spike in an activity and that is
captured as an event in the time line of the entity because of
which there is a huge activity related to the entity. For each
such event, there is evidence like an article or information
that shows the true importance of the event. If no such
article or information exists, then it may not be an episode.
      </p>
      <p>Similar to ‘Lady Gaga’ example mentioned in Section 1,
we noticed a similar episode being detected in our tweet
data set related to ‘Justin’(entity). A phrase formed by
‘Justin’ and ‘Boyfriend’ put together is an episode whereas
‘Justin’ is not. After the release of Justin Bieber’s new song
‘Boyfriend’, there was a sudden outburst of tweets about
this song. Even though the number of tweets about ‘Justin’
are large implying that it is a trending topic, it is not an
episode because the reason for more social activity about
‘Justin’ is not due to a single significant event.
3.2</p>
    </sec>
    <sec id="sec-6">
      <title>System Architecture</title>
      <p>The whole tweet episode analytics system can be divided
into different modules. Tweet collection and tweet
processing are offline modules (module in which processing is done
beforehand) where as, episode detection, sentiment
analyzing are online modules (module in which the processing
starts after receiving the query as input to the system).
The flowchart of system architecture to “Detect Episodes
of an entity from Twitter data using Episode Detection
Algorithm” is given in Figure 1. Below is a brief explanation
for each of the modules.</p>
      <p>Tweet Collection module collects tweets using Twitter
Streaming API. A sample of public tweets are extracted from
twitter.com every 2 minutes. We have been collecting tweets
since March 2012 and until December 2012. Around 140
Million public tweets were collected from Twitter. Tweets
were collected on an hourly basis; tweets for each hour are
stored in a separate file.
3.2.2</p>
      <sec id="sec-6-1">
        <title>Tweet Processing Module</title>
        <p>
          Tweet processing includes removing non-english tweets
and tweets with incomplete details. These processed tweets
are stored by indexing them using Lucene [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The details
about a tweet that are being stored in the Lucene index are
tweet id, text, retweet count of that particular tweet and its
creation time. In addition to this, the id, name, location, url,
description, followers count, creation time of the account of
the user who has tweeted the tweet are also stored for each
tweet.
3.2.3
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>Episode Detection Module</title>
        <p>A query(entity) is given as input to this module along
with the processed Lucene Index from the above module.
Episode detection module will extract all the tweets that
are related to the given query and then all the episodes that
have occurred over the life time of the entity are detected by
applying Episode Detection Algorithm on the related tweets.
3.2.4</p>
      </sec>
      <sec id="sec-6-3">
        <title>Sentiment Analysing Module</title>
        <p>Sentiment Analysis is a method of analyzing/finding the
opinion/sentiment that is expressed in a piece of text, a
tweet in our context. In this module, a very basic
sentiment scoring algorithm is applied on the tweets which are
related to the given entity to get their sentiment score. This
algorithm could be replaced with any other sentiment
scoring algorithm; for this paper, we used a basic scoring
algorithm as explained in Section 4.3. This module generates
charts/graphs which shows how the sentiment of the entity
has been changing over the period of its twitter lifetime.</p>
        <p>We have given “Federer” query for our system along with
the output of Tweet Collection and Tweet Processing offline
modules and the flow is as below: (i) we retrieved episodes
mentioned in Table 1 using Episode detection module, (ii)
from episodes - we merged episodes and got bubble chart,
(iii) we extracted sentiment scores and the trending graphs
using sentiment analysis module.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>TWEET EPISODE ANALYTICS</title>
      <p>In this section, we present our algorithm to detect episodes
from the tweet data. After the episode detection algorithm
is executed on the data set, we use the information obtained
from the algorithm to detect all the episodes of a
particular entity. We also present sentiment analysis method that
we have used in our system. In the post processing phase,
we present sentiment, trend and temporal analytics of each
episode.
4.1</p>
    </sec>
    <sec id="sec-8">
      <title>Episode Detection Algorithm</title>
      <p>Given an entity/query as an input, Episode Detection
Algorithm gives episodes for an entity over a given time period.</p>
      <p>
        The algorithm will detect the episodes that have occurred in
the entity’s twitter lifetime. The time of birth for an entity
in our twitter data set is the time stamp of the first
occurring tweet that mentions it. Lifetime of an entity would be
the first time stamp to till date. For this, all the tweets
related to a given query are extracted from the Lucene
index and are processed by cleaning the text. The proper
nouns that have occurred in these tweets are determined
using Stanford POS tagger[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] along with their frequency of
occurrence in the tweets. Frequent bi-gram nouns are also
extracted and then using the episode detection algorithm,
all the episodes that have occurred over the lifetime of the
entity are detected.
      </p>
      <p>The following are the conditions to be satisfied to say that
an episode has occurred on a short duration of time:</p>
      <p>Entering into
Wimbledon
’12 finals
http://t.co/eOeQjSbu
–
http://www.tennisnow.com/News/
Happy-Birthday-Mr–Federer.aspx
http://www.espn.co.uk/tennis/sport/
story/165924.html
spikeFactor ( 0 &lt; spikeFactor &lt;= TE ) is set manually. The
maximum spikeExtent of all days should be greater than the
spikeLimit threshold. The number of days the spikeExtent is
greater than the spikeLimit is also counted as spikeFreq. The</p>
      <p>1) The total number of tweets that are related to the day on which the spikeExtent is maximum is the spikeDay.
event considering retweet count should be greater than min- 3) The tweets on spikeDay are processed and then all the
NumTweets (parameter). nouns in those tweets are extracted along with their
occurrence frequency in the tweets. If the maximum frequent
TE &gt;= minN umT weets (1) nouns which are most frequent after the query words
corresponds to a single or at most two topics then the event is an
where TE is the total number of tweets that are related to Episode.
event E. The difference between the number of tweets on a
partic2) For each day, spike extent (spikeExtent ) is calculated. ular day and the number of tweets of the previous day is
Let the day be represented by d and D is the number of calculated for each day and the days are sorted in
decreasdays in the lifetime of given entity. The number of tweets ing order based on this difference that is computed. The
related to the event E on a day d are NumTweets(d,E) days which also satisfy the above conditions are considered
spikeExtent(d, E) = N umT weets(d, E)−N umT weets(d−1, E) as spikeDays.</p>
      <p>(2) The following additional information is extracted for each</p>
      <p>episode:
md=aDx(spikeExtent(d, E)) &gt;= spikeLimit (3) 1) Let FreqN, FreqrtN are arrays of nouns which are stored
d=0 in decreasing order of their frequency from the tweets
withwhereas out and with retweet count correspondingly on the spikeDay.</p>
      <p>First 20 elements of FreqN and FreqrtN are extracted.
spikeLimit = TE/spikeF actor (4) 2) Let FreqB, FreqrtB are arrays of bigram nouns which
are stored in the decreasing order of their frequency from
1Note: * - not generated from our algorithm, but provided
by us as a verification of the episode detected.
the tweets without and with retweet count correspondingly
on the spikeDay. First 50 elements of FreqB and FreqrtB
are extracted. Similarly, let us say FreqPosB, FreqNegB and
FreqNeuB are arrays with bigrams which are extracted from
tweets with positive, negative and neutral sentiments on the
spikeDay correspondingly. First 50 elements from each of
FreqPosB, FreqNegB, FreqNeuB are also extracted.</p>
      <p>3) Let Tmax is the tweet which has maximum retweet
count on the spikeDay and Tnoun is array of nouns present
in Tmax. Tmax is extracted and Tnoun is determined from
Tmax. In addition to the above, the difference between
maximum retweet count and minimum retweet count of the tweet
on the spikeDay (MaxMindiff ) is also extracted.</p>
      <p>From the tweets, all the above information is extracted
and then top k (can be set manually) of the nouns, bigrams
and the maximum frequent tweet, nouns in that tweet are
all presented in the results as episodes.
4.2</p>
    </sec>
    <sec id="sec-9">
      <title>Episode Analytics on Tweets</title>
      <p>As a part of episode analytics for twitter, the sentiment
trend and cumulative trend of tweets with retweet count
are also presented as charts. Number of tweets with
different polarities in each 100 tweets are also shown. For all
the episodes their strength is calculated and presented in a
chart. A chart with all the episodes of entity is generated
and presented.</p>
      <p>For an entity that has been given as input, until a
maximum of 10 episodes are detected based on the threshold and
the number of tweets related to the entity. The episodes are
ranked based on their strengths. The strength of an episode
is calculated as the ratio of the number of tweets that are
tweeted about it and the time period over which the episode
has occurred. The strength is the average number of tweets
that are tweeted per day in the duration of the episode. The
formula of the strength is given below:</p>
      <p>n
SE = (X Ni)/n
i=1
(5)
where SE is the Strength of an Episode (E ) and Ni is the
number of tweets on ith day where as n is the number of
days the episode has occurred.</p>
      <p>The episodes are further sorted based on the time of their
occurrence and all the episodes are presented from the start
to the end of the lifetime of the entity. For us, the start and
end times are the start and end points of the tweet collection.</p>
      <p>Apart from the episode detection, the trends or patterns
in the number of tweets and their sentiments are visualized.
Basic polarity scoring algorithm is implemented by using
cumulative polarity of adjectives. It is explained below in
brief.
4.3</p>
    </sec>
    <sec id="sec-10">
      <title>Sentiment Analysis</title>
      <p>
        Given a piece of text, sentiment analysis algorithm will
give the sentiment score of the text. The text is split by
sentence and then all the words like stop words and others
that has no sentiment or opinion in it are removed. The
list of stop words used is taken from the Stanford stop word
list[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] Sentiment lexicon has a list of words with their
polarity score. It is taken from MPQA Subjectivity Lexicon[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
The polarity score of the remaining words from the sentence
which are present in the sentiment lexicon are added, which
adds upto polarity score of a sentence. The polarity scores of
all the sentences in the text are added to get the sentiment
score of the total text. The sentiment score can be either
positive, zero or negative, depending upon whether the text
has positive opinion, neutral opinion or negative opinion.
5.
      </p>
    </sec>
    <sec id="sec-11">
      <title>RESULTS AND EVALUATION</title>
      <p>In this section, we evaluate the proposed episode detection
method by analysing the episodes strength for some famous
personalities(entities). We have considered the twitter data
from March 2012 to December 2012 for our experiments, so
the episodes detected will fall into this timeline.</p>
      <p>We have experimented with some queries like “Federer”,
“Serena Williams”, “Lumia 920”. We will be analysing the
results on the entity query “Federer” in this section. Our
Episode Detection algorithm has found 6 episodes related
to “Federer” over the period of consideration(March ’12 to
December ’12) and they are presented in Table 1 in sorted
order of time.</p>
      <p>Each Episode in the table has the following fields: Rank
of the episode, episode description, date/duration of the
episode, Maximum Frequent Tweet during the episode and
Frequent Nouns, Frequency of the maximum frequent tweet
and tweet spike, finally the web URL which shows details of
the episode on the internet.</p>
      <p>The rank of the episode is decided based on the strength
of the episode that is being calculated. Episode
description is the description in short for the episode that is
detected. Date/duration of an episode is the period in which
the episode has occurred. Maximum Frequent Tweet is the
tweet which have occurred maximum number of times in
the episode time period and Frequent Nouns are the nouns
that are related to the episode which are sorted based on
their frequency of occurrence. Frequency is the number of
times the tweet has occurred where as tweet spike is the
total number of tweets that are tweeted in the duration of the
episode. For evaluating the episode that is detected, we have
searched on the internet and then included the web URL of
the page which shows the details of an episode and so
proving the occurrence of that corresponding episode. Observe
that the dates of the articles in the web URLs are same as
the dates of occurrence of its corresponding episode. Each of
the episode detected related to “Federer” is analysed further
based on their date of occurrence below:</p>
      <p>1) The first episode has occurred on 6th and 7th of July
2012 when Federer won the semi finals against Djokovic and
entered into Wimbledon ’12 Finals just before the day of the
finals. The rank of this episode is 3 and the maximum
frequent tweet has tweeted 3464 times. The frequent nouns are
wimbledon, federer, crack, Djokovic, title, sunday. The web
URL shows that Federer has entered into finals by winning
over Djokovic dated 6th of July 2012.</p>
      <p>2) The second episode is after Federer winning the
Wimbledon ’12 Finals over Murray. This episode is ranked
number 1 and has occurred between 8th and 10th July 2012.
Maximum frequent tweet has been tweeted 6230 times.
Federer, wimbledon, title, man, murray, today are frequent nouns.
The web page talks about Federer winning Wimbledon for
the 7th time.</p>
      <p>3) The third episode is the blog that is written about
the final match between Federer and Murray and how people
want both to win the match. This episode has occurred
on 21st July 2012, 9days after the blog has been posted.
Frequent nouns are fan, federer, murray. This might be
because this is not an event, but the opinion of a person
written in the form of a blog and so it took time to tweak.
It is number 5 episode and the tweet itself has the URL to
the blog.</p>
      <p>4) Robin Van Persie tweets about Federer. Many
people have retweeted it as they share the same opinion and so
this has become an episode. The rank is 2 and this tweet has
retweeted 3360 times. Federer, gold, Andy, Murray,
wimbledon, mens, singles are frequent nouns.</p>
      <p>5) Federer’s 31st Birthday is the fifth episode that
has occurred on his birthday 8th August 2012. It is rank 4
and 2180 people has tweeted the same birthday wishes tweet
to “Federer”. Frequent nouns are Roger, Federer, birthday,
today.</p>
      <p>6) The last episode is about Federer winning the Cincy
Tennis Crown on 19th August 2012. Frequent nouns are
Roger, Federer, Cinnicati, masters, title, congrats, today.
The episode is ranked 6 and the url shows details about the
episode.</p>
      <p>All these episodes are sorted and their strengths are
calculated and then the episodes strength of the entity is
generated. The chart in figure 2 shows the strength of detected
episodes of “Federer” with Time on X-axis and Number of
days an episode has occurred on Y-axis. The radius of the
bubble is taken as the strength of an episode. The strength
is divided by 50000 to mark it as radius just to scale the
value to fit into the chart.</p>
      <p>
        Figures 3.(a), 3.(b) and 4.(a) shows the sentiment trends
of tweets related to “Federer” over the time line. The
sentiment trends charts are generated using Zingchart javascript
library[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ](free branded version). Figure 3.(a) shows the
number of tweets that are tweeted positive (green line), negative
(red line) or neutral (blue line) with sentiment on each day.
Figure 3.(b) shows the number of tweets that are tweeted
positive (green line), negative (red line), neutral (blue line)
with sentiment or all in total (yellow line) until that day
from the start day with retweet count. We can see there is a
sudden spike in the number of tweets at several places.
Figure 4.(a) shows the number of positive (green line), negative
(red line) and neutral (blue line) tweets with sentiment that
are present in every 100 tweets.
      </p>
      <p>The episodes of “Narendra Modi” were also detected.
“Narendra Modi” is an Indian Politician, Chief Minister of the state
Gujarat in India. Table 2 shows episodes detected for the
entity “Narendra Modi” with 6 episodes presented based on
their occurrence date.</p>
      <p>A brief analyis of the episodes detected is done below
based on their date of occurrence: 1) The rank of the first
episode is 1 and it occurred on 03/17/12. The episode is
Modi on cover page of Time Magazine. 2) This episode
occurred on 07/24/12 about Modi going to Japan. The rank
of the episode is 3. 3) This episode is Modi wishing everyone
on Janmastami. The rank of this episode is 4 and occurred
on 08/10/12. 4) The episode with rank 6 has occurred on
Modi’s Birthday on 09/17/12. 5) The episode occurred after
Modi completed 4000 days as Gujarat’s CM and the rank
of the episode is 5. It has occurred on 09/18/12. 6)
Message from Modi is the next episode whose rank is 2. It has
occurred on 10/13/12.</p>
      <p>As a part of TEA system evaluation, we have calculated
precision, recall and F-measure of our TEA approach. For
an entity, the detected episodes are classified manually to be
either valid or invalid episodes. An episode is valid if it is
a significant event that has occurred in the lifespan of that
particular entity. The ratio of number of episodes that are
valid to the total number of episodes detected will be the
precision of our TEA algorithm for that particular entity.
The precision of TEA system is calculated by taking the
average precision of all the entities.</p>
      <p>The recall of TEA system for a particular entity is the
ratio of number of valid episodes to the actual number of
episodes that have occurred over that entity’s lifespan in
twitter. The recall of our TEA algorithm is the average
recall of all the entities. However, it is difficult to determine
how many episodes have actually occurred for an entity over
its twitter lifespan. So, for each entity we have manually
searched over the internet (mostly their Wikipedia pages)
and listed down the significant events that have occurred
over a period from March 2012 to December 2012.</p>
      <p>Table 3 shows the precision and recall for each entity that
is given as input to the TEA system. The overall precision
of the system that is calculated over these 11 entities is 0.864
whereas the overall recall of the system is 0.503.</p>
      <p>F1-score (F-measure) is a measure of a test’s accuracy.
The F1-score can be interpreted as a weighted average of
the precision and recall and it’s formula is given by:
F1-score = 2 * (Precision * Recall)/(Precision + recall)
(6)
Table 4 shows the F1-score (f-measure) that are computed
using precision and recall values from table 3 for each of the
entities that are considered. The overall F1 score of TEA
system is 0.62.</p>
      <p>The precision, recall and f-measure values that are
presented for different entities are calculated by setting different
thresholds (spikeFactor ) for different entities. These
validation measure values change based on the threshold value
that is set. For entity ‘Narendra Modi’, we have presented
values of validation measures for different thresholds.
Fig</p>
      <p>Modi on
cover page
of Times
Magazine
Modi going
to Japan
Modi
wishes on
Janmashtami
Modi’s
Birthday
4000 days
as
Gujarat’s
CM
Message
from Modi</p>
      <p>Date/ Maximum Frequent Tweet
[[FreDura- quent Nouns]] [[Polarity Score]]
tion
03/17/12 RT @vijsimha: Here’s news more
interesting than #Budget2012.</p>
      <p>
        Time magazine puts Narendra
Modi on cover as the man who
could change Indi ... [[news,
time, magazine, narendra, modi,
cover, man]] [[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]]
07/24/12 RT @sardesairajdeep:
Appreciate Narendra Modi for going to
Japan and standing by Haryana
govt. Nation above politics.
(there you go folk ... [[narendra,
modi, japan, standing, haryana,
govt]] [[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]]
08/10/12 RT @TOIBlogs: Janmashtami
the protector of cows, Lord
Krishna’s birthday :
Narendra Modi http://t.co/foHZ8Qwb
[[protector, cow, lord, krishna,
birthday, narendra]] [[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]]
09/17/12 RT @Ohfakenews: Narendra
      </p>
      <p>Modi turns 62 today. You may
remember him from his biggest
hit: Naroda Patiya riots.
#HappyBdayNamo #NaMo
[[narendra, modi, today, hit,
#happybdaynamo, #namo]] [[0]]
09/18/12 RT @sardesairajdeep: Narendra</p>
      <p>Modi completes 4000 days as
Gujarat chief minister today. Quite
an achievement Shouldn’t that
be trending? [[narendra, modi,
days, gujarat, chief, minister,
today]] [[ 0 ]]
10/13/12 RT @Swamy39: Narendra Modi:</p>
      <p>
        UK has melted. US is not far
behind. The hidden message is
that if we are strong then they
will come looking ... [[narendra,
modi, message]] [[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]]
314
280
86
27
93
437
http://timesofindia.indiatimes.com/
india/Narendra-Modion-Time-magazinecover/articleshow/12296366.cms
http://articles.economictimes.indiati
mes.com/2012-0723/news/32804624 1
maruti-suzukis-manesar-manesar-plant-maruti-smanesar
http://t.co/foHZ8Qwb
http://en.wikipedia.org/wiki/Narendra
Modi
http://samvada.org/2012/news/4000days-as-cm-narendra-modi-takesgujarat-as-model-state-of-india-indevelopment/
ure 4.(b) shows how precision, recall and f-measure values
change with spikeFactor (threshold). The plot shows
precision, recall and f-measure values on Y-axis for different
thresholds on X-axis. The blue line in the plot corresponds
to precision, maroon line corresponds to recall and green
line to F-measure. The precision started low, increased to a
maximum value and then decreased with increase in
spikeFactor. Whereas, the recall started even low and increased
2Note: * - not generated from our algorithm, but provided
by us as a verification of the episode detected.
#Microposts2014
4th Workshop on Making Sense of Microposts
with spikeFactor until it reached a maximum value and then
it became constant from there. F-measure followed a similar
pattern as that of precision curve. Table 5 shows the
precision, recall and f-measure values for different spikeFactor.
      </p>
      <p>The top 6 episodes that are detected for entity
‘Narendra Modi’ when threshold (spikeFactor ) is set to be 50 are
presented in Table 2 and validation measures for different
thresholds for ‘Narendra Modi’ are presented in Table 5.</p>
    </sec>
    <sec id="sec-12">
      <title>CONCLUSIONS</title>
      <p>Our intention to infer significant knowledge/insight from
huge number of tweets raises problems. The key issue is to
comprehend what a set of tweets convey about an entity.
Our approach has been to consider lifetime of an entity and
determine what all events can occur in it. From the events
one can get episodes that convey larger description of the
set of tweets are conveying, and then episodes strength of
an entity are shown. We built a system for taking any
entity as a keyword and process relevant tweets to detect the
episodes. Our results validate our approach by providing
episodes that provide the essence of information that can be
gleaned from tweets. In particular, we are able to convey
sentiments about tweets and phrases that describe tweets
over different periods of time. Therefore, our system can
be used to determine short term understanding from tweets
about a given entity and use it to promote or rectify certain
actions. For example, sell more mobile phones at discount
or quickly send out a patch for a malfunctioning applet. As
part of future work we will continue to improve core
algorithms applied in this paper, and delve into what can be
learned from detected episodes.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Lucene</surname>
          </string-name>
          . https://lucene.apache.org/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>MPQA</given-names>
            <surname>Subjectivity</surname>
          </string-name>
          <article-title>Lexicon</article-title>
          . http://mpqa.cs.pitt.edu/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Stanford</given-names>
            <surname>Part-Of-Speech Tagger</surname>
          </string-name>
          . http://nlp.stanford. edu/software/tagger.shtml.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Stanford</given-names>
            <surname>Stop-Word List</surname>
          </string-name>
          . http://www.wordsift.com/ wordlists.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>ZingChart</given-names>
            <surname>Javascript Charting</surname>
          </string-name>
          <article-title>Library</article-title>
          . http://www. zingchart.com.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Asur</surname>
          </string-name>
          and
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Huberman</surname>
          </string-name>
          .
          <article-title>Trends in social media: Persistence and decay</article-title>
          .
          <source>AAAI</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Becker</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Naaman</surname>
          </string-name>
          .
          <article-title>Beyond trending topics: Real-world event identification on twitter</article-title>
          .
          <source>AAAI</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          and
          <string-name>
            <given-names>B. S.</given-names>
            <surname>Eddi</surname>
          </string-name>
          .
          <article-title>Interactive topic-based browsing of social status streams</article-title>
          .
          <source>UIST</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gruhl</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Guha</surname>
          </string-name>
          .
          <article-title>Information diffusion through blogspace</article-title>
          .
          <source>WWW</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Iwata</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Saka</surname>
          </string-name>
          . Aspectiles:
          <article-title>Tile-based visualization of diversified web search results</article-title>
          .
          <source>SIGIR</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kwak</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>What is twitter, a social network or a news media?</article-title>
          <source>WWW</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mathioudakis</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Koudas</surname>
          </string-name>
          . Twittermonitor:
          <article-title>Trend detection over the twitter stream</article-title>
          .
          <source>SIGMOD</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Morris</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Counts</surname>
          </string-name>
          .
          <article-title>Tweeting is believing? understanding microblog credibility perceptions</article-title>
          .
          <source>CSCW</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Nichols</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Mahmud</surname>
          </string-name>
          .
          <article-title>Summarizing sporting events using twitter</article-title>
          .
          <source>ACM IUI</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sakaki</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Okazaki</surname>
          </string-name>
          .
          <article-title>Earthquake shakes twitter users: real-time event detection by social sensors</article-title>
          .
          <source>WWW</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>