<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop,
Glasgow, Scotland</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>StreamGrid: Summarization of Large Scale Events using Topic Modelling and Temporal Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Emmanouil Schinas</string-name>
          <email>manosetro@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yiannis Kompatsiaris</string-name>
          <email>ikom@iti.gr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Symeon Papadopoulos</string-name>
          <email>papadop@iti.gr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pericles A. Mitkas</string-name>
          <email>mitkas@eng.auth.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>1Dept. of Electrical &amp; Computer Engineering, Aristotle University of Thessaloniki, 2Information Technologies Institute, Centre for Research &amp; Technology Hellas</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Information Technologies Institute, Centre for Research &amp; Technology Hellas</institution>
          ,
          <addr-line>Thessaloniki</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <volume>0</volume>
      <fpage>1</fpage>
      <lpage>04</lpage>
      <abstract>
        <p />
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Due to the increasing popularity of
microblogging platforms, the amount of messages
related to large scale public events reach
impressive levels. Although such messages can be
quite informative regarding di erent aspects
of the main event, there is a lot of spam and
redundancy that makes it challenging to extract
insights regarding the event of interest. In this
work we describe a summarization framework
that captures the important moments of an
event by using a combination of topic
modelling and bursty activity detection. We
propose a data structure named StreamGrid, that
maintains the information of active topics in
regular time intervals at several scales. This
structure is used for the creation of concise
summaries for any time interval. Finally, the
evaluation on a large Twitter dataset around
the Sundance Film Festival demonstrates the
potential of the proposed framework.</p>
      <p>Copyright c by the paper's authors. Copying permitted only
for private and academic purposes.</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>
        Due to their increasing popularity, micro-blogging
platforms, and especially Twitter, have evolved into a
powerful means for getting connected with real world
events. In large scale public events, ranging from sport
events, such as football matches, to political events
and festivals, the users that are somehow involved in
the event use social media to share their experiences
and express their opinions. In many cases, these
messages are quite informative and provide real-time
coverage of the ongoing event and may be correlated with
important variables related to the event, e.g. lm
ratings [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Thus, not surprisingly, the amount of
eventrelated messages has reached impressive levels [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>However, a signi cant percentage of micro-blogging
messages can be considered as non-informative or
spam. This fact combined with the huge number
of messages, makes it very challenging for interested
stakeholders, such as event organizers and enthusiasts,
to monitor the evolution of the event and understand
its important moments. In case of long-running events,
this becomes even more di cult due to the existence of
numerous sub-events occurring within the main event.
Such sub-events have di erent durations and impact
on the main event. In addition, a large portion of the
messages contain conversations about other entities of
interest associated with the event. In other words, an
event-related stream of messages is quite diverse and
noisy, with di erent associated topics, conversations
among users, and spam messages. Thus, there is a
profound need for event-based summarization methods
that can produce concise multi-document summaries
for any time interval of the event, covering its main
aspects.</p>
      <p>The framework we propose in this work aims to
create topic-based summaries of large-scale events for
arbitrary time durations by applying post-analysis on
the stream of event related messages. First, we
apply LDA topic modelling to discover the underlying
aspects of the event. To support summarization, we
create a 2D-array structure named StreamGrid. This
maintains the information of each topic at each time
interval. To create the grid we assign messages to the
detected topics and divide topic-associated messages
using regular time intervals. Next, we create timelines
for the set of topics and use them to detect the set
of active topics at each time interval by nding the
bursty activity periods in them. A greedy algorithm
is used to obtain a set of representative messages that
maximize the coverage of the event by selecting the
maximum possible number of active topics and
minimize redundancy across messages at the same time.
Finally, to demonstrate the potential of the proposed
framework, we perform an experimental evaluation on
a real-world dataset consisting of tweets around the
Sundance Film Festival 2013.</p>
      <p>The paper is organized as follows. Section 2
contains a brief survey of related methods and
applications. Section 3 describes in detail the proposed
framework. Section 4 presents an experimental case study
on the Sundance 2013 dataset. We conclude the paper
and describe future work in Section 5.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>
        A substantial body of work exists in literature on
the problem of micro-blogging summarization. A
notable method for multi-document summarization
relies on the computation of centroids based on content.
Namely, the summary of a set of documents,
represented as tf idf vectors, consists of those documents
that are closest to the centroid of the set [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Shari et
al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] propose a method for the generation of a single
sentence from a set of tweets, by using a graph-based
technique. Nichols et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] describe an algorithm
that generates a summary of sports events. They use
a peak detection algorithm to detect important
moments and then apply the method of [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] to extract
summary sentences from the tweets around these
moments. The work of [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] uses linear-programming
optimization to select summary sentences from tweets
related to trending topics. Notably, they also make use
of linked Web content to extend the original sources
of information.
      </p>
      <p>
        Shen et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] present a participant-based
approach for event summarization. A mixture model is
proposed to detect sub-events at participant level, and
the tf idf centroid approach is used to create a
summary of each sub-event. Similarly, Chakrabarti and
Punera [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] propose the use of a Hidden Markov Model
to obtain a time-based segmentation of the stream that
captures the underlying sub-events. Alonso and Shiells
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] create timelines for football games, annotated with
the key aspects of the event. Dork et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] propose
an interface for large scale events that employs several
visualizations for interactive presentation of the event.
      </p>
      <p>
        A di erent problem is tackled by Wang et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
Unlike other methods, that method aims to create a
storyline from a set of event-related objects. A
multiview graph of objects is constructed, where the two
type of edges capture the contextual similarity and
the temporal proximity among objects. Then a
timeordered sequence of important objects is obtained via
graph optimization. Lin et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] extends the previous
work to generate storylines from a set of micro-blog
messages for arbitrary queries. To achieve this, they
use query expansion techniques to retrieve the
queryrelated messages and then apply the same method as
[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] to create the storyline.
      </p>
      <p>
        Another approach for summarizing evolving tweet
streams is proposed by the Sumblr framework [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
This relies on an online clustering algorithm for tweets
and on maintaining distilled statistics of the clusters at
speci c time snapshots using a structure, named
Pyramidal Time Frame. Then, a summarization technique
is employed for generating summaries of arbitrary time
durations based on the LexRank method [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>Proposed Method</title>
      <p>An overview of the proposed method is illustrated in
Figure 1. The proposed framework processes a stream
of online messages around an event and extracts
informative summaries for any requested time duration. In
other words, the proposed framework identi es a set
of topics and then selects related messages based on
their importance.
3.1</p>
      <sec id="sec-4-1">
        <title>Topic Modelling</title>
        <p>
          Topic modelling is based on the assumption that each
document can be described as a random mixture of
topics and each topic as a multinomial distribution
over terms. In our approach we employ topic
modelling by using the well known Latent Dirichlet
Allocation model [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] across the whole stream of messages.
This process is applied after the end of the event, when
all the messages are available. However, topic
modelling in micro-blog messages is problematic due to the
short length of their text. To overcome this, a lot of
approaches have been proposed. To avoid changes on
standard LDA, a relative simple solution is message
pooling, in which messages are pooled together to form
larger documents. We experimented with four
methods of message pooling in a similar way as [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. First,
we tried to merge messages using constant length time
bins. Then, we merged messages of the same author to
form a single document. As a third option, we pooled
messages together based on their hashtags. Messages
with multiple hastags assigned to multiple documents
and messages without any hashtag were assigned to
the document with the highest textual similarity. As
a fourth option, we used a 1NN clustering algorithm to
cluster messages with high textual similarity. Each of
those clusters formed a single document for the LDA
method. In addition, for all of the pooling methods
we ltered out messages having only one term and
removed standard stopwords to discard the non
informative terms.
        </p>
        <p>
          Another drawback of LDA is that the number of
topics must be de ned; obviously, the number of
topics in not known a priori in the context of large events.
To determine the optimal number of topics for a given
set of documents D we calculate two metrics,
perplexity and average similarity across topics for di erent
number of topics and choose a value that minimizes
both metrics. For the calculation of perplexity we slit
D into training and test documents, we estimate LDA
over a range of possible numbers of topics using Dtrain
and calculate the total perplexity of the documents in
the test dataset Dtest [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. The perplexity of a
document d given a trained model is de ned as follows:
perplexity(d) = exp
logP (dj ; ; G)
        </p>
        <p>Ld
(1)
perplexity(Dtest) = exp d2D
logP (dj ; ; G)</p>
        <p>P Ld
d2D
(2)</p>
        <p>For the similarity between two topics, we calculate
the Jaccard coe cient on the sets of top N terms of
each topic.</p>
        <p>After the detection of topics we have to associate
messages with topics. We use the LDA model, estimated
from the merged documents, to infer the
probabilities of each message over the set of topics. We assign
each message to the topic with the highest
probability under the condition that this probability exceeds
a prede ned threshold. Although thresholding in this
step leaves some messages unassigned, this is a
desirable feature of the procedure as most of the
unassigned messages are of low quality. In other words
these mesages can be considered as spam messages
that cannot contribute any valuable information in the
summary. Next, assignments are used for the creation
of a data structure named StreamGrid. The rst
dimension of this grid comprises the detected topics and
the second corresponds to time, divided into regular
time intervals. Each cell c(i; j) of StreamGrid
contains the set of messages Mij associated with topici,
at time interval j. Each message m is represented as a
tf idf vector. The idf components are pre-computed
over the whole set of messages. The tf part is the
frequency of a term in the message normalized by the
maximum frequency. Due to the short length of the
documents in micro-blogging platforms, this
component often equals to one. Using the set of associated
messages in each cell, we calculate a merged tf idf
vector vij . In addition, we calculate a weight for each
message and rank them according to it. The weight of
a message m, associated with topici, in a speci c time
window j is de ned as the sum of the weights of the
terms contained in m. To calculate the weight of each
term t, we use the following tf idf scheme:
W (t; i; j) = tfij (t) idf (t)
W (m; j) = X W (t; i; j)
t2m
(3)
(4)
where Ld is the number of terms in document d, is
the document-speci c topic distribution, is the word
distribution for topics, and G is the set of topics in the
trained model. The total perplexity over dataset Dtest
is de ned as
where tfij (t) is the frequency of term t 2 vij into the
cell c(i; j) of StreamGrid, and idf (t) is the inverse
document frequency over the whole corpus, W (t; i; j) is
the weight of term t in c(i; j), and W (m; j) the weight
of message m in time interval j.</p>
        <p>
          To detect the time intervals that a speci c topici of
StreamGrid is active, we create a topic timeline by
using time intervals as bins, and counting the associated
messages of topici in bin j. Then, we apply the peak
detection algorithm used in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] to detect time frames in
the timeline that exhibit bursty behaviour. The
algorithm identi es windows with high activity by nding
signi cant increases in the timeline, compared to the
historical mean value of activity. The time windows
reported by the algorithm are used to set the active
topics of each time interval. For example, if for a
speci c topic i, the algorithm identi es a time window
[a; b] with high activity, then we de ne all the time
intervals a j b as active moments of topici. After
this step, the cells of StreamGrid, have a ag that
indicates whether a speci c cell is active or not. We use
this ag to select a summary subset of messages, as
described in the next paragraph. Also for each active
topici in a speci c time interval j, we calculate a score
that captures its signi cance over the rest of the active
topics A in the same time interval.
        </p>
        <p>Signif icance(topici; j) =
jMij j</p>
        <p>P
topick2A
jMkj j</p>
        <p>
          In addition, to have an overall estimation of the
importance of each topic throughout the event, we
calculate two measures for each topic using a
similar approach as [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. More speci cally we de ne the
peakiness of a topic as:
        </p>
        <p>peakiness(topici) =
persistence(topici) =
maxjMij j
P jMij j
8j
tpaeavkg&lt;j Pj MjMijijjj
j&lt;atvpgeak Pj MjMijijjj
and its persistence as
where tpeak is the time that the maximum peak of the
timeline occurs.
3.3</p>
      </sec>
      <sec id="sec-4-2">
        <title>Topic-Time Summarization</title>
        <p>
          Our goal is to use the StreamGrid to summarize the
event for an arbitrary time frame. As summary we
denote a set of representative messages that mention
the key aspects of the selected time period. Assuming
that topics can capture these aspects, we use the
active topics for that period to create a summary that
meets the following criteria: a) as many aspects as
possible are covered and b) redundancy due to near
duplicate messages is minimized. To achieve this, we
(5)
(6)
(7)
use an adapted version of the greedy algorithm used in
[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. The algorithm selects messages that are
associated with di erent topics and that simultaneously have
low degree of textual similarity between each other.
The selection process is detailed by Algorithm 1. For
an arbitrary time frame F = [a; b], we rst nd the
sequence of time intervals in StreamGrid that covers
F . Then we get the set of active topics. A topici is
active in F if any cell c(i; j) contained in F is active.
Also, the signi cance score of an active topic in F is
de ned as the maximum signi cance score across all
time intervals in F . The weight W (t; i; F ) of a term
t for topici in F is de ned as the sum of the weights
in each cell c(i; j) 2 F . In a similar way, we de ne
the weight W (m; F ) of message m over F . Note that
although a message belongs to a speci c time interval,
we use the term weights across the whole time frame
to calculate the weight of m.
        </p>
        <sec id="sec-4-2-1">
          <title>Algorithm 1 Topic-Time summarization</title>
          <p>Input: StreamGrid, a time frame F , length of
summary L
Output: a summary set S
1: S = ;
2: A = fset of active topics in F g
3: Mc = mjargmaxW (m; i; F ); 8i 2 A</p>
          <p>m
4: while jSj &lt; L or Mc 6= ; do
5: for each message m in Mc do
6: calculate score(m) according to Equation 8
7: end for
8: Select mmax = argmax[score(m)]</p>
          <p>mi
9: S = S [ fmmaxg
10: Mc = Mc fmmaxg
11: end while
12: if jSj &lt; L then
13: M = [Mij , 8i 2 A; j 2 F
14: M 0 = M S
15: while jSj &lt; L do
16: for each message m in M 0 do
17: calculate score(m) according to
Equa</p>
          <p>To produce a summary S of length L, the algorithm
rst gets the set of active topics as described above.
Then, it collects the messages Mc with the highest
weight W (m; F ) in each active topic (line 3). Through
the lines 4-11, the algorithm, following a greedy
approach, selects the messages that maximize the score
of Equation 8. This consists of two parts weighted
by a parameter a. The rst part, measures the
importance of the message, while the second the redundancy
compared to the set of already selected messages. The
importance of a message m 2 topici is a combination
of two factors: a) the signi cance of the topic it
belongs to, at this time frame, and b) the contribution
of its textual content. To measure the redundancy of
a message, we compute its average cosine similarity to
the already selected messages. If the summary length
is not reached, we perform the same selection process
on the set of tweets that belong to the active topics
(Lines 12-23).
We conducted an evaluation of the proposed method
on a dataset around the Sundance 2013 Film
Festival that took place between January 15th and 30th,
2013. We used the Streaming API of Twitter to
acquire tweets containing terms related to Sundance and
posted during the event. More precisely, we collected
all tweets containing the hashtags, #sundance,
#sundance2013 and #sundancefest, and all the tweets that
mentioned the o cial account of Sundance Film
Festival (@sundancefest). This resulted in a dataset of
201,752 tweets. Among them, 100,046 were original
tweets, while the rest of them were retweets. Although
using three hashtags and one mentioned account
covers only a subset of all possible tweets about the event,
we consider this subset su ciently representative as
the vast majority of Twitter's users tend to adopt the
o cial hashtags provided by organizers during events.
4.2</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>Topic detection</title>
        <p>Figure 2 shows the perplexity and average similarity
for di erent numbers of topics K. Although there is
signi cant variance for the di erent values of K, the
main trend for perplexity is to decrease as K increases.
As we can see from Figure 2, the average similarity
between all pairs of topics appears to stabilize for values
of K larger than 100 topics. However, having a large
number of topics creates topics with very few
associated messages. We found that for K &gt; 200 there is
a substantial proportion of topics that have no
associated message. Taking into account these facts, we
set K = 200 for the rest of the evaluation. Regarding
the pooling scheme, merging tweets having the same
hashtags into single documents gave us the best
performance with respect to perplexity and average topic
similarity.
The rst part of Table 2 contains the top ve topics
with respect to the peakiness and the second one the
topics with the highest persistence ratio. Examining
the set of persistent topics we conclude that they can
be divided into two main categories: The rst
comprises the truly persistent topics that are regularly
discussed during the event, while the second category
is made up of multiplexed topics that LDA failed to
split further. This is due to the fact that some
topics are conceptually di erent but share a similar set of
related terms. This obviously a ects summarization
performance, as for each topic we select only the top
weighted message. Thus, if the topic contains more
than one concepts then the summarization algorithm
selects only one concept and ignores the rest.</p>
        <p>Figure 3 depicts the timelines of the same two sets
of topics respectively. It becomes obvious that peaky
topics are highly localized, while persistent topics
sustain for the whole duration of the event. To provide a
visual representation of the StreamGrid structure over
the whole duration of the event, we represent it as a
heat map (Figure 4). The coloured cells in the grid
represent the time intervals, in which the
corresponding topics are active, and the color of the cell gives
the signi cance of each active topic at this point. As
shown in Figure 4, StreamGrid appears to be sparse,
as only a few cells in it contain active topics.
However, one can also observe several topics (rows) that
exhibit consistent activity over the whole duration of
the event.
4.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Summarization</title>
        <p>Baselines: To evaluate the summaries produced with
StreamGrid, we used ve baseline methods. Given an
arbitrary time interval, we rst get the set of messages
posted during this interval and then we apply the
following baselines to produce a summary of constant
length L.</p>
        <p>Random Summarizer: For the set of tweets we
choose randomly a subset of L tweets.</p>
        <p>Popularity Summarizer: We select the L most
retweeted messages to form a summary. This
favours the tweets that have attracted the
attention of the audience. However, niche topics and
potentially interesting events that gathered less
attention tend to be missed.
tf idf Summarizer: We use the tf idf weighting
scheme described in the previous section to get
the L highest weighted tweets.</p>
        <p>Cluster-based Summarizer: Instead of active
topics, we divide the tweets of the time interval into L
clusters using k-means clustering. For each cluster
produced this way, we pick the highest weighted
tweet using the tf idf scheme.</p>
        <p>LexRank Summarizer: We create a graph where
nodes represent tweets and the weights of edges
between nodes represent their pairwise cosine
similarity. The total weight of a tweet is the sum of
the weights of the adjacent edges. The summary
consists of the L highest weighted tweets in the
graph.</p>
        <p>Finally, we compare the results of the StreamGrid
Summarizer to the ones of the baseline methods for ve
time intervals that are connected with high activity
during the main event. We detect these intervals by
applying the peak detection algorithm of the previous
section to the timeline of the whole dataset. We rank
the detected bursts according to the rate of tweets and
use the top ve of them. The details of these intervals
are provided in Table 1.</p>
        <p>Table 3 contains summaries consisting of ve tweets
using StreamGrid and three of the baselines for the
time period around the Awards Ceremony of Sundance
Film Festival. Unsurprisingly, this is the time period
with the highest peak during the event. During this
period what may be reasonably considered as
important pertains to the lms that won awards. Such
messages are usually posted by authoritative users
and become highly retweeted. For this reason,
summaries based on the number of retweets cover quite
e ectively the winning lms. However, in other cases
choosing very popular tweets does not lead to
informative summaries. For example in the third time
interval, the summary consists of tweets like \So freaking
cool. #sundance http://t.co/C7a8rSaw" and
\#Sundance day 4- leavin for Vegas now. Bye for now
http://t.co/C2aRZnEC". These tweets were retweeted
a lot, but may be considered as non-informative for
the event. On the other hand, StreamGrid-based
summaries for the Awards Ceremony contain tweets about
winning lms, even though these messages are not
very popular. That is an indication that StreamGrid
may detect an important topic even in cases that this
does not attract attention from many users.
Regarding the Cluster-based Summarization, an interesting
feature is that avoidance of redundancy is inherent in
the method, as similar messages are clustered together,
and only the most weighted of them are selected for
the summary. However, the weakness of the method
is that not all clusters represent important aspects of
the event.</p>
        <p>Another indication of how topic modelling can
improve summarization is the fact that StreamGrid,
compared to the other baselines, tends to include tweets
that mention lms. The reason that this happens is
that most of the topics detected by LDA are about
lms, so when the proposed summarization algorithm
selects a set of tweets from the pool of active moments,
this leads to the selection of lm-associated tweets. We
expect that, for other types of events, it will naturally
generalize to other pertinent entities of interest that
occur frequently, thus leading to the creation of
topics. A noticeable disadvantage of baselines such as
tf idf and LexRank is the remarkable existence of
redundancy. For example in case of LexRank four out
of ve tweets are related to the 'Fruitvale' lm. This
indicates that redundancy minimization is a necessary
component of any summarization approach.</p>
        <p>Finally, to evaluate how well the proposed method
can create visual summaries, we apply it on the subset
of tweets with embedded pictures. These tweets that
comprise about 10% of the dataset create a
considerably sparser StreamGrid as the bursty periods in this
subset are much fewer. An example of a multimedia
summary using StreamGrid for the Awards Ceremony
is shown in Figure 5. Comparing the
StreamGridbased multimedia summaries with the ones produced
by the popular images (6), we observe that
StreamGrid does not perform noticeably better in this task.
This can be explained by the fact that tweets with
embedded media have text of very low length and
informativeness, which leads LDA to inferior performance
with respect to the creation of representative topics
and the assignment of messages to them. Regarding
the redundancy in multimedia summaries, we found
that using cosine similarity on the text of images as
a metric of similarity between them is not
appropriate to minimize redundancy. This can be seen in the
LexRank-based summary in Figure 7. To this end, a
combination of visual and textual features is foreseen
as a more suitable means for discarding similar images.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and future work</title>
      <p>In this work, we proposed a framework for the
summarization of micro-blogging messages during large scale
events. The framework makes use of topic modelling
to detect the underlying aspects of an event to the set
of related messages. Then, for each topic it derives
its temporal representation by associating messages to
the discovered topics. Subsequently, a burst
detection algorithm is used to nd the important intervals
for each topic. Finally, a greedy summarization
algorithm generates summaries for arbitrary time intervals
using the set of active topics for the same time
duration. The results of experiments in a Twitter dataset
around the Sundance Film Festival appear promising,
demonstrating the potential of topic modelling on the
multi-document summarization problem.</p>
      <p>For future work, we rst plan to compare our
approach with competing summarization algorithms in a
more systematic way, over more events and with the
help of independent evaluators, with the goal of better
capturing the subjective quality aspects of
summarization. Taking into account the large number of topic
modelling techniques that appeared in literature over
the last years, we plan to investigate how the
underlying model a ects the summarization process.
Furthermore, we intend to create a real-time version of
StreamGrid, which could be used to get summaries of
evolving and continuous streams of messages. To this
end, we plan to employ more advanced topic modelling
methods that can detect topic drift and unseen topics
on new incoming messages. Finally, we will
investigate methods to integrate popularity and user
authority into the summarization process.</p>
      <p>Acknowledgements: This work is supported by
the SocialSensor FP7 project, partially funded by the
EC under contract number 287975.</p>
      <sec id="sec-5-1">
        <title>Method</title>
        <p>tf idf</p>
        <sec id="sec-5-1-1">
          <title>LexRank</title>
        </sec>
        <sec id="sec-5-1-2">
          <title>Popularity</title>
        </sec>
        <sec id="sec-5-1-3">
          <title>StreamGrid</title>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Celebrating</surname>
            <given-names>#</given-names>
          </string-name>
          <article-title>SB48 on Twitter</article-title>
          . https://blog.twitter.com/2014/ celebrating-sb48
          <string-name>
            <surname>-</surname>
          </string-name>
          on-twitter,
          <year>2014</year>
          . [Online; accessed 27-Feb-2014].
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Alonso</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Shiells</surname>
          </string-name>
          .
          <article-title>Timelines as summaries of popular scheduled events</article-title>
          .
          <source>In Proceedings of the 22nd international conference on World Wide Web companion</source>
          , pages
          <volume>1037</volume>
          {
          <fpage>1044</fpage>
          . International World Wide Web Conferences Steering Committee,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>J. Mach. Learn. Res.</source>
          ,
          <volume>3</volume>
          :
          <fpage>993</fpage>
          {
          <fpage>1022</fpage>
          ,
          <string-name>
            <surname>Mar</surname>
          </string-name>
          .
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chakrabarti</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Punera</surname>
          </string-name>
          .
          <article-title>Event summarization using tweets</article-title>
          .
          <source>In ICWSM</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gruen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Williamson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Carpendale</surname>
          </string-name>
          .
          <article-title>A visual backchannel for largescale events</article-title>
          .
          <source>Visualization and Computer Graphics</source>
          , IEEE Transactions on,
          <volume>16</volume>
          (
          <issue>6</issue>
          ):
          <volume>1129</volume>
          {
          <fpage>1138</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Erkan</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Radev</surname>
          </string-name>
          . Lexrank:
          <article-title>Graphbased lexical centrality as salience in text summarization</article-title>
          .
          <source>J. Artif. Int. Res.</source>
          ,
          <volume>22</volume>
          (
          <issue>1</issue>
          ):
          <volume>457</volume>
          {
          <fpage>479</fpage>
          ,
          <string-name>
            <surname>Dec</surname>
          </string-name>
          .
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Generating event storylines from microblogs</article-title>
          .
          <source>In Proceedings of the 21st ACM International Conference on Information and Knowledge Management</source>
          ,
          <source>CIKM '12</source>
          , pages
          <fpage>175</fpage>
          {
          <fpage>184</fpage>
          , New York, NY, USA,
          <year>2012</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Weng</surname>
          </string-name>
          .
          <article-title>Why is "sxsw" trending? exploring multiple text sources for twitter topic summarization</article-title>
          .
          <source>In Proceedings of the ACL Workshop on Language in Social Media (LSM)</source>
          , pages
          <fpage>66</fpage>
          {
          <fpage>75</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Marcus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Badar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Karger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Miller</surname>
          </string-name>
          .
          <article-title>Twitinfo: Aggregating and visualizing microblogs for event exploration</article-title>
          .
          <source>In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '11</source>
          , pages
          <fpage>227</fpage>
          {
          <fpage>236</fpage>
          , New York, NY, USA,
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mehrotra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sanner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Buntine</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Xie</surname>
          </string-name>
          .
          <article-title>Improving lda topic models for microblogs via tweet pooling and automatic labeling</article-title>
          .
          <source>In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '13</source>
          , pages
          <fpage>889</fpage>
          {
          <fpage>892</fpage>
          , New York, NY, USA,
          <year>2013</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Nichols</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mahmud</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Drews</surname>
          </string-name>
          .
          <article-title>Summarizing sporting events using twitter</article-title>
          .
          <source>In Proceedings of the 2012 ACM International Conference on Intelligent User Interfaces</source>
          ,
          <source>IUI '12</source>
          , pages
          <fpage>189</fpage>
          {
          <fpage>198</fpage>
          , New York, NY, USA,
          <year>2012</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Radev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stys</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Tam</surname>
          </string-name>
          .
          <article-title>Centroid-based summarization of multiple documents</article-title>
          .
          <source>Inf</source>
          . Process. Manage.,
          <volume>40</volume>
          (
          <issue>6</issue>
          ):
          <volume>919</volume>
          {
          <fpage>938</fpage>
          ,
          <string-name>
            <surname>Nov</surname>
          </string-name>
          .
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E.</given-names>
            <surname>Schinas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Diplaris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          , Y. Mass,
          <string-name>
            <given-names>J.</given-names>
            <surname>Herzig</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Boudakidis</surname>
          </string-name>
          .
          <article-title>Eventsense: Capturing the pulse of large-scale events by mining social media streams</article-title>
          .
          <source>In Proceedings of the 17th Panhellenic Conference on Informatics, PCI '13</source>
          , pages
          <fpage>17</fpage>
          {
          <fpage>24</fpage>
          , New York, NY, USA,
          <year>2013</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Shamma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kennedy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E. F.</given-names>
            <surname>Churchill</surname>
          </string-name>
          .
          <article-title>Peaks and persistence: Modeling the shape of microblog conversations</article-title>
          .
          <source>In Proceedings of the ACM 2011 Conference on Computer Supported Cooperative Work, CSCW '11</source>
          , pages
          <fpage>355</fpage>
          {
          <fpage>358</fpage>
          , New York, NY, USA,
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Shari</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Hutton</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kalita</surname>
          </string-name>
          .
          <article-title>Summarizing microblogs automatically</article-title>
          .
          <source>In Human Language Technologies</source>
          :
          <article-title>The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics</article-title>
          ,
          <source>HLT '10</source>
          , pages
          <fpage>685</fpage>
          {
          <fpage>688</fpage>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA,
          <year>2010</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Weng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>A participant-based approach for event summarization using twitter streams</article-title>
          .
          <source>In Proceedings of NAACL-HLT</source>
          , pages
          <volume>1152</volume>
          {
          <fpage>1162</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>L.</given-names>
            <surname>Shou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          . Sumblr:
          <article-title>Continuous summarization of evolving tweet streams</article-title>
          .
          <source>In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '13</source>
          , pages
          <fpage>533</fpage>
          {
          <fpage>542</fpage>
          , New York, NY, USA,
          <year>2013</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>H. M. Wallach</surname>
            ,
            <given-names>I. Murray</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Mimno</surname>
          </string-name>
          .
          <article-title>Evaluation methods for topic models</article-title>
          . In L. Bottou and M. Littman, editors,
          <source>Proceedings of the 26th International Conference on Machine Learning (ICML)</source>
          , pages
          <fpage>1105</fpage>
          {
          <fpage>1112</fpage>
          , Montreal, June 2009. Omnipress.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Ogihara</surname>
          </string-name>
          .
          <article-title>Generating pictorial storylines via minimum-weight connected dominating set approximation in multiview graphs</article-title>
          .
          <source>In AAAI'12</source>
          , pages {
          <volume>1</volume>
          {
          <issue>1</issue>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>