StreamGrid: Summarization of Large Scale Events using Topic Modelling and Temporal Analysis

StreamGrid: Summarization of Large Scale Events using Topic Modelling and Temporal Analysis EmmanouilSchinas SymeonPapadopoulos YiannisKompatsiaris PericlesAMitkas mitkas@eng.auth.gr Dept. of Electrical & Computer Engineering Aristotle University of Thessaloniki Information Technologies Institute Centre for Research & Technology Hellas Information Technologies Institute Centre for Research Technology Hellas

Thessaloniki Greece

Information Technologies Institute Centre for Research Technology Hellas

Thessaloniki Greece

Dept. of Electrical & Computer Engineering Aristotle University of Thessaloniki Information Technologies Institute Centre for Research & Technology Hellas

01-04-2014 Glasgow Scotland

StreamGrid: Summarization of Large Scale Events using Topic Modelling and Temporal Analysis F0218062487A40EEE0D684891A855B51 GROBID - A machine learning software for extracting information from scholarly documents

Due to the increasing popularity of microblogging platforms, the amount of messages related to large scale public events reach impressive levels. Although such messages can be quite informative regarding different aspects of the main event, there is a lot of spam and redundancy that makes it challenging to extract insights regarding the event of interest. In this work we describe a summarization framework that captures the important moments of an event by using a combination of topic modelling and bursty activity detection. We propose a data structure named StreamGrid, that maintains the information of active topics in regular time intervals at several scales. This structure is used for the creation of concise summaries for any time interval. Finally, the evaluation on a large Twitter dataset around the Sundance Film Festival demonstrates the potential of the proposed framework.

Introduction

Due to their increasing popularity, micro-blogging platforms, and especially Twitter, have evolved into a powerful means for getting connected with real world events. In large scale public events, ranging from sport events, such as football matches, to political events and festivals, the users that are somehow involved in the event use social media to share their experiences and express their opinions. In many cases, these messages are quite informative and provide real-time coverage of the ongoing event and may be correlated with important variables related to the event, e.g. film ratings [13]. Thus, not surprisingly, the amount of eventrelated messages has reached impressive levels [1].

However, a significant percentage of micro-blogging messages can be considered as non-informative or spam. This fact combined with the huge number of messages, makes it very challenging for interested stakeholders, such as event organizers and enthusiasts, to monitor the evolution of the event and understand its important moments. In case of long-running events, this becomes even more difficult due to the existence of numerous sub-events occurring within the main event. Such sub-events have different durations and impact on the main event. In addition, a large portion of the messages contain conversations about other entities of interest associated with the event. In other words, an event-related stream of messages is quite diverse and noisy, with different associated topics, conversations among users, and spam messages. Thus, there is a profound need for event-based summarization methods that can produce concise multi-document summaries for any time interval of the event, covering its main aspects.

The framework we propose in this work aims to create topic-based summaries of large-scale events for arbitrary time durations by applying post-analysis on the stream of event related messages. First, we apply LDA topic modelling to discover the underlying aspects of the event. To support summarization, we create a 2D-array structure named StreamGrid. This maintains the information of each topic at each time interval. To create the grid we assign messages to the detected topics and divide topic-associated messages using regular time intervals. Next, we create timelines for the set of topics and use them to detect the set of active topics at each time interval by finding the bursty activity periods in them. A greedy algorithm is used to obtain a set of representative messages that maximize the coverage of the event by selecting the maximum possible number of active topics and minimize redundancy across messages at the same time.

Finally, to demonstrate the potential of the proposed framework, we perform an experimental evaluation on a real-world dataset consisting of tweets around the Sundance Film Festival 2013.

The paper is organized as follows. Section 2 contains a brief survey of related methods and applications. Section 3 describes in detail the proposed framework. Section 4 presents an experimental case study on the Sundance 2013 dataset. We conclude the paper and describe future work in Section 5.

Related Work

A substantial body of work exists in literature on the problem of micro-blogging summarization. A notable method for multi-document summarization relies on the computation of centroids based on content. Namely, the summary of a set of documents, represented as tf • idf vectors, consists of those documents that are closest to the centroid of the set [12]. Sharifi et al. [15] propose a method for the generation of a single sentence from a set of tweets, by using a graph-based technique. Nichols et al. [11] describe an algorithm that generates a summary of sports events. They use a peak detection algorithm to detect important moments and then apply the method of [15] to extract summary sentences from the tweets around these moments. The work of [8] uses linear-programming optimization to select summary sentences from tweets related to trending topics. Notably, they also make use of linked Web content to extend the original sources of information.

Shen et al. [16] present a participant-based approach for event summarization. A mixture model is proposed to detect sub-events at participant level, and the tf • idf centroid approach is used to create a summary of each sub-event. Similarly, Chakrabarti and Punera [4] propose the use of a Hidden Markov Model to obtain a time-based segmentation of the stream that captures the underlying sub-events. Alonso and Shiells [2] create timelines for football games, annotated with the key aspects of the event. Dork et al. [5] propose an interface for large scale events that employs several visualizations for interactive presentation of the event.

A different problem is tackled by Wang et al. [19]. Unlike other methods, that method aims to create a storyline from a set of event-related objects. A multiview graph of objects is constructed, where the two type of edges capture the contextual similarity and the temporal proximity among objects. Then a timeordered sequence of important objects is obtained via graph optimization. Lin et al. [7] extends the previous work to generate storylines from a set of micro-blog messages for arbitrary queries. To achieve this, they use query expansion techniques to retrieve the queryrelated messages and then apply the same method as [19] to create the storyline.

Another approach for summarizing evolving tweet streams is proposed by the Sumblr framework [17]. This relies on an online clustering algorithm for tweets and on maintaining distilled statistics of the clusters at specific time snapshots using a structure, named Pyramidal Time Frame. Then, a summarization technique is employed for generating summaries of arbitrary time durations based on the LexRank method [6].

Proposed Method

An overview of the proposed method is illustrated in Figure 1. The proposed framework processes a stream of online messages around an event and extracts informative summaries for any requested time duration. In other words, the proposed framework identifies a set of topics and then selects related messages based on their importance.

Topic Modelling

Topic modelling is based on the assumption that each document can be described as a random mixture of topics and each topic as a multinomial distribution over terms. In our approach we employ topic modelling by using the well known Latent Dirichlet Allocation model [3] across the whole stream of messages. This process is applied after the end of the event, when all the messages are available. However, topic modelling in micro-blog messages is problematic due to the Figure 1: The StreamGrid framework short length of their text. To overcome this, a lot of approaches have been proposed. To avoid changes on standard LDA, a relative simple solution is message pooling, in which messages are pooled together to form larger documents. We experimented with four methods of message pooling in a similar way as [10]. First, we tried to merge messages using constant length time bins. Then, we merged messages of the same author to form a single document. As a third option, we pooled messages together based on their hashtags. Messages with multiple hastags assigned to multiple documents and messages without any hashtag were assigned to the document with the highest textual similarity. As a fourth option, we used a 1NN clustering algorithm to cluster messages with high textual similarity. Each of those clusters formed a single document for the LDA method. In addition, for all of the pooling methods we filtered out messages having only one term and removed standard stopwords to discard the non informative terms.

Another drawback of LDA is that the number of topics must be defined; obviously, the number of topics in not known a priori in the context of large events. To determine the optimal number of topics for a given set of documents D we calculate two metrics, perplexity and average similarity across topics for different number of topics and choose a value that minimizes both metrics. For the calculation of perplexity we slit D into training and test documents, we estimate LDA over a range of possible numbers of topics using D train and calculate the total perplexity of the documents in the test dataset D test [18]. The perplexity of a document d given a trained model is defined as follows:

perplexity(d) = exp −logP (d|θ, φ, G) L d (1)

where L d is the number of terms in document d, θ is the document-specific topic distribution, φ is the word distribution for topics, and G is the set of topics in the trained model. The total perplexity over dataset D test is defined as

perplexity(D test ) = exp d∈D −logP (d|θ, φ, G) d∈D L d(2)

For the similarity between two topics, we calculate the Jaccard coefficient on the sets of top N terms of each topic.

StreamGrid Creation

After the detection of topics we have to associate messages with topics. We use the LDA model, estimated from the merged documents, to infer the probabilities of each message over the set of topics. We assign each message to the topic with the highest probability under the condition that this probability exceeds a predefined threshold. Although thresholding in this step leaves some messages unassigned, this is a desirable feature of the procedure as most of the unassigned messages are of low quality. In other words these mesages can be considered as spam messages that cannot contribute any valuable information in the summary. Next, assignments are used for the creation of a data structure named StreamGrid. The first dimension of this grid comprises the detected topics and the second corresponds to time, divided into regular time intervals. Each cell c(i, j) of StreamGrid contains the set of messages M ij associated with topic i , at time interval j. Each message m is represented as a tf • idf vector. The idf components are pre-computed over the whole set of messages. The tf part is the frequency of a term in the message normalized by the maximum frequency. Due to the short length of the documents in micro-blogging platforms, this component often equals to one. Using the set of associated messages in each cell, we calculate a merged tf • idf vector v ij . In addition, we calculate a weight for each message and rank them according to it. The weight of a message m, associated with topic i , in a specific time window j is defined as the sum of the weights of the terms contained in m. To calculate the weight of each term t, we use the following tf • idf scheme:

W (t, i, j) = tf ij (t) • idf (t)(3)W (m, j) = t∈m W (t, i, j)(4)

where tf ij (t) is the frequency of term t ∈ v ij into the cell c(i, j) of StreamGrid, and idf (t) is the inverse document frequency over the whole corpus, W (t, i, j) is the weight of term t in c(i, j), and W (m, j) the weight of message m in time interval j.

To detect the time intervals that a specific topic i of StreamGrid is active, we create a topic timeline by using time intervals as bins, and counting the associated messages of topic i in bin j. Then, we apply the peak detection algorithm used in [9] to detect time frames in the timeline that exhibit bursty behaviour. The algorithm identifies windows with high activity by finding significant increases in the timeline, compared to the historical mean value of activity. The time windows reported by the algorithm are used to set the active topics of each time interval. For example, if for a specific topic i, the algorithm identifies a time window [a, b] with high activity, then we define all the time intervals a ≤ j ≤ b as active moments of topic i . After this step, the cells of StreamGrid, have a flag that indicates whether a specific cell is active or not. We use this flag to select a summary subset of messages, as described in the next paragraph. Also for each active topic i in a specific time interval j, we calculate a score that captures its significance over the rest of the active topics A in the same time interval.

Signif icance(topici , j) = |M ij | topic k ∈A |M kj |(5)

In addition, to have an overall estimation of the importance of each topic throughout the event, we calculate two measures for each topic using a similar approach as [14]. More specifically we define the peakiness of a topic as:

peakiness(topic i ) = max|M ij | ∀j |M ij |(6)

and its persistence as

persistence(topic i ) = avg t peak <j |Mij | |Mij | avg j<t peak |Mij | |Mij |(7)

where t peak is the time that the maximum peak of the timeline occurs.

Topic-Time Summarization

Our goal is to use the StreamGrid to summarize the event for an arbitrary time frame. As summary we denote a set of representative messages that mention the key aspects of the selected time period. Assuming that topics can capture these aspects, we use the active topics for that period to create a summary that meets the following criteria: a) as many aspects as possible are covered and b) redundancy due to near duplicate messages is minimized. To achieve this, we use an adapted version of the greedy algorithm used in [17]. The algorithm selects messages that are associated with different topics and that simultaneously have low degree of textual similarity between each other. The selection process is detailed by Algorithm 1. For an arbitrary time frame F = [a, b], we first find the sequence of time intervals in StreamGrid that covers F. Then we get the set of active topics. A topic i is active in F if any cell c(i, j) contained in F is active. Also, the significance score of an active topic in F is defined as the maximum significance score across all time intervals in F. The weight W (t, i, F) of a term t for topic i in F is defined as the sum of the weights in each cell c(i, j) ∈ F. In a similar way, we define the weight W (m, F) of message m over F. Note that although a message belongs to a specific time interval, we use the term weights across the whole time frame to calculate the weight of m.

Algorithm 1 Topic-Time summarization Input: StreamGrid, a time frame F, length of summary L Output: a summary set S

1: S = ∅ 2: A = {set of active topics in F} 3: M c = m|argmax m W (m, i, F), ∀i ∈ A 4: while |S| < L or M c = ∅ do 5:

for each message m in M c do 6:

calculate score(m) according to Equation 87:

end for To produce a summary S of length L, the algorithm first gets the set of active topics as described above. Then, it collects the messages M c with the highest weight W (m, F) in each active topic (line 3). Through the lines 4-11, the algorithm, following a greedy ap-proach, selects the messages that maximize the score of Equation 8. This consists of two parts weighted by a parameter a. The first part, measures the importance of the message, while the second the redundancy compared to the set of already selected messages. The importance of a message m ∈ topic i is a combination of two factors: a) the significance of the topic it belongs to, at this time frame, and b) the contribution of its textual content. To measure the redundancy of a message, we compute its average cosine similarity to the already selected messages. If the summary length is not reached, we perform the same selection process on the set of tweets that belong to the active topics (Lines 12-23). 4 Experiments

Dataset and event description

We conducted an evaluation of the proposed method on a dataset around the Sundance 2013 Film Festival that took place between January 15th and 30th, 2013. We used the Streaming API of Twitter to acquire tweets containing terms related to Sundance and posted during the event. More precisely, we collected all tweets containing the hashtags, #sundance, #sun-dance2013 and #sundancefest, and all the tweets that mentioned the official account of Sundance Film Festival (@sundancefest). This resulted in a dataset of 201,752 tweets. Among them, 100,046 were original tweets, while the rest of them were retweets. Although using three hashtags and one mentioned account covers only a subset of all possible tweets about the event, we consider this subset sufficiently representative as the vast majority of Twitter's users tend to adopt the official hashtags provided by organizers during events.

Topic detection

Figure 2 shows the perplexity and average similarity for different numbers of topics K. Although there is significant variance for the different values of K, the main trend for perplexity is to decrease as K increases.

As we can see from Figure 2, the average similarity between all pairs of topics appears to stabilize for values of K larger than 100 topics. However, having a large number of topics creates topics with very few associated messages. We found that for K > 200 there is a substantial proportion of topics that have no associated message. Taking into account these facts, we set K = 200 for the rest of the evaluation. Regarding the pooling scheme, merging tweets having the same hashtags into single documents gave us the best performance with respect to perplexity and average topic similarity.

Figure 2: Perplexity and Average Similarity between topics for different number of topics K

StreamGrid Construction

The first part of Table 2 contains the top five topics with respect to the peakiness and the second one the topics with the highest persistence ratio. Examining the set of persistent topics we conclude that they can be divided into two main categories: The first comprises the truly persistent topics that are regularly discussed during the event, while the second category is made up of multiplexed topics that LDA failed to split further. This is due to the fact that some topics are conceptually different but share a similar set of related terms. This obviously affects summarization performance, as for each topic we select only the top weighted message. Thus, if the topic contains more than one concepts then the summarization algorithm selects only one concept and ignores the rest. Figure 3 depicts the timelines of the same two sets of topics respectively. It becomes obvious that peaky Figure 3: Timelines of the top five peaky and persistent topics topics are highly localized, while persistent topics sustain for the whole duration of the event. To provide a visual representation of the StreamGrid structure over the whole duration of the event, we represent it as a heat map (Figure 4). The coloured cells in the grid represent the time intervals, in which the corresponding topics are active, and the color of the cell gives the significance of each active topic at this point. As shown in Figure 4, StreamGrid appears to be sparse, as only a few cells in it contain active topics. However, one can also observe several topics (rows) that exhibit consistent activity over the whole duration of the event.

Summarization

Baselines: To evaluate the summaries produced with StreamGrid, we used five baseline methods. Given an arbitrary time interval, we first get the set of messages posted during this interval and then we apply the following baselines to produce a summary of constant length L.

• Random Summarizer: For the set of tweets we choose randomly a subset of L tweets.

• Popularity Summarizer: We select the L most retweeted messages to form a summary. This favours the tweets that have attracted the attention of the audience. However, niche topics and potentially interesting events that gathered less attention tend to be missed.

• tf • idf Summarizer: We use the tf • idf weighting scheme described in the previous section to get the L highest weighted tweets.

• Cluster-based Summarizer: Instead of active topics, we divide the tweets of the time interval into L clusters using k-means clustering. For each cluster produced this way, we pick the highest weighted tweet using the tf • idf scheme.

• LexRank Summarizer: We create a graph where nodes represent tweets and the weights of edges between nodes represent their pairwise cosine similarity. The total weight of a tweet is the sum of the weights of the adjacent edges. The summary consists of the L highest weighted tweets in the graph. Finally, we compare the results of the StreamGrid Summarizer to the ones of the baseline methods for five time intervals that are connected with high activity during the main event. We detect these intervals by applying the peak detection algorithm of the previous section to the timeline of the whole dataset. We rank Figure 5: StreamGrid-based Multimedia Summary during awards ceremony (4 th row in Table 1) Figure 6: Multimedia Summary using most retweeted images during awards ceremony (4 th row in Table 1) Figure 7: Multimedia Summary using LexRank during awards ceremony (4 th row in Table 1) the detected bursts according to the rate of tweets and use the top five of them. The details of these intervals are provided in Table 1.

Table 3 contains summaries consisting of five tweets using StreamGrid and three of the baselines for the time period around the Awards Ceremony of Sundance Film Festival. Unsurprisingly, this is the time period with the highest peak during the event. During this period what may be reasonably considered as important pertains to the films that won awards. Such messages are usually posted by authoritative users and become highly retweeted. For this reason, summaries based on the number of retweets cover quite effectively the winning films. However, in other cases choosing very popular tweets does not lead to informative summaries. For example in the third time interval, the summary consists of tweets like "So freaking cool. #sundance http://t.co/C7a8rSaw" and "#Sundance day 4-leavin for Vegas now. Bye for now http://t.co/C2aRZnEC". These tweets were retweeted a lot, but may be considered as non-informative for the event. On the other hand, StreamGrid-based summaries for the Awards Ceremony contain tweets about winning films, even though these messages are not very popular. That is an indication that StreamGrid may detect an important topic even in cases that this does not attract attention from many users. Regarding the Cluster-based Summarization, an interesting feature is that avoidance of redundancy is inherent in the method, as similar messages are clustered together, and only the most weighted of them are selected for the summary. However, the weakness of the method is that not all clusters represent important aspects of the event.

Another indication of how topic modelling can improve summarization is the fact that StreamGrid, compared to the other baselines, tends to include tweets that mention films. The reason that this happens is that most of the topics detected by LDA are about films, so when the proposed summarization algorithm selects a set of tweets from the pool of active moments, this leads to the selection of film-associated tweets. We expect that, for other types of events, it will naturally generalize to other pertinent entities of interest that occur frequently, thus leading to the creation of topics. A noticeable disadvantage of baselines such as tf • idf and LexRank is the remarkable existence of redundancy. For example in case of LexRank four out of five tweets are related to the 'Fruitvale' film. This indicates that redundancy minimization is a necessary component of any summarization approach.

Finally, to evaluate how well the proposed method can create visual summaries, we apply it on the subset of tweets with embedded pictures. These tweets that comprise about 10% of the dataset create a consider- This can be explained by the fact that tweets with embedded media have text of very low length and informativeness, which leads LDA to inferior performance with respect to the creation of representative topics and the assignment of messages to them. Regarding the redundancy in multimedia summaries, we found that using cosine similarity on the text of images as a metric of similarity between them is not appropriate to minimize redundancy. This can be seen in the LexRank-based summary in Figure 7. To this end, a combination of visual and textual features is foreseen as a more suitable means for discarding similar images.

Conclusion and future work

In this work, we proposed a framework for the summarization of micro-blogging messages during large scale events. The framework makes use of topic modelling to detect the underlying aspects of an event to the set of related messages. Then, for each topic it derives its temporal representation by associating messages to the discovered topics. Subsequently, a burst detection algorithm is used to find the important intervals for each topic. Finally, a greedy summarization algorithm generates summaries for arbitrary time intervals using the set of active topics for the same time duration. The results of experiments in a Twitter dataset around the Sundance Film Festival appear promising, demonstrating the potential of topic modelling on the multi-document summarization problem.

For future work, we first plan to compare our ap-proach with competing summarization algorithms in a more systematic way, over more events and with the help of independent evaluators, with the goal of better capturing the subjective quality aspects of summarization. Taking into account the large number of topic modelling techniques that appeared in literature over the last years, we plan to investigate how the underlying model affects the summarization process. Furthermore, we intend to create a real-time version of StreamGrid, which could be used to get summaries of evolving and continuous streams of messages. To this end, we plan to employ more advanced topic modelling methods that can detect topic drift and unseen topics on new incoming messages. Finally, we will investigate methods to integrate popularity and user authority into the summarization process.

S = S ∪ {m max } 10 :10M c = M c − {m max } 11: end while 12: if |S| < L then 13: M = ∪M ij , ∀i ∈ A, j ∈ F 14: M = M − S 15: while |S| < L do 16: for each message m in M do 17:calculate score(m) according to Equa-

8 )8score(m) = a * Importance(m)−(1−a) * Redundancy(m) (Importance(m) = Signif icance(i, F) * W (m, F) (9) Redundance(m, S) = avg m ∈SSimilarity(m, m )(10)

Figure 4 :4Figure 4: StreamGrid: Each cell of StreamGrid corresponds to a specific time interval and topic

Table 1 :1Details of five time intervals with the highest activity during Sundance Film Festival 2013StartEnd#TweetsThu Jan 17 23:00Fri Jan 18 00:001545Sat Jan 19 19:00Sat Jan 19 20:001477Mon Jan 21 19:00 Mon Jan 21 20:001247Sun Jan 27 03:00Sun Jan 27 08:003735Wed Jan 23 18:00 Wed Jan 23 21:001910

Table 2 :2Examples of peaky and persistent topics Peaky TopicsTopic Representative TermsPeakiness#tweets135paris, hilton, Blackfish, cnn, films0.358695133death, drink, countryman, sundance, charlie0.24758811lovelace, amanda, seyfried, portraits, premiere0.161129350defeat, inevitable, pete, mister, film0.14326729butch, dynamite, android, worth, apps0.123323Persistent TopicsTopic Representative TermsPersistence #tweets63hemingway, running, follow, crazy , marshall3.963249475jehane, square, girlrising, premiere, screening2.650500108vhs, sequel , horror, review , time2.31846945afar, week, enjoy, ways, kicked1.61212717lindsay, lohan, canyons, blame, snubbed1.557343ably sparser StreamGrid as the bursty periods in thissubset are much fewer. An example of a multimediasummary using StreamGrid for the Awards Ceremonyis shown in Figure 5. Comparing the StreamGrid-based multimedia summaries with the ones producedby the popular images (6), we observe that Stream-Grid does not perform noticeably better in this task.

Acknowledgements: This work is supported by the SocialSensor FP7 project, partially funded by the EC under contract number 287975.

Celebrating #SB48 on Twitter 2014. 27-Feb-2014 Timelines as summaries of popular scheduled events OAlonso KShiells Proceedings of the 22nd international conference on World Wide Web companion the 22nd international conference on World Wide Web companion 2013 International World Wide Web Conferences Steering Committee Latent dirichlet allocation DMBlei AYNg MIJordan 1) Method Examples tf • idf Profound comment from @JoKiefer : Looper storyline echoes war on terror. Kill the terrorist before he becomes one? #Sundance13 #dirtywars #Sundance Institute Mahindra Global Filmmaking Award winnners include UK co-prodcution: Eva Weber: Let the Northern Lights Erase your Name #Sundance Institute Mahindra Global Filmmaking Award winnners include UK co-prodcution: Eva Weber: Let the Northern Lights Erase your Name #PussyRiot -A Punk Prayer takes home a World Cinema Doc Special Jury Award, directors Mike Lerner & Maxim Pozdorovkin Mar. 2003 3 Table 3: Summaries during awards ceremony (4 th line in Table was really eye-opening. I had no idea how many brave men and women have died trying to put Bibles in hotel rooms. #Sundance LexRank Yes! Audience Gideon'sArmy ;Award US Blood Brother' wins both Grand Jury and Audience Award for U.S. Documentary #sundance it's coming FRUITVALE wins the #Sundance Grand Jury Prize AND the Audience Award. Could not be happier. Congrats @fruitvalemovie and Ryan Coogler! Popularity #PussyRiot -A Punk Prayer takes home a World Cinema Doc Special Jury Award MikeLerner Predict Grand Jury Prize, too Crystal Fairy SebastianSilva ' Wins #Sundance World Cinema Dramatic Directing Award Fruitvale RyanCoogler ' wins the World Cinema Dramatic Audience award at the Sundance Film Festival -via @goldenglobes "The Spectacular Now" Wins #Sundance U. S. Dramatic Special Jury Award for actors Miles Teller Shailene Woodley Wins #Sundance U.S. Dramatic Audience Award s recorded speech: singing Hava Nagila while warping his face in Photo Booth. Word. #Sundance My pics of StreamgridSebastian Silva The Spectacular Now" #Sundance Q&A , winner Special Jury award for acting to Miles Teller & Shailene Woodley Fruitvale" (dramatic) & "Blood Brother. doc) FilmLinc list of winners Event summarization using tweets DChakrabarti KPunera ICWSM 2011 A visual backchannel for largescale events. Visualization and Computer Graphics MDork DGruen CWilliamson SCarpendale IEEE Transactions on 16 6 2010 Lexrank: Graphbased lexical centrality as salience in text summarization GErkan DRRadev J. Artif. Int. Res 22 1 Dec. 2004 Generating event storylines from microblogs CLin CLin JLi DWang YChen TLi Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM '12 the 21st ACM International Conference on Information and Knowledge Management, CIKM '12

New York, NY, USA

ACM 2012 Why is "sxsw" trending? exploring multiple text sources for twitter topic summarization FLiu YLiu FWeng Proceedings of the ACL Workshop on Language in Social Media (LSM) the ACL Workshop on Language in Social Media (LSM) 2011 Twitinfo: Aggregating and visualizing microblogs for event exploration AMarcus MSBernstein OBadar DRKarger SMadden RCMiller Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '11 the SIGCHI Conference on Human Factors in Computing Systems, CHI '11

New York, NY, USA

ACM 2011 Improving lda topic models for microblogs via tweet pooling and automatic labeling RMehrotra SSanner WBuntine LXie Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '13 the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '13

New York, NY, USA

ACM 2013 Summarizing sporting events using twitter JNichols JMahmud CDrews Proceedings of the 2012 ACM International Conference on Intelligent User Interfaces, IUI '12 the 2012 ACM International Conference on Intelligent User Interfaces, IUI '12

New York, NY, USA

ACM 2012 Centroid-based summarization of multiple documents DRRadev HJing MStyś DTam Inf. Process. Manage 40 6 Nov. 2004 Eventsense: Capturing the pulse of large-scale events by mining social media streams ESchinas SPapadopoulos SDiplaris YKompatsiaris YMass JHerzig LBoudakidis Proceedings of the 17th Panhellenic Conference on Informatics, PCI '13 the 17th Panhellenic Conference on Informatics, PCI '13

New York, NY, USA

ACM 2013 Peaks and persistence: Modeling the shape of microblog conversations DAShamma LKennedy EFChurchill Proceedings of the ACM 2011 Conference on Computer Supported Cooperative Work, CSCW '11 the ACM 2011 Conference on Computer Supported Cooperative Work, CSCW '11

New York, NY, USA

ACM 2011 Summarizing microblogs automatically BSharifi M.-AHutton JKalita Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT '10

Stroudsburg, PA, USA

Association for Computational Linguistics 2010 A participant-based approach for event summarization using twitter streams CShen FLiu FWeng TLi Proceedings of NAACL-HLT NAACL-HLT 2013 Sumblr: Continuous summarization of evolving tweet streams LShou ZWang KChen GChen Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '13 the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '13

New York, NY, USA

ACM 2013 Evaluation methods for topic models HMWallach IMurray RSalakhutdinov DMimno Proceedings of the 26th International Conference on Machine Learning (ICML) LBottou MLittman the 26th International Conference on Machine Learning (ICML) Omnipress Montreal. June 2009 Generating pictorial storylines via minimum-weight connected dominating set approximation in multiview graphs DWang TLi MOgihara AAAI'12 2012