Mining Concurrent Topical Activity in Microblog Streams A. Panisson, L. Gauvin, M. Quaggiotto, C. Cattuto Data Science Laboratory, ISI Foundation, Torino, Italy {andre.panisson},{laetitia.gauvin},{marco.quaggiotto},{ciro.cattuto}@isi.it ABSTRACT the capability of teasing apart latent signals that involve complex Streams of user-generated content in social media exhibit patterns correlations between users, topics and time intervals. of collective attention across diverse topics, with temporal struc- tures determined both by exogenous factors and endogenous fac- The motivation for the present study is twofold. On the one hand, tors. Teasing apart different topics and resolving their individ- we want to devise techniques that can reliably solve the inverse ual, concurrent, activity timelines is a key challenge in extracting problem of extracting latent signals of attention to specific topics knowledge from microblog streams. Facing this challenge requires based on a stream of posts from a micro-blogging system. That the use of methods that expose latent signals by using term cor- is, we aim at extracting the time-varying topical structure of a mi- relations across posts and over time. Here we focus on content croblog stream such as Twitter. On the other hand, we want to posted to Twitter during the London 2012 Olympics, for which a deploy these techniques in a context where temporal and semantic detailed schedule of events is independently available and can be metadata about external events driving Twitter are available, so that used for reference. We mine the temporal structure of topical ac- the relation between exogenous driving and time-varying topical tivity by using two methods based on non-negative matrix factor- responses can be elucidated. We do not regard this as a validation ization. We show that for events in the Olympics schedule that of the methods we use, because the relation between the external can be semantically matched to Twitter topics, the extracted Twit- drivers and the response of a social system is known to be com- ter activity timeline closely matches the known timeline from the plex, with memory effects, topical selectivity, and different degrees schedule. Our results show that, given appropriate techniques to de- of endogenous social amplification. Rather, we regard the com- tect latent signals, Twitter can be used as a social sensor to extract parison between the time-resolved topical structure of a microblog topical-temporal information on real-world events at high temporal stream and an independently available event schedule as an impor- resolution. tant step for understanding to what extent Twitter can be used as a social sensor to extract high-resolution information on concurrent events happening in the real world. Keywords topic detection, microblogs, matrix and tensor factorization, collec- Here we focus on content collected by the Emoto project1 from tive attention, event detection Twitter during the London 2012 Olympics, for which a daily sched- ule of the starting time and duration of sport events and social 1. INTRODUCTION events is available and can be used for reference. In this context, re- Streams of user-generated content from social media and microblog- solving topical activity over time requires to go beyond the analysis ging systems exhibit patterns of collective attention across diverse and characterization of popularity spikes. A given topic driven by topics, with temporal structures determined both by exogenous fac- external events usually displays an extended temporal structure at tors, such as driving from mass media, and endogenous factors such the hourly scale, with multiple activity spikes or alternating periods as viral propagation. Because of the openness of social media, of of high and low activity. We aim at extracting signals that consists the complexity of their interactions with other social and informa- of an association of (i) a weighted set of terms defining the topic, tion systems, and of the aggregation that typically leads to the ob- (ii) a set of tweets that are associated to the topic, together with servable stream of posts, several concurrent signals are usually si- the corresponding users, and (iii) an activity profile for the topic multaneously present in the post stream, corresponding to the activ- over time, which may comprise disjoint time intervals of nonzero ity of different user communities in the context of several different activity. We detect time-varying topics by using two independent topics. Making sense of this information stream is an inverse prob- methods, both based on non-negative matrix or tensor factoriza- lem that requires moving beyond simple frequency counts, towards tion. In the first case we build the full tweet-term-time tensor and use non-negative tensor factorization to extract the topics and their activity over time. We introduce an adapted factorization technique that can naturally deal with the special tensor structure arising from microblog streams. In the second case, which in principle affords Copyright c 2014 held by author(s)/owner(s); copying permitted on-line computation, we build tweet-term frequency matrices over only for private and academic purposes. consecutive time intervals of fixed duration. We apply non-negative Published as part of the #Microposts2014 Workshop proceedings, available online as CEUR Vol-1141 (http://ceur-ws.org/Vol-1141) matrix factorization to extract topics for each time interval and we track similar topics over time by means of agglomerative hierarchi- #Microposts2014, April 7th, 2014, Seoul, Korea. 1 http://www.emoto2012.org · #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014 cal clustering. composition (CANDECOM), also called parallel factorization (PARAFAC, [9]), which can be regarded as a generalization to ten- We then apply both methods to the Twitter dataset collected dur- sors of singular value decomposition (SVD). Maintaining the in- ing the Olympics, which reflects the attention users pay to tens of terpretability of the factors usually requires to achieve factoriza- different concurrent events over the course of every day. We fo- tion under non-negativity constraints, leading to techniques such cus on topical dynamics at the hourly scale, and find that for those as non-negative matrix or tensor factorization (NMF and NTF). sport events in the schedule that can be semantically matched to Tensor factorization to detect latent structures has been extensively the topics we obtain from Twitter, the activity timeline of the de- used in several domains such as signal processing, psychometrics, tected topic in Twitter closely matches the event timeline from the brain science, linguistics and chemometrics [26, 6, 29, 28, 30]. schedule. This paper is structured as follows: Section 2 reviews the literature on collective attention, popularity, and topic detection in microblog 3. DATA AND REPRESENTATION streams. Section 3 describes the Olympics 2012 Twitter dataset Notation used for the study, the event schedule we use as an external refer- The following notations are used throughout the paper. Scalars are ence, and introduces some notations and conventions used through- denoted by lowercase letters, e.g., x, and vectors are denoted by out the paper. Section 4 and Section 5 describe the two techniques boldface lowercase letters, e.g., x, where the i-th entry is xi . Ma- we use to mine time-varying topical activity in the Twitter stream. trices are denoted by boldface capital letters, e.g., X, where the i-th Section 6 discusses the relation between the time-varying topics column of matrix X is xi , and the (i, j)-th entry is xij . Third or- we obtain and the known schedule of the Olympics events for one der tensors are denoted by calligraphic letters, e.g., A. The i-th representative day, and provides some general observations on the slice of A, denoted by Ai , is formed by setting the last mode of the behavior of the two methods. Finally, Section 7 summarizes our third order tensor to i. The (i, j)-th vector of A, denoted by aij , is findings and points to directions for further research. formed by setting the second to last and last modes of A to i and j respectively, and the (i, j, k)-th entry of A is aijk . 2. RELATED WORK The dynamics of collective attention and popularity in social media Twitter Dataset has been the object of extensive investigation in the literature. At- The Emoto dataset consists of around 14 million tweets collected tention can suddenly concentrate on a Web page [31, 22], a YouTube during the London 2012 Summer Olympics using the public Twit- video [7, 8, 20], a story in the news media [17], or a topic in Twit- ter Streaming API. All tweets have at least one of 400 keywords, ter [14, 2, 34]. Intrinsic features of the popular item under consid- including common words used in the Olympic Games – like ath- eration have been related to its popularity profile by means of se- lete, olympic, sports names and twitter accounts of high followed mantic analysis and natural language processing of user-generated athletes and media companies. Tweets were collected during all the content [1, 32, 33]. In particular, a great deal of research [7, 14, 15, interval of 17 days comprising the Olympic Games, from July 27 34, 16] has focused on characterizing the shape of peaks in popu- to August 12 2012. larity time series and in relating their properties to the popular item under consideration, to the relevant semantics, or to the process Event Schedule driving popularity. In order to investigate the relation between the extracted time- vary- ing topics and the sport events of the Olympic Games, we use the Within the broad context of social media, Twitter has emerged as schedule available on the official London 2012 Olympics page2 , a paradigmatic system for the vision of a “social sensor” that can be where the starting time and duration of most events is reported to- used to measure diverse societal processes and responses at scale [11, gether with metadata about the type of event (discipline, involved 25, 3, 19]. To date, comparatively little work has been devoted to teams or countries, etc.) extracting signals that expose complex correlations between top- ics and temporal behaviors in micro-blogging systems. Given the Data Preprocessing many factors driving Twitter, and their highly concurrent nature, For the text analysis performed in this paper, URLs are removed exposing such a topical-temporal structure may provide important from the original tweet content. The remaining text is used to build insights in using Twitter as a sensor when the social signals of inter- a vocabulary composed of the most common 30,000 terms, where est cannot be pinpointed by simply using known terms or hashtags each term can be a single word, a digram or a trigram. 352 common to select the relevant content, or when the topical structure itself, words of the English language are also removed from the vocabu- and its temporal evolution, needs to be learned from the data. Saha lary. and Sindhwani [24] adopt such as viewpoint and propose an algo- rithm based on non-negative matrix factorization that captures and In order to localize Twitter users, we examine the user profile de- tracks topics over time, but is evaluated at the daily temporal scale scriptions and use an adapted version of GeoDict3 to identify, if only, against events that mainly consist of single popularity peaks, possible, the user country. To study the relation between the ex- without concurrency. Here we aim at capturing multiple concurrent tracted topical activity and the schedule of the Olympic events, we topics and their temporal evolution at the scale of hours, in order to focus on tweets posted by users located in the UK, only. This al- be able to compare the extracted signals with a known schedule for lows us to avoid potential confusion arising from tweets posted in several concurrent events taking place during one day. countries, such as the USA, where Olympics events were broad- casted with delays of several hours due to time zone differences. As we will discuss in detail, microblog activity can be represented This selection leaves us with a still substantial amount of data (about using a tweet-term-time three-way tensor, and tensor factorization 2 techniques can be used to uncover latent structures that represent http://www.london2012.com/schedule-and-results/ 3 time-varying topics. Ref. [5] proposed in 1970 the Canonical De- https://github.com/petewarden/geodict 4 · #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014 Representation of Topics 1 Topics 2 one third of the full dataset) and simplifies the subsequent temporal the topics for one analysis, even though it probably oversamples the attention payed day to events that involved UK athletes. For the scope of this study, we represent the data as a sparse third- order tensor T ∈ RI×J×K , with I tweets, J terms and K time intervals. We aggregate the tweets over 1-hour intervals, for a total of K = 408 intervals. The tensor T is sparse: the average number of terms (also referred as features in the following) for each tweet is Figure 1: Representation of a Kruskal decomposition. The typically no more than 10, compared to the 30k terms of our term cube corresponds to the tensor to be factorized while the rect- vocabulary. Moreover, as each tweet is emitted at a given time, angles represent the vectors. In the Twitter case, each of the each interval k has a limited number of active tweets, Ik . A ten- rank-one tensor would correspond to the description of one sor slice Tk ∈ RI×J is a sparse matrix with non-zero values only topics. for Ik rows. Tk represent the sparse tweet-term matrix observed at time k. The term values tijk for each tweet i are normalized us- ing the standard Term Frequency and Inverse-Document Frequency (TF-IDF) weighting, tijk = tf(i, j) × idf(j), where tf(i, j) is the Factorization Methodology |D| frequency of term j in tweet i, and idf(j) = log 1+|{d:j∈d}| where Regarding the extraction of topics, the aim is not to decompose the |D| is the total number of tweets and |{d : j ∈ d}| is the number tensor in its exact form but to approximate the tensor by a sum of of tweets where the term j appears. rank-1 tensors with a number of terms smaller than the rank of the original tensor. This number R corresponds to the number of topics that we want to extract (see Fig. 1). Such an approximation of the Visualizing Topics over Time tensor leads to minimize the difference between T and JA, B, CK: The methods that we present in this paper are able to extract topical- min kT − JA, B, CKk2F (2) temporal structures from T . Such topical-temporal structures can A,B,C be represented a stream matrix S ∈ RR×K with R topics and K intervals. Each component R is also characterized by a term-vector where kk is the Frobenius norm. We transform the 3-dimensional h ∈ RJ that defines the most representative terms for that compo- problem (Eq. 2) in 2-dimensional sub-problems by unfolding the nent. In order to visualize such topical-temporal structures repre- tensor T in three different ways. This process called matriciza- sented as a stream matrix, we use the method described by Byron tion gives rise to three modes X(1) , X(2) , X(3) . The mode-n ma- and Wattenberg [4], which yields a layered stream-graph visualiza- tricization consists of linearizing all the indices of the tensor ex- tion. cept n. The three resulting matrices have respectively a size of I ×JK,J ×IK and K ×IJ. Each element of the matrix X(i=1,2,3) corresponds to one element of the tensor T such that each of the 4. MASKED NON-NEGATIVE TENSOR mode contains all the values of the tensor. Due to matricization, the FACTORIZATION factorization problem given by Eq.1 can be reframed in factoriza- tion of the three modes. In other terms, maximizing the likelihood Problem Statement between T and JA, B, CK is equivalent to minimizing the differ- As explained in section 3, the tensor T ∈ RI×J×K with I tweets, ence between each of the mode and their respective approximation J terms and K intervals is a natural way to represent the tweets and in terms of A, B, C. The factorization problem (PARAFAC) in their contents with respect to the time. The tensor has the advan- Eq.2 is converted to the three following sub-problems where we tage to directly encompass the relationship between tweets posted added a condition of non-negativity of the three modes: at different hours and consequently between topics of the differ- 2 ent hours. The tensor factorization as described below allows to min kX(1) − A(C B)T kF (3) A≥0 uncover topics together with their temporal pattern. 2 min kX(2) − B(C A)T kF (4) B≥0 Before describing the process of factorization itself and its out- 2 put, one needs to introduce the concept of canonical decomposition min kX(3) − C(B A)T kF (5) C≥0 (CP). CP in 3 dimensions aims at writing a tensor T ∈ RI×J×K in a factorized way that is the sum of the outer product of three where is the Khatri-Rao product which is a columnwise Kro- vectors: necker product, i.e. such that C B = [c1 ⊗ b1 c2 ⊗ b2 . . . cr ⊗ br ]. RT If C ∈ RK×R and B ∈ RJ×R , then the Khatri-Rao product X C B ∈ RKJ×R . In our case of study, A, B, C will give each T = ar ◦ br ◦ cr (1) access to a different information: A allows to know at which topic r=1 belongs a tweet, B gives the definition of the topics with respect to where the smallest value of RT for which this relation exists, is the features and C gives the temporal activity of each topic. the rank of the tensor T . In other words, the tensor T is ex- pressed with a sum of rank-1 tensors. The set of vectors a{1,2,...,R} Several algorithms have been developped to tackle the PARAFAC (resp. b{1,2,...,RT } ,c{1,2,...,RT }) can be re-written as a matrix A ∈ decomposition. The two most common are one method based on RI×RT (resp B ∈ RJ×RT ,C ∈ RK×RT ) where each of the RT the projected gradient and the Alternating Least square method vectors is a column of the matrix. The decomposition of Eq. 1 (ALS). The first one is convenient for its ease of implementation can also be represented in terms of the three matrices A, B, C as and is largely used in Singular Value Decomposition (SVD) but JA, B, CK. A visual representation of such a factorization, also converges slowly. In the ALS method, the modes are deduced suc- called Kruskal decomposition, is displayed on Fig. 1. cessively by solving Eq 5. In each iteration, for each of the sub- 5 · #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014 problem, two modes are kept fixed while the third one is computed. Non-negative Matrix Factorization This process is repeated until convergence. In our case, we use a For each tensor slice Tk , we compute a non-negative factorization nonnegativity constraint to make the factorization better posed and by minimizing the following error function, the results meaningful. One thus uses nonnegative ALS (ANLS 2 [21]) combined with a block-coordinate-descent method in order min kTk − W(k) H(k) kF , (8) W,H to reach the convergence faster. Each of the step of the algorithm needs to take into account the Karush-Kuhn-Tucker (KKT) con- where kk is the Frobenius norm, subject to the constraint that the ditions to have a stationary point. Our program is based on the values in W(k) and H(k) must be non-negative. The non-negative algorithm implemented by [12]. factorization is achieved using the projected gradient method with sparseness constraints, as described in [18, 10]. The factorization Masked Adaptation of the NTF produces a matrix of left vectors W(k) ∈ RIk ×F and a matrix of right vectors H(k) ∈ RF ×J , where F is the number of components We cannot directly perform the NTF on the tensor [Tweets × Fea- used in the decomposition. The matrix H(k) stores the term vectors tures × Interval] built as explained aboved as this tensor has a of the extracted components at interval t. The matrix W(k) is used “block-disjoint” structure peculiar to the tweets. Indeed each tweet to calculate the strength of each extracted component, which are has non-zero values only at one interval because a tweet is emitted represented in a matrix Z ∈ RF ×K given by only at a given time. Each interval k has only Ik active tweets. In each slice Tk of the tensor, only Ik rows have meaningful values. Ik X wif (k) So, we are only interested in reproducing the tensor part which con- zf k = (9) P F tains the meaningful values. In order to focus on these meaningful i=1 (k) wif 0 values, one needs to consider an adapted version of the tensor T . f 0 =1 We first consider the tensor T built as explained above. We gen- where zf k is the strength of factor f at interval k. erate a first set of matrices A, B, C which could approximate the tensor. At the next step, one tries to decompose a tensor T̄ where the values are a combination of the values of T̄ and of the values Component Clustering of JA, B, CK. More exactly, this tensor has the same size than T In order to track topics over time, we need to merge components and the same values than T for the rows Ik of each slice T̄k . The into topics depending on how similar they are. Since each com- complementary values are given by JA, B, CK. In other terms, at ponent is defined by a term vector, we can calculate a similarity each step, the tensor that we approximate is updated by: matrix of all possible pairs of term vectors using cosine similarity. This matrix is fed to a standard agglomerative hierarchical cluster- T̄ = T  W + (1 − W)JA, B, CK (6) ing algorithm, known as UPGMA [27], that at each step combines the two most similar clusters into a higher-level cluster. Cluster where  is the Hadamard product (element-wise product) and W is similarity is defined in terms of average linkage: that is, the dis- a binary tensor of the same size than T with 1-values only when the tance between two clusters c1 and c2 is defined as the average of values of T at this position are meaningful. The particular struc- all pair-wise distances between the children of c1 and those of c2 . ture of the tensor (disjoint blocks in time) could be perceived as a “missing values” problem in the tensor, this problem has been for The hierarchical clustering produces a tree that can be cut at a given example tackled in [23]. depth to yield a clustering at a chosen level of detail. That is, by varying the threshold similarity we use for the cut we can go from a Concretely, the implementation is an adaptation of a Matlab pro- coarse-grained topical structure, with few clusters that may merge gram [12] which uses the Tensor Toolbox [13]. This adaptation unrelated topics, to a fine-grained topical structure, with many clus- includes the introduction of a mask (via the tensor of weight) as ters that may separate term vectors that otherwise could be regarded mentionned above and the rewriting of some operations to avoid as the same continuous topic over time. The cut threshold needs to memory issues. This point is not detailed here as it is not part of be chosen based on criteria that depends on the application at hand. the main point of the paper. Each choice for the cut yields a number of clusters C and a map Stream Matrix Construction function C(r, f ) → k that associates the component index f at time We calculate the strength of each topic with respect to the time interval k to a topic cluster r. This function collects all components by using both the information about the link between each topic associated to cluster r in a set Crk for each interval k. and each tweet and about temporal pattern of the topics. These informations are available through A and C and the consequent Stream Matrix Construction strength of a topic r on each interval of time k is given by: When constructing the stream matrix, the number of topics R in X the stream matrix is given by the number of clusters generated by srk = air ∗ ckr (7) the clustering step. In order to calculate the entries srk of the re- i|k sulting stream matrix S, we aggregate the strengths of the clustered P where i|k is a sum over the tweets indexed by i occurring at the components. We build a stream matrix S ∈ RR×K , with R topics interval indexed by k. The set of elements s{r,k} with r = J1, RK and K intervals, given by and k = J1, KK forms the stream matrix S. Each topic is then defined by a terms vector and each of this term vector is given by a X column of B. srk = zf k (10) f ∈Crk 5. AGGLOMERATIVE NON-NEGATIVE MATRIX FACTORIZATION Finally, we extract the term vectors that are associated to each clus- 6 · #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014 (k) ter. Each cluster will be associated to a term vector hr ∈ RJ that terms appear with a high weight in its term vectors, we define our (k) is the average of all term vectors hf associated to that cluster in matching score based on the geometric average of the weights of the component clustering step. the event annotated terms in the topic’s term vectors: p hwi = n hw1 r hw2 r . . . hwn r (11) 6. ANALYSIS OF THE OLYMPICS DATA- For Masked NTF, for each event, we choose the topic with the high- est corresponding geometric average hwi. In the agglomerative SET NMF case, for each event, we choose the topic with the highest We now move to the analysis of the London 2012 Twitter dataset corresponding geometric average hwi weighted by log(n) where n and its relation with the known schedule of the Games. We focus is the number of components in the selected cluster. We use log(n) on one representative day, July 29th, during which several sport in order to favor the selection of clusters with a higher number of events took place at different times and concurrently. We use both aggregated components, otherwise the most detailed clusters which topic detection methods, show the signals they extract, and check aggregates only one component are always selected. Since the Ag- to what extent they are capable of extracting signals that we can glomerative NMF method produces a tree structure in which each understand in terms of the schedule. node agglomerates a set of components and represents topic activ- ity, we have to calculate such matching result for each node, and The topic detection methods are set up as follows. For the masked select the node for which such matching result is the highest. NTF method, we decompose the tensor using a fixed number of components, using a tolerance value of 10−4 for the stopping con- dition, and limiting the number of iterations to 50. For the agglom- Results and Observations erative NMF method, we decompose each interval matrix using a At this point, we have, for each event, a topic which was selected in fixed number of components. We use a tolerance value of 10−4 for each method, and the corresponding matching result. In Figure 3, the stopping condition, and limit the number of iterations to 20. We we show the schedule events for the top 20 highest matching re- use 250 topics for the Masked NTF, and 50 components per time sults. In the lefthand figure, we show, for each one of the top 20 intervals in the Agglomerative NMF. matching results, the topic extracted by the Masked NTF method, while in the righthand figure, we show the topic extracted by the Figure 2 shows a streamgraph representation of time-varying top- Agglomerative NMF method. The results are sorted by the corre- ics extracted using the two methods we have discussed. Two global sponding matching weight. activity peaks are visible in both streamgraphs: the peak at about 2.30pm UTC was triggered when Elizabeth Armistead won the sil- For each event, on its top left corner, we show the manually an- ver medal in road cycle; the peat at about 7pm UTC is driven by notated terms used for the matching. The shaded blue area shows the bronze medal in 400m freestyle to Rebecca Adlington. In the the exact interval during which the event was occurring accord- stream graphs, for clarity, each topic is annotated using only its ing to the official Olympics schedule. In the same area, the solid topmost weighted term. This makes it difficult to assess a visual green line represents the temporal structure of the topic with higher correspondence between the same topics across the two represen- matching result according to our matching criteria. Such values tations, as the term with top weight may be different for the two roughly represent the amount of activity for such topic and are nor- term vectors even though the vectors are overall very similar (in malized according to the peak of activity. We show the value for terms of cosine similarity). On closer inspection, many precise cor- this peak in the top right side, along with the matching results be- respondences can be established between the topics extracted by tween parenthesis. In the Agglomerative NMF graph (on the right) the Masked NTF method and those extracted by the Agglomerative we show as a dotted line the activity in time for the given terms NMF method: for example, the topic armistead in the top stream- regarding the number of tweets that have such terms (tweet count). graph matches the topic congratulation in the bottom one. An inter- We remark that by considering only the dotted line the timing of active streamgraph visualization of the London 2012 Twitter data- many events on the right side of the figure does not match the set is available at http://www.datainterfaces.org/projects/emoto/. schedule timings, i.e., merely counting tweets is not sufficient at this resolution level. We also measured the number of tweets where the terms are co-ocurring, and in this case the number of tweets is 6.1 Comparison with the Olympics Schedule so small that it does not allow the detection of any structure in time. Event Selection In order to show the possible correspondence between the extracted We evaluated these activity profiles using the CrowdFlower Web- topics and sport events, we manually annotate the schedule col- based crowdsourcing platform (restricted to Amazon Mechanical lected from the official London 2012 Olympics page for July 29th, Turk workers). Each work unit asks a worker to visually inspect 2012. As the number of events in a day can be substantial and we and compare two timelines: the one to be evaluated, and a refer- want to focus on events with higher impact on social media, we ence timeline corresponding to the known time intervals for sport retain events that are either finals or team sports match. We anno- events taken from the Olympic schedule. Each work unit looks tate each event with a set of at most three terms extracted from the like a row from Figure 3. Our evaluation was based on 100 work schedule, as described in Section 3. For a team sport, we use the units evenly distributed among 5 types: 1) (NMF) work units based sport name and the countries of the two teams, otherwise, we put on the results of Agglomerative NMF; 2) (CNT-NMF) work units the name of the sport and its characteristics, e.g., the discipline for with activity profiles generated by simply counting the number of swimming. tweets with the terms used in matching the NMF topics; 3) (NTF) work units from the Masked NTF approach; 4) (CNT-NMF) same Matching Topics and Events as (CNT-NMF) for Masked NTF; 5) synthetic work units (“gold” For each event, we use a matching criteria to select one of the ex- units) used to assess worker quality. For each work unit, we asked tracted topics from each of the set of topics produced by the meth- the workers whether the two timelines matched exactly (Yes), mat- ods. Since we want to select a topic in which all event annotated ched partially (Partially) or not at all (No). 95% of the judgments 7 · #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014 Masked NTF lizzie beckadlington armitstead gb, team teamgb, judo seat, sponsor radcliffe seat seat, ticket teamgb, giggssturridge teamgb, volleyball watch teamgb, medal good sport mail teamgb, volleyball football athlete race medal, armitstead support volleyball bronze ceremony medal, teamgb road, cycle teamgb, hockey teamgb, silver teamgb, gymnastics becky teamgb, support beckadlington swim swim, liamtancock home 800, gb woman Agglomerative NMF teamgb bronze becky paula olympic game rebecca swim beckadlington centre woman road olympic football medal sponsor seat, corporate seb coe say woman beach volley win congratulation olympic, sport seat swim team win gratitude support sure l_armitstead run tomdaley1994 liamtancock box olympic, open olympic come lizzy medal teamgb synchronise beckadlington medal silver swim represent 08:00 12:00 16:00 20:00 Figure 2: Streamgraph visualization of the stream matrices generated by the Masked NTF (top) using 50 topics, and by the Ag- glomerative NMF (bottom) using 20 components per interval and a total of 150 clusters. Interactive streamgraph visualizations for a few use cases are available at http://www.datainterfaces.org/2013/06/twitter-topic-explorer/. for gold work units were correct. We only retained those users who poral and topical information about the event schedule, pointing to correctly judged more than 80% of the gold units. Figure 4 shows more sophisticated uses of Twitter as a social sensor. the distribution of judgements for the different types of work units. The left hand side of the figure shows the distribution obtained for The work described here can be extended along several directions. all work units, while the right hand side shows the distribution re- It would be interesting to develop and characterize on-line versions stricted to work units with more than 80% of agreement across dif- of the techniques we used here, so that topic emergence and trend ferent workers. According to this evaluation, both NTF and NMF detection could be carried out on live microblog streams. Because outperform the count-based methods. of its temporal segmentation, the Agglomerative NMF case lends itself rather well to on-line incremental computation, whereas a dy- We see that for most of the events there is a close temporal align- namic version of the Masked NTF technique would be more chal- ment between the event schedule and the topic structure, at the scale lenging to achieve. of the hour or less. We see that such temporal alignment is much closer than when compared to the peaks of activity generated by Another interesting direction for future research would be to aug- counting tweets. ment the tweet-term-time tensor with a fourth dimension represent- ing the location of the users, so that the latent signals we extract We observe that the mismatches in the temporal alignment are caused could expose correlation between topics, time intervals and loca- by two different factors. The first one is due to a low matching re- tions, exposing geographical patterns of collective attention and sults, like the event annotated with (football, mexico, gabon). It their relation to delays, e.g., in the seeding by mass media across means that the term vectors for the given topic does not represent different countries. with high confidence the terms used to annotate the event. The sec- ond one is due to a different behaviour in collective attention. This happens for example in the case of swimming events, where the Acknowledgements The Authors acknowledge the Emoto project www.emoto2012.org and its first part of the event is related to eliminatories and the second part partners for access to the Twitter dataset on the London Olympics 2012. The is related to the finals. In such cases, the peak in activity arrives Authors acknowledge inspiring discussions with Moritz Stefaner and Bruno when the event finishes and the attention goes to the winner. Goncalves. The Authors aknowledge partial support from the Lagrange Project of the ISI Foundation funded by the CRT Foundation, from the Q- ARACNE project funded by the Fondazione Compagnia di San Paolo, and 7. SUMMARY AND FUTURE WORK from the FET Multiplex Project (EU-FET-317532) funded by the European The topic detection techniques we discussed here afford tracking Commission. the attention that a community of users devotes to multiple con- current topics over time, teasing apart social signals that cannot be disentangled by simply measuring frequencies of term or hash- 8. REFERENCES tags occurrences. This allows to capture the emergence of topics [1] E. Adar, D. Weld, Bershad, B.N., and S. Gribble. Why we search: visualizing and predicting user behavior. In Proc. 16th intl. conf. on and to track their popularity with a high temporal resolution and World Wide Web (WWW’07), pages 161–170, 2007. a controllable semantic granularity. The comparison with an in- [2] S. Asur, B. A. Huberman, G. Szabo, and W. C. Trends in social dependently available schedule of real-world events shows that the media : Persistence and decay. In Proc. 5th Intl. Conf. on Weblogs response of Twitter to external driving retains a great deal of tem- and Social Media (ICWSM), page 434, 2011. 8 · #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014 swim, 400m, freestyle 228.8 (0.535) football, egypt, zealand 171.4 (1.467) canoe 95.8 (0.415) basketball, usa, france 335.9 (1.293) football, spain, honduras 157.8 (0.285) canoe 477.9 (1.233) road cycle, medal 1220.3 (0.225) road cycle, medal 3330.2 (1.130) swim, 4x100m, freestyle 269.5 (0.212) swim, 4x100m, freestyle 292.3 (0.974) basketball, usa, france 269.5 (0.211) archery, korea, china 180.9 (0.921) archery, korea, china 86.2 (0.197) football, spain, honduras 1374.4 (0.844) hockey, gb, japan 260.0 (0.147) football, brazil, belarus 447.4 (0.819) judo, final 150.7 (0.123) judo, final 320.3 (0.799) volleyball, bulgaria, gb 97.7 (0.117) basketball, spain, china 323.5 (0.646) football, egypt, new zealand 108.8 (0.096) football, senegal, uruguay 199.5 (0.594) football, teamgb, uae 152.7 (0.093) swim, 400m, freestyle 384.9 (0.582) swim, 200m, freestyle 228.8 (0.091) hockey, gb, japan 584.4 (0.424) football, brazil, belarus 108.8 (0.089) football, uae, gb 742.9 (0.325) swim, breaststroke, 100m 289.2 (0.081) basketball, nigeria, tunisia 136.5 (0.304) football, senegal, uruguay 247.9 (0.074) archery, japan, russia 180.9 (0.273) archery, japan, russia 282.2 (0.056) football, japan, morocco 742.9 (0.234) basketball, russia, gb 367.3 (0.056) basketball, brazil, australia 115.3 (0.222) swim, 100m, breakstroke 126.1 (0.049) swim, 100m, backstroke 352.2 (0.217) water polo, romania, gb 123.0 (0.036) football, mexico, gabon 628.2 (0.214) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 (a) Masked NTF (b) Agglomerative NMF Figure 3: The top 20 most representative schedule events regarding the matching weight of its annotations with the term vectors of an extracted topic. In the left, for each event, we show the topic extracted by the Masked NTF method for which the matching weight is the highest, and in the right we show the topic extracted by the Agglomerative NMF method for which the matching weight is the highest. Since we are showing the topmost 20 schedule events regarding the matching weight, the events are sorted by such matching weight. On the top left corner of each event, we show its annotated terms, along with the exact interval in which the event happened according to the official Olympics schedule (shaded blue area). The solid green line shows the temporal structure of the topic with higher matching weight along the 24 hours of July 29. The values in the top right side shows the value for the peak of the temporal structure, which roughly represents the amount of activity for such topic, and, between parenthesis, the matching weight for the given topic. In the Agglomerative NMF side (on the right) we show as a dotted line the activity in time for the given terms regarding the number of tweets that have such terms (tweet count). 9 · #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014 NMF NTF No Partially CNT-NMF Yes CNT-NTF 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Percentage of answers Percentage of answers ( ≥80% agreement) Figure 4: Crowdsourced evaluation of the topical activity profiles for selected sport events (see main text) of the London 2012 Olympics dataset obtained by using the different topic detection methods. [3] J. Bollen, H. Mao, and X. Zeng. Twitter mood predicts the stock [20] M. Naaman, H. Becker, and L. Gravano. Hip and trendy: market. Journal of Computational Science, 2(1):1 – 8, 2011. Characterizing emerging trends on twitter. J. Am. Soc. Inf. Sci., [4] L. Byron and M. Wattenberg. Stacked graphs–geometry & aesthetics. 62:902–918, 2011. Visualization and Computer Graphics, IEEE Transactions on, [21] P. Paatero and U. Tapper. Positive matrix factorization: A 14(6):1245–1252, 2008. non-negative factor model with optimal utilization of error estimates [5] J. Carroll and J.-J. Chang. Analysis of individual differences in of data values. Environmetrics, 5(2):111–126, 1994. multidimensional scaling via an n-way generalization of [22] J. Ratkiewicz, F. Menczer, S. Fortunato, A. Flammini, and "eckart-young" decomposition. Psychometrika, 35(3):283–319, A. Vespignani. Traffic in social media ii: Modeling bursty popularity. September 1970. In SocialCom 2010: SIN, 2010. [6] A. Cichocki, A. H. Phan, and R. Zdunek. Nonnegative Matrix and [23] J.-P. Royer, N. Thirion-Moreau, and P. Comon. NonNegative 3-Way Tensor Factorizations: Applications to Exploratory Multi-way Data tensor Factorzation taking into account Possible Missing Data. In Analysis and Blind Source Separation. Wiley, Chichester, 2009. Eurasip, editor, EUSIPCO-2012, pages 1–5, Bucarest, Roumanie, [7] R. Crane and D. Sornette. Robust dynamic classes revealed by Aug. 2012. Elsevier. measuring the response function of a social system. PNAS, [24] A. Saha and V. Sindhwani. Learning evolving and emerging topics in 105:15649, 2008. social media: a dynamic nmf approach with temporal regularization. [8] F. Figueiredo, F. Benevenuto, and J. Almeida. The tube over time: In Proc. of the fifth ACM intl. conf. on Web search and data mining, Characterizing popularity growth of youtube videos. In Proc. ACM WSDM ’12, pages 693–702, New York, NY, USA, 2012. ACM. Intl. Conf. on Web Search and Data Mining (WSDM), pages [25] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter 745–754, 2011. users: real-time event detection by social sensors. In Proc. of the 19th [9] R. A. Harshman. Foundations of the PARAFAC procedure: Models intl. conf. on World wide web, WWW ’10, pages 851–860, New and conditions for an" explanatory" multi-modal factor analysis. York, NY, USA, 2010. ACM. UCLA Working Papers in Phonetics, 16(1):84, 1970. [26] A. Shashua and T. Hazan. Non-negative tensor factorization with [10] P. Hoyer. Non-negative matrix factorization with sparseness applications to statistics and computer vision. In Proc. of the 22nd constraints. The Journal of Machine Learning Research, intl. conf. on Machine learning, ICML ’05, pages 792–799, New 5:1457–1469, 2004. York, NY, USA, 2005. ACM. [11] B. J. Jansen, M. Zhang, K. Sobel, and A. Chowdury. Twitter power: [27] R. R. Sokal and C. D. Michener. A statistical method for evaluating Tweets as electronic word of mouth. Journal of the American Society systematic relationships. University of Kansas Scientific Bulletin, for Information Science & Technology, 60(11), 2009. 28:1409–1438, 1958. [12] J. Kim and H. Park. Fast nonnegative tensor factorization with an [28] J. Sun, D. Tao, and C. Faloutsos. Beyond streams and graphs: active-set-like method. In M. W. Berry, K. A. Gallivan, dynamic tensor analysis. In Proc. of the 12th ACM SIGKDD intl. E. Gallopoulos, A. Grama, B. Philippe, Y. Saad, and F. Saied, editors, conf. on Knowledge discovery and data mining, KDD ’06, pages High-Performance Scientific Computing, pages 311–326. Springer 374–383, New York, NY, USA, 2006. ACM. London, 2012. [29] T. Van de Cruys. A non-negative tensor factorization model for [13] T. G. Kolda and B. W. Bader. Tensor decompositions and selectional preference induction. In Proc. of the Workshop on applications. SIAM Rev., 51(3):455–500, Aug. 2009. Geometrical Models of Natural Language Semantics, GEMS ’09, [14] H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social pages 83–90, Stroudsburg, PA, USA, 2009. Association for network or a news media? WWW ’10 Proc. of the 19th intl. conf. on Computational Linguistics. World wide web, page 591, Feb 2010. [30] Y. Wang and E. Agichtein. Temporal latent semantic analysis for [15] D. Laniado and P. Mika. Making sense of twitter. In Semantic Web - collaboratively generated content: preliminary results. In Proc. of the ISWC, volume 6469, pages 470–485, 2010. 34th intl. ACM SIGIR conf. on Research and development in [16] J. Lehmann, B. Gonçalves, J. J. Ramasco, and C. Cattuto. Dynamical Information Retrieval, SIGIR ’11, pages 1145–1146, New York, NY, classes of collective attention in twitter. In Proc. of the 21st intl. conf. USA, 2011. ACM. on World Wide Web, WWW ’12, pages 251–260, New York, NY, [31] F. Wu and B. A. Huberman. Novelty and collective attention. Proc. USA, 2012. ACM. Nat. Acad. Sci., 104:17599, 2007. [17] J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the [32] J. Wu, K. M. Thornton, and E. N. Efthimiadis. Conversational dynamics of the news cycle. In Proc. of the 15th ACM SIGKDD intl. tagging in twitter. In Proc. of the 21st ACM conf. on Hypertext and conf. on Knowledge discovery and data mining, page 497, 2009. Hypermedia, pages 173–178, 2010. [18] C. Lin. Projected gradient methods for nonnegative matrix [33] S. Wu, J. M. Hofman, W. A. Mason, and D. J. Watts. Who says what factorization. Neural computation, 19(10):2756–2779, 2007. to whom on twitter. WWW 2011, pages 1–10, Feb 2011. [19] D. Mocanu, A. Baronchelli, N. Perra, B. Gonçalves, Q. Zhang, and [34] J. Yang and J. Leskovec. Patterns of temporal variation in online A. Vespignani. The twitter of babel: Mapping world languages media. In Proc. of the fourth ACM intl. conf. on Web search and data through microblogging platforms. PloS one, 8(4):e61981, 2013. mining, pages 177–186, 2011. 10 · #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014