Mining Concurrent Topical Activity in Microblog Streams

                                  A. Panisson, L. Gauvin, M. Quaggiotto, C. Cattuto
                                 Data Science Laboratory, ISI Foundation, Torino, Italy
                              {andre.panisson},{laetitia.gauvin},{marco.quaggiotto},{ciro.cattuto}@isi.it


 ABSTRACT                                                                 the capability of teasing apart latent signals that involve complex
 Streams of user-generated content in social media exhibit patterns       correlations between users, topics and time intervals.
 of collective attention across diverse topics, with temporal struc-
 tures determined both by exogenous factors and endogenous fac-           The motivation for the present study is twofold. On the one hand,
 tors. Teasing apart different topics and resolving their individ-        we want to devise techniques that can reliably solve the inverse
 ual, concurrent, activity timelines is a key challenge in extracting     problem of extracting latent signals of attention to specific topics
 knowledge from microblog streams. Facing this challenge requires         based on a stream of posts from a micro-blogging system. That
 the use of methods that expose latent signals by using term cor-         is, we aim at extracting the time-varying topical structure of a mi-
 relations across posts and over time. Here we focus on content           croblog stream such as Twitter. On the other hand, we want to
 posted to Twitter during the London 2012 Olympics, for which a           deploy these techniques in a context where temporal and semantic
 detailed schedule of events is independently available and can be        metadata about external events driving Twitter are available, so that
 used for reference. We mine the temporal structure of topical ac-        the relation between exogenous driving and time-varying topical
 tivity by using two methods based on non-negative matrix factor-         responses can be elucidated. We do not regard this as a validation
 ization. We show that for events in the Olympics schedule that           of the methods we use, because the relation between the external
 can be semantically matched to Twitter topics, the extracted Twit-       drivers and the response of a social system is known to be com-
 ter activity timeline closely matches the known timeline from the        plex, with memory effects, topical selectivity, and different degrees
 schedule. Our results show that, given appropriate techniques to de-     of endogenous social amplification. Rather, we regard the com-
 tect latent signals, Twitter can be used as a social sensor to extract   parison between the time-resolved topical structure of a microblog
 topical-temporal information on real-world events at high temporal       stream and an independently available event schedule as an impor-
 resolution.                                                              tant step for understanding to what extent Twitter can be used as a
                                                                          social sensor to extract high-resolution information on concurrent
                                                                          events happening in the real world.
 Keywords
 topic detection, microblogs, matrix and tensor factorization, collec-    Here we focus on content collected by the Emoto project1 from
 tive attention, event detection                                          Twitter during the London 2012 Olympics, for which a daily sched-
                                                                          ule of the starting time and duration of sport events and social
 1. INTRODUCTION                                                          events is available and can be used for reference. In this context, re-
 Streams of user-generated content from social media and microblog-       solving topical activity over time requires to go beyond the analysis
 ging systems exhibit patterns of collective attention across diverse     and characterization of popularity spikes. A given topic driven by
 topics, with temporal structures determined both by exogenous fac-       external events usually displays an extended temporal structure at
 tors, such as driving from mass media, and endogenous factors such       the hourly scale, with multiple activity spikes or alternating periods
 as viral propagation. Because of the openness of social media, of        of high and low activity. We aim at extracting signals that consists
 the complexity of their interactions with other social and informa-      of an association of (i) a weighted set of terms defining the topic,
 tion systems, and of the aggregation that typically leads to the ob-     (ii) a set of tweets that are associated to the topic, together with
 servable stream of posts, several concurrent signals are usually si-     the corresponding users, and (iii) an activity profile for the topic
 multaneously present in the post stream, corresponding to the activ-     over time, which may comprise disjoint time intervals of nonzero
 ity of different user communities in the context of several different    activity. We detect time-varying topics by using two independent
 topics. Making sense of this information stream is an inverse prob-      methods, both based on non-negative matrix or tensor factoriza-
 lem that requires moving beyond simple frequency counts, towards         tion. In the first case we build the full tweet-term-time tensor and
                                                                          use non-negative tensor factorization to extract the topics and their
                                                                          activity over time. We introduce an adapted factorization technique
                                                                          that can naturally deal with the special tensor structure arising from
                                                                          microblog streams. In the second case, which in principle affords
 Copyright c 2014 held by author(s)/owner(s); copying permitted           on-line computation, we build tweet-term frequency matrices over
 only for private and academic purposes.                                  consecutive time intervals of fixed duration. We apply non-negative
 Published as part of the #Microposts2014 Workshop proceedings,
 available online as CEUR Vol-1141 (http://ceur-ws.org/Vol-1141)          matrix factorization to extract topics for each time interval and we
                                                                          track similar topics over time by means of agglomerative hierarchi-
 #Microposts2014, April 7th, 2014, Seoul, Korea.
                                                                          1
                                                                              http://www.emoto2012.org


· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
 cal clustering.                                                            composition (CANDECOM), also called parallel factorization
                                                                            (PARAFAC, [9]), which can be regarded as a generalization to ten-
 We then apply both methods to the Twitter dataset collected dur-           sors of singular value decomposition (SVD). Maintaining the in-
 ing the Olympics, which reflects the attention users pay to tens of        terpretability of the factors usually requires to achieve factoriza-
 different concurrent events over the course of every day. We fo-           tion under non-negativity constraints, leading to techniques such
 cus on topical dynamics at the hourly scale, and find that for those       as non-negative matrix or tensor factorization (NMF and NTF).
 sport events in the schedule that can be semantically matched to           Tensor factorization to detect latent structures has been extensively
 the topics we obtain from Twitter, the activity timeline of the de-        used in several domains such as signal processing, psychometrics,
 tected topic in Twitter closely matches the event timeline from the        brain science, linguistics and chemometrics [26, 6, 29, 28, 30].
 schedule.

 This paper is structured as follows: Section 2 reviews the literature
 on collective attention, popularity, and topic detection in microblog      3.      DATA AND REPRESENTATION
 streams. Section 3 describes the Olympics 2012 Twitter dataset             Notation
 used for the study, the event schedule we use as an external refer-        The following notations are used throughout the paper. Scalars are
 ence, and introduces some notations and conventions used through-          denoted by lowercase letters, e.g., x, and vectors are denoted by
 out the paper. Section 4 and Section 5 describe the two techniques         boldface lowercase letters, e.g., x, where the i-th entry is xi . Ma-
 we use to mine time-varying topical activity in the Twitter stream.        trices are denoted by boldface capital letters, e.g., X, where the i-th
 Section 6 discusses the relation between the time-varying topics           column of matrix X is xi , and the (i, j)-th entry is xij . Third or-
 we obtain and the known schedule of the Olympics events for one            der tensors are denoted by calligraphic letters, e.g., A. The i-th
 representative day, and provides some general observations on the          slice of A, denoted by Ai , is formed by setting the last mode of the
 behavior of the two methods. Finally, Section 7 summarizes our             third order tensor to i. The (i, j)-th vector of A, denoted by aij , is
 findings and points to directions for further research.                    formed by setting the second to last and last modes of A to i and j
                                                                            respectively, and the (i, j, k)-th entry of A is aijk .
 2.    RELATED WORK
 The dynamics of collective attention and popularity in social media        Twitter Dataset
 has been the object of extensive investigation in the literature. At-      The Emoto dataset consists of around 14 million tweets collected
 tention can suddenly concentrate on a Web page [31, 22], a YouTube         during the London 2012 Summer Olympics using the public Twit-
 video [7, 8, 20], a story in the news media [17], or a topic in Twit-      ter Streaming API. All tweets have at least one of 400 keywords,
 ter [14, 2, 34]. Intrinsic features of the popular item under consid-      including common words used in the Olympic Games – like ath-
 eration have been related to its popularity profile by means of se-        lete, olympic, sports names and twitter accounts of high followed
 mantic analysis and natural language processing of user-generated          athletes and media companies. Tweets were collected during all the
 content [1, 32, 33]. In particular, a great deal of research [7, 14, 15,   interval of 17 days comprising the Olympic Games, from July 27
 34, 16] has focused on characterizing the shape of peaks in popu-          to August 12 2012.
 larity time series and in relating their properties to the popular item
 under consideration, to the relevant semantics, or to the process          Event Schedule
 driving popularity.                                                        In order to investigate the relation between the extracted time- vary-
                                                                            ing topics and the sport events of the Olympic Games, we use the
 Within the broad context of social media, Twitter has emerged as           schedule available on the official London 2012 Olympics page2 ,
 a paradigmatic system for the vision of a “social sensor” that can be      where the starting time and duration of most events is reported to-
 used to measure diverse societal processes and responses at scale [11,     gether with metadata about the type of event (discipline, involved
 25, 3, 19]. To date, comparatively little work has been devoted to         teams or countries, etc.)
 extracting signals that expose complex correlations between top-
 ics and temporal behaviors in micro-blogging systems. Given the
                                                                            Data Preprocessing
 many factors driving Twitter, and their highly concurrent nature,
                                                                            For the text analysis performed in this paper, URLs are removed
 exposing such a topical-temporal structure may provide important
                                                                            from the original tweet content. The remaining text is used to build
 insights in using Twitter as a sensor when the social signals of inter-
                                                                            a vocabulary composed of the most common 30,000 terms, where
 est cannot be pinpointed by simply using known terms or hashtags
                                                                            each term can be a single word, a digram or a trigram. 352 common
 to select the relevant content, or when the topical structure itself,
                                                                            words of the English language are also removed from the vocabu-
 and its temporal evolution, needs to be learned from the data. Saha
                                                                            lary.
 and Sindhwani [24] adopt such as viewpoint and propose an algo-
 rithm based on non-negative matrix factorization that captures and
                                                                            In order to localize Twitter users, we examine the user profile de-
 tracks topics over time, but is evaluated at the daily temporal scale
                                                                            scriptions and use an adapted version of GeoDict3 to identify, if
 only, against events that mainly consist of single popularity peaks,
                                                                            possible, the user country. To study the relation between the ex-
 without concurrency. Here we aim at capturing multiple concurrent
                                                                            tracted topical activity and the schedule of the Olympic events, we
 topics and their temporal evolution at the scale of hours, in order to
                                                                            focus on tweets posted by users located in the UK, only. This al-
 be able to compare the extracted signals with a known schedule for
                                                                            lows us to avoid potential confusion arising from tweets posted in
 several concurrent events taking place during one day.
                                                                            countries, such as the USA, where Olympics events were broad-
                                                                            casted with delays of several hours due to time zone differences.
 As we will discuss in detail, microblog activity can be represented
                                                                            This selection leaves us with a still substantial amount of data (about
 using a tweet-term-time three-way tensor, and tensor factorization
                                                                            2
 techniques can be used to uncover latent structures that represent             http://www.london2012.com/schedule-and-results/
                                                                            3
 time-varying topics. Ref. [5] proposed in 1970 the Canonical De-               https://github.com/petewarden/geodict


                                                                                                                                               4
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
                                                                                        Representation of    Topics 1            Topics 2
 one third of the full dataset) and simplifies the subsequent temporal                  the topics for one
 analysis, even though it probably oversamples the attention payed                      day

 to events that involved UK athletes.

 For the scope of this study, we represent the data as a sparse third-
 order tensor T ∈ RI×J×K , with I tweets, J terms and K time
 intervals. We aggregate the tweets over 1-hour intervals, for a total
 of K = 408 intervals. The tensor T is sparse: the average number
 of terms (also referred as features in the following) for each tweet is    Figure 1: Representation of a Kruskal decomposition. The
 typically no more than 10, compared to the 30k terms of our term           cube corresponds to the tensor to be factorized while the rect-
 vocabulary. Moreover, as each tweet is emitted at a given time,            angles represent the vectors. In the Twitter case, each of the
 each interval k has a limited number of active tweets, Ik . A ten-         rank-one tensor would correspond to the description of one
 sor slice Tk ∈ RI×J is a sparse matrix with non-zero values only           topics.
 for Ik rows. Tk represent the sparse tweet-term matrix observed
 at time k. The term values tijk for each tweet i are normalized us-
 ing the standard Term Frequency and Inverse-Document Frequency
 (TF-IDF) weighting, tijk = tf(i, j) × idf(j), where tf(i, j) is the        Factorization Methodology
                                                         |D|
 frequency of term j in tweet i, and idf(j) = log 1+|{d:j∈d}|    where      Regarding the extraction of topics, the aim is not to decompose the
 |D| is the total number of tweets and |{d : j ∈ d}| is the number          tensor in its exact form but to approximate the tensor by a sum of
 of tweets where the term j appears.                                        rank-1 tensors with a number of terms smaller than the rank of the
                                                                            original tensor. This number R corresponds to the number of topics
                                                                            that we want to extract (see Fig. 1). Such an approximation of the
 Visualizing Topics over Time                                               tensor leads to minimize the difference between T and JA, B, CK:
 The methods that we present in this paper are able to extract topical-
                                                                                                    min kT − JA, B, CKk2F                     (2)
 temporal structures from T . Such topical-temporal structures can                                 A,B,C
 be represented a stream matrix S ∈ RR×K with R topics and K
 intervals. Each component R is also characterized by a term-vector         where kk is the Frobenius norm. We transform the 3-dimensional
 h ∈ RJ that defines the most representative terms for that compo-          problem (Eq. 2) in 2-dimensional sub-problems by unfolding the
 nent. In order to visualize such topical-temporal structures repre-        tensor T in three different ways. This process called matriciza-
 sented as a stream matrix, we use the method described by Byron            tion gives rise to three modes X(1) , X(2) , X(3) . The mode-n ma-
 and Wattenberg [4], which yields a layered stream-graph visualiza-         tricization consists of linearizing all the indices of the tensor ex-
 tion.                                                                      cept n. The three resulting matrices have respectively a size of
                                                                            I ×JK,J ×IK and K ×IJ. Each element of the matrix X(i=1,2,3)
                                                                            corresponds to one element of the tensor T such that each of the
 4. MASKED NON-NEGATIVE TENSOR                                              mode contains all the values of the tensor. Due to matricization, the
    FACTORIZATION                                                           factorization problem given by Eq.1 can be reframed in factoriza-
                                                                            tion of the three modes. In other terms, maximizing the likelihood
 Problem Statement                                                          between T and JA, B, CK is equivalent to minimizing the differ-
 As explained in section 3, the tensor T ∈ RI×J×K with I tweets,            ence between each of the mode and their respective approximation
 J terms and K intervals is a natural way to represent the tweets and       in terms of A, B, C. The factorization problem (PARAFAC) in
 their contents with respect to the time. The tensor has the advan-         Eq.2 is converted to the three following sub-problems where we
 tage to directly encompass the relationship between tweets posted          added a condition of non-negativity of the three modes:
 at different hours and consequently between topics of the differ-                                                          2
 ent hours. The tensor factorization as described below allows to                            min kX(1) − A(C            B)T kF                (3)
                                                                                             A≥0
 uncover topics together with their temporal pattern.                                                                       2
                                                                                             min kX(2) − B(C            A)T kF                (4)
                                                                                             B≥0
 Before describing the process of factorization itself and its out-                                                         2
 put, one needs to introduce the concept of canonical decomposition                          min kX(3) − C(B            A)T kF                (5)
                                                                                             C≥0
 (CP). CP in 3 dimensions aims at writing a tensor T ∈ RI×J×K
 in a factorized way that is the sum of the outer product of three          where     is the Khatri-Rao product which is a columnwise Kro-
 vectors:                                                                   necker product, i.e. such that C B = [c1 ⊗ b1 c2 ⊗ b2 . . . cr ⊗ br ].
                              RT
                                                                            If C ∈ RK×R and B ∈ RJ×R , then the Khatri-Rao product
                              X                                             C B ∈ RKJ×R . In our case of study, A, B, C will give each
                        T =         ar ◦ br ◦ cr                     (1)
                                                                            access to a different information: A allows to know at which topic
                              r=1
                                                                            belongs a tweet, B gives the definition of the topics with respect to
 where the smallest value of RT for which this relation exists, is          the features and C gives the temporal activity of each topic.
 the rank of the tensor T . In other words, the tensor T is ex-
 pressed with a sum of rank-1 tensors. The set of vectors a{1,2,...,R}      Several algorithms have been developped to tackle the PARAFAC
 (resp. b{1,2,...,RT } ,c{1,2,...,RT }) can be re-written as a matrix A ∈   decomposition. The two most common are one method based on
 RI×RT (resp B ∈ RJ×RT ,C ∈ RK×RT ) where each of the RT                    the projected gradient and the Alternating Least square method
 vectors is a column of the matrix. The decomposition of Eq. 1              (ALS). The first one is convenient for its ease of implementation
 can also be represented in terms of the three matrices A, B, C as          and is largely used in Singular Value Decomposition (SVD) but
 JA, B, CK. A visual representation of such a factorization, also           converges slowly. In the ALS method, the modes are deduced suc-
 called Kruskal decomposition, is displayed on Fig. 1.                      cessively by solving Eq 5. In each iteration, for each of the sub-


                                                                                                                                              5
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
 problem, two modes are kept fixed while the third one is computed.     Non-negative Matrix Factorization
 This process is repeated until convergence. In our case, we use a      For each tensor slice Tk , we compute a non-negative factorization
 nonnegativity constraint to make the factorization better posed and    by minimizing the following error function,
 the results meaningful. One thus uses nonnegative ALS (ANLS                                                               2
 [21]) combined with a block-coordinate-descent method in order                              min kTk − W(k) H(k) kF ,                     (8)
                                                                                             W,H
 to reach the convergence faster. Each of the step of the algorithm
 needs to take into account the Karush-Kuhn-Tucker (KKT) con-           where kk is the Frobenius norm, subject to the constraint that the
 ditions to have a stationary point. Our program is based on the        values in W(k) and H(k) must be non-negative. The non-negative
 algorithm implemented by [12].                                         factorization is achieved using the projected gradient method with
                                                                        sparseness constraints, as described in [18, 10]. The factorization
 Masked Adaptation of the NTF                                           produces a matrix of left vectors W(k) ∈ RIk ×F and a matrix of
                                                                        right vectors H(k) ∈ RF ×J , where F is the number of components
 We cannot directly perform the NTF on the tensor [Tweets × Fea-
                                                                        used in the decomposition. The matrix H(k) stores the term vectors
 tures × Interval] built as explained aboved as this tensor has a
                                                                        of the extracted components at interval t. The matrix W(k) is used
 “block-disjoint” structure peculiar to the tweets. Indeed each tweet
                                                                        to calculate the strength of each extracted component, which are
 has non-zero values only at one interval because a tweet is emitted
                                                                        represented in a matrix Z ∈ RF ×K given by
 only at a given time. Each interval k has only Ik active tweets. In
 each slice Tk of the tensor, only Ik rows have meaningful values.                                       Ik
                                                                                                         X       wif
                                                                                                                    (k)
 So, we are only interested in reproducing the tensor part which con-                           zf k =                                    (9)
                                                                                                                P
                                                                                                                F
 tains the meaningful values. In order to focus on these meaningful                                      i=1         (k)
                                                                                                                   wif 0
 values, one needs to consider an adapted version of the tensor T .                                            f 0 =1
 We first consider the tensor T built as explained above. We gen-
                                                                        where zf k is the strength of factor f at interval k.
 erate a first set of matrices A, B, C which could approximate the
 tensor. At the next step, one tries to decompose a tensor T̄ where
 the values are a combination of the values of T̄ and of the values     Component Clustering
 of JA, B, CK. More exactly, this tensor has the same size than T       In order to track topics over time, we need to merge components
 and the same values than T for the rows Ik of each slice T̄k . The     into topics depending on how similar they are. Since each com-
 complementary values are given by JA, B, CK. In other terms, at        ponent is defined by a term vector, we can calculate a similarity
 each step, the tensor that we approximate is updated by:               matrix of all possible pairs of term vectors using cosine similarity.
                                                                        This matrix is fed to a standard agglomerative hierarchical cluster-
                T̄ = T  W + (1 − W)JA, B, CK                     (6)   ing algorithm, known as UPGMA [27], that at each step combines
                                                                        the two most similar clusters into a higher-level cluster. Cluster
 where  is the Hadamard product (element-wise product) and W is
                                                                        similarity is defined in terms of average linkage: that is, the dis-
 a binary tensor of the same size than T with 1-values only when the
                                                                        tance between two clusters c1 and c2 is defined as the average of
 values of T at this position are meaningful. The particular struc-
                                                                        all pair-wise distances between the children of c1 and those of c2 .
 ture of the tensor (disjoint blocks in time) could be perceived as a
 “missing values” problem in the tensor, this problem has been for
                                                                        The hierarchical clustering produces a tree that can be cut at a given
 example tackled in [23].
                                                                        depth to yield a clustering at a chosen level of detail. That is, by
                                                                        varying the threshold similarity we use for the cut we can go from a
 Concretely, the implementation is an adaptation of a Matlab pro-
                                                                        coarse-grained topical structure, with few clusters that may merge
 gram [12] which uses the Tensor Toolbox [13]. This adaptation
                                                                        unrelated topics, to a fine-grained topical structure, with many clus-
 includes the introduction of a mask (via the tensor of weight) as
                                                                        ters that may separate term vectors that otherwise could be regarded
 mentionned above and the rewriting of some operations to avoid
                                                                        as the same continuous topic over time. The cut threshold needs to
 memory issues. This point is not detailed here as it is not part of
                                                                        be chosen based on criteria that depends on the application at hand.
 the main point of the paper.
                                                                        Each choice for the cut yields a number of clusters C and a map
 Stream Matrix Construction                                             function C(r, f ) → k that associates the component index f at time
 We calculate the strength of each topic with respect to the time       interval k to a topic cluster r. This function collects all components
 by using both the information about the link between each topic        associated to cluster r in a set Crk for each interval k.
 and each tweet and about temporal pattern of the topics. These
 informations are available through A and C and the consequent          Stream Matrix Construction
 strength of a topic r on each interval of time k is given by:          When constructing the stream matrix, the number of topics R in
                                X                                       the stream matrix is given by the number of clusters generated by
                         srk =       air ∗ ckr                 (7)
                                                                        the clustering step. In order to calculate the entries srk of the re-
                                i|k
                                                                        sulting stream matrix S, we aggregate the strengths of the clustered
       P
 where i|k is a sum over the tweets indexed by i occurring at the       components. We build a stream matrix S ∈ RR×K , with R topics
 interval indexed by k. The set of elements s{r,k} with r = J1, RK      and K intervals, given by
 and k = J1, KK forms the stream matrix S. Each topic is then
 defined by a terms vector and each of this term vector is given by a
                                                                                                            X
 column of B.                                                                                      srk =            zf k                 (10)
                                                                                                           f ∈Crk
 5.   AGGLOMERATIVE NON-NEGATIVE
      MATRIX FACTORIZATION                                              Finally, we extract the term vectors that are associated to each clus-


                                                                                                                                          6
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
                                                         (k)
 ter. Each cluster will be associated to a term vector hr ∈ RJ that        terms appear with a high weight in its term vectors, we define our
                                      (k)
 is the average of all term vectors hf associated to that cluster in       matching score based on the geometric average of the weights of
 the component clustering step.                                            the event annotated terms in the topic’s term vectors:
                                                                                                    p
                                                                                             hwi = n hw1 r hw2 r . . . hwn r             (11)

 6. ANALYSIS OF THE OLYMPICS DATA-                                         For Masked NTF, for each event, we choose the topic with the high-
                                                                           est corresponding geometric average hwi. In the agglomerative
    SET                                                                    NMF case, for each event, we choose the topic with the highest
 We now move to the analysis of the London 2012 Twitter dataset            corresponding geometric average hwi weighted by log(n) where n
 and its relation with the known schedule of the Games. We focus           is the number of components in the selected cluster. We use log(n)
 on one representative day, July 29th, during which several sport          in order to favor the selection of clusters with a higher number of
 events took place at different times and concurrently. We use both        aggregated components, otherwise the most detailed clusters which
 topic detection methods, show the signals they extract, and check         aggregates only one component are always selected. Since the Ag-
 to what extent they are capable of extracting signals that we can         glomerative NMF method produces a tree structure in which each
 understand in terms of the schedule.                                      node agglomerates a set of components and represents topic activ-
                                                                           ity, we have to calculate such matching result for each node, and
 The topic detection methods are set up as follows. For the masked         select the node for which such matching result is the highest.
 NTF method, we decompose the tensor using a fixed number of
 components, using a tolerance value of 10−4 for the stopping con-
 dition, and limiting the number of iterations to 50. For the agglom-
                                                                           Results and Observations
 erative NMF method, we decompose each interval matrix using a             At this point, we have, for each event, a topic which was selected in
 fixed number of components. We use a tolerance value of 10−4 for          each method, and the corresponding matching result. In Figure 3,
 the stopping condition, and limit the number of iterations to 20. We      we show the schedule events for the top 20 highest matching re-
 use 250 topics for the Masked NTF, and 50 components per time             sults. In the lefthand figure, we show, for each one of the top 20
 intervals in the Agglomerative NMF.                                       matching results, the topic extracted by the Masked NTF method,
                                                                           while in the righthand figure, we show the topic extracted by the
 Figure 2 shows a streamgraph representation of time-varying top-          Agglomerative NMF method. The results are sorted by the corre-
 ics extracted using the two methods we have discussed. Two global         sponding matching weight.
 activity peaks are visible in both streamgraphs: the peak at about
 2.30pm UTC was triggered when Elizabeth Armistead won the sil-            For each event, on its top left corner, we show the manually an-
 ver medal in road cycle; the peat at about 7pm UTC is driven by           notated terms used for the matching. The shaded blue area shows
 the bronze medal in 400m freestyle to Rebecca Adlington. In the           the exact interval during which the event was occurring accord-
 stream graphs, for clarity, each topic is annotated using only its        ing to the official Olympics schedule. In the same area, the solid
 topmost weighted term. This makes it difficult to assess a visual         green line represents the temporal structure of the topic with higher
 correspondence between the same topics across the two represen-           matching result according to our matching criteria. Such values
 tations, as the term with top weight may be different for the two         roughly represent the amount of activity for such topic and are nor-
 term vectors even though the vectors are overall very similar (in         malized according to the peak of activity. We show the value for
 terms of cosine similarity). On closer inspection, many precise cor-      this peak in the top right side, along with the matching results be-
 respondences can be established between the topics extracted by           tween parenthesis. In the Agglomerative NMF graph (on the right)
 the Masked NTF method and those extracted by the Agglomerative            we show as a dotted line the activity in time for the given terms
 NMF method: for example, the topic armistead in the top stream-           regarding the number of tweets that have such terms (tweet count).
 graph matches the topic congratulation in the bottom one. An inter-       We remark that by considering only the dotted line the timing of
 active streamgraph visualization of the London 2012 Twitter data-         many events on the right side of the figure does not match the
 set is available at http://www.datainterfaces.org/projects/emoto/.        schedule timings, i.e., merely counting tweets is not sufficient at
                                                                           this resolution level. We also measured the number of tweets where
                                                                           the terms are co-ocurring, and in this case the number of tweets is
 6.1    Comparison with the Olympics Schedule                              so small that it does not allow the detection of any structure in time.
 Event Selection
 In order to show the possible correspondence between the extracted        We evaluated these activity profiles using the CrowdFlower Web-
 topics and sport events, we manually annotate the schedule col-           based crowdsourcing platform (restricted to Amazon Mechanical
 lected from the official London 2012 Olympics page for July 29th,         Turk workers). Each work unit asks a worker to visually inspect
 2012. As the number of events in a day can be substantial and we          and compare two timelines: the one to be evaluated, and a refer-
 want to focus on events with higher impact on social media, we            ence timeline corresponding to the known time intervals for sport
 retain events that are either finals or team sports match. We anno-       events taken from the Olympic schedule. Each work unit looks
 tate each event with a set of at most three terms extracted from the      like a row from Figure 3. Our evaluation was based on 100 work
 schedule, as described in Section 3. For a team sport, we use the         units evenly distributed among 5 types: 1) (NMF) work units based
 sport name and the countries of the two teams, otherwise, we put          on the results of Agglomerative NMF; 2) (CNT-NMF) work units
 the name of the sport and its characteristics, e.g., the discipline for   with activity profiles generated by simply counting the number of
 swimming.                                                                 tweets with the terms used in matching the NMF topics; 3) (NTF)
                                                                           work units from the Masked NTF approach; 4) (CNT-NMF) same
 Matching Topics and Events                                                as (CNT-NMF) for Masked NTF; 5) synthetic work units (“gold”
 For each event, we use a matching criteria to select one of the ex-       units) used to assess worker quality. For each work unit, we asked
 tracted topics from each of the set of topics produced by the meth-       the workers whether the two timelines matched exactly (Yes), mat-
 ods. Since we want to select a topic in which all event annotated         ched partially (Partially) or not at all (No). 95% of the judgments


                                                                                                                                              7
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
               Masked NTF
                                                                                                                lizzie

                                                                                                                                                                                                 beckadlington
                                                                                                            armitstead                                                                                  gb, team
                          teamgb, judo        seat, sponsor                                                    radcliffe
                                                                                                     seat
                                                                  seat, ticket                                                                                                                   teamgb, giggssturridge
                                          teamgb, volleyball                                      watch
                                                                                                             teamgb, medal                           good                               sport
                                                                                                                                                                                                                           mail
                                                                                                                                                                                                                                     teamgb, volleyball

                                                                                                                                                                                                                       football         athlete
                                                                                                                   race        medal, armitstead
                                                                                                                                                   support          volleyball                     bronze                            ceremony
                                                                                                                medal, teamgb road, cycle                                      teamgb, hockey
                                                                                                                                                                    teamgb, silver
                                                                                                                               teamgb, gymnastics                                                    becky teamgb, support
                                                                                                                                                                                                                     beckadlington

                                                                                                                                                                                                                         swim
                                                                                                                                                                                                  swim, liamtancock
                                                                                                                                                                                                       home
                                                                                                                                                                                                       800, gb


                                                                                                                                                                                                       woman

               Agglomerative NMF                                                                                 teamgb
                                                                                                                                                                                                       bronze
                                                                                                                                                                                                       becky
                                                                                                                  paula                                             olympic game                      rebecca
                                                                                                                                                                                                                            swim
                                                                                                                                                                                                     beckadlington
                                                                                        centre                woman road                                                     olympic football


                                                                                                                                                                                                 medal
                                               sponsor
                        seat, corporate                  seb coe say                                                                                               woman beach volley    win


                                                                                                            congratulation
                                                                       olympic, sport
                                                                                         seat
                 swim        team                                                                                                                                                                       win
                                                                                                                                                                                                     gratitude            support         sure
                                                                                                                                                    l_armitstead
                                                                                                                                   run            tomdaley1994                                     liamtancock             box
                                                                   olympic, open                 olympic


                                                                                                                come lizzy
                                                                                                               medal teamgb
                                                                                                                                  synchronise
                                                                                                                                                                                                beckadlington
                                                                                                                medal silver


                                                                                                                                                                                                   swim
                                                                                                                                                                                                     represent


               08:00                                                                    12:00                                                    16:00                                                               20:00

 Figure 2: Streamgraph visualization of the stream matrices generated by the Masked NTF (top) using 50 topics, and by the Ag-
 glomerative NMF (bottom) using 20 components per interval and a total of 150 clusters. Interactive streamgraph visualizations for
 a few use cases are available at http://www.datainterfaces.org/2013/06/twitter-topic-explorer/.


 for gold work units were correct. We only retained those users who                                                                             poral and topical information about the event schedule, pointing to
 correctly judged more than 80% of the gold units. Figure 4 shows                                                                               more sophisticated uses of Twitter as a social sensor.
 the distribution of judgements for the different types of work units.
 The left hand side of the figure shows the distribution obtained for                                                                           The work described here can be extended along several directions.
 all work units, while the right hand side shows the distribution re-                                                                           It would be interesting to develop and characterize on-line versions
 stricted to work units with more than 80% of agreement across dif-                                                                             of the techniques we used here, so that topic emergence and trend
 ferent workers. According to this evaluation, both NTF and NMF                                                                                 detection could be carried out on live microblog streams. Because
 outperform the count-based methods.                                                                                                            of its temporal segmentation, the Agglomerative NMF case lends
                                                                                                                                                itself rather well to on-line incremental computation, whereas a dy-
 We see that for most of the events there is a close temporal align-                                                                            namic version of the Masked NTF technique would be more chal-
 ment between the event schedule and the topic structure, at the scale                                                                          lenging to achieve.
 of the hour or less. We see that such temporal alignment is much
 closer than when compared to the peaks of activity generated by                                                                                Another interesting direction for future research would be to aug-
 counting tweets.                                                                                                                               ment the tweet-term-time tensor with a fourth dimension represent-
                                                                                                                                                ing the location of the users, so that the latent signals we extract
 We observe that the mismatches in the temporal alignment are caused                                                                            could expose correlation between topics, time intervals and loca-
 by two different factors. The first one is due to a low matching re-                                                                           tions, exposing geographical patterns of collective attention and
 sults, like the event annotated with (football, mexico, gabon). It                                                                             their relation to delays, e.g., in the seeding by mass media across
 means that the term vectors for the given topic does not represent                                                                             different countries.
 with high confidence the terms used to annotate the event. The sec-
 ond one is due to a different behaviour in collective attention. This
 happens for example in the case of swimming events, where the                                                                                  Acknowledgements
                                                                                                                                                The Authors acknowledge the Emoto project www.emoto2012.org and its
 first part of the event is related to eliminatories and the second part
                                                                                                                                                partners for access to the Twitter dataset on the London Olympics 2012. The
 is related to the finals. In such cases, the peak in activity arrives                                                                          Authors acknowledge inspiring discussions with Moritz Stefaner and Bruno
 when the event finishes and the attention goes to the winner.                                                                                  Goncalves. The Authors aknowledge partial support from the Lagrange
                                                                                                                                                Project of the ISI Foundation funded by the CRT Foundation, from the Q-
                                                                                                                                                ARACNE project funded by the Fondazione Compagnia di San Paolo, and
 7.    SUMMARY AND FUTURE WORK                                                                                                                  from the FET Multiplex Project (EU-FET-317532) funded by the European
 The topic detection techniques we discussed here afford tracking                                                                               Commission.
 the attention that a community of users devotes to multiple con-
 current topics over time, teasing apart social signals that cannot
 be disentangled by simply measuring frequencies of term or hash-                                                                               8.           REFERENCES
 tags occurrences. This allows to capture the emergence of topics                                                                                [1] E. Adar, D. Weld, Bershad, B.N., and S. Gribble. Why we search:
                                                                                                                                                     visualizing and predicting user behavior. In Proc. 16th intl. conf. on
 and to track their popularity with a high temporal resolution and                                                                                   World Wide Web (WWW’07), pages 161–170, 2007.
 a controllable semantic granularity. The comparison with an in-                                                                                 [2] S. Asur, B. A. Huberman, G. Szabo, and W. C. Trends in social
 dependently available schedule of real-world events shows that the                                                                                  media : Persistence and decay. In Proc. 5th Intl. Conf. on Weblogs
 response of Twitter to external driving retains a great deal of tem-                                                                                and Social Media (ICWSM), page 434, 2011.


                                                                                                                                                                                                                                                          8
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
     swim, 400m, freestyle                              228.8 (0.535)    football, egypt, zealand                            171.4 (1.467)

     canoe                                              95.8 (0.415)     basketball, usa, france                             335.9 (1.293)

     football, spain, honduras                          157.8 (0.285)    canoe                                               477.9 (1.233)

     road cycle, medal                                  1220.3 (0.225)   road cycle, medal                                   3330.2 (1.130)

     swim, 4x100m, freestyle                            269.5 (0.212)    swim, 4x100m, freestyle                             292.3 (0.974)

     basketball, usa, france                            269.5 (0.211)    archery, korea, china                               180.9 (0.921)

     archery, korea, china                              86.2 (0.197)     football, spain, honduras                           1374.4 (0.844)

     hockey, gb, japan                                  260.0 (0.147)    football, brazil, belarus                           447.4 (0.819)

     judo, final                                        150.7 (0.123)    judo, final                                         320.3 (0.799)

     volleyball, bulgaria, gb                           97.7 (0.117)     basketball, spain, china                            323.5 (0.646)

     football, egypt, new zealand                       108.8 (0.096)    football, senegal, uruguay                          199.5 (0.594)

     football, teamgb, uae                              152.7 (0.093)    swim, 400m, freestyle                               384.9 (0.582)

     swim, 200m, freestyle                              228.8 (0.091)    hockey, gb, japan                                   584.4 (0.424)

     football, brazil, belarus                          108.8 (0.089)    football, uae, gb                                   742.9 (0.325)

     swim, breaststroke, 100m                           289.2 (0.081)    basketball, nigeria, tunisia                        136.5 (0.304)

     football, senegal, uruguay                         247.9 (0.074)    archery, japan, russia                              180.9 (0.273)

     archery, japan, russia                             282.2 (0.056)    football, japan, morocco                            742.9 (0.234)

     basketball, russia, gb                             367.3 (0.056)    basketball, brazil, australia                       115.3 (0.222)

     swim, 100m, breakstroke                            126.1 (0.049)    swim, 100m, backstroke                              352.2 (0.217)

     water polo, romania, gb                            123.0 (0.036)    football, mexico, gabon                             628.2 (0.214)

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24     0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

                            (a) Masked NTF                                                   (b) Agglomerative NMF

 Figure 3: The top 20 most representative schedule events regarding the matching weight of its annotations with the term vectors
 of an extracted topic. In the left, for each event, we show the topic extracted by the Masked NTF method for which the matching
 weight is the highest, and in the right we show the topic extracted by the Agglomerative NMF method for which the matching weight
 is the highest. Since we are showing the topmost 20 schedule events regarding the matching weight, the events are sorted by such
 matching weight. On the top left corner of each event, we show its annotated terms, along with the exact interval in which the event
 happened according to the official Olympics schedule (shaded blue area). The solid green line shows the temporal structure of the
 topic with higher matching weight along the 24 hours of July 29. The values in the top right side shows the value for the peak of the
 temporal structure, which roughly represents the amount of activity for such topic, and, between parenthesis, the matching weight
 for the given topic. In the Agglomerative NMF side (on the right) we show as a dotted line the activity in time for the given terms
 regarding the number of tweets that have such terms (tweet count).


                                                                                                                                              9
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
                    NMF
                    NTF                                                                                                                          No
                                                                                                                                                 Partially
              CNT-NMF                                                                                                                            Yes
               CNT-NTF
                        0       10      20     30     40     50          60          0      10    20      30      40    50    60
                                       Percentage of answers                             Percentage of answers ( ≥80% agreement)

 Figure 4: Crowdsourced evaluation of the topical activity profiles for selected sport events (see main text) of the London 2012
 Olympics dataset obtained by using the different topic detection methods.


  [3] J. Bollen, H. Mao, and X. Zeng. Twitter mood predicts the stock                       [20] M. Naaman, H. Becker, and L. Gravano. Hip and trendy:
      market. Journal of Computational Science, 2(1):1 – 8, 2011.                                Characterizing emerging trends on twitter. J. Am. Soc. Inf. Sci.,
  [4] L. Byron and M. Wattenberg. Stacked graphs–geometry & aesthetics.                          62:902–918, 2011.
      Visualization and Computer Graphics, IEEE Transactions on,                            [21] P. Paatero and U. Tapper. Positive matrix factorization: A
      14(6):1245–1252, 2008.                                                                     non-negative factor model with optimal utilization of error estimates
  [5] J. Carroll and J.-J. Chang. Analysis of individual differences in                          of data values. Environmetrics, 5(2):111–126, 1994.
      multidimensional scaling via an n-way generalization of                               [22] J. Ratkiewicz, F. Menczer, S. Fortunato, A. Flammini, and
      "eckart-young" decomposition. Psychometrika, 35(3):283–319,                                A. Vespignani. Traffic in social media ii: Modeling bursty popularity.
      September 1970.                                                                            In SocialCom 2010: SIN, 2010.
  [6] A. Cichocki, A. H. Phan, and R. Zdunek. Nonnegative Matrix and                        [23] J.-P. Royer, N. Thirion-Moreau, and P. Comon. NonNegative 3-Way
      Tensor Factorizations: Applications to Exploratory Multi-way Data                          tensor Factorzation taking into account Possible Missing Data. In
      Analysis and Blind Source Separation. Wiley, Chichester, 2009.                             Eurasip, editor, EUSIPCO-2012, pages 1–5, Bucarest, Roumanie,
  [7] R. Crane and D. Sornette. Robust dynamic classes revealed by                               Aug. 2012. Elsevier.
      measuring the response function of a social system. PNAS,                             [24] A. Saha and V. Sindhwani. Learning evolving and emerging topics in
      105:15649, 2008.                                                                           social media: a dynamic nmf approach with temporal regularization.
  [8] F. Figueiredo, F. Benevenuto, and J. Almeida. The tube over time:                          In Proc. of the fifth ACM intl. conf. on Web search and data mining,
      Characterizing popularity growth of youtube videos. In Proc. ACM                           WSDM ’12, pages 693–702, New York, NY, USA, 2012. ACM.
      Intl. Conf. on Web Search and Data Mining (WSDM), pages                               [25] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter
      745–754, 2011.                                                                             users: real-time event detection by social sensors. In Proc. of the 19th
  [9] R. A. Harshman. Foundations of the PARAFAC procedure: Models                               intl. conf. on World wide web, WWW ’10, pages 851–860, New
      and conditions for an" explanatory" multi-modal factor analysis.                           York, NY, USA, 2010. ACM.
      UCLA Working Papers in Phonetics, 16(1):84, 1970.                                     [26] A. Shashua and T. Hazan. Non-negative tensor factorization with
 [10] P. Hoyer. Non-negative matrix factorization with sparseness                                applications to statistics and computer vision. In Proc. of the 22nd
      constraints. The Journal of Machine Learning Research,                                     intl. conf. on Machine learning, ICML ’05, pages 792–799, New
      5:1457–1469, 2004.                                                                         York, NY, USA, 2005. ACM.
 [11] B. J. Jansen, M. Zhang, K. Sobel, and A. Chowdury. Twitter power:                     [27] R. R. Sokal and C. D. Michener. A statistical method for evaluating
      Tweets as electronic word of mouth. Journal of the American Society                        systematic relationships. University of Kansas Scientific Bulletin,
      for Information Science & Technology, 60(11), 2009.                                        28:1409–1438, 1958.
 [12] J. Kim and H. Park. Fast nonnegative tensor factorization with an                     [28] J. Sun, D. Tao, and C. Faloutsos. Beyond streams and graphs:
      active-set-like method. In M. W. Berry, K. A. Gallivan,                                    dynamic tensor analysis. In Proc. of the 12th ACM SIGKDD intl.
      E. Gallopoulos, A. Grama, B. Philippe, Y. Saad, and F. Saied, editors,                     conf. on Knowledge discovery and data mining, KDD ’06, pages
      High-Performance Scientific Computing, pages 311–326. Springer                             374–383, New York, NY, USA, 2006. ACM.
      London, 2012.                                                                         [29] T. Van de Cruys. A non-negative tensor factorization model for
 [13] T. G. Kolda and B. W. Bader. Tensor decompositions and                                     selectional preference induction. In Proc. of the Workshop on
      applications. SIAM Rev., 51(3):455–500, Aug. 2009.                                         Geometrical Models of Natural Language Semantics, GEMS ’09,
 [14] H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social                           pages 83–90, Stroudsburg, PA, USA, 2009. Association for
      network or a news media? WWW ’10 Proc. of the 19th intl. conf. on                          Computational Linguistics.
      World wide web, page 591, Feb 2010.                                                   [30] Y. Wang and E. Agichtein. Temporal latent semantic analysis for
 [15] D. Laniado and P. Mika. Making sense of twitter. In Semantic Web -                         collaboratively generated content: preliminary results. In Proc. of the
      ISWC, volume 6469, pages 470–485, 2010.                                                    34th intl. ACM SIGIR conf. on Research and development in
 [16] J. Lehmann, B. Gonçalves, J. J. Ramasco, and C. Cattuto. Dynamical                         Information Retrieval, SIGIR ’11, pages 1145–1146, New York, NY,
      classes of collective attention in twitter. In Proc. of the 21st intl. conf.               USA, 2011. ACM.
      on World Wide Web, WWW ’12, pages 251–260, New York, NY,                              [31] F. Wu and B. A. Huberman. Novelty and collective attention. Proc.
      USA, 2012. ACM.                                                                            Nat. Acad. Sci., 104:17599, 2007.
 [17] J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the                    [32] J. Wu, K. M. Thornton, and E. N. Efthimiadis. Conversational
      dynamics of the news cycle. In Proc. of the 15th ACM SIGKDD intl.                          tagging in twitter. In Proc. of the 21st ACM conf. on Hypertext and
      conf. on Knowledge discovery and data mining, page 497, 2009.                              Hypermedia, pages 173–178, 2010.
 [18] C. Lin. Projected gradient methods for nonnegative matrix                             [33] S. Wu, J. M. Hofman, W. A. Mason, and D. J. Watts. Who says what
      factorization. Neural computation, 19(10):2756–2779, 2007.                                 to whom on twitter. WWW 2011, pages 1–10, Feb 2011.
 [19] D. Mocanu, A. Baronchelli, N. Perra, B. Gonçalves, Q. Zhang, and                      [34] J. Yang and J. Leskovec. Patterns of temporal variation in online
      A. Vespignani. The twitter of babel: Mapping world languages                               media. In Proc. of the fourth ACM intl. conf. on Web search and data
      through microblogging platforms. PloS one, 8(4):e61981, 2013.                              mining, pages 177–186, 2011.


                                                                                                                                                                    10
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014