Event Detection using Images of Temporal Word Patterns Yunli Wang Cyril Goutte Yunli.Wang@nrc-cnrc.gc.ca Cyril.Goutte@nrc-cnrc.gc.ca Multilingual Text Processing National Research Council Canada Ottawa ON, Canada adopt the definition of event from Hasan et al. (2018): an event, in the context of social media, is an occur- Abstract rence of interest in the real world which instigates a discussion of event-associated topics by various users Detecting events from social media requires of social media, either soon after the occurrence or, to deal with the noisy sequences of user gen- sometimes, in anticipation of it. Approaches to event erated text. Previous work typically focuses detection can be classified according to event types: either on semantic patterns, using e.g. topic specified or unspecified (Farzindar and Khreich, 2015). models, or on temporal patterns of word us- For specified event detection, some information such as age, e.g. using wavelet analysis. In our study, time, type or description of target events is known be- we propose a novel method to capture the forehands, for example, detecting earthquakes (Sakaki temporal patterns of word usage on social me- et al., 2010). We focus on unspecified event detection, dia, by transforming time series of word oc- for which no prior information is available. currence frequency into images, and clustering Previous work on unspecified event detection typ- images using features extracted from the im- ically uses topic modeling or signal processing ap- ages using the convolutional neural network proaches. Topic modeling methods are able to discover ResNet. These clusters are then ranked by topics based on semantic similarities between words in burstiness, identifying the top ranked clus- an unsupervised way (Pozdnoukhov and Kaiser, 2011; ters as detected events. Words in the clusters Chae et al., 2012; Zhou and Chen, 2014), but the tem- are also filtered using co-occurrence similar- poral similarity between words is not captured. Sig- ity, in order to identify the most representa- nal processing methods such as wavelet analysis pay tive words describing the event. We test our more attention to the temporal correlation between approach on one Instagram and one Twitter words (Weng and Lee, 2011; Li et al., 2012; Schubert datasets, and obtain performance of up to 80% et al., 2014), but ignore the semantic similarity. One precision from the top five detected events on key challenge of unspecified event detection from so- both datasets. cial media data is to filter noisy messages unrelated to actual events. 1 Introduction In recent years, deep learning approaches have rev- Social media are a rich source for news data, and often olutionized image processing, speech recognition, and report events in a more timely manner than traditional most of Natural Language Processing. Convolutional media. Event detection from social media is quite chal- Neural Networks (CNNs) have become the leading ar- lenging because of the noisy nature of the data. We chitecture for many image processing, classification, and detection tasks. The features extracted by CNNs Copyright c National Research Council Canada, 2019. This have been shown to provide impressive baselines for volume is published and copyrighted by its editors. various computer vision tasks (Oquab et al., 2014; In: A. Aker, D. Albakour, A. Barrón-Cedeño, S. Dori-Hacohen, M. Martinez, J. Stray, S. Tippmann (eds.): Proceedings of the Sharif Razavian et al., 2014). CNNs were also used for NewsIR’19 Workshop at SIGIR, Paris, France, 25-July-2019, specified event detection: Lee et al. (2017) used CNNs published at http://ceur-ws.org in unsupervised feature learning and supervised classi- fication to detect adverse drug events from tweets; Bis- chke et al. (2016) used visual features extracted from images by X-ResNet (an extension of ResNet, He et al. (2016)) and metadata features to detect flood events from satellite images. We introduce the novel idea of transforming word usage patterns into images, then use features extracted from those images in order to detect unspecified events from social media. The event detection problem is then addressed as an image clustering task by trans- forming the time series of word occurrences into im- ages. We adopt the deep learning model ResNet to extract features from these images, identify clusters based on those images, and rank them by burstiness. Our experiments show that the performance of our system is robust across different parameter settings. 2 Method Figure 1: Temporal profiles of ”justiceforfreddie” (top) Our method includes four steps: transforming time se- and ”prayforbaltimore” (bottom). #justiceforfreddie ries of word frequencies into images; clustering those picks up after the death of Freddie Gray on April 19, images; ranking clusters by burstiness; filtering the 2015 (` ≈ 75) with several spikes during the subse- words in each cluster by co-occurrence similarity. We quent demonstrations. #prayforbaltimore peaks on name this proposed approach Image Co-occurrence the day of Mr Gray’s burial and following days. Event detection (ICE), as it relies on representing tem- poral word usage by images, and selecting relevant words using co-occurrence similarity. In the first step, we adopt the Gramian Angular Filed (GAF) method (Wang and Oates, 2015) to transform time series of the frequency of each individual words into images. We then use ResNet to extract features from images and k-means to cluster those images. All clusters are ranked based on a burstiness measure. Finally, words in clusters are filtered to remove non-relevant words. Figure 2: GAF image for ”justiceforfreddie” (left) and ”prayforbaltimore” (right). 2.1 Transforming time series into images time intervals is represented as a T × T image: Given a dataset of messages with time stamps, we first split the time range into T time intervals ` = 1 . . . T < w10 , w10 > ... < w10 , wT0 >   and merge all messages in time interval ` into one doc-  < w0 , w0 > ... < w0 , w0 >  2 1 2 T ument. For each of the N unique words in the dataset, GW =  (2)   ... ... ...  we build a temporal profile by counting the frequency   0 0 0 0 of each word in each interval. This produces N time < wT , w1 > ... < wT , wT > series of size T (Fig. 1), resulting in a N × T matrix q where hwi0 , wj0 i = wj0 1 − wi0 2 − wi0 1 − wj0 2 is a p of temporal profiles. Each row of the matrix contains a time series w1 , w2 , ..., wT for each word W . signed dissimilarity, representing the angular dissim- GAF turns each time series into an image by first ilarity of the time series (see Wang and Oates (2015) rescaling the time series into [−1, 1]: for details). GW is a T × T image representing the time series for a word. For example, GAF images of (wi − Wmax ) + (wi − Wmin ) ”justiceforfreddie” and ”prayforbaltimore” are shown wi0 = (1) Wmax − Wmin in Figure 2: Although specific to each word, they both show activity in the 100-120 region. GAF preserves with Wmax and Wmin the maximun and minimum of the temporal dependency by containing the relative the time series. Then the temporal correlation within correlation between different time intervals, Gi,j . 2.2 Clustering images into clusters δ m , where the kth element of δ w is 1 if w appears in message k, and 0 otherwise: After all time series of words are represented as GAF P sX images, we use ResNet (He et al., 2016) to extract k δwk δmk 2 . Owm = , kδ · k = δ·k (5) features from images, and cluster all images into C kδ w kkδ m k k clusters using k-means. ResNet is a deep convolu- tional neural network. We applied ResNet-50, pre- We remove the noisy words further using hierarchi- trained on ImageNet, to extract features from GAF cal clustering on the co-occurrence similarity matrix. images. Then, k-means was used to generate clus- • Run hierarchical clustering using co-occurrence ters of words. Words in the same clusters have simi- similarity matrix O = [Owm ]; lar GAF images and, therefore, similar temporal pat- terns. For instance, GAF images of ”justiceforfreddie” • Cut the resulting hierarchy; and ”prayforbaltimore” (Fig. 2) end up into the same • Extract the cluster with maximum number of cluster . words as the filtered cluster. 2.3 Ranking clusters by burstiness 3 Experiments and Results After words are grouped into clusters, we use bursti- We tested our method on two social media datasets: ness to rank all clusters. We use DF-IDF score of the Baltimore dataset, collected from Instagram and words to measure burstiness. DF-IDF scores of words the Toronto dataset from Twitter. To detect un- are significantly higher during a time interval that cov- specified events, all messages within the geographical ers the event than during other time intervals, so we boundary of these two cities were collected during a expect the DF-IDF score of a word to peak during the time period. event and be low and stable the rest of the time. The DF-IDF score for cluster C at interval ` is 3.1 Datasets PT The Baltimore dataset contains 385,595 Instagram NC (`) N (i) sC (`) = log PTi=1 (3) messages collected in Baltimore, MD, USA from April N (`) i=1 NC (i) 1 to May 31, 2015. After removing all non-ASCII char- acters, URLs, mentions of Instagram users (@user- where NC (`) is the number of words from cluster C that name), stop words, and words with certain patterns are used in messages from time interval `, summed over repeated more than twice (e.g. ”booo”, ”hahahaaa”), all messages and divided by the number of words in C. there are 358,458 messages and 218,281 unique words N (`) is the number of messages in time window `. The left. The Toronto dataset contains 312,836 Twitter burstiness of cluster C is given by messages collected in Toronto from May 17 to May 31, 2018. After removing stop words, there are 231,773 σs (C) − µs (C) B(C) = (4) messages and 81,351 unique words left. σs (C) + µs (C) 3.2 Detected events in the Baltimore dataset where µs (resp. σs ) is the average (resp. standard deviation) of sC (`), over `. The burstiness index is In the Baltimore dataset, we generate the time series bounded between -1 and +1, and its magnitude cor- for each individual words as their occurrence frequency relates with the signal’s burstiness, as bursty signals within six-hours time windows. Since rare words are have a large standard deviation w.r.t. their average not likely to be associated with any event, we remove (Goh and Barabási, 2008). words that appear less than 40 times over all 240 time points, so 8392 unique words are left. We then trans- form these 8392 time series into images, and generate 2.4 Filtering clusters by word co-occurrence 100 clusters using k-means on features extracted by similarity ResNet from these images. The 100 clusters are ranked Each cluster contains words with similar temporal pat- by burstiness and the top 10 clusters are selected and terns, but these words might discuss different top- filtered using hierarchical clustering. Words represent- ics. Therefore, we use the co-occurrence similarity to ing these 10 clusters are listed in Table 1. represent the similarity between words at the docu- In the Baltimore dataset, the major events are re- ment level. Words associated with the same event are lated to the 2015 Baltimore protests. They appear more likely to be used together. We measure the co- in two clusters that correspond to several subsequent occurrence similarity Owm between words w and m events related to the major event. A few local music by the cosine similarity of two sparse vectors δ w and and culture events are detected as well. Table 1: Events detected by ICE on Baltimore dataset from 100 clusters. Words Event Date Burst. tigolebitties, bigtittyclub, - - 0.635 tittietuesday deathmetal, blackmetal, 2015 Maryland 21-May 0.267 marylanddeathfest, deathfest 24-May baltimoretattoo, blackandgreytattoo, 2015 Baltimore 10-Apr 0.260 baltimoresblackartists Tattoo Arts Convention 12-Apr eastersunday, livemusic, Easter 01-Apr 0.233 jazz, jesus cincodemayo, metgala - - 0.219 justiceforblackmeneverywhere, 2015 Baltimore 18-Apr 0.106 equalrights, bmorestrong, protest 03-May justicefreddie, baltimoreprotests, baltimoreuprising prayforbaltimore, mondawmin Violence students vs. police 27-Apr 0.090 violence,freddiegrey Mondawmin Mall loyolamaryland, graduate Loyola University Maryland 19-May 0.077 afterpartytoallpartys graduation nationalsiblingsday, - - 0.069 baltimoretattooconvention onedirection, imagine, harry, zayn, Onedirection band 12-Apr 0.066 louis, sauce, fans, london tour London 3.3 Detected events in the Toronto dataset tends to be larger and more likely to contain event- related words. They are also more noisy and may con- In the Toronto dataset, the time series of word oc- tain mixed events (Tab. 1). On the other hand, with currence are obtained using a one-hour time window. more clusters, clusters are smaller and do not contain After removing words that appear less than 20 times mixed events. over all time windows, there are 7,095 unique words left and 264 time points. Similarly, we generated 200 clusters from the images of 7,095 words, and the top 4 Discussion 10 ranked and filtered clusters are shown in Table 2. 4.1 Transfer learning In the Toronto data, several entertainment, sports and political events are detected. The detected events Although ResNet has been used in transfer learning in reflect users’ interests on social media in these geo- many image tasks, transforming time series of words graphical regions. into images and using ResNet for event detection is novel as far as we know. Our work differs from Bis- 3.4 Performance in Baltimore and Toronto chke et al. (2016). They used X-ResNet to extract datasets visual features in satellite images for specified event detection. We transform word occurrence frequency Reference events happening in Baltimore (Apr-May into images, and use ResNet to extract visual features 2015) and Toronto (May 2018) are not available, there- for unspecified event detection. We adopt the tool fore recall can not be computed. As a consequence, for transforming time series to images, and make it we use precision as the performance measure for event possible to use the state-of-art deep neural network detection, which is consistent with a number of other architecture for image recognition. studies (Farzindar and Khreich, 2015). We used pre- We also tested the use of reduced size Piecewise cision on the top ranked detected events to evaluate Aggregation Approximation (PAA) image (Wang and the performance of ICE. We measure the precision at Oates, 2015) as the input for clustering, but the per- the top five (P@5) and the top ten (P@10) events in formance on PAA images was very poor. a range of 50 to 1000 clusters on the Baltimore and Toronto datasets (Tables 3, 4). Both of these two 4.2 Word embedding datasets achieve a top-5 precision of 80% and top-10 precision of 70%. This indicates that ICE is effective In ICE, we use the co-occurrence similarity matrix to at detecting events from noisy social media messages. filter non-event words in clusters. Co-occurrence simi- Performance decreases when the number of clusters larity represents the semantic similarity between words increases. When there are fewer clusters, each cluster occurring in the same document. Word embeddings Table 2: Events detected by ICE on Toronto dataset from 200 clusters. Words Event Date Burst. rebuttle, shaw, huxley - - 0.327 kindermorgan, pipeline Kindermorgan pipeline 29-May 0.261 royalwedding, meghanandharry Royal wedding 19-May 0.256 nobel, pinterest - - 0.228 uclfinal, liverpool, UCFL game 26-May 0.220 championsleaguefinal stanleycupplayoffs, goldenknights, 2018 Stanley Cup game 28-May 0.210 savelucifer Fox’s cancellation of ”Lucifer” show roseanne’s, cancelling Roseanne show cancelling 29-May 0.173 terryfoxrun - - 0.116 fireworks, victoriaday, Victoria day 21-May 0.106 trudeaumustgo, saveontario 2018 Ontario General Election 26-May 0.055 Table 3: Performance of ICE on the Baltimore dataset. Table 5: Using word embedding for clustering #clusters avg #words P@5 P@10 Features #clusters P@5 P@10 in cluster filtered (%) (%) Image+100d(GloVe) 100 60 60 50 161.0 134.8 80 70 Image+200d(GloVe) 100 0 0 100 92.8 74.0 60 70 Image+200d(fastText) 100 60 70 200 40.7 26.7 40 50 1000* 9.6 9.6 40 50 4.3 Parameter Analysis * Clusters with less than 11 words are not filtered The parameters used in ICE include the time window ` and the number of clusters |C|. During the clus- ter filtering step, we use hierarchical clustering, and Table 4: Performance of ICE on the Toronto dataset. the largest branch of the clustering tree is chosen to represent events, which does not introduce additional #clusters avg #words P@5 P@10 parameters. in cluster filtered (%) (%) As discussed earlier, we keep ` as small as possible 100 74.2 63.5 80 70 to gain granularity of clusters. The only tuned param- 200 34.9 26.0 60 70 eter in ICE is the number of clusters |C|. As shown 300* 27.1 20.7 60 60 in Tables 3–4, increasing |C| naturally results in a de- * Clusters with less than 11 words are not filtered crease of the number of words in each clusters. We also observed that precision dropped as the number of have been widely used in many NLP applications. We clusters increases, although this effect was more pro- therefore tested the combination of the temporal fea- nounced with the Baltimore dataset (Table 3). Over- tures extracted from images and semantic features ob- all, these results suggest that ICE is relatively robust tained from word embeddings. We first used the 100- to mild differences in parameter settings when it comes and 200-dimension of GloVe word embeddings (Pen- to detecting relevant events. nington et al., 2014) pre-trained on Twitter data, to- gether with image features. It shows that the per- 5 Conclusions formance of using the 100-dimension GloVe embed- ding with image features is worse than image features Event detection from social media is a challenging task alone, and using the 200 dimension GloVe embeddings due to the noisy nature of user generated text. In does not result in any detected event (Table 5). We this study, we propose a novel method, transforming also trained 200-dimension fastText word embeddings the time series of word occurrence frequency into im- (Mikolov et al., 2017) on the Baltimore dataset, and ages, and using ResNet to extract features from im- combined them with image features. The result shows ages. The images are clustered, clusters are ranked by that using fastText word embedding trained on Bal- burstiness, and words in each clusters are filtered us- timore data does not hurt or help the overall perfor- ing the co-occurrence similarity within messages. Con- mance. Overall, the use of word embeddings simulta- verting word occurrence into images allows to capture neously with temporal features does not perform par- the dynamic changes in the social media environment. ticularly well. Clustering words with similar temporal patterns using features extracted by advanced convolutional neural C. Li, A. Sun, and A. Datta. Twevent: segment-based network architecture ResNet provides a robust method event detection from tweets. In Proc. 21st ACM intl. for separating real event from daily chatter on social conf. on Information and Knowledge Management, media. The subsequent steps of ranking and filtering pages 155–164, 2012. clusters refines the detected events. Note that our method is not an end-to-end event T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and detection method. End-to-end systems ususally need A. Joulin. Advances in pre-training distributed word large amounts of training samples, which are not avail- representations. arXiv:1712.09405, 2017. able for unspecified event detection. For future work, M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learn- we would like to explore how to combine the tempo- ing and transferring mid-level image representations ral patterns and co-occurrence patterns in images and using convolutional neural networks. In Proc. IEEE improve the ranking of longer events. conf. on Computer Vision and Pattern Recognition, pages 1717–1724, 2014. 6 Acknowledgments J. Pennington, R. Socher, and C. D. Manning. Glove: We would like to thank Yuanjing Cai for writing the Global vectors for word representation. In Empirical code for the burstiness index and co-occurrence matrix Methods in Natural Language Processing (EMNLP), used in our method during her co-op term at NRC. pages 1532–1543, 2014. A. Pozdnoukhov and C. Kaiser. Space-time dynam- References ics of topics in streaming text. In Proc. 3rd ACM B. Bischke, D. Borth, C. Schulze, and A. Dengel. Con- SIGSPATIAL intl. workshop on Location-Based So- textual enrichment of remote-sensed events with so- cial Networks, pages 1–8, 2011. cial media streams. In Proc. 24th ACM intl. conf. T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake on Multimedia, pages 1077–1081, 2016. shakes twitter users: real-time event detection by J. Chae, D. Thom, H. Bosch, Y. Jang, R. Maciejew- social sensors. In Proc. 19th intl. conf. on World ski, D. S. Ebert, and T. Ertl. Spatiotemporal social Wide Web, pages 851–860, 2010. media analytics for abnormal event detection and E. Schubert, M. Weiler, and H.-P. Kriegel. Signi- examination using seasonal-trend decomposition. In trend: scalable detection of emerging topics in tex- IEEE Conf. on Visual Analytics Science and Tech- tual streams by hashed significance thresholds. In nology (VAST), pages 143–152, 2012. Proc. 20th ACM SIGKDD intl. conf. on Knowledge Discovery and Data Mining, pages 871–880, 2014. A. Farzindar and W. Khreich. A survey of techniques for event detection in twitter. Computational Intel- A. Sharif Razavian, H. Azizpour, J. Sullivan, and ligence, 31(1):132–164, 2015. S. Carlsson. CNN features off-the-shelf: an astound- ing baseline for recognition. In Proc. IEEE conf. K.-I. Goh and A.-L. Barabási. Burstiness and mem- on Computer Vision and Pattern Recognition work- ory in complex systems. Europhysics Letters, 81(4): shops, pages 806–813, 2014. 48002, 2008. Z. Wang and T. Oates. Encoding time series as images M. Hasan, M. A. Orgun, and R. Schwitter. A survey for visual inspection and classification using tiled on real-time event detection from the twitter data convolutional neural networks. In Workshops at the stream. Journal of Information Science, 44(4):443– 29th AAAI Conf. on Artificial Intelligence, 2015. 463, 2018. doi: 10.1177/0165551517698564. J. Weng and B.-S. Lee. Event detection in twitter. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual In Proc. Intl. Conf. on Web and Social Media, vol- learning for image recognition. In Proc. IEEE conf. ume 11, pages 401–408, 2011. on Computer Vision and Pattern Recognition, pages X. Zhou and L. Chen. Event detection over twitter 770–778, 2016. social media streams. The International Journal on Very Large Data Bases, 23(3):381–400, 2014. K. Lee, A. Qadir, S. A. Hasan, V. Datla, A. Prakash, J. Liu, and O. Farri. Adverse drug event detection in tweets with semi-supervised convolutional neural networks. In Proc. 26th Intl. Conf. on the World Wide Web, pages 705–714, 2017.