Estimating the Time between Twitter Messages and Future Events Ali Hürriyetoğlu, Florian Kunneman, and Antal van den Bosch Centre for Language Studies, Radboud University Nijmegen P.O. Box 9103 NL-6500 HD Nijmegen ali.hurriyetoglu@gmail.com, {f.kunneman,a.vandenbosch}@let.ru.nl ABSTRACT in time, texts can be viewed in relation to this point, and We describe and test three methods to estimate the remain- generalizations can be made over texts at different distances ing time between a series of microtexts (tweets) and the in time to t = 0. The goal of this paper is to present new future event they refer to via a hashtag. Our system gener- methods that are able to automatically estimate the time-to- ates hourly forecasts. A linear and a local regression-based event from a stream of microtext messages. These methods approach are applied to map hourly clusters of tweets di- could serve as modules in news media mining systems1 to rectly onto time-to-event. To take changes over time into fill upcoming event calendars. The methods should be able account, we develop a novel time series analysis approach to work robustly in a stream of messages, and the dual goal that first derives word frequency time series from sets of would be to make (i) reliable predictions of times-to-event tweets and then performs local regression to predict time- (ii) as early as possible. Predicting that an event is starting to-event from nearest-neighbor time series. We train and imminently is arguably less useful than being able to predict test on a single type of event, Dutch premier league foot- its start in a number of days. This implies that if a method ball matches. Our results indicate that in an ’early’ stage, requires a sample of tweets (e.g. with the same hashtag) to four days or more before the event, the time series analysis be gathered during some time frame, the frame should not produces time-to-event predictions that are about one day be too long, otherwise predictions could come in too late to off; closer to the event, local regression attains a similar ac- be relevant. curacy. Local regression also outperforms both mean and median-based baselines, but on average none of the tested In this paper we test the predictive capabilities of three dif- system has a consistently strong performance through time. ferent approaches. The first system is based on linear regres- sion and maps sets of tweets with the same hashtag during an hour to a time-to-event estimate. The second system Categories and Subject Descriptors attempts to do the same based on local regression. The H.4 [Information Systems Applications]: Spatial-Temporal third system uses time series analysis. It takes into account Systems more than a single set of tweets: during a certain time pe- riod it samples several sets of tweets in fixed time frames, General Terms and derives time series information from individual word fre- Algorithms, Performance quencies in these samples. It compares these word frequency time series profiles against a labeled training set of profiles Keywords in order to find similar patterns of change in word frequen- cies. The method then adopts local regression: finding a Time series analysis, Event prediction, Twitter nearest-neighbor word frequency time series, the time-to- event stored with that neighbor is copied to the tested time 1. INTRODUCTION series. With this third system, and with the comparison With the advent of social media, data streams of unprece- against the second system, we can test the hypothesis that dented volume have become available. These streams do not it is useful to gather time series information (more specifi- only contain text, but also identity markers of the persons cally, patterns in word frequency changes) over an amount who generated the text, and the time at which the mes- of time. sages were published. The availability of massive amounts of time-stamped texts is an invitation to incorporate time This paper is structured as follows. We describe the rela- series analysis methods into the natural language process- tion of our work to earlier research in Section 2. The three ing toolbox. For instance, predictive models can be built systems are described in Section 3. Section 4 describes the through time series analysis that can estimate the likelihood overall experimental setup, including a description of the and time of future events. data, the baseline, and the evaluation method used. The results are presented and analyzed in Section 5. We con- Our study focuses on textual data published by humans via clude with a discussion of the results and future studies in social media about particular events. If the starting point Section 6. of the event in time is taken as the anchor t = 0 point 1 For instance, http://www.zapaday.com/ 2. RELATED RESEARCH posted in different periods of time before an event. In con- The growing availability of digital texts with time stamps, trast, local regression is unbiased and will adapt to any local such as e-mails, weblogs, and online news, has spawned vari- distribution. ous types of studies on the analysis of patterns in texts over time. An early publication on the general applicability of 3.2 Time series analysis time series analysis on time-stamped text is [2]. A more Time series are data structures that contain multiple mea- recent overview of future predictions using social media is surements of data features over time. If values of a feature [5]. A popular goal of time series analysis of texts is event change meaningfully over time, then time series analysis can prediction, where a correlation is sought between a point in be used to capture this pattern of change. Comparing new the future and preliminary texts. time series with memorized time series can reveal similari- ties that may lead to a prediction of a subsequent value or, Ritter et al. train on annotated open-domain event men- in our case, the time-to-event. Our time series approach ex- tions in tweets in order to create a calendar of events based tends the local regression approach by not only considering on explicit date mentions and words typical of the event single sets of aggregrated tweets in a fixed time frame (e.g. [3]. While we also aim to estimate the point in time at one hour in our study), but creating sequences of these sets which an event will take place, our focus lies on the pattern representing several consecutive hours of gathered tweets. of anticipation seen in tweets linked to the time until the Using the same bag-of-words representation as the local re- event occurs rather than specific time references to a future gression approach, we find nearest neighbors of sequences event. [4] do look at anticipation seen in tweets, but focus of bag-of-word vectors rather than single hour frames. The on personal activities in the very near future, while we aim similarity between a test time series and a training time se- to predict the time-to-event of potentially large-scale news ries of the same length is calculated by computing their Eu- events as early as possible. clidean distance. In this study we did not further optimize any hyperparameters; we set k = 1. 3. METHODS In this section we introduce the methods adopted in our The time series approach generates predictions by following study. They operate on streams of tweets, and generate the same strategy as the simple local regression approach: hourly forecasts for the events that tweets with the same upon finding the nearest-neighbor training time series, the hashtag refer to. The single tweet is the smallest unit avail- time-to-event of this training time series is taken as the time- able for this task; we may also consider more than one tweet to-event estimate of the test time series. In case of equidis- and aggregate tweets over a certain time frame. If these sin- tant nearest neighbors, the average of their associated time- gle tweets or sets of tweets are represented as bag-of-words to-events is given as the prediction. vectors, the task can be cast as a regression problem: map- ping a feature vector onto a continuous numeric output rep- 4. EXPERIMENTAL SET-UP resenting the time-to-event. In this study the smallest time unit is one hour, and all three methods work with this time 4.1 Data collection frame. For this study we chose football matches as a specific type of event. They occur frequently, have a distinctive hash- 3.1 Linear and local regression tag by convention (‘#ajafey’ for a match between Ajax and In linear regression, each feature in the bag-of-words feature Feyenoord) and often generate a useful amount of tweets: vector (representing the presence or frequency of occurrence up to tens of thousands of tweets per match. For the collec- of a specific word) can be regarded as a predictive variable tion of training and test data we focused on Dutch football to which a weight can be assigned that, in a simple linear matches played in the Eredivisie. We harvested tweets by function, multiplies the value of the predictive variable to means of twiqs.nl, a database of Dutch tweets from Decem- generate a value for the response variable, the time-to-event. ber 2010 onwards. We selected the (arbitrary) top 6 teams A multiple linear regression function can be approximated of the league2 , and queried all matches played between them by finding the weights for a set of features that generates in 2011 and 2012. For each query, the conventional hashtag the response variable with the smallest error. for a match was used with a restricted search space of three weeks before the time of the match until the start time of the Local regression, or local learning [1], is the numeric variant match (to ensure that the collected tweets were referring to of the k-nearest neighbor classifier. Given a test instance, it that specific match, and not to an earlier match consisting finds the closest k training instances based on a similarity of the same home and away team and therefore the same metric, and bases a local estimation of the numeric output hashtag). by taking some average of the outcomes of the closest k training instances. The queries resulted in tweets referring to 60 matches be- tween the selected six teams in the period from January 2011 Linear regression and local regression can be considered base- until December 2012. From these, we selected the matches line approaches, but are complementary. While in linear re- with the most frequent similar starting time, Sundays at gression an overall pattern is generated to fit the whole train- 2:30 PM, for our experiment. As we focused on the amount ing set, local regression only looks at local information for of hours before an event, the actual time when a tweet is classification (the characteristics of single instances). Lin- posted (for example during the night or in the afternoon) ear regression is unfit for approximating gaussian or other can bias the type of tweet; with the fixed starting time this non-linear distributions; as we will see, there are reasons 2 Ajax, Feyenoord, PSV, FC Twente, AZ Alkmaar and FC to believe that there are substantial differences in tweets Utrecht effect is neutralized. To generate training and test events frame for each word, where the word count in that frame is that simulate a system trained on passed events and tested divided by the number of tweets in the frame. The latter on upcoming events, we selected tweets referring to matches frequency is the basic element in our time series calculations. played in 2011 (a calendar year comprising two halfs of a football season) as training data and tweets referring to 2012 As many time frames contain only a small number of tweets, matches as test data. This resulted in 12 matches as train- especially the frames more than a few days before the event, ing events (totaling 54,081 tweets) and 14 matches as test word counts are sparse as well. Besides taking longer time events (40,204 tweets). frames of more than a single sample size, frequencies can also be smoothed through typical time series analysis smoothing The time-to-event in hours was calculated for every tweet, techniques such as moving average smoothing. We apply a based on their time of posting and the known start time pseudo-exponential moving average filter by replacing each of the event they referred to. For this task we did not word count by a weighted average of the word count at time take tweets into account that were posted during and af- frames t, t − 1, and t − 2, where wt = 4 (the weight at t is ter matches. We also constrained the number of days before set to 4), wt−1 = 2, and wt−2 = 1. the event: for both training and test sets, tweets were kept within eight days before the event. Although this is an ar- tificial constraint, the eight days window captures the vast 4.3 Evaluation and baselines majority, about 98%, of forward-looking tweets. A common metric for evaluating numeric predictions is the Root Mean Squared Error (RMSE), cf. Equation 1. For all hourly forecasts made in N hour frames, a sum is made 4.2 Generation of training and test data of the squared differences between the actual value vi and The goal of the experiments was to compare systems that the estimated value ei ; the (square) root is then taken to generate hourly forecasts of the event start time for each produce the RMSE of the prediction series. test event. This was done based on the information in ag- gregated sets of tweets within the time span of an hour. Aggregation is done by treating all training events as one v u N collection during the extraction of features. The linear and u1X local regression methods only operate on vectors represent- RM SE = t (vi − ei )2 (1) N i=1 ing hour blocks. The time series analysis approach makes use of longer sequences of six hour blocks - this number was empirically set in preliminary experiments. We computed two straightforward baselines derived from the training set: the median and the mean of time-to-event over The aggregated tweets were used as training instances for all training tweets. For the median baseline, all tweets in the linear and local regression methods. To maximize the the training set were ordered in time and the median time number of training instances, we generated a sequence of was identified. As we use one-hour time frames through- overlapping instances using the minute as a finer-grained out our study, we round the median by the one-hour time shift unit. At every minute, all tweets posted within the frame it is in, which turns out to be −3 hours. The mean is hour before the tweets in that minute were added to the computed by averaging the time-to-event of all tweets, and instance. again rounded at the hour. The mean is −26 hours. In order to reduce the feature space for the linear and local regression instances, we pruned every bag-of-word feature 5. RESULTS that occured less than 500 times in the training set. Linear Table 1 displays the averaged RMSE results on the 14 test regression was applied by means of R3 . Absolute occurrence events. On average the performance of the linear regression counts of features were taken into account. For local re- method is worse than both baselines, while the time series gression we made use of the k-NN implementation as part analysis outperforms the median baseline. Given that the of TiMBL4 , setting k = 5, using Information Gain feature best performing method is still an unsatisfactory 43 hours weighting, and an overlap-based metric as similary metric off, there is still a lot of improvement needed. The best that does not count matches on zero values (features mark- method per event varies. Even linear regression, which has ing words that are absent in both test and training vectors). a below baseline performance on average, leads to the best For k-NN, the binary value of features were used. RMSE for two events. It appears that some negative devi- ations (110 for ’twefey’, 410 for ’tweaja’) lead to the poor The time series analysis vectors are not filled with absolute average RMSE. occurrence counts, but with relative and smoothed frequen- cies. After having counted all words in each time frame, two The average performance of the different methods in terms frequencies are computed for each word. The first, the over- of their RMSE according to hourly forecasts is plotted in all frequency of a word, is calculated as the sum of its counts Figure 1. In the left half of the graph the three systems out- in all time frames, divided by the total number of tweets in perform the baselines, except for an error peak of the linear all time frames in our 8-day window. This frequency ranges regression method at around t = −150. Before t = −100 the between 0 (the word does not occur) and 1 (the word occurs time series prediction is performing rather well, with RMSE in every tweet). The second frequency is computed per time values averaging 23 hours. The linear regression and local regression methods produce larger errors at first, decreasing 3 http://www.r-project.org/ as time progresses. In the second half of the graph, however, 4 http://ilk.uvt.nl/timbl only the local regression method retains fairly low RMSE Spring 2012 Fall 2012 azaja feyaz feyutr psvfey tweaja twefey tweutr utraz azfey psvaz twefey utraz utrpsv utrtwe Av (sd) Baseline Median 63 49 54 62 38 64 96 71 62 67 62 66 61 62 63 (12) Baseline Mean 51 40 44 51 31 52 77 58 50 55 51 53 49 51 51 (10) Linear regression 52 42 59 54 410 41 41 33 111 31 110 54 37 68 82 (94) Local regression 48 44 35 41 43 43 31 20 57 40 52 48 34 52 43 (9) Time Series 48 50 42 43 45 41 63 70 48 58 46 71 59 63 54 (10) Table 1: Overall Root Mean Squared Error scores for each method: difference in hours between the estimated time-to-event and the actual time-to-event are sometimes close to t = 0 (21 hours on average in this time range). On the one hand, our results are not very strong: predic- tions that are more than two days off and that are at the same time only mildly better than simple baselines cannot be considered precise. However, the results indicate that if we divide the problem into an ‘early’ prediction system based on time series analysis and a ‘late’ prediction system based on local regression, we could limit the prediction error to within a day. If we can detect the point at which the time series analysis starts increasing its predicted time-to-event (which is the wrong trend as the event can only come closer in time), it is time to switch to the local regression system. In our data, this point is around t = −100. In future work we plan to extend the current study in several directions. Most importantly, we plan to extend the study to other events, moving from football to other scheduled events, and from scheduled events to unscheduled events, Figure 1: RMSE curves for the two baselines and the the ultimate goal of a forecasting system like this. A second three methods for the last 192 hours before t = 0. extension is to improve on the time series analysis method, particularly to investigate why it is performing well only up to several days before the future event (and what kind of pat- terns it matches successfully). We also plan to optimize the values at an average of 21 hours, while the linear regres- local regression approach, as we now utilize a fairly standard sion method becomes increasingly erratic in its predictions. k-NN approach without optimized hyperparameters, and we The time series analysis method also produces considerably have not optimized the selection of features either. higher RMSE values in the last days before the events. Acknowledgement 6. CONCLUSION This study is financed by the COMMIT program as part of In this study we explored and compared three approaches to the Infiniti project. time-to-event prediction on the basis of streams of tweets. We tested on the prediction of the time-to-event of football matches by generating hourly forecasts. When the three ap- 7. REFERENCES proaches are compared to two simplistic baselines based on [1] C. Atkeson, A. Moore, and S. Schaal. Locally weighted the mean and median of the time-to-event of tweets sent learning. Artificial Intelligence Review, 11(1–5):11–73, before an event, only local regression displays better over- 1997. all RMSE values on the tested prediction range of 192 . . . 0 [2] J. Kleinberg. Temporal dynamics of On-Line hours before the event. Linear regression generates some information streams. In Data stream management: highly erratic predictions and scores below both baselines. Processing high-speed data streams. Springer, 2006. A novel time series approach that implements local regres- [3] A. Ritter, Mausam, O. Etzioni, and S. Clark. Open sion based on sequences of samples of tweets performs better domain event extraction from twitter. In Proceedings of than the mean baseline, but under the median baseline. the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’12, pages Yet, the time series method generates fairly accurate fore- 1104–1112. ACM, 2012. casts during the first half of the test period. Before t < −100 [4] W. Weerkamp and M. De Rijke. Activity prediction: A hours, i.e. earlier than four days before the event, predic- Twitter-based exploration. In Proceedings of TAIA’12, tions by the time series method are only about a day off (23 Aug. 2012. hours on average in this time range). When t ≤ −100, the [5] S. Yu and S. Kak. A survey of prediction using social local regression approach based on sets of tweets in hourly media. In ArXiv e-prints, Mar. 2012. time frames is the better predictor, with RMSE values that