Estimating the Time between Twitter Messages and Future
                        Events

                       Ali Hürriyetoğlu, Florian Kunneman, and Antal van den Bosch
                               Centre for Language Studies, Radboud University Nijmegen
                                                    P.O. Box 9103
                                                NL-6500 HD Nijmegen
                   ali.hurriyetoglu@gmail.com, {f.kunneman,a.vandenbosch}@let.ru.nl

ABSTRACT                                                           in time, texts can be viewed in relation to this point, and
We describe and test three methods to estimate the remain-         generalizations can be made over texts at different distances
ing time between a series of microtexts (tweets) and the           in time to t = 0. The goal of this paper is to present new
future event they refer to via a hashtag. Our system gener-        methods that are able to automatically estimate the time-to-
ates hourly forecasts. A linear and a local regression-based       event from a stream of microtext messages. These methods
approach are applied to map hourly clusters of tweets di-          could serve as modules in news media mining systems1 to
rectly onto time-to-event. To take changes over time into          fill upcoming event calendars. The methods should be able
account, we develop a novel time series analysis approach          to work robustly in a stream of messages, and the dual goal
that first derives word frequency time series from sets of         would be to make (i) reliable predictions of times-to-event
tweets and then performs local regression to predict time-         (ii) as early as possible. Predicting that an event is starting
to-event from nearest-neighbor time series. We train and           imminently is arguably less useful than being able to predict
test on a single type of event, Dutch premier league foot-         its start in a number of days. This implies that if a method
ball matches. Our results indicate that in an ’early’ stage,       requires a sample of tweets (e.g. with the same hashtag) to
four days or more before the event, the time series analysis       be gathered during some time frame, the frame should not
produces time-to-event predictions that are about one day          be too long, otherwise predictions could come in too late to
off; closer to the event, local regression attains a similar ac-   be relevant.
curacy. Local regression also outperforms both mean and
median-based baselines, but on average none of the tested          In this paper we test the predictive capabilities of three dif-
system has a consistently strong performance through time.         ferent approaches. The first system is based on linear regres-
                                                                   sion and maps sets of tweets with the same hashtag during
                                                                   an hour to a time-to-event estimate. The second system
Categories and Subject Descriptors                                 attempts to do the same based on local regression. The
H.4 [Information Systems Applications]: Spatial-Temporal
                                                                   third system uses time series analysis. It takes into account
Systems
                                                                   more than a single set of tweets: during a certain time pe-
                                                                   riod it samples several sets of tweets in fixed time frames,
General Terms                                                      and derives time series information from individual word fre-
Algorithms, Performance                                            quencies in these samples. It compares these word frequency
                                                                   time series profiles against a labeled training set of profiles
Keywords                                                           in order to find similar patterns of change in word frequen-
                                                                   cies. The method then adopts local regression: finding a
Time series analysis, Event prediction, Twitter
                                                                   nearest-neighbor word frequency time series, the time-to-
                                                                   event stored with that neighbor is copied to the tested time
1.   INTRODUCTION                                                  series. With this third system, and with the comparison
With the advent of social media, data streams of unprece-          against the second system, we can test the hypothesis that
dented volume have become available. These streams do not          it is useful to gather time series information (more specifi-
only contain text, but also identity markers of the persons        cally, patterns in word frequency changes) over an amount
who generated the text, and the time at which the mes-             of time.
sages were published. The availability of massive amounts
of time-stamped texts is an invitation to incorporate time         This paper is structured as follows. We describe the rela-
series analysis methods into the natural language process-         tion of our work to earlier research in Section 2. The three
ing toolbox. For instance, predictive models can be built          systems are described in Section 3. Section 4 describes the
through time series analysis that can estimate the likelihood      overall experimental setup, including a description of the
and time of future events.                                         data, the baseline, and the evaluation method used. The
                                                                   results are presented and analyzed in Section 5. We con-
Our study focuses on textual data published by humans via          clude with a discussion of the results and future studies in
social media about particular events. If the starting point        Section 6.
of the event in time is taken as the anchor t = 0 point
                                                                   1
                                                                       For instance, http://www.zapaday.com/
2.    RELATED RESEARCH                                             posted in different periods of time before an event. In con-
The growing availability of digital texts with time stamps,        trast, local regression is unbiased and will adapt to any local
such as e-mails, weblogs, and online news, has spawned vari-       distribution.
ous types of studies on the analysis of patterns in texts over
time. An early publication on the general applicability of         3.2    Time series analysis
time series analysis on time-stamped text is [2]. A more           Time series are data structures that contain multiple mea-
recent overview of future predictions using social media is        surements of data features over time. If values of a feature
[5]. A popular goal of time series analysis of texts is event      change meaningfully over time, then time series analysis can
prediction, where a correlation is sought between a point in       be used to capture this pattern of change. Comparing new
the future and preliminary texts.                                  time series with memorized time series can reveal similari-
                                                                   ties that may lead to a prediction of a subsequent value or,
Ritter et al. train on annotated open-domain event men-            in our case, the time-to-event. Our time series approach ex-
tions in tweets in order to create a calendar of events based      tends the local regression approach by not only considering
on explicit date mentions and words typical of the event           single sets of aggregrated tweets in a fixed time frame (e.g.
[3]. While we also aim to estimate the point in time at            one hour in our study), but creating sequences of these sets
which an event will take place, our focus lies on the pattern      representing several consecutive hours of gathered tweets.
of anticipation seen in tweets linked to the time until the        Using the same bag-of-words representation as the local re-
event occurs rather than specific time references to a future      gression approach, we find nearest neighbors of sequences
event. [4] do look at anticipation seen in tweets, but focus       of bag-of-word vectors rather than single hour frames. The
on personal activities in the very near future, while we aim       similarity between a test time series and a training time se-
to predict the time-to-event of potentially large-scale news       ries of the same length is calculated by computing their Eu-
events as early as possible.                                       clidean distance. In this study we did not further optimize
                                                                   any hyperparameters; we set k = 1.
3.    METHODS
In this section we introduce the methods adopted in our            The time series approach generates predictions by following
study. They operate on streams of tweets, and generate             the same strategy as the simple local regression approach:
hourly forecasts for the events that tweets with the same          upon finding the nearest-neighbor training time series, the
hashtag refer to. The single tweet is the smallest unit avail-     time-to-event of this training time series is taken as the time-
able for this task; we may also consider more than one tweet       to-event estimate of the test time series. In case of equidis-
and aggregate tweets over a certain time frame. If these sin-      tant nearest neighbors, the average of their associated time-
gle tweets or sets of tweets are represented as bag-of-words       to-events is given as the prediction.
vectors, the task can be cast as a regression problem: map-
ping a feature vector onto a continuous numeric output rep-        4. EXPERIMENTAL SET-UP
resenting the time-to-event. In this study the smallest time
unit is one hour, and all three methods work with this time        4.1 Data collection
frame.                                                             For this study we chose football matches as a specific type
                                                                   of event. They occur frequently, have a distinctive hash-
3.1    Linear and local regression                                 tag by convention (‘#ajafey’ for a match between Ajax and
In linear regression, each feature in the bag-of-words feature     Feyenoord) and often generate a useful amount of tweets:
vector (representing the presence or frequency of occurrence       up to tens of thousands of tweets per match. For the collec-
of a specific word) can be regarded as a predictive variable       tion of training and test data we focused on Dutch football
to which a weight can be assigned that, in a simple linear         matches played in the Eredivisie. We harvested tweets by
function, multiplies the value of the predictive variable to       means of twiqs.nl, a database of Dutch tweets from Decem-
generate a value for the response variable, the time-to-event.     ber 2010 onwards. We selected the (arbitrary) top 6 teams
A multiple linear regression function can be approximated          of the league2 , and queried all matches played between them
by finding the weights for a set of features that generates        in 2011 and 2012. For each query, the conventional hashtag
the response variable with the smallest error.                     for a match was used with a restricted search space of three
                                                                   weeks before the time of the match until the start time of the
Local regression, or local learning [1], is the numeric variant    match (to ensure that the collected tweets were referring to
of the k-nearest neighbor classifier. Given a test instance, it    that specific match, and not to an earlier match consisting
finds the closest k training instances based on a similarity       of the same home and away team and therefore the same
metric, and bases a local estimation of the numeric output         hashtag).
by taking some average of the outcomes of the closest k
training instances.                                                The queries resulted in tweets referring to 60 matches be-
                                                                   tween the selected six teams in the period from January 2011
Linear regression and local regression can be considered base-     until December 2012. From these, we selected the matches
line approaches, but are complementary. While in linear re-        with the most frequent similar starting time, Sundays at
gression an overall pattern is generated to fit the whole train-   2:30 PM, for our experiment. As we focused on the amount
ing set, local regression only looks at local information for      of hours before an event, the actual time when a tweet is
classification (the characteristics of single instances). Lin-     posted (for example during the night or in the afternoon)
ear regression is unfit for approximating gaussian or other        can bias the type of tweet; with the fixed starting time this
non-linear distributions; as we will see, there are reasons        2
                                                                     Ajax, Feyenoord, PSV, FC Twente, AZ Alkmaar and FC
to believe that there are substantial differences in tweets        Utrecht
effect is neutralized. To generate training and test events       frame for each word, where the word count in that frame is
that simulate a system trained on passed events and tested        divided by the number of tweets in the frame. The latter
on upcoming events, we selected tweets referring to matches       frequency is the basic element in our time series calculations.
played in 2011 (a calendar year comprising two halfs of a
football season) as training data and tweets referring to 2012    As many time frames contain only a small number of tweets,
matches as test data. This resulted in 12 matches as train-       especially the frames more than a few days before the event,
ing events (totaling 54,081 tweets) and 14 matches as test        word counts are sparse as well. Besides taking longer time
events (40,204 tweets).                                           frames of more than a single sample size, frequencies can also
                                                                  be smoothed through typical time series analysis smoothing
The time-to-event in hours was calculated for every tweet,        techniques such as moving average smoothing. We apply a
based on their time of posting and the known start time           pseudo-exponential moving average filter by replacing each
of the event they referred to. For this task we did not           word count by a weighted average of the word count at time
take tweets into account that were posted during and af-          frames t, t − 1, and t − 2, where wt = 4 (the weight at t is
ter matches. We also constrained the number of days before        set to 4), wt−1 = 2, and wt−2 = 1.
the event: for both training and test sets, tweets were kept
within eight days before the event. Although this is an ar-
tificial constraint, the eight days window captures the vast      4.3    Evaluation and baselines
majority, about 98%, of forward-looking tweets.                   A common metric for evaluating numeric predictions is the
                                                                  Root Mean Squared Error (RMSE), cf. Equation 1. For
                                                                  all hourly forecasts made in N hour frames, a sum is made
4.2      Generation of training and test data                     of the squared differences between the actual value vi and
The goal of the experiments was to compare systems that           the estimated value ei ; the (square) root is then taken to
generate hourly forecasts of the event start time for each        produce the RMSE of the prediction series.
test event. This was done based on the information in ag-
gregated sets of tweets within the time span of an hour.
Aggregation is done by treating all training events as one                                v
                                                                                          u    N
collection during the extraction of features. The linear and                              u1X
local regression methods only operate on vectors represent-                       RM SE = t       (vi − ei )2                (1)
                                                                                            N i=1
ing hour blocks. The time series analysis approach makes
use of longer sequences of six hour blocks - this number was
empirically set in preliminary experiments.                       We computed two straightforward baselines derived from the
                                                                  training set: the median and the mean of time-to-event over
The aggregated tweets were used as training instances for         all training tweets. For the median baseline, all tweets in
the linear and local regression methods. To maximize the          the training set were ordered in time and the median time
number of training instances, we generated a sequence of          was identified. As we use one-hour time frames through-
overlapping instances using the minute as a finer-grained         out our study, we round the median by the one-hour time
shift unit. At every minute, all tweets posted within the         frame it is in, which turns out to be −3 hours. The mean is
hour before the tweets in that minute were added to the           computed by averaging the time-to-event of all tweets, and
instance.                                                         again rounded at the hour. The mean is −26 hours.

In order to reduce the feature space for the linear and local
regression instances, we pruned every bag-of-word feature         5.    RESULTS
that occured less than 500 times in the training set. Linear      Table 1 displays the averaged RMSE results on the 14 test
regression was applied by means of R3 . Absolute occurrence       events. On average the performance of the linear regression
counts of features were taken into account. For local re-         method is worse than both baselines, while the time series
gression we made use of the k-NN implementation as part           analysis outperforms the median baseline. Given that the
of TiMBL4 , setting k = 5, using Information Gain feature         best performing method is still an unsatisfactory 43 hours
weighting, and an overlap-based metric as similary metric         off, there is still a lot of improvement needed. The best
that does not count matches on zero values (features mark-        method per event varies. Even linear regression, which has
ing words that are absent in both test and training vectors).     a below baseline performance on average, leads to the best
For k-NN, the binary value of features were used.                 RMSE for two events. It appears that some negative devi-
                                                                  ations (110 for ’twefey’, 410 for ’tweaja’) lead to the poor
The time series analysis vectors are not filled with absolute     average RMSE.
occurrence counts, but with relative and smoothed frequen-
cies. After having counted all words in each time frame, two      The average performance of the different methods in terms
frequencies are computed for each word. The first, the over-      of their RMSE according to hourly forecasts is plotted in
all frequency of a word, is calculated as the sum of its counts   Figure 1. In the left half of the graph the three systems out-
in all time frames, divided by the total number of tweets in      perform the baselines, except for an error peak of the linear
all time frames in our 8-day window. This frequency ranges        regression method at around t = −150. Before t = −100 the
between 0 (the word does not occur) and 1 (the word occurs        time series prediction is performing rather well, with RMSE
in every tweet). The second frequency is computed per time        values averaging 23 hours. The linear regression and local
                                                                  regression methods produce larger errors at first, decreasing
3
    http://www.r-project.org/                                     as time progresses. In the second half of the graph, however,
4
    http://ilk.uvt.nl/timbl                                       only the local regression method retains fairly low RMSE
                                            Spring 2012                                   Fall 2012
                       azaja feyaz feyutr psvfey tweaja twefey tweutr utraz azfey psvaz twefey utraz utrpsv utrtwe Av (sd)
     Baseline Median      63    49     54     62     38     64     96    71    62    67     62    66     61     62 63 (12)
       Baseline Mean      51 40        44     51     31     52     77    58 50       55     51    53     49     51 51 (10)
     Linear regression    52    42     59     54    410    41      41    33 111      31    110    54     37     68 82 (94)
      Local regression   48     44    35     41      43     43    31    20     57    40     52   48      34     52 43 (9)
          Time Series    48     50     42     43     45    41      63    70    48    58     46    71     59     63 54 (10)

Table 1: Overall Root Mean Squared Error scores for each method: difference in hours between the estimated
time-to-event and the actual time-to-event


                                                                  are sometimes close to t = 0 (21 hours on average in this
                                                                  time range).

                                                                  On the one hand, our results are not very strong: predic-
                                                                  tions that are more than two days off and that are at the
                                                                  same time only mildly better than simple baselines cannot
                                                                  be considered precise. However, the results indicate that
                                                                  if we divide the problem into an ‘early’ prediction system
                                                                  based on time series analysis and a ‘late’ prediction system
                                                                  based on local regression, we could limit the prediction error
                                                                  to within a day. If we can detect the point at which the time
                                                                  series analysis starts increasing its predicted time-to-event
                                                                  (which is the wrong trend as the event can only come closer
                                                                  in time), it is time to switch to the local regression system.
                                                                  In our data, this point is around t = −100.

                                                                  In future work we plan to extend the current study in several
                                                                  directions. Most importantly, we plan to extend the study
                                                                  to other events, moving from football to other scheduled
                                                                  events, and from scheduled events to unscheduled events,
Figure 1: RMSE curves for the two baselines and the               the ultimate goal of a forecasting system like this. A second
three methods for the last 192 hours before t = 0.                extension is to improve on the time series analysis method,
                                                                  particularly to investigate why it is performing well only up
                                                                  to several days before the future event (and what kind of pat-
                                                                  terns it matches successfully). We also plan to optimize the
values at an average of 21 hours, while the linear regres-        local regression approach, as we now utilize a fairly standard
sion method becomes increasingly erratic in its predictions.      k-NN approach without optimized hyperparameters, and we
The time series analysis method also produces considerably        have not optimized the selection of features either.
higher RMSE values in the last days before the events.
                                                                  Acknowledgement
6.     CONCLUSION                                                 This study is financed by the COMMIT program as part of
In this study we explored and compared three approaches to        the Infiniti project.
time-to-event prediction on the basis of streams of tweets.
We tested on the prediction of the time-to-event of football
matches by generating hourly forecasts. When the three ap-        7.   REFERENCES
proaches are compared to two simplistic baselines based on        [1] C. Atkeson, A. Moore, and S. Schaal. Locally weighted
the mean and median of the time-to-event of tweets sent               learning. Artificial Intelligence Review, 11(1–5):11–73,
before an event, only local regression displays better over-          1997.
all RMSE values on the tested prediction range of 192 . . . 0     [2] J. Kleinberg. Temporal dynamics of On-Line
hours before the event. Linear regression generates some              information streams. In Data stream management:
highly erratic predictions and scores below both baselines.           Processing high-speed data streams. Springer, 2006.
A novel time series approach that implements local regres-        [3] A. Ritter, Mausam, O. Etzioni, and S. Clark. Open
sion based on sequences of samples of tweets performs better          domain event extraction from twitter. In Proceedings of
than the mean baseline, but under the median baseline.                the 18th ACM SIGKDD international conference on
                                                                      Knowledge discovery and data mining, KDD ’12, pages
Yet, the time series method generates fairly accurate fore-           1104–1112. ACM, 2012.
casts during the first half of the test period. Before t < −100   [4] W. Weerkamp and M. De Rijke. Activity prediction: A
hours, i.e. earlier than four days before the event, predic-          Twitter-based exploration. In Proceedings of TAIA’12,
tions by the time series method are only about a day off (23          Aug. 2012.
hours on average in this time range). When t ≤ −100, the          [5] S. Yu and S. Kak. A survey of prediction using social
local regression approach based on sets of tweets in hourly           media. In ArXiv e-prints, Mar. 2012.
time frames is the better predictor, with RMSE values that