<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Estimating the Time between Twitter Messages and Future Events</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ali Hürriyetog˘ lu</string-name>
          <email>ali.hurriyetoglu@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Florian Kunneman</string-name>
          <email>f.kunneman@let.ru.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antal van den Bosch</string-name>
          <email>a.vandenbosch@let.ru.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Language Studies, Radboud University Nijmegen P.</institution>
          <addr-line>O. Box 9103 NL-6500 HD Nijmegen</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>We describe and test three methods to estimate the remaining time between a series of microtexts (tweets) and the future event they refer to via a hashtag. Our system generates hourly forecasts. A linear and a local regression-based approach are applied to map hourly clusters of tweets directly onto time-to-event. To take changes over time into account, we develop a novel time series analysis approach that rst derives word frequency time series from sets of tweets and then performs local regression to predict timeto-event from nearest-neighbor time series. We train and test on a single type of event, Dutch premier league football matches. Our results indicate that in an 'early' stage, four days or more before the event, the time series analysis produces time-to-event predictions that are about one day o ; closer to the event, local regression attains a similar accuracy. Local regression also outperforms both mean and median-based baselines, but on average none of the tested system has a consistently strong performance through time.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Time series analysis</kwd>
        <kwd>Event prediction</kwd>
        <kwd>Twitter</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>With the advent of social media, data streams of
unprecedented volume have become available. These streams do not
only contain text, but also identity markers of the persons
who generated the text, and the time at which the
messages were published. The availability of massive amounts
of time-stamped texts is an invitation to incorporate time
series analysis methods into the natural language
processing toolbox. For instance, predictive models can be built
through time series analysis that can estimate the likelihood
and time of future events.</p>
      <p>Our study focuses on textual data published by humans via
social media about particular events. If the starting point
of the event in time is taken as the anchor t = 0 point
in time, texts can be viewed in relation to this point, and
generalizations can be made over texts at di erent distances
in time to t = 0. The goal of this paper is to present new
methods that are able to automatically estimate the
time-toevent from a stream of microtext messages. These methods
could serve as modules in news media mining systems1 to
ll upcoming event calendars. The methods should be able
to work robustly in a stream of messages, and the dual goal
would be to make (i) reliable predictions of times-to-event
(ii) as early as possible. Predicting that an event is starting
imminently is arguably less useful than being able to predict
its start in a number of days. This implies that if a method
requires a sample of tweets (e.g. with the same hashtag) to
be gathered during some time frame, the frame should not
be too long, otherwise predictions could come in too late to
be relevant.</p>
      <p>In this paper we test the predictive capabilities of three
different approaches. The rst system is based on linear
regression and maps sets of tweets with the same hashtag during
an hour to a time-to-event estimate. The second system
attempts to do the same based on local regression. The
third system uses time series analysis. It takes into account
more than a single set of tweets: during a certain time
period it samples several sets of tweets in xed time frames,
and derives time series information from individual word
frequencies in these samples. It compares these word frequency
time series pro les against a labeled training set of pro les
in order to nd similar patterns of change in word
frequencies. The method then adopts local regression: nding a
nearest-neighbor word frequency time series, the
time-toevent stored with that neighbor is copied to the tested time
series. With this third system, and with the comparison
against the second system, we can test the hypothesis that
it is useful to gather time series information (more speci
cally, patterns in word frequency changes) over an amount
of time.</p>
      <p>This paper is structured as follows. We describe the
relation of our work to earlier research in Section 2. The three
systems are described in Section 3. Section 4 describes the
overall experimental setup, including a description of the
data, the baseline, and the evaluation method used. The
results are presented and analyzed in Section 5. We
conclude with a discussion of the results and future studies in
Section 6.
1For instance, http://www.zapaday.com/</p>
    </sec>
    <sec id="sec-2">
      <title>2. RELATED RESEARCH</title>
      <p>
        The growing availability of digital texts with time stamps,
such as e-mails, weblogs, and online news, has spawned
various types of studies on the analysis of patterns in texts over
time. An early publication on the general applicability of
time series analysis on time-stamped text is [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. A more
recent overview of future predictions using social media is
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. A popular goal of time series analysis of texts is event
prediction, where a correlation is sought between a point in
the future and preliminary texts.
      </p>
      <p>
        Ritter et al. train on annotated open-domain event
mentions in tweets in order to create a calendar of events based
on explicit date mentions and words typical of the event
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. While we also aim to estimate the point in time at
which an event will take place, our focus lies on the pattern
of anticipation seen in tweets linked to the time until the
event occurs rather than speci c time references to a future
event. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] do look at anticipation seen in tweets, but focus
on personal activities in the very near future, while we aim
to predict the time-to-event of potentially large-scale news
events as early as possible.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. METHODS</title>
      <p>In this section we introduce the methods adopted in our
study. They operate on streams of tweets, and generate
hourly forecasts for the events that tweets with the same
hashtag refer to. The single tweet is the smallest unit
available for this task; we may also consider more than one tweet
and aggregate tweets over a certain time frame. If these
single tweets or sets of tweets are represented as bag-of-words
vectors, the task can be cast as a regression problem:
mapping a feature vector onto a continuous numeric output
representing the time-to-event. In this study the smallest time
unit is one hour, and all three methods work with this time
frame.</p>
    </sec>
    <sec id="sec-4">
      <title>3.1 Linear and local regression</title>
      <p>In linear regression, each feature in the bag-of-words feature
vector (representing the presence or frequency of occurrence
of a speci c word) can be regarded as a predictive variable
to which a weight can be assigned that, in a simple linear
function, multiplies the value of the predictive variable to
generate a value for the response variable, the time-to-event.
A multiple linear regression function can be approximated
by nding the weights for a set of features that generates
the response variable with the smallest error.</p>
      <p>
        Local regression, or local learning [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], is the numeric variant
of the k-nearest neighbor classi er. Given a test instance, it
nds the closest k training instances based on a similarity
metric, and bases a local estimation of the numeric output
by taking some average of the outcomes of the closest k
training instances.
      </p>
      <p>Linear regression and local regression can be considered
baseline approaches, but are complementary. While in linear
regression an overall pattern is generated to t the whole
training set, local regression only looks at local information for
classi cation (the characteristics of single instances).
Linear regression is un t for approximating gaussian or other
non-linear distributions; as we will see, there are reasons
to believe that there are substantial di erences in tweets
posted in di erent periods of time before an event. In
contrast, local regression is unbiased and will adapt to any local
distribution.</p>
    </sec>
    <sec id="sec-5">
      <title>3.2 Time series analysis</title>
      <p>Time series are data structures that contain multiple
measurements of data features over time. If values of a feature
change meaningfully over time, then time series analysis can
be used to capture this pattern of change. Comparing new
time series with memorized time series can reveal
similarities that may lead to a prediction of a subsequent value or,
in our case, the time-to-event. Our time series approach
extends the local regression approach by not only considering
single sets of aggregrated tweets in a xed time frame (e.g.
one hour in our study), but creating sequences of these sets
representing several consecutive hours of gathered tweets.
Using the same bag-of-words representation as the local
regression approach, we nd nearest neighbors of sequences
of bag-of-word vectors rather than single hour frames. The
similarity between a test time series and a training time
series of the same length is calculated by computing their
Euclidean distance. In this study we did not further optimize
any hyperparameters; we set k = 1.</p>
      <p>The time series approach generates predictions by following
the same strategy as the simple local regression approach:
upon nding the nearest-neighbor training time series, the
time-to-event of this training time series is taken as the
timeto-event estimate of the test time series. In case of
equidistant nearest neighbors, the average of their associated
timeto-events is given as the prediction.</p>
    </sec>
    <sec id="sec-6">
      <title>4. EXPERIMENTAL SET-UP</title>
    </sec>
    <sec id="sec-7">
      <title>4.1 Data collection</title>
      <p>For this study we chose football matches as a speci c type
of event. They occur frequently, have a distinctive
hashtag by convention (`#ajafey' for a match between Ajax and
Feyenoord) and often generate a useful amount of tweets:
up to tens of thousands of tweets per match. For the
collection of training and test data we focused on Dutch football
matches played in the Eredivisie. We harvested tweets by
means of twiqs.nl, a database of Dutch tweets from
December 2010 onwards. We selected the (arbitrary) top 6 teams
of the league2, and queried all matches played between them
in 2011 and 2012. For each query, the conventional hashtag
for a match was used with a restricted search space of three
weeks before the time of the match until the start time of the
match (to ensure that the collected tweets were referring to
that speci c match, and not to an earlier match consisting
of the same home and away team and therefore the same
hashtag).</p>
      <p>The queries resulted in tweets referring to 60 matches
between the selected six teams in the period from January 2011
until December 2012. From these, we selected the matches
with the most frequent similar starting time, Sundays at
2:30 PM, for our experiment. As we focused on the amount
of hours before an event, the actual time when a tweet is
posted (for example during the night or in the afternoon)
can bias the type of tweet; with the xed starting time this
2Ajax, Feyenoord, PSV, FC Twente, AZ Alkmaar and FC
Utrecht
e ect is neutralized. To generate training and test events
that simulate a system trained on passed events and tested
on upcoming events, we selected tweets referring to matches
played in 2011 (a calendar year comprising two halfs of a
football season) as training data and tweets referring to 2012
matches as test data. This resulted in 12 matches as
training events (totaling 54,081 tweets) and 14 matches as test
events (40,204 tweets).</p>
      <p>The time-to-event in hours was calculated for every tweet,
based on their time of posting and the known start time
of the event they referred to. For this task we did not
take tweets into account that were posted during and
after matches. We also constrained the number of days before
the event: for both training and test sets, tweets were kept
within eight days before the event. Although this is an
arti cial constraint, the eight days window captures the vast
majority, about 98%, of forward-looking tweets.</p>
    </sec>
    <sec id="sec-8">
      <title>4.2 Generation of training and test data</title>
      <p>The goal of the experiments was to compare systems that
generate hourly forecasts of the event start time for each
test event. This was done based on the information in
aggregated sets of tweets within the time span of an hour.
Aggregation is done by treating all training events as one
collection during the extraction of features. The linear and
local regression methods only operate on vectors
representing hour blocks. The time series analysis approach makes
use of longer sequences of six hour blocks - this number was
empirically set in preliminary experiments.</p>
      <p>The aggregated tweets were used as training instances for
the linear and local regression methods. To maximize the
number of training instances, we generated a sequence of
overlapping instances using the minute as a ner-grained
shift unit. At every minute, all tweets posted within the
hour before the tweets in that minute were added to the
instance.</p>
      <p>In order to reduce the feature space for the linear and local
regression instances, we pruned every bag-of-word feature
that occured less than 500 times in the training set. Linear
regression was applied by means of R3. Absolute occurrence
counts of features were taken into account. For local
regression we made use of the k-NN implementation as part
of TiMBL4, setting k = 5, using Information Gain feature
weighting, and an overlap-based metric as similary metric
that does not count matches on zero values (features
marking words that are absent in both test and training vectors).
For k-NN, the binary value of features were used.
The time series analysis vectors are not lled with absolute
occurrence counts, but with relative and smoothed
frequencies. After having counted all words in each time frame, two
frequencies are computed for each word. The rst, the
overall frequency of a word, is calculated as the sum of its counts
in all time frames, divided by the total number of tweets in
all time frames in our 8-day window. This frequency ranges
between 0 (the word does not occur) and 1 (the word occurs
in every tweet). The second frequency is computed per time
3http://www.r-project.org/
4http://ilk.uvt.nl/timbl
frame for each word, where the word count in that frame is
divided by the number of tweets in the frame. The latter
frequency is the basic element in our time series calculations.
As many time frames contain only a small number of tweets,
especially the frames more than a few days before the event,
word counts are sparse as well. Besides taking longer time
frames of more than a single sample size, frequencies can also
be smoothed through typical time series analysis smoothing
techniques such as moving average smoothing. We apply a
pseudo-exponential moving average lter by replacing each
word count by a weighted average of the word count at time
frames t, t 1, and t 2, where wt = 4 (the weight at t is
set to 4), wt 1 = 2, and wt 2 = 1.</p>
    </sec>
    <sec id="sec-9">
      <title>4.3 Evaluation and baselines</title>
      <p>A common metric for evaluating numeric predictions is the
Root Mean Squared Error (RMSE), cf. Equation 1. For
all hourly forecasts made in N hour frames, a sum is made
of the squared di erences between the actual value vi and
the estimated value ei; the (square) root is then taken to
produce the RMSE of the prediction series.</p>
      <p>v
RM SE = tuu N1 X (vi</p>
      <p>N
i=1
ei)2
(1)
We computed two straightforward baselines derived from the
training set: the median and the mean of time-to-event over
all training tweets. For the median baseline, all tweets in
the training set were ordered in time and the median time
was identi ed. As we use one-hour time frames
throughout our study, we round the median by the one-hour time
frame it is in, which turns out to be 3 hours. The mean is
computed by averaging the time-to-event of all tweets, and
again rounded at the hour. The mean is 26 hours.</p>
    </sec>
    <sec id="sec-10">
      <title>5. RESULTS</title>
      <p>Table 1 displays the averaged RMSE results on the 14 test
events. On average the performance of the linear regression
method is worse than both baselines, while the time series
analysis outperforms the median baseline. Given that the
best performing method is still an unsatisfactory 43 hours
o , there is still a lot of improvement needed. The best
method per event varies. Even linear regression, which has
a below baseline performance on average, leads to the best
RMSE for two events. It appears that some negative
deviations (110 for 'twefey', 410 for 'tweaja') lead to the poor
average RMSE.</p>
      <p>The average performance of the di erent methods in terms
of their RMSE according to hourly forecasts is plotted in
Figure 1. In the left half of the graph the three systems
outperform the baselines, except for an error peak of the linear
regression method at around t = 150. Before t = 100 the
time series prediction is performing rather well, with RMSE
values averaging 23 hours. The linear regression and local
regression methods produce larger errors at rst, decreasing
as time progresses. In the second half of the graph, however,
only the local regression method retains fairly low RMSE
Baseline Median</p>
      <p>Baseline Mean
Linear regression
Local regression</p>
      <p>Time Series
values at an average of 21 hours, while the linear
regression method becomes increasingly erratic in its predictions.
The time series analysis method also produces considerably
higher RMSE values in the last days before the events.</p>
    </sec>
    <sec id="sec-11">
      <title>6. CONCLUSION</title>
      <p>In this study we explored and compared three approaches to
time-to-event prediction on the basis of streams of tweets.
We tested on the prediction of the time-to-event of football
matches by generating hourly forecasts. When the three
approaches are compared to two simplistic baselines based on
the mean and median of the time-to-event of tweets sent
before an event, only local regression displays better
overall RMSE values on the tested prediction range of 192 : : : 0
hours before the event. Linear regression generates some
highly erratic predictions and scores below both baselines.
A novel time series approach that implements local
regression based on sequences of samples of tweets performs better
than the mean baseline, but under the median baseline.
Yet, the time series method generates fairly accurate
forecasts during the rst half of the test period. Before t &lt; 100
hours, i.e. earlier than four days before the event,
predictions by the time series method are only about a day o (23
hours on average in this time range). When t 100, the
local regression approach based on sets of tweets in hourly
time frames is the better predictor, with RMSE values that
are sometimes close to t = 0 (21 hours on average in this
time range).</p>
      <p>On the one hand, our results are not very strong:
predictions that are more than two days o and that are at the
same time only mildly better than simple baselines cannot
be considered precise. However, the results indicate that
if we divide the problem into an `early' prediction system
based on time series analysis and a `late' prediction system
based on local regression, we could limit the prediction error
to within a day. If we can detect the point at which the time
series analysis starts increasing its predicted time-to-event
(which is the wrong trend as the event can only come closer
in time), it is time to switch to the local regression system.
In our data, this point is around t = 100.</p>
      <p>In future work we plan to extend the current study in several
directions. Most importantly, we plan to extend the study
to other events, moving from football to other scheduled
events, and from scheduled events to unscheduled events,
the ultimate goal of a forecasting system like this. A second
extension is to improve on the time series analysis method,
particularly to investigate why it is performing well only up
to several days before the future event (and what kind of
patterns it matches successfully). We also plan to optimize the
local regression approach, as we now utilize a fairly standard
k-NN approach without optimized hyperparameters, and we
have not optimized the selection of features either.</p>
    </sec>
    <sec id="sec-12">
      <title>Acknowledgement</title>
      <p>This study is nanced by the COMMIT program as part of
the In niti project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Atkeson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moore</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Schaal</surname>
          </string-name>
          .
          <article-title>Locally weighted learning</article-title>
          .
          <source>Arti cial Intelligence Review</source>
          ,
          <volume>11</volume>
          (
          <issue>1</issue>
          {5):
          <volume>11</volume>
          {
          <fpage>73</fpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kleinberg</surname>
          </string-name>
          .
          <article-title>Temporal dynamics of On-Line information streams</article-title>
          .
          <source>In Data stream management: Processing high-speed data streams</source>
          . Springer,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ritter</surname>
          </string-name>
          ,
          <string-name>
            <surname>Mausam</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Etzioni</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
          </string-name>
          .
          <article-title>Open domain event extraction from twitter</article-title>
          .
          <source>In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          ,
          <source>KDD '12</source>
          , pages
          <fpage>1104</fpage>
          {
          <fpage>1112</fpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Weerkamp and M. De Rijke</surname>
          </string-name>
          .
          <article-title>Activity prediction: A Twitter-based exploration</article-title>
          .
          <source>In Proceedings of TAIA'12</source>
          ,
          <string-name>
            <surname>Aug</surname>
          </string-name>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Kak</surname>
          </string-name>
          .
          <article-title>A survey of prediction using social media</article-title>
          . In ArXiv e-prints,
          <source>Mar</source>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>