=Paper=
{{Paper
|id=Vol-1670/paper-68
|storemode=property
|title=Variable Attention and Variable Noise: Forecasting User Activity
|pdfUrl=https://ceur-ws.org/Vol-1670/paper-68.pdf
|volume=Vol-1670
|authors=Cesar Ojeda,Kostadin Cvejoski,Rafet Sifa,Christian Bauckhage
|dblpUrl=https://dblp.org/rec/conf/lwa/OjedaCSB16
}}
==Variable Attention and Variable Noise: Forecasting User Activity==
<pdf width="1500px">https://ceur-ws.org/Vol-1670/paper-68.pdf</pdf>
<pre>
        Variable Attention and Variable Noise:
              Forecasting User Activity

    César Ojeda, Kostadin Cvejoski, Rafet Sifa, and Christian Bauckhage

                            Fraunhofer IAIS, Germany
                       {name.surname}@iais.fraunhofer.de


      Abstract. The study of collective attention is of growing interest in
      an age where mass- and social media generate massive amounts of of-
      ten short lived information. That is, the problem of understanding how
      particular ideas, news items, or memes grow and decline in popularity
      has become a central problem of the information age. Recent research
      efforts in this regard have mainly addressed methods and models which
      quantify the success of such memes and track their behavior over time.
      Surprisingly, however, the aggregate behavior of users over various news
      and social media platforms where this content originates has large been
      ignored even though the success of memes and messages is linked to
      the way users interact with web platforms. In this paper, we therefore
      present a novel framework that allows for studying the shifts of attention
      of whole populations related to websites or blogs. The framework is an
      extension of the Gaussian process methodology, where we incorporate
      regularization methods that improve prediction and model input depen-
      dent noise. We provide comparisons with traditional Gaussian process
      and show improved results. Our study in a real world data set, uncovers
      hidden patterns of user behavior.


      Keywords: Gaussian process, Regularization Methods


1   Introduction

Over the last couple of years, so called “question answering” (QA) sites have
gained considerable popularity. These are internet platforms where users pose
questions to a general population. Yahoo Answers, Quora and the Stack Ex-
change family establish internet communities which provide natural and seamless
ways for organizing and providing knowledge [1]. So far, dynamical aspects of
such questions answering sites have been studied in different contexts. Previous
work in this area includes studying causality aspects through quasi experimental
designs [8], user churn analysis through classification algorithms such as support
vector machines or random forests [9], and predictions of the future value of
questions answers pairs according to the initial activity of the question post [2].
In contrast to previous work where long term activity of users is being predicted,
our focus in this paper is time series analysis related to user-defined tags. This
approach allows detailed daily analysis of the behavior of users and we concen-
trate on the QA site Stackoverflow. This platform has an established reputation
on the web and boasts a community of over 5 million distinct active users who,
so far, have provided more 18 million answers to more than 11 million questions.
Thanks to the shear size of the corresponding data set as well as because of the
regular activity of the user base, we are able to mine temporal data in order to
uncover defining aspects of the dynamics of the user behavior.
Due to the complexity of user-system interaction (millions of people discuss
thousands of topics), flexible and accurate models are required in order to guar-
antee reliable forecasting. In recent years the Bayesian setting and the Gaussian
Process (GP) framework [11, 5] has shown to provide an accurate and flexible
tool for time series analysis. In particular, the possibility of incorporating error
ranges as well as different models with the selection of different kernels, permits
interpretability of the results. In this work, we model changes in attention as
a variability in the fluctuation of the time series of occurrences of user defined
tags which can be categorized as a special case of heterocedasticity or input de-
pendent noise. We provide an extension of sparse input Gaussian Processes [15,
14] which allow us to model functional dependence in the time variation of the
fluctuations. In practical experiments, we study the top 10 different tags for the
Stackoverflow data set over different years, spanning a data set of over 2.9 million
questions. We find that our model outperform predictions made by the simple
GP model under variable noise. In particular, we uncover weekly and seasonal
periodicity patterns as well as random behavior in monthly trends. All in all, we
are able to forecast the number of questions within a 5 percent error 20 days in
the future.
In the next section, we formally introduce the Gaussian Process framework and
provide details regarding our extensions towards variable noise models. We then
show an analysis of the periodicity of the time series of tag activity as appar-
ent from the Stackoverflow data set. Next, we compare our prediction results
with those of other models and discuss the advantages of introducing functional
dependencies on noise terms. Finally, we provide conclusions and directions of
future work.


2   A Model for Time Series Analysis

In this section, we propose a Gaussian process (GP) model for regression that
extends the sparse pseudo-input Gaussian process (SPGP) for regression [14].
Our model deals with the problem of over fitting that hampers the SPGP model
and makes it possible to analyze the function of the uncertainty added to ev-
ery pseudo-input. Analyzing the uncertainty function, we indirectly analyze the
effects of heteroscedastic noise.
A GP is a Bayesian model that is commonly used for regression tasks [11]. The
main advantages of this method are its non-parametric nature, the possibility
of interpreting the model through flexible kernel selection, and the confidence
intervals (error bars) obtained for every prediction. The non-parametric nature
of this method
                has a drawback, though. The computational cost of the training
is O N 3 , where N is the number of training points. There are many sparse
approximation methods of the  full GP that try to lower the computational cost
of the training to O M 2 N where M is the size of the subset of the training
points that are used for approximation (i.e. the active set) and typically M  N
[13, 12]. The M points for the approximation are chosen according to various
information criteria. This leads to difficulties w.r.t. learning the kernel hyper-
parameters by maximizing the marginal likelihood of the GP using gradient
ascent. The re-selection of the active set causes non-smooth fluctuations of the
gradients of the marginal likelihood, which results likely convergence to sub-
optimal local maxima [14].


2.1   Gaussian Process for Regression

Next, we first briefly review the GP model for regression, yet, for a detailed
discussion we refer to [11, 10].
                                                                               N
Consider a data set D of size N containing pairs of input vectors X = {xn }n=1
                                      N
and real value target points y = {yn }n=1 . In order to apply the GP model to
regression problems, we need to account for noise in the available target values,
which are thus expressed as
                                 yn = fn + n                                 (1)
where fn = f (xn ) and n is a random noise variable which is independently
chosen for each n. We shall consider a noise process following a Gaussian distri-
bution defined as
                            p (y | f ) = N y | f , σ 2 I
                                                         
                                                                             (2)
where N (y | m, C) is a Gaussian distribution with mean m and covariance C.
The marginal distribution of p (f ) is then given by another Gaussian distribution,
namely p (f | X, θ) = N (f | 0, KN ). The covariance function that determines
KN is chosen to express the property that, if points xn and xm are similar, the
value [KN ]nm should express this similarity. Usually, this property of the covari-
ance function is controlled by small number of hyperparameters θ. Integrating
out over f , we obtain the marginal likelihood as
                                        Z
                       p (y | X, θ) = p (y | f ) p (f | X, θ)
                                                                                (3)
                                  = N y | 0, KN + σ 2 IN ,
                                                           

which is used for training the GP model by maximizing it with respect to θ and
σ 2 . The distribution of the target value of a new point x will then be
                                                         −1
                 p (y | x, D, θ) = N y | kx T (KN + σ 2 I     y, Kxx
                                                     −1                   (4)
                                    −kx T KN + σ 2 I       kx + σ 2 ,
where [kx ]n = K (xn , x) and Kxx = K (x, x). In order to predict with GP model,
we need to have all the training data available during run-time, which is why
the GP for regression is referred to as a non-parametric model.


2.2   SPGP and SPGP+HS Models

An approximation of the full GP model for regression is presented in [14] in which
the authors propose the sparse pseudo-input Gaussian process (SPGP) regression
model that enables a search for the kernel hyper-parameters and the active set
in a single joint optimization process. This is possible because it is allowed for
the active set (pseudo-inputs M ) to take any position in the data space, not
only to be a subset of the training data. Parameterizing the covariance function
of the GP by the pseudo-inputs, gives the possibility for learning the pseudo-
inputs using gradient ascent. This is a major advantage, because it improves the
model fit by fine tuning the locations of the pseudo-inputs. Let, X = {xm }M  m=1
be the pseudo-inputs and f = {f m }M   m=1 are the pseudo targets, the predictive
distribution of the model for a new input x∗ will be given by
                                   Z
                                                              
               p y∗ | x∗ , D, X = p y∗ | x∗ , X, f p f | D, X df

                                 = N y∗ | µ∗ , σ∗2 ,
                                                  
                                                                               (5)
                                         −1              2 −1
                              µ∗ = k>
                                                           
                                     ∗ Q M  KMN    Λ + σ  I   y
                              σ∗ = K∗∗ − k∗ K−1
                               2            >           −1        2
                                                            
                                                  M − QM k∗ + σ ,

where KN is the covariance matrix of the training data, KM is the covariance
matrix of the pseudo inputs, σ 2 is the noise, Q is defined as
                                             −1
                          Q = KM N Λ + σ 2         KN M                       (6)

and Λ is defined as

                      Λ = diag KN − KN M K−1
                                                
                                          M KM N .                            (7)

Finding the pseudo input locations X and the hyperparameters (kernel parame-
ters and noise) Θ = {θ, σ 2 } can be done by maximizing the marginal likelihood
(8) with respect to the parameters {X, Θ}.
                                Z
                                                    
             p y | X, X, Θ = p y | X, X, f p f | X df
                                                                             (8)
                              = N y | 0, KN M K−1               2
                                                                    
                                                M KM N  + Λ + σ   I

One positive effect of the sparsity of the SPGP model is the capability of learn-
ing data sets that have variable noise where the term variable noise refers to
noise which depends on the input. However, it is important to note, that this
capability is limited and an improvement of the SPGP model is presented in
[15]. Introducing an additional uncertainty hm parameter to every pseudo-input
point makes the model more flexible and allows for improved representations of
heteroscedastic data sets. The covariance matrix of the pseudo-inputs is defined
by
                            KM → KM + diag (h) ,                              (9)
where h is a positive vector of uncertainties that needs to be learned and diag (h)
represents a diagonal matrix whose elements are those of the h vector. This
extension allows the possibility of gradual influence on the pseudo inputs. This
means that if uncertainty hm = 0, then the pseudo input m behaves like in the
standard SPGP. Yet, as hm grows the particular pseudo input has less influence
on the predictive distribution. This possibility of partially turning off the pseudo
inputs allows a larger noise variance in the prediction. The authors of [15] refer
to this as heteroscedastic (input dependable noise) extension SPGP+HS.


2.3   SPGP+FUNC-HS

Introducing the heteroscedastic extension to the SPGP empowers the model
to learn from data sets with varying noise. However, making the model this
flexible may cause problems of over fitting. Also, using the SPGP+HS to predict
user and website activities, does not allow us to interpret the behavior of the
noise because noise is represented as a positive vector h of uncertainties and
attempts of interpreting these values do not yield meaningful information about
the behavior of the noise.
One way of solving the problems of over fitting and lack of interpretability will
be to put a prior distribution over the vector h of uncertainties. However, taking
this approach leads to computationally intractable integrals.
The solution which we propose for these problems is to make use of an uncer-
tainty function that depends on the pseudo-inputs. Our covariance function of
the pseudo-inputs is defined as

                          KM → KM + diag (fh (xm )) ,                          (10)

where fh is the uncertainty function and xm is a pseudo-input. By defining the
heteroscedastic extension in this way, it is possible for the parameters of the
uncertainty function to be learned by the gradient based maximum likelihood
approaches. Hence, later on, we are able to interpret the parameters of the het-
eroscedastic noise function as parameters that govern the noise in the model.
Another advantage of having a heteroscedastic function is that it restricts the
parameter search space when learning the model. This restriction can be bene-
ficial when learning the model, because, it removes unnecessary local maxima.
This results in much faster convergence when learning the model and also in
improved chances of reducing over fitting. In the following, we will refer to our
new heteroscedastic function model as SPGP+FUNC-HS.
For modeling the Stackoverflow data set, we introduce two heteroscedastic noise
functions. In general, we may use any function that can describe the noise of the
given data set. The first heteroscedastic noise function which we consider is the
simple sine function defined by
                                        fh (xm ) = a sin (2πωxm + ϕ) ,                  (11)
where a is the amplitude, ω is the frequency and ϕ is the phase. We refer to
this model as SPGP+SIN-HS. The second heteroscedastic noise function we
investigate is a product of the sine function and an RBF kernel, namely
                                                     (xm −hm )2
                  fh (xm , hm ) = c2 e−                  2l2         sin (2πωxm + ϕ),   (12)
where c is the variance, hm is a mean associated with every pseudo-input xm in
the RBF kernel, and l is the length scale of the RBF kernel. The mean in the RBF
kernel can be initialized at random or set by the user if the user has corresponding
prior knowledge. Setting a mean for every pseudo-input point divides the whole
input space into regions where, in each region, we have a function governing
the uncertainty associated with every pseudo input. The uncertainty function
defined like this then behaves like mixture of experts and we refer to this model
as SPGP+RBFSIN-HS model.


3     Results


                                   10 10
                                       10 9
                       PSD [V**2/Hz]


                                       10 8
                                       10 7
                                       10 6
                                       10 5
                                                      10 1                  10 2
                                                 Time Periods Log(Days)

Fig. 1: Spectral Density Estimation on the Stackoverflow dataset using a peri-
odogram. We observe two peaks at two and a half and five days, where the latter
peak is the doubled period of the former peak.


In the previous section, we presented the Gaussian process method and two
extensions of this method, the SPGP+HS and the SPGP+FUNC-HS. In this
section, we present results we obtained when using these models on our Stack-
overflow data set.
In order to test our models, we used publicly available data-dumps of Stackover-
flow1 . The data set contains the number of questions and answers of postings
1
    Downloadable URL: www.archive.org/details/stackexchange
                 MSE                              NLPD                    NLML
                  GP RBFSIN         HS     SIN      GP RBFSIN HS SIN        GP RBFSIN           HS      SIN
     android     960.88 692.03 887.45 720.75       4.65   4.49 4.61 4.49 -1076.37 -948.40 -1149.22 -993.23
     c#         1029.06 881.11 950.64 894.61       4.70   4.62 4.64 4.62 -1003.23 -949.54 -962.43 -961.62
     c++        1216.94 533.68 5068.20 675.84      4.84   4.45 6.02 4.66 -717.14 -698.50 -756.95 -716.58
     html        681.57 678.19 774.17 754.95       4.47   4.45 4.51 4.50 -841.93 -784.78 -798.28 -820.60
     ios        2598.35 1474.72 3064.63 1660.90    5.82   4.81 5.53 4.86 -757.36 -737.24 -750.69 -740.49
     java       1917.86 1431.70 3446.30 1782.17    5.12   4.90 5.79 4.95 -1098.13 -1034.83 -1087.29 -1068.30
     javascript 2992.30 1869.61 2396.68 2102.05    6.09   4.97 5.52 5.22 -1493.42 -883.31 -1054.49 -1044.76
     jquery     808.26 825.28 989.07 1163.88       4.57   4.77 4.69 4.73 -957.31 -932.99 -866.17 -862.45
     php        5892.26 907.07 5379.89 2745.40     6.83   4.60 6.13 5.15 -1042.95 -883.65 -945.85 -853.21
     python     604.89 702.25 744.28 881.65        4.44   4.58 4.53 4.62 -782.68 -842.76 -787.24 -788.14
Table 1: Results showing the MSE and NLPD (smaller better) on the 2014
question test set and NLML (larger is better) on the 2014 question training
set. GP indicates a pure Gaussian process, HS indicates a sparse pseudo-input
Gaussian process with heteroscedastic noise, SIN-HS refers to a sparse pseudo-
input Gaussian process with sine functional noise, and RBFSIN-HS refers to a
sparse pseudo-input Gaussian process with sine and RBF kernel functional noise.


classified by tag for every business day. The models are trained on a data set
containing information about daily postings in the time between 01.02.2014 to
31.08.2014. The evaluation of the models is done on a test set containing postings
for the first 21 working days in September 2014.

The performance of the presented models depends on the choice of the kernels
used for the covariance matrix. When working with GPs, an additional analysis
is required to select proper kernels for the covariance matrix. Because we work
with a data set that reflects user behavior, we supposed that it may show a form
of periodicity in the behavior of the users. Accordingly, we performed spectral
density estimation analysis [6, 4, 7] of the time series using a periodogram analysis
[6, 4, 7]. This analysis shows the power (amplitude) of the time series as a function
of frequency and in this way we are able to verify if there are indeed periodicities
and, if so, at what frequency they occur.

A periodogram of the time series data that we are analyzing is shown in Figure
1. Since all our tag related time series tag have almost the same periodogram, we
only show one of them. For better interpretability we converted the frequencies
into periods to observe in how many days the periods occur. There are two
apparent peaks, the first occurring at two and a half days and the second at five
days. In this case the period of five days appears as an echo of the two and a
half days period, therefore we dismiss the second period and we only take into
account the first period. Additional characteristics of this data set are minor
irregularities and a long term rising trend in the overall time series.

Given these observations, our models that show the best performance as a co-
variance function use a sum of four kernels

               k (x, x0) = k1 (x, x0) + k2 (x, x0) + k3 (x, x0) + k4 (x, x0) .                                 (13)
The question of how to choose these kernels and the particular role of each kernel
in the learned model will be discussed in the next section.
Next, we present the result achieved for the top ten tags according to the number
of posted questions and answers in the 2014 Stackoverflow data set. Table 1
presents the results of the different models of the posted questions time series and
Table 2 presents the results of the different models for the posted answers time
series. In order to compare the prediction models, we considered the following
measures:
 – Mean Square Error (MSE) to accounts for the accuracy of the prediction
   of an unseen data point
 – Negative Log Predictive Distribution (NLPD) to obtain a confidence
   for the predicted values on an unseen data point
 – Negative Log Marginal Likelihood (NLML) to account for how well
   the model fits the training data.


                MSE                              NLPD                    NLML
                 GP RBFSIN        HS      SIN      GP RBFSIN HS SIN        GP RBFSIN           HS      SIN
    android    1097.05 1098.29 1041.58 1031.10    4.80   4.79 4.81 4.78 -903.82 -919.78 -913.40 -927.61
    c#         2889.76 2723.95 2998.26 2878.46    5.24   5.18 5.24 5.22 -1007.62 -983.75 -989.81 -995.13
    c++        1602.27 1436.71 3491.81 3010.85    4.89   4.85 6.21 5.15 -825.76 -805.98 -886.82 -775.62
    html      1856.82 2016.96 2162.96 1904.25     4.98   4.99 5.02 4.96 -1082.09 -957.67 -907.46 -954.44
    ios        3944.90 1541.55 5156.82 5017.53    5.74   4.87 5.48 5.41 -831.93 -839.15 -778.98 -777.68
    java       3207.22 2987.25 4085.50 3090.13    5.38   5.19 5.35 5.20 -1283.56 -1016.72 -1024.00 -1047.48
    javascript 5360.20 4869.97 5434.37 5374.24    5.61   5.50 5.68 5.51 -1141.66 -1110.28 -1139.14 -1131.77
    jquery     1817.16 1728.42 1749.74 1725.81    5.12   5.03 5.07 5.00 -976.82 -1021.99 -1009.31 -1023.81
    php        2950.13 2948.65 3076.88 2982.74    5.16   5.16 5.20 5.17 -1011.84 -1015.56 -995.81 -994.36
    python      911.70 606.00 1660.13 605.22      4.64   4.64 4.90 4.64 -867.67 -820.73 -792.96 -813.29
Table 2: Results showing the MSE and NLPD (smaller is better) on the 2014
answers test set and NLML (larger is better) on the 2014 answers training set.
GP indicates a pure Gaussian process, HS indicates a sparse pseudo-input Gaus-
sian process with heteroscedastic noise, SIN-HS refers to a sparse pseudo-input
Gaussian process with sine functional noise, and RBFSIN-HS refers to a sparse
pseudo-input Gaussian process with sine and RBF kernel functional noise.


For the MSE and the NLPD measures, smaller values are better, and for the
NLML larger values are better. The best model for each tag has been chosen
using the Akaike information criterion (AIC) [3]. We observe that models with
functional noise perform better in nine of the ten tags in the answer time series,
and eight of the ten tags in the question time series. The superior performance
of the SPGP+FUNC-HS over the full GP can be attributed to the fact that the
data set contains variable noise. Note that for this data set, SPGP+FUNC-HS
performs better, because of the sparsity of the model and the additional func-
tional noise that is added to the pseudo-inputs. SPGP+HS performs worse than
the best models because, adding only a positive vector of uncertainty increases
the flexibility of the covariance function, which at the end can lead to over fit-
ting and convergence to bad local maxima. Using a functional noise constraint,
the optimization space shrinks and implicitly prunes bad local maximas. The
drawback is that the function of the noise should follow the distribution of the
noise in the data set, otherwise the model will perform poorly. This is probably
the case why the SPGP+FUNC-HS performs worse on one tag for the answers
and on two tags for the questions.
In Fig. 2, we present two learned models, one for the tag “Java” (Fig. 2a) and
one for the tag “iOS” (Fig. 2b). We observe that the model that models the Java
tag strives to predict the test point using the mean. In contrast, the model that
models the iOS tag predicts the test points in terms of noise.

                                  training data           95% conf. interval                       500
                                                                                                            training data            95% conf. interval
                  1100            test data               peseudo-inputs                                    test data                peseudo-inputs
                                  mean                                                             450      mean
                  1000
                                                                                                   400
                      900
# of visits per day


                                                                                 # of visits per day
                                                                                                   350
                      800                                                                          300
                      700                                                                          250

                      600                                                                          200
                                                                                                   150
                      500
                                                                                                   100
                      Feb   Mar      Apr   May    Jun    Jul   Aug   Sep   Oct                      Feb   Mar   Apr   May   Jun    Jul   Aug   Sep   Oct
                                                  Days                                                                      Days


                                           (a) “Java”                                                                  (b) “iOS”

Fig. 2: Models learned with SPGP+SIN-HS for the tags “Java” and “iOS” in the
2014 data set.


4                      Analysis

The different kernels in Eq. (13) allow us to dissect the dynamical behavior of
the population w.r.t. different scales and patterns. In order to portrait these
behaviors, we calculated the mean function and variance Eq. (5) by generating
vector k>
        ∗ using independent kernels. We present the values of each kernel in the
“android” question data set in Fig. 3
       – Mean trends (Fig. 3a) characterize the behavior of the population of users
         over scales measured in months and represent the global mean behavior of the
         population. We hypothesize that they are driven by the shear size of the user
         base. The more people interested in the tag are visiting the site, the higher
         the average number of questions per month. Further, this overall trend might
    represent the changes in the dominance of this particular tag of questions in
    the data set. Because the tag refers to a programming language, trends like
    this indicate changes in attention to various languages. Such dynamics are
    modeled using the rational quadratic kernel
                                              (x − x0 )2 −θ8
                         k1 (x, x0 ) = θ62 1 +                              (14)
                                                 2θ8 θ72

 – Seasonal trends (Fig. 3b) arise on a time scale smaller than major trends
   and show both periodical and stochastic patterns. They represent changes in
   the population behavior throughout the different months of the year which
   can be uncovered with the Ornstein-Ulenbeck kernel
                                                    |x − x0 |
                           k2 (x, x0 ) = θ1 exp(−             ).            (15)
                                                       θ2

 – Weekly periods (Fig. 3c) as obtained from the periodogram represent
   weekly usage patterns and fine grained periods of activity in our data set.
   We hypothesize that such behaviors are related natural work patterns during
   the working week and model them using the following kernel (16).
                                                       
                          k3 (x, x0 ) = θ32 exp L1 + L2                   (16)

    where we define L1 and L2 as
                                           (x − x0 )2
                                  L1 = −                                    (17)
                                              2θ42
    and
                                     2 sin2 [π(x − x0 )/P ]
                            L2 = −                          .               (18)
                                               θ52

 – Weekly noise (Fig. 3d) are fluctuations in the weakly behavior which can
   be expected due to the statistical nature of our data set. Randomness in
   the behavioral pattern of each user might give rise to fluctuations which we
   model using the following kernel
                                              (x − x0 )2 −θ11
                         k4 (x, x0 ) = θ92 1 +        2                     (19)
                                                2θ11 θ10


5   Conclusion and Future Work

In this paper, we addressed the problem of forecasting the daily posting behav-
ior of users of the Stackoverflow question answering web platform. In order to
accomplish this task, we extended the variable noise pseudo inputs Gaussian
                  850         training data                   95% conf. interval                                 training data               95% conf. interval
                              mean                            peseudo-inputs                                     mean                        peseudo-inputs
                  800                                                                                800


                                                                                   # of visits per day
# of visits per day
                  750
                  700                                                                                700
                  650
                                                                                                     600
                  600
                  550                                                                                500
                  500
                   Feb       Mar       Apr    May      Jun      Jul   Aug    Sep                         Feb    Mar    Apr    May     Jun      Jul   Aug    Sep
                                                     Days                                                                           Days

                              (a) mean trend k1 Eq. 14                                                         (b) seasonal trends k2 Eq. 15
                                   training data              95% conf. interval                                  training data              95% conf. interval
                                   mean                       peseudo-inputs                             60       mean                       peseudo-inputs
                      100                                                                                40
# of visits per day


                                                                                   # of visits per day
                       50                                                                                20
                        0                                                                                 0
                                                                                                         20
                       50
                                                                                                         40
                      100                                                                                60
                       Feb    Mar       Apr    May      Jun     Jul   Aug    Sep                         Feb    Mar    Apr    May      Jun     Jul   Aug    Sep
                                                     Days                                                                           Days

                             (c) weekly periods k3 Eq. 16                                                       (d) weekly noise k4 Eq. 19

Fig. 3: Decomposition of the SPGP+SIN-HS model for the “android” tag in
different kernels. We observe four main behaviors: mean trends, seasonal trends,
weekly periods, and weekly noise.


Process framework by introducing a functional noise variant. The idea of using
functional descriptions of noise allowed us to study periodic patterns in collective
attention shifts and was found to act as a regularizer in model training.

Our extended Gaussian Process framework with functional representations of
various kinds of noise provides the added advantage of increased interpretability
of results as the different kernels defined for this purpose can uncover different
kinds of dynamics. In particular, our kernels revealed major distinct character-
istics of the question answering behavior of users. First of all, there are major
trends on time scales of about six months showing growing and declining inter-
est in particular topics or corresponding tags. Second of all, these major trends
are perturbed by seasonal behavior, for example overall activities usually drop
during the summer season. Third of all, on a fine grained scale, there are weekly
patterns characterized by periods of 2.5 days. Fourth of all, there are noisy fluc-
tuations in activities on daily scales.

Given the models and results presented in this paper, there various directions
for future work. First and foremost, we are currently working on implementing
a distributed Gaussian Process framework in order to extend our approach to-
wards massive amounts of behavioral data (use of tags, comments, and likes)
that can be retrieved from similar social media platforms such as Twitter or
Facebook.
References
 1. L. A. Adamic, J. Zhang, E. Bakshy, and M. S. Ackerman. Knowledge Sharing and
    Yahoo Answers: Everyone Knows Something. In Proc. of ACM WWW, 2008.
 2. A. Anderson, D. Huttenlocher, J. Kleinberg, and J. Leskovec. Discovering Value
    from Community Activity on Focused Question Answering Sites: A Case Study of
    Stack Overflow. In Proc. of ACM KDD, 2012.
 3. C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
 4. J. D. Hamilton. Time Series Analysis. Princeton University Press, 1994.
 5. K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard. Most likely heteroscedastic
    gaussian process regression. In Proceedings of the 24th international conference on
    Machine learning, pages 393–400. ACM, 2007.
 6. D. G. Manolakis, V. K. Ingle, and S. M. Kogon. Statistical and Adaptive Signal
    Processing: Spectral Estimation, Signal Modeling, Adaptive Filtering, and Array
    Processing. Artech House Norwood, 2005.
 7. C. Ojeda, R. Sifa, and C. Bauckhage. Investigating and Forecasting User Activities
    in Newsblogs: A Study of Seasonality, Volatility and Attention Burst. Work On
    Progress, 2016.
 8. H. Oktay, B. J. Taylor, and D. D. Jensen. Causal Discovery in Social Media
    Using Quasi-experimental Designs. In Proc. of ACM Workshop on Social Media
    Analytics, 2010.
 9. J. S. Pudipeddi, L. Akoglu, and H. Tong. User Churn in Focused Question An-
    swering Sites: Characterizations and Prediction. In Proc. of ACM WWW, 2014.
10. C. E. Rasmussen. Evaluation of Gaussian Processes and Other Methods for Non-
    linear Regression. PhD thesis, University of Toronto, 1996.
11. C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning.
    MIT Press, 2005.
12. M. Seeger. Pac-bayesian Generalisation Error Bounds for Gaussian Process Clas-
    sification. J. Mach. Learn. Res., 3:233–269, 2003.
13. M. Seeger, C. Williams, and N. Lawrence. Fast Forward Selection to Speed Up
    Sparse Gaussian Process Regression. In Proc. of Workshop on Artificial Intelligence
    and Statistics, 2003.
14. E. Snelson and Z. Ghahramani. Sparse Gaussian Processes Using Pseudo-inputs.
    In Proc. of NIPS, 2005.
15. E. Snelson and Z. Ghahramani. Variable Noise and Dimensionality Reduction for
    Sparse Gaussian Processes. arXiv preprint arXiv:1206.6873, 2012.

</pre>