-

Variable Attention and Variable Noise: Forecasting User Activity

Cesar Ojeda

Kostadin Cvejoski

Rafet Sifa

Christian Bauckhage

0 0 Fraunhofer IAIS , Germany

The study of collective attention is of growing interest in an age where mass- and social media generate massive amounts of often short lived information. That is, the problem of understanding how particular ideas, news items, or memes grow and decline in popularity has become a central problem of the information age. Recent research e orts in this regard have mainly addressed methods and models which quantify the success of such memes and track their behavior over time. Surprisingly, however, the aggregate behavior of users over various news and social media platforms where this content originates has large been ignored even though the success of memes and messages is linked to the way users interact with web platforms. In this paper, we therefore present a novel framework that allows for studying the shifts of attention of whole populations related to websites or blogs. The framework is an extension of the Gaussian process methodology, where we incorporate regularization methods that improve prediction and model input dependent noise. We provide comparisons with traditional Gaussian process and show improved results. Our study in a real world data set, uncovers hidden patterns of user behavior.

Gaussian process Regularization Methods

Over the last couple of years, so called \question answering" (QA) sites have gained considerable popularity. These are internet platforms where users pose questions to a general population. Yahoo Answers, Quora and the Stack Exchange family establish internet communities which provide natural and seamless ways for organizing and providing knowledge [ 1 ]. So far, dynamical aspects of such questions answering sites have been studied in di erent contexts. Previous work in this area includes studying causality aspects through quasi experimental designs [ 8 ], user churn analysis through classi cation algorithms such as support vector machines or random forests [ 9 ], and predictions of the future value of questions answers pairs according to the initial activity of the question post [ 2 ]. In contrast to previous work where long term activity of users is being predicted, our focus in this paper is time series analysis related to user-de ned tags. This approach allows detailed daily analysis of the behavior of users and we concentrate on the QA site Stackover ow. This platform has an established reputation on the web and boasts a community of over 5 million distinct active users who, so far, have provided more 18 million answers to more than 11 million questions. Thanks to the shear size of the corresponding data set as well as because of the regular activity of the user base, we are able to mine temporal data in order to uncover de ning aspects of the dynamics of the user behavior.

Due to the complexity of user-system interaction (millions of people discuss thousands of topics), exible and accurate models are required in order to guarantee reliable forecasting. In recent years the Bayesian setting and the Gaussian Process (GP) framework [ 11, 5 ] has shown to provide an accurate and exible tool for time series analysis. In particular, the possibility of incorporating error ranges as well as di erent models with the selection of di erent kernels, permits interpretability of the results. In this work, we model changes in attention as a variability in the uctuation of the time series of occurrences of user de ned tags which can be categorized as a special case of heterocedasticity or input dependent noise. We provide an extension of sparse input Gaussian Processes [ 15, 14 ] which allow us to model functional dependence in the time variation of the uctuations. In practical experiments, we study the top 10 di erent tags for the Stackover ow data set over di erent years, spanning a data set of over 2.9 million questions. We nd that our model outperform predictions made by the simple GP model under variable noise. In particular, we uncover weekly and seasonal periodicity patterns as well as random behavior in monthly trends. All in all, we are able to forecast the number of questions within a 5 percent error 20 days in the future.

In the next section, we formally introduce the Gaussian Process framework and provide details regarding our extensions towards variable noise models. We then show an analysis of the periodicity of the time series of tag activity as apparent from the Stackover ow data set. Next, we compare our prediction results with those of other models and discuss the advantages of introducing functional dependencies on noise terms. Finally, we provide conclusions and directions of future work. 2

A Model for Time Series Analysis

In this section, we propose a Gaussian process (GP) model for regression that extends the sparse pseudo-input Gaussian process (SPGP) for regression [ 14 ]. Our model deals with the problem of over tting that hampers the SPGP model and makes it possible to analyze the function of the uncertainty added to every pseudo-input. Analyzing the uncertainty function, we indirectly analyze the e ects of heteroscedastic noise.

A GP is a Bayesian model that is commonly used for regression tasks [ 11 ]. The main advantages of this method are its non-parametric nature, the possibility of interpreting the model through exible kernel selection, and the con dence intervals (error bars) obtained for every prediction. The non-parametric nature of this method has a drawback, though. The computational cost of the training is O N 3 , where N is the number of training points. There are many sparse approximation methods of the full GP that try to lower the computational cost of the training to O M 2N where M is the size of the subset of the training points that are used for approximation (i.e. the active set) and typically M N [ 13, 12 ]. The M points for the approximation are chosen according to various information criteria. This leads to di culties w.r.t. learning the kernel hyperparameters by maximizing the marginal likelihood of the GP using gradient ascent. The re-selection of the active set causes non-smooth uctuations of the gradients of the marginal likelihood, which results likely convergence to suboptimal local maxima [ 14 ]. 2.1

Gaussian Process for Regression

Next, we rst brie y review the GP model for regression, yet, for a detailed discussion we refer to [ 11, 10 ].

N Consider a data set D of size N containing pairs of input vectors X = fxngn=1

N and real value target points y = fyngn=1. In order to apply the GP model to regression problems, we need to account for noise in the available target values, which are thus expressed as

yn = fn + n where fn = f (xn) and n is a random noise variable which is independently chosen for each n. We shall consider a noise process following a Gaussian distribution de ned as p (y j f ) = N y j f ; 2I where N (y j m; C) is a Gaussian distribution with mean m and covariance C. The marginal distribution of p (f ) is then given by another Gaussian distribution, namely p (f j X; ) = N (f j 0; KN ). The covariance function that determines KN is chosen to express the property that, if points xn and xm are similar, the value [KN ]nm should express this similarity. Usually, this property of the covariance function is controlled by small number of hyperparameters . Integrating out over f , we obtain the marginal likelihood as

Z p (y j X; ) =

p (y j f ) p (f j X; ) where [kx]n = K (xn; x) and Kxx = K (x; x). In order to predict with GP model, we need to have all the training data available during run-time, which is why the GP for regression is referred to as a non-parametric model. An approximation of the full GP model for regression is presented in [ 14 ] in which the authors propose the sparse pseudo-input Gaussian process (SPGP) regression model that enables a search for the kernel hyper-parameters and the active set in a single joint optimization process. This is possible because it is allowed for the active set (pseudo-inputs M ) to take any position in the data space, not only to be a subset of the training data. Parameterizing the covariance function of the GP by the pseudo-inputs, gives the possibility for learning the pseudoinputs using gradient ascent. This is a major advantage, because it improves the model t by ne tuning the locations of the pseudo-inputs. Let, X = fxmgmM=1 be the pseudo-inputs and f = ff mgmM=1 are the pseudo targets, the predictive distribution of the model for a new input x will be given by

Z p y j x ; D; X =

p y j x ; X; f p f j D; X df where KN is the covariance matrix of the training data, KM is the covariance matrix of the pseudo inputs, 2 is the noise, Q is de ned as and is de ned as

Q = KMN + 2 1

KNM = diag KN

KNM KM1KMN : Finding the pseudo input locations X and the hyperparameters (kernel parameters and noise) = f ; 2g can be done by maximizing the marginal likelihood (8) with respect to the parameters fX; g.

p y j X; X;

Z = One positive e ect of the sparsity of the SPGP model is the capability of learning data sets that have variable noise where the term variable noise refers to noise which depends on the input. However, it is important to note, that this capability is limited and an improvement of the SPGP model is presented in (5) (6) (7) (8) [ 15 ]. Introducing an additional uncertainty hm parameter to every pseudo-input point makes the model more exible and allows for improved representations of heteroscedastic data sets. The covariance matrix of the pseudo-inputs is de ned by

KM ! KM + diag (h) ; (9) where h is a positive vector of uncertainties that needs to be learned and diag (h) represents a diagonal matrix whose elements are those of the h vector. This extension allows the possibility of gradual in uence on the pseudo inputs. This means that if uncertainty hm = 0, then the pseudo input m behaves like in the standard SPGP. Yet, as hm grows the particular pseudo input has less in uence on the predictive distribution. This possibility of partially turning o the pseudo inputs allows a larger noise variance in the prediction. The authors of [ 15 ] refer to this as heteroscedastic (input dependable noise) extension SPGP+HS. 2.3

SPGP+FUNC-HS

Introducing the heteroscedastic extension to the SPGP empowers the model to learn from data sets with varying noise. However, making the model this exible may cause problems of over tting. Also, using the SPGP+HS to predict user and website activities, does not allow us to interpret the behavior of the noise because noise is represented as a positive vector h of uncertainties and attempts of interpreting these values do not yield meaningful information about the behavior of the noise.

One way of solving the problems of over tting and lack of interpretability will be to put a prior distribution over the vector h of uncertainties. However, taking this approach leads to computationally intractable integrals.

The solution which we propose for these problems is to make use of an uncertainty function that depends on the pseudo-inputs. Our covariance function of the pseudo-inputs is de ned as

KM ! KM + diag (fh (xm)) ; (10) where fh is the uncertainty function and xm is a pseudo-input. By de ning the heteroscedastic extension in this way, it is possible for the parameters of the uncertainty function to be learned by the gradient based maximum likelihood approaches. Hence, later on, we are able to interpret the parameters of the heteroscedastic noise function as parameters that govern the noise in the model. Another advantage of having a heteroscedastic function is that it restricts the parameter search space when learning the model. This restriction can be benecial when learning the model, because, it removes unnecessary local maxima. This results in much faster convergence when learning the model and also in improved chances of reducing over tting. In the following, we will refer to our new heteroscedastic function model as SPGP+FUNC-HS.

For modeling the Stackover ow data set, we introduce two heteroscedastic noise functions. In general, we may use any function that can describe the noise of the given data set. The rst heteroscedastic noise function which we consider is the simple sine function de ned by

fh (xm) = a sin (2 !xm + ') , where a is the amplitude, ! is the frequency and ' is the phase. We refer to this model as SPGP+SIN-HS. The second heteroscedastic noise function we investigate is a product of the sine function and an RBF kernel, namely fh (xm; hm) = c2e (xm2lh2m)2 sin (2 !xm + '), (11) (12) where c is the variance, hm is a mean associated with every pseudo-input xm in the RBF kernel, and l is the length scale of the RBF kernel. The mean in the RBF kernel can be initialized at random or set by the user if the user has corresponding prior knowledge. Setting a mean for every pseudo-input point divides the whole input space into regions where, in each region, we have a function governing the uncertainty associated with every pseudo input. The uncertainty function de ned like this then behaves like mixture of experts and we refer to this model as SPGP+RBFSIN-HS model. 3

Results

1010 ]z109 H / **2108 V [D107 S P106 105

Time P1e01riods Log(Days) 102

In the previous section, we presented the Gaussian process method and two extensions of this method, the SPGP+HS and the SPGP+FUNC-HS. In this section, we present results we obtained when using these models on our Stackover ow data set.

In order to test our models, we used publicly available data-dumps of Stackoverow1. The data set contains the number of questions and answers of postings 1 Downloadable URL: www.archive.org/details/stackexchange classi ed by tag for every business day. The models are trained on a data set containing information about daily postings in the time between 01.02.2014 to 31.08.2014. The evaluation of the models is done on a test set containing postings for the rst 21 working days in September 2014.

The performance of the presented models depends on the choice of the kernels used for the covariance matrix. When working with GPs, an additional analysis is required to select proper kernels for the covariance matrix. Because we work with a data set that re ects user behavior, we supposed that it may show a form of periodicity in the behavior of the users. Accordingly, we performed spectral density estimation analysis [ 6, 4, 7 ] of the time series using a periodogram analysis [ 6, 4, 7 ]. This analysis shows the power (amplitude) of the time series as a function of frequency and in this way we are able to verify if there are indeed periodicities and, if so, at what frequency they occur.

A periodogram of the time series data that we are analyzing is shown in Figure 1. Since all our tag related time series tag have almost the same periodogram, we only show one of them. For better interpretability we converted the frequencies into periods to observe in how many days the periods occur. There are two apparent peaks, the rst occurring at two and a half days and the second at ve days. In this case the period of ve days appears as an echo of the two and a half days period, therefore we dismiss the second period and we only take into account the rst period. Additional characteristics of this data set are minor irregularities and a long term rising trend in the overall time series. Given these observations, our models that show the best performance as a covariance function use a sum of four kernels k (x; x0) = k1 (x; x0) + k2 (x; x0) + k3 (x; x0) + k4 (x; x0) : (13) The question of how to choose these kernels and the particular role of each kernel in the learned model will be discussed in the next section.

Next, we present the result achieved for the top ten tags according to the number of posted questions and answers in the 2014 Stackover ow data set. Table 1 presents the results of the di erent models of the posted questions time series and Table 2 presents the results of the di erent models for the posted answers time series. In order to compare the prediction models, we considered the following measures: { Mean Square Error (MSE) to accounts for the accuracy of the prediction of an unseen data point { Negative Log Predictive Distribution (NLPD) to obtain a con dence for the predicted values on an unseen data point { Negative Log Marginal Likelihood (NLML) to account for how well the model ts the training data. For the MSE and the NLPD measures, smaller values are better, and for the NLML larger values are better. The best model for each tag has been chosen using the Akaike information criterion (AIC) [ 3 ]. We observe that models with functional noise perform better in nine of the ten tags in the answer time series, and eight of the ten tags in the question time series. The superior performance of the SPGP+FUNC-HS over the full GP can be attributed to the fact that the data set contains variable noise. Note that for this data set, SPGP+FUNC-HS performs better, because of the sparsity of the model and the additional functional noise that is added to the pseudo-inputs. SPGP+HS performs worse than the best models because, adding only a positive vector of uncertainty increases 1100 1000 ya900 d r sep800 iit s fvo700 #600 500 500 450 400 y rad350 e tsp300 ii s fv250 o #200 150 the exibility of the covariance function, which at the end can lead to over tting and convergence to bad local maxima. Using a functional noise constraint, the optimization space shrinks and implicitly prunes bad local maximas. The drawback is that the function of the noise should follow the distribution of the noise in the data set, otherwise the model will perform poorly. This is probably the case why the SPGP+FUNC-HS performs worse on one tag for the answers and on two tags for the questions.

In Fig. 2, we present two learned models, one for the tag \Java" (Fig. 2a) and one for the tag \iOS" (Fig. 2b). We observe that the model that models the Java tag strives to predict the test point using the mean. In contrast, the model that models the iOS tag predicts the test points in terms of noise.

training data test data mean The di erent kernels in Eq. (13) allow us to dissect the dynamical behavior of the population w.r.t. di erent scales and patterns. In order to portrait these behaviors, we calculated the mean function and variance Eq. (5) by generating vector k> using independent kernels. We present the values of each kernel in the \android" question data set in Fig. 3 { Mean trends (Fig. 3a) characterize the behavior of the population of users over scales measured in months and represent the global mean behavior of the population. We hypothesize that they are driven by the shear size of the user base. The more people interested in the tag are visiting the site, the higher the average number of questions per month. Further, this overall trend might represent the changes in the dominance of this particular tag of questions in the data set. Because the tag refers to a programming language, trends like this indicate changes in attention to various languages. Such dynamics are modeled using the rational quadratic kernel In this paper, we addressed the problem of forecasting the daily posting behavior of users of the Stackover ow question answering web platform. In order to accomplish this task, we extended the variable noise pseudo inputs Gaussian training data mean ya 100 repd 50 iitss 0 v fo# 50 100

Feb Mar

Process framework by introducing a functional noise variant. The idea of using functional descriptions of noise allowed us to study periodic patterns in collective attention shifts and was found to act as a regularizer in model training. Our extended Gaussian Process framework with functional representations of various kinds of noise provides the added advantage of increased interpretability of results as the di erent kernels de ned for this purpose can uncover di erent kinds of dynamics. In particular, our kernels revealed major distinct characteristics of the question answering behavior of users. First of all, there are major trends on time scales of about six months showing growing and declining interest in particular topics or corresponding tags. Second of all, these major trends are perturbed by seasonal behavior, for example overall activities usually drop during the summer season. Third of all, on a ne grained scale, there are weekly patterns characterized by periods of 2.5 days. Fourth of all, there are noisy uctuations in activities on daily scales.

Given the models and results presented in this paper, there various directions for future work. First and foremost, we are currently working on implementing a distributed Gaussian Process framework in order to extend our approach towards massive amounts of behavioral data (use of tags, comments, and likes) that can be retrieved from similar social media platforms such as Twitter or Facebook.

L. A.

Adamic ,

Zhang , E. Bakshy, and

M. S.

Ackerman . Knowledge Sharing and Yahoo Answers: Everyone Knows Something . In Proc. of ACM WWW , 2008 .

2. A. Anderson , D.

Huttenlocher , J.

Kleinberg , and J.

Leskovec . Discovering Value from Community Activity on Focused Question Answering Sites: A Case Study of Stack Over ow . In Proc. of ACM KDD , 2012 .

C. M.

Bishop . Pattern Recognition and Machine Learning . Springer, 2006 .

J. D.

Hamilton . Time Series Analysis . Princeton University Press, 1994 .

Kersting ,

Plagemann ,

Pfa , and

Burgard . Most likely heteroscedastic gaussian process regression . In Proceedings of the 24th international conference on Machine learning , pages 393 { 400 . ACM, 2007 .

D. G.

Manolakis ,

V. K.

Ingle , and

S. M.

Kogon . Statistical and Adaptive Signal Processing: Spectral Estimation , Signal Modeling, Adaptive Filtering, and

Array

Processing . Artech House Norwood , 2005 .

Ojeda ,

Sifa , and

Bauckhage . Investigating and Forecasting User Activities in Newsblogs: A Study of Seasonality, Volatility and

Attention

Burst . Work On Progress, 2016 .

Oktay ,

B. J.

Taylor , and

D. D.

Jensen . Causal Discovery in Social Media Using Quasi-experimental Designs . In Proc. of ACM Workshop on Social Media Analytics , 2010 .

J. S.

Pudipeddi ,

Akoglu , and

Tong . User Churn in Focused Question Answering Sites: Characterizations and Prediction . In Proc. of ACM WWW , 2014 .

10.

C. E.

Rasmussen . Evaluation of Gaussian Processes and Other Methods for Nonlinear Regression . PhD thesis , University of Toronto, 1996 .

11.

C. E.

Rasmussen and

C. K. I.

Williams . Gaussian Processes for Machine Learning . MIT Press, 2005 .

12.

Seeger . Pac-bayesian Generalisation Error Bounds for Gaussian Process Classi cation . J. Mach. Learn. Res. , 3 : 233 { 269 , 2003 .

13. M. Seeger , C.

Williams , and N.

Lawrence . Fast Forward Selection to Speed Up Sparse Gaussian Process Regression . In Proc. of Workshop on Arti cial Intelligence and Statistics , 2003 .

14.

Snelson and

Ghahramani . Sparse Gaussian Processes Using Pseudo-inputs . In Proc. of NIPS , 2005 .

15.

Snelson and

Ghahramani . Variable Noise and Dimensionality Reduction for Sparse Gaussian Processes . arXiv preprint arXiv:1206.6873 , 2012 .