Collusion-resistant Spatial Phenomena Crowdsourcing via Mixture of Gaussian Processes Regression Qikun Xiang1 , Ido Nevat2 , Pengfei Zhang2 , Jie Zhang1 1 Nanyang Technological University, Singapore. 2 Institute for Infocomm Research (I 2 R), Singapore. Abstract With the rapid development of mobile devices, spatial location-based crowdsourcing applications have attracted much attention. These ap- plications also introduce new security risks due to untrustworthy data sources. In the context of crowdsourcing applications for spatial inter- polation (i.e. spatial regression) using crowdsourced data, the results can be seriously affected if malicious data sources initiate a colluding (collaborate) attacks which purposely alter some of the measurements. To combat this serious detrimental effect, and to mitigate such attacks, we develop a robust version via a Gaussian Process mixture model and develop a computationally efficient algorithm which utilises a Markov chain Monte Carlo (MCMC)-based methodology to produce an accu- rate predictive inference in the presence of collusion attacks. The algo- rithm is fully Bayesian and produces posterior predictive distribution for any point-of-interest in the input space. It also assesses the trust- worthiness of each worker, i.e. the probability of each worker being honest (trustworthy). Simulation results demonstrate the accuracy of this algorithm. 1 Introduction Crowdsourcing, as defined by Howe [1], is often characterized by the participation of large networks of people from the public in tasks given by companies or institutes. It has been broadly used in data acquisition, which replaces the traditional data gathering processes that are costly. Recent works have shown many successful applications in crowdsourced information gathering from various different domains. In [2], the authors intro- duce a crowdsourcing framework for acquisition and analysis of mobile videos to monitor rapidly changing disaster sites that optimally utilise bandwidth and resources. In a study done by Meier et al. [3], crowdsourced atmospheric data in Berlin, Germany was used for urban climate research. In [4], a privacy-preserving com- munity sensing method is developed for real-time road traffic monitoring. Another study by Kim et al. [5] uses crowdsourcing and regression techniques to automatically analyse the conflict level in television political debates. While decentralized spatial information gathering (crowdsourcing) techniques have clear benefits, such as being inexpensive and easy to implement, their quality and reliability are not as good as data collected from Copyright c by the paper’s authors. Copying permitted only for private and academic purposes. In: J. Zhang, R. Cohen, and M. Sensoy (eds.): Proceedings of the 18th International Workshop on Trust in Agent Societies, Singapore, 09-MAY-2016, published at http://ceur-ws.org 1 domain experts. The issue of trust is one of the major impediments to the success of real-world crowdsourc- ing applications [6]. Apart from inherent noises, untrustworthiness in crowdsourced data can originate from various ways. For example the sensor used for measurement can be faulty (if the data is in the form of sensor reading). Data that come from humans can be subjective and highly biased [7]. For example, workers might put very little effort into the work they are assigned in the hope that they will not be noticed and still get the reward. Other workers might lack the essential skills to perform the task, hence produce erroneous data [7]. In addition, there can be adversaries that report malicious data strategically to interrupt the crowdsourcing application [8]. The challenge of identifying and filtering untrustworthy information before summarization (prediction, de- cision making, etc.) has been addressed by a number of researchers in various domains. Some of these works focus on binary/multi-class labelling tasks. Groot et at. [9] applied a Gaussian Process for regression model to merge information from multiple inaccurate annotators to produce reliable labels for objects. Tarasov et at. [10] proposed a method to dynamically estimate reliability of workers in crowdsourcing for regression by adopting a multi-armed bandits (MAB) model. Some studies are applied to problems of estimation. Reece et al. [11] studied the problem of tackling sensor faults in applications based on sensor networks. Their method consists of two stages. Firstly the model-based step detects and filters sensor faults with known types, then the consensus-based step removes unanticipated faults that passed the first step. Venanzi et al. [12] developed an algorithm called MaxTrust that jointly estimates the target attribute and the trustworthiness of each worker using maximum likelihood paradigm. Not much attention has been given to address the problem of spatial regression (or the estimation of a spatial process/function) with the presence of untrustworthy data sources. Venanzi et al. [13] proposed a heteroskedas- tic Gaussian Process (HGP) model with independent noises and designed an algorithm called TrustHGP to jointly estimate the spatial function and the trustworthiness of each worker. Their model assumes that all re- ports are noisy observations from the same underlying function with heterogeneous and independent noises. Malicious observations are associated with higher noises compared to honest observations. These assumptions of their model are too intuitive to be able to handle complex real-world situations. Especially, this method is vulnerable against collusion attacks. In collusion attack scenarios in crowdsourcing, multiple untrustworthy workers can collaborate to inject deceptive false data to skew the regression results. More specifically, an effective strategy for attackers is to make a number of similar reports that are far from the ground-truth (or follow a spatial function that is far from the underlying spatial function) with similar input values. This often results in the regression algorithm mistakenly accepting the false value provided by attackers. We refer to this type of attacks and similar types as collusion attacks. We will formally present the attacker model in the model specification in Section 3 of this paper. 1.1 The Proposed Gaussian Processes Mixture Model In order to address the issue of collusion attacks, we propose a Gaussian Processes mixture model where ma- licious data and non-malicious (honest) data are modelled to be from two different realizations of Gaussian Processes. We develop a Markov chain Monte Carlo (MCMC) based algorithm that takes all data as input and produces statistical prediction of function value at any spatial location of interest. The predictions produced by the algorithm is based on honest data. It minimizes the impact of malicious data by excluding them before performing posterior inference with high accuracy. The algorithm also assesses the trustworthiness of each worker in terms of the probability that the data it produces being honest. We show that this method is truly collusion-resistant by performing experiment on synthetic dataset. 1.2 Contribution The contribution of this paper includes: • We develop a unified statistical framework for collusion attacks in the regression setting via a Gaussian Process mixture model. • We develop a novel MCMC-based algorithm to produce comprehensive summary of data in terms of pos- terior predictive distribution and trustworthiness of each worker (data point). The inference method is fully Bayesian. • We conduct a proof of concept on synthetic dataset and show that our method is truly resistant to collusion attacks. 2 2 Problem Statement In the field of spatial monitoring, n sets of data are gathered from n different workers that are potentially un- trustworthy. Each set of data comes in the form of a tuple (xi , yi ), i = 1, . . . , n, where xi ∈ Rd represents the independent variables (spatial location, time, input features, etc.), and yi ∈ R represents the response (de- pendent) variable that depends on xi . Let x := {x1 , . . . , xn }, and let y := {y1 , . . . , yn }. We refer to a tuple of (xi , yi ) as a data point. Each data point is either honest or malicious, depending on whether it comes from a trustworthy or untrustworthy worker. Honest data points are noisy observations from the underlying function fH : Rd → R, while malicious data points are not (and will be defined later). The task of the regression problem is that given x and y, produce a reliable estimation of fH (x∗ ) with its probability distribution, for any given x∗ ∈ Rd , that is free from the influence of malicious data. The probability density function p(fH (x∗ )|x∗ , x, y) is referred to as posterior predictive distribution and contains all the information about fH (x∗ ), given all the observations. 2.1 Illustrative Example of Spatial Phenomena Crowdsourcing To motivate our model, we present a real-world example of the stated problem. Suppose a meteorological study aims to estimate the spatial structure of atmospheric parameters (e.g. temperature, humidity, air quality, etc.) of a particular region of interest by using crowdsourced data collected from workers in the region by hand-held devices equipped with sensors. In this example, each report (xi , yi ) consists of: xi , the geographical location of the i-th worker (longitude and latitude), and yi , the reading of the atmospheric parameter from the sensor yi , e.g. air quality. The underlying function of interest fH corresponds to the inherent spatial structure of the parameter, i.e. for a given location x∗ , fH (x∗ ) corresponds to the ground-truth of the value of the parameter at the location. Suppose that, for some reasons, a worker reports dishonest data at location xm that does not correspond to the actual parameter value fH (xm ). This piece of data is deemed as malicious. There can be practical reasons why malicious data exists, such as to defame a spatial location by claiming that the air quality in the vicinity is poor. There can be multiple malicious workers reporting dishonestly to achieve their objectives. Thus, the data reported by these workers cannot be trusted and should be regarded as unrelated to fH . 3 Crowdsourcing Statistical Model Development In this section, we formally present the statistical crowdsourcing model which uses Gaussian Process mixture model. 3.1 Gaussian Process, Covariance Functions and Hyperparameters We model the physical phenomena as spatially dependent continuous processes with a spatial correlation struc- ture and are independent from each other. The degree of the spatial correlation in the process increases with the decrease of the separation between two observing locations and can be accurately modelled as a Gaussian random field 1 . A Gaussian process (GP) defines a distribution over a space of functions and it is completely specified by the equivalent of sufficient statistics for such a process, and is formally defined as follows. Definition 1 (Gaussian process [14]): Let X ⊂ RD be some bounded domain of a D-dimensional real valued vector space. Denote by f (x) : X 7→ R a stochastic process parametrized by x ∈ X . Then, the random function f (x) is a Gaussian process if all its finite dimensional distributions are Gaussian, where for any m ∈ N, the random variables (f (x1 ) , · · · , f (xm )) are normally distributed. We can therefore interpret a GP as formally defined by the following class of random functions: F := {f (·) : X 7→ R s.t. f (·) ∼ GP (µ (·; θ) , C (·, ·; Ψ)) , with µ (x; θ) := E [f (x)] : X 7→ R, C (xi , xj ; Ψ) := E [(f (xi ) − µ (xi ; θ)) (f (xj ) − µ (xj ; θ))] : X × X 7→ R+ , where at each point the mean of the function is µ(·; θ), parameterised by θ, and the spatial dependence between any two points is given by the covariance function (Mercer kernel) C (·, ·; Ψ), parameterised by Ψ (see detailed discussion in [14]). 1 We use Gaussian Process and Gaussian random field interchangeably. 3 3.2 Specification of the Spatial Processes We assume that fH can be accurately modelled a Gaussian Process prior with zero mean and square exponential 2 2 2 covariance function parameterised by σH , wH1 , · · · , wHd . fH ∼ GP(0, CH ), d ! 2 1 X (xim − xi0 m )2 CH (xi , xi0 ) = σH exp − 2 . 2 m=1 wHm Similarly, fM is also modelled a Gaussian Process prior with zero mean and square exponential covariance 2 2 2 function parameterised by σM , wM1 , · · · , wMd . fM ∼ GP(0, CM ), d ! 2 1 X (xim − xi0 m )2 CM (xi , xi0 ) = σM exp − 2 . 2 m=1 wMm 3.3 Partition of Data Points and Hyperparameters of Gaussian Processes Data points are partitioned into 2 mutually exclusive sets, H and M, denoting honest data points and malicious data points. In other words, either i ∈ H or i ∈ M. Without prior knowledge, H is indistinguishable from M. Hence, we make the assumption that H contains the majority of data points, that is, |H| > |M|. The case where malicious data points become the majority is unrealistic in real applications. When it happens, there is no way to distinguish which set is honest without informative prior knowledge. Since each data point belongs either to H or M, we use c ∈ {H, M}n to denote the configuration vector, where each component ci denotes the set which i belongs to. In other words, ( H if i ∈ H ci = . M if i ∈ M As mentioned above, honest data points are noisy observations from underlying function fH : Rd → R. Similarly, malicious data points are noisy observations from a different function fM : Rd → R. Hence, ( fH (xi ) + i , if ci = H yi = , (1) fM (xi ) + ωi , if ci = M where iid iid i ∼ N (0, σn2 ), ωi ∼ N (0, σω2 ). (2) Within each set, the Gaussian noises are assumed to be independently and identically distributed (iid). The assumption of homoscedasticity is made to simplify derivation. In future work we will explore the possibility to extended it to incorporate input-dependent noises such as in [15–17]. Let θ denote hyperparameters of above GPs (including variances of noise), 2 2 2 2 2 2 θ = (σH , σM , wH1 , · · · , wHd , wM1 , · · · , wMd , σn2 , σω2 ). 2 2 In these hyperparameters, signal variances σH and σM , noise variances σn2 and σω2 are given inverse-gamma priors, 2 2 iid iid σH , σM ∼ Inv-Gamma(αf , βf ), σn2 , σω2 ∼ Inv-Gamma(αn , βn ), 2 2 2 2 and length-scales wH1 , · · · , wHd , wM1 , · · · , wMd are given log-normal prior, 2 2 2 2 iid 2 wH1 , · · · , wHd , wM1 , · · · , wMd ∼ ln N (µw , σw ). Placing priors over hyperparameters makes the method fully Bayesian and provides extra flexibility to the 2 model. Parameters αf , βf , αn , βn , µw , σw are pre-set values to make these priors non-informative. Alternatively, 4 when prior knowledge about hyperparameters is available, these parameters can be tuned in order to achieve better accuracy. In addition, we define the following notations that are used in the following parts of the paper. xH = {xi : ci = H}, xM = {xi : ci = M}, yH = {yi : ci = H}, yM = {yi : ci = M}, n X Xn nH = |H| = δ(ci , H), nM = |M| = δ(ci , M). i=1 i=1 3.4 Distribution of Independent Variables In our assumptions, xH is distributed uniformly throughout the domain of x since no region is preferred over other regions when gathering data. When prior knowledge about distribution of independent variables is present, the uniformity assumption can be replaced with no difficulty. iid xHim ∼ U (lm , um ), for i = 1 · · · n, m = 1 · · · d, where xHim denotes the mth dimension of ith honest data point, lm , um are the lower bound and upper bound of the mth dimension. It is assumed that xM follows a d-variate Gaussian distribution with mean µx and covariance Σx . This is a practical assumption since in the presence of collusion attacks, the damage maximizing strategy for attackers is to concentrate malicious data points in a particular region. In addition, attackers are assumed to possess limited resources, making it difficult for them to inject malicious data from a broad region (for example, when the independent variable is geographical locations). In the case of a distributed attack where independent variables of malicious data spread through a broad region, Σx corresponds to a large value that makes the distribution of xM flat. iid xMi ∼ N (µx , Σx ), for i = 1 · · · n, where xMi denotes the ith malicious data point. Let φ denote parameters of the input distribution, φ = (µx , Σx ), where µx is given uniform prior, µxm ∼ U (lm , um ), for m = 1 · · · d, and Σx is given inverse-Wishart prior, Σx ∼ W −1 (Ψ, ν), and the parameters Ψ, ν are set to be non-informative. 3.5 Prior Over Configurations We assume there is no a priori information about the trustworthiness of each worker. Hence there is an identical probability ρ of each data point being honest. Therefore, c has a n-variate Bernoulli distribution with parameter ρ (H corresponds to the positive outcome and M corresponds to the negative outcome), iid ci ∼ Bernoulli(ρ), for i = 1 · · · n. It is obvious that nH follows a binomial distribution with parameters n and ρ. nH ∼ B(n, ρ). Strictly speaking, since nH > nM , nH should follow a truncated binomial distribution where nH > n2 . For   simplicity, this condition is ignored here. As later discussed in the algorithm, if a posterior sample violates the condition, we simply ignore the sample. ρ is given a beta prior, with known parameters αρ and βρ , ρ ∼ Beta(αρ , βρ ). Here αρ and βρ should be set such that αρ  βρ , because before making any observations, all workers should be trusted as being honest. 5 4 Posterior Predictive Inference and MCMC Sampler Design Based on the model we propose, the goal is to obtain the posterior predictive distribution, which is derived by marginalizing over c and θ, i.e. XZ p(fH (x∗ )|x∗ , x, y) = p(fH (x∗ )|x∗ , x, y, c, θ)p(c, θ|x, y)dθ. c θ The first term inside the integral p(f (x∗ )|x∗ , x, y, c, θ) is the posterior distribution of Gaussian Process regres- sion and is Gaussian: p(fH (x∗ )|x∗ , x, y, c, θ) = N (fH (x∗ ); µf , Σf ), (3) where µf = CH (xH , x∗ )T [CH (xH , xH ) + σn2 I]−1 yH , (4) Σf = CH (x∗ , x∗ ) − CH (xH , x∗ )T [CH (xH , xH ) + σn2 I]−1 CH (xH , x∗ ). The second term inside the integral p(c, θ|x, y) is the marginal posterior distribution of c and θ obtained by marginalizing out φ and ρ from the full posterior p(c, θ, φ, ρ|x, y). Z 1Z p(c, θ|x, y) = p(c, θ, φ, ρ|x, y)dφdρ, 0 φ (5) p(c, θ, φ, ρ|x, y) ∝ p(y|x, c, θ)p(x|c, φ)p(c|ρ)p(ρ)p(θ)p(φ). This full posterior p(c, θ, φ, ρ|x, y) is clearly intractable. However, if we could sample from this posterior p(c, θ, φ, ρ|x, y) and get S samples {(c(1) , θ(1) , φ(1) , ρ(1) ), · · · , (c(S) , θ(S) , φ(S) , ρ(S) )}, we could approximate the posterior predictive distribution by: S 1X p(fH (x∗ )|x∗ , x, y) ≈ p(fH (x∗ )|x∗ , x, y, c(r) , θ(r) ) (6) S r=1 This posterior predictive distribution is a mixture of multivariate Gaussian distribution, whose mean µfH (x∗ ) and covariance matrix ΣfH (x∗ ) are given by: S 1 X (r) µfH (x∗ ) = µ , S r=1 (7) S X T ΣfH (x∗ ) = (Σ(r) + µ(r) µ(r) ) − µfH (x∗ ) µTfH (x∗ ) , r=1 where µ(r) and Σ(r) are the mean and covariance matrix of the rth component of the mixture, respectively. Inference of other characteristics of the posterior predictive distribution, such as median, mode, percentiles, etc. can be done by generating T samples from each of the S components from the approximate posterior predictive distribution and use the histogram of ST samples to approximate the posterior predictive distribution. Therefore, we develop a MCMC sampler algorithm to generate samples from the posterior distribution p(c, θ, φ, ρ|x, y). We can use Gibbs Sampling to generate samples from conditional posterior distribution of each component of the parameter space. 4.1 Sampling c Since it is difficult to sample c as a whole from its conditional posterior distribution, we sample it component by component. Let c−i denote the vector resulting from removing i-th component from c.  p(ci = j|c−i , x, y, θ, φ, ρ) ∝ p yi |ci = j, c−i , xj−i , yj−i , θ p(xi |xj−i , ci = j, c−i , φ)p(ci = j|c−i , ρ) (8) = N (yi ; µj , Σj )p(xi |xj−i , ci = j, c−i , φ)p(ci = j|c−i , ρ), for j ∈ {H, M}, 6 where µH = CH (xH−i , xi )T [CH (xH−i , xH−i ) + σn2 I]−1 yH−i , ΣH = [CH (xi , xi ) + σn2 ] − CH (xH−i , xi )T [CH (xH−i , xH−i ) + σn2 I]−1 CH (xH−i , xi ), µM = CM (xM−i , xi )T [CM (xM−i , xM−i ) + σω2 I]−1 yM−i , ΣM = [CM (xi , xi ) + σω2 ] − CM (xM−i , xi )T [CM (xM−i , xM−i ) + σω2 I]−1 CM (xM−i , xi ), 1 (9) p(xi |xj−i , ci = H, c−i , φ) = d , Q (um − lm ) m=1 p(xi |xj−i , ci = M, c−i , φ) = N (xi ; µx , Σx ), p(ci = H|c−i , ρ) = ρ, p(ci = M|c−i , ρ) = 1 − ρ. In above equations, xj−i = x \ xi , yj−i = y \ yi for j ∈ {H, M}. Note that rank-one updates to the inverse covariance matrix can be performed to reduce the number of inversion needed. 4.2 Sampling θ We can sample the entire θ from its conditional posterior distribution. p(θ|x, y, c, φ, ρ) ∝ p(y|x, c, θ)p(θ) d Y 2 2 = p(y|x, c, θ)p(σH )p(σM )p(σn2 )p(σω2 ) 2 p(wHm 2 )p(wMm ) (10) m=1 The unnormalized distribution can be derived analytically, along with its partial derivative w.r.t. each pa- rameter. Hence Hamiltonian Monte Carlo (HMC) [19] can be used to efficiently sample from this conditional posterior. 4.3 Complementary Metropolis-Hastings Jump When sampling c and θ, a convergence issue may occur. Intuitively, the joint posterior distribution p(c, θ, φ, ρ|x, y) should be bimodal, since the model specification of c and θ is highly symmetric except for the condition that nH > nM , as reflected in the prior of c, and the distribution of independent variables xH and xM . Thus, switching the labels of all data points (labelling all honest data points as M and all malicious data points as H) will result in an inferior mode in the posterior distribution (besides the major mode we are interested in where all data points are correctly classified). This mode has much less probability mass as compared to the major mode, and since its violation of the condition nH > nM , it will eventually be rejected when producing posterior predictive inference. Nonetheless, its existence will make the MCMC sampler algorithm highly sen- sitive to initial state. Due to the nature of Gibbs samplers, c is updated one component at a time. The slow movement of the Gibbs sampler in the state space will make it extremely unlikely to jump from a state close to inferior mode to a state close to the major mode, due to low probability mass between two modes. Hence, apart from the Gibbs sampler for c and HMC sampler for θ mentioned above, an auxiliary Metropolis-Hastings sampler is introduced to facilitate the movement between two modes. The Metropolis-Hastings proposal dis- tribution will propose a switch of labels as well as their associated GP hyperparameters θ. Below shows the proposal distribution q(c∗ |c) and q(θ∗ |θ) of the Metropolis-Hastings jump. q(ci∗ 6= ci |ci ) = pc , q(ln σH∗ |θ) = N (ln σH∗ ; ln σM , σF2 ), q(ln σM∗ |θ) = N (ln σM∗ ; ln σH , σF2 ), 2 q(ln σn∗ |θ) = N (ln σn∗ ; ln σω , σN ), (11) 2 q(ln σω∗ |θ) = N (ln σω∗ ; ln σn , σN ), 2 q(ln wHm∗ |θ) = N (ln wHm∗ ; ln wMm , σW ), for m = 1 · · · d, 2 q(ln wMm∗ |θ) = N (ln wMm∗ ; ln wHm , σW ), for m = 1 · · · d. 7 Here, pc is set to be close to 1 and σF2 , σN 2 2 and σW are set close to 0 so that the Metropolis-Hastings jump roughly switches labels of all data points along with their corresponding hyperparameters. The update is accepted with probability shown below:   p(c∗ , θ∗ , φ, ρ|x, y)q(c|c∗ )q(θ|θ∗ ) p(accept) = min ,1 . (12) p(c, θ, φ, ρ|x, y)q(c∗ |c)q(θ∗ |θ) q(θ|θ∗ ) is equal to 1 since the part of the Metropolis-Hastings update related to θ is symmetrical. Therefore, q(θ∗ |θ) the sampler for c and θ becomes a mixture of two MCMC samplers. One is the Gibbs sampler for c and HMC sampler for θ, and the other is a Metropolis-Hastings sampler that jointly updates c and θ. At each iteration of the MCMC sampler algorithm, with probability pMH , the Metropolis-Hastings sampler is selected. Since the sole purpose of this sampler is to get the chain out of the inferior mode, pMH should be set close to 0. Now when the chain walks into states close to the inferior mode, the Metropolis-Hastings sampler will take the chain to states close to the major mode. When the chain is already close to the major mode, the Metropolis-Hastings jumps will almost never be accepted. 4.4 Sampling φ To sample φ, we generate a random sample from the conditional posterior of each component, i.e. from p(µx |xM , c, Σx ) and p(Σx |xM , c, µx ), which are analytically tractable. For µx , p(xM |c, µx , Σx )p(µx ) p(µx |xM , c, Σx ) = R µx p(xM |c, µx , Σx )p(µx )dµx d 1 1 Q Q C N (µx ; xi , Σx ) (um −lm ) {i:ci =M} m=1 = R 1 Q d Q 1 (13) µx C N (µx ; xi , Σx ) (um −lm ) dµx {i:ci =M} m=1   N µx ; xM , nΣMx =R   , µx N µx ; xM , nΣMx dµx Q R where C = µx N (µx ; xi , Σx )dµx is the product of all the normalizing constants to truncate the Gaus- {i:ci =M} sian distribution. Since C appears in both the denominator and the numerator, it cancels out. Note that this conditional posterior distribution is a truncated Gaussian distribution and can be efficiently generated. For Σx , we have that p(xM |c, µx , Σx )p(Σx ) p(Σx |xM , c, µx ) = R Σx p(xM |c, µx , Σx )p(Σx )dΣx ! d 1 N (µx ; xi , Σx ) W −1 (Σx ; Ψ, ν) Q C {i:ci =M} (14) = ! d 1 R N (µx ; xi , Σx ) W −1 (Σx ; Ψ, ν)dΣx Q Σx C {i:ci =M} = W −1 (Σx ; Z + Ψ, nM + ν), (xi −µx )(xi −µx )T . This conditional posterior distribution is an inverse-Wishart distribution P where Z = {i:ci =M} and can be directly generated. 8 Algorithm 1: MCMC Sampler Input: S, x, y and parameters of priors Output: S samples of posterior distribution (c(1) , θ(1) , φ(1) , ρ(1) ), · · · , (c(S) , θ(S) , φ(S) , ρ(S) ) 1 Initialize θ, φ and ρ by randomly drawing from their respective priors. 2 Initialize c. 3 for r = 1 · · · S do 4 Randomly sample u1 ∼ U (0, 1). 5 if u1 < pMH (Probability of proposing a Metropolis-Hastings jump) then 6 Propose a jump to c∗ , θ∗ by q(c∗ |c) and q(θ∗ |θ). 7 Compute acceptance probability p(accept). 8 Randomly sample u2 ∼ U (0, 1). 9 if u2 < p(accept) then 10 c ← c∗ . 11 θ ← θ∗ . else 12 for i = 1 · · · n do 13 Sample ci from p(ci |c−i , x, y, θ, φ, ρ). 14 Sample θ from p(θ|x, y, c, φ, ρ) by HMC. 15 Sample µx from p(µx |x, y, c, θ, Σx , ρ). 16 Sample Σx from p(Σx |x, y, c, θ, µx , ρ). 17 Sample ρ from p(ρ|x, y, c, θ, φ). 18 Save current values of c, θ, φ, ρ in (c(r) , θ(r) , φ(r) , ρ(r) ). 19 return {(c(1) , θ(1) , φ(1) , ρ(1) ), · · · , (c(S) , θ(S) , φ(S) , ρ(S) )}. 4.5 Sampling ρ Finally, to sample ρ we generate a random sample from its conditional posterior, which is also a beta distribu- tion. p(ρ|x, y, c, θ, φ) = p(ρ|c) p(c|ρ)p(ρ) =R 1 (15) 0 p(c|ρ)p(ρ)dρ =Beta (ρ; αρ + nH , βρ + n − nH ) . 5 The Algorithms Our methods consists of two algorithms. The first, presented in Algorithm 1, is a MCMC sampler used to generate samples of model parameters from the posterior distribution p(c, θ, φ, ρ|x, y). The second algorithm, presented in Algorithm 2, is used to summarize the posterior predictive distribution by calculating the posterior mean and variance. Details of both algorithms are shown. S is set sufficiently large to allow the Markov chain to approximately converge to the target posterior distribution. To assess convergence, we used the method introduced by Gelman et al. in [18]. Apart from mean and variance of the posterior predictive distribution, we also obtain the trustworthiness of each worker (each data point) in the form of the posterior probability of each data point being honest, or alternatively, p(ci |x, y) for i = 1 · · · n. This information can be used to further model the behavior of each worker in crowdsourcing applications. However, this is not the main focus of this paper and will not be fur- ther discussed. Derivation of posterior probability p(ci |x, y) is also straightforward and thus omitted from the algorithm specification. Assuming d  n, the computational complexity of Algorithm 1 is limited by the HMC step to sample θ, in which L leapfrog steps are taken, each requires an inverse operation that has O(n3 ) complexity. This results in Algorithm 1 having an overall complexity of O(SLn3 ). The computational complexity of Algorithm 2 is O(S 0 (n2∗ + n∗ n2 )), where n∗ is the length of x∗ . 9 Algorithm 2: Posterior Predictive Inference Input: S samples of posterior distribution (c(1) , θ(1) , φ(1) , ρ(1) ), · · · , (c(S) , θ(S) , φ(S) , ρ(S) ), x∗ Output: µfH (x∗ ) , ΣfH (x∗ ) 1 Discard the first half of samples as warm-up. The number of samples used for approximating posterior predictive distribution is S 0 . 0 2 r ← 1. 0 3 for r = 1 · · · S do Pn 4 nH ← δ(ci , H). i=1 5 if 2nH > n then 0 6 µ(r ) ← CH (xH , x∗ )T [CH (xH , xH ) + σn2 I]−1 yH . 0 7 Σ(r ) ← CH (x∗ , x∗ ) − CH (xH , x∗ )T [CH (xH , xH ) + σn2 I]−1 CH (xH , x∗ ). 8 r0 ← r0 + 1. 0 S 0 µfH (x∗ ) ← S1 µ(r ) . P 9 r 0 =1 0 S 0 0 0 T (Σ(r ) + µ(r ) µ(r ) ) − µfH (x∗ ) µTfH (x∗ ) . P 10 ΣfH (x∗ ) ← r 0 =1 11 return (µfH (x∗ ) , ΣfH (x∗ ) ). 6 Simulation Results To demonstrate that the algorithms are able to make accurate inference from data, we performed experiments on one-dimensional synthetic dataset, where n = 100 and d = 1, generated in accordance to our assumptions. In the synthetic dataset, the underlying function is fH (x) = sin π2 x and the malicious function is fM (x) = 1 π 2 tan 4 x. 60 noisy observations are sampled from fH , with homogeneous Gaussian noise with standard devia- tion 0.10. The distribution of independent variables is uniform in [0, 1]. 40 noisy observations are sampled from fM , with homogeneous Gaussian noise with standard deviation 0.12. The distribution of independent variables is Gaussian with mean 0.80 and standard deviation 0.10, truncated to range [0, 1]. The choices of priors in terms of their parameters are given by: αf = 2.0, βf = 2.0, αn = 2.0, βn = 1.0, 2 µw = 0.7, σw = 0.3, Ψ = 0.0025, ν = 1. For the Metropolis-Hastings sampler, the probability of proposing a jump is pMH = 0.05. Parameters of the proposal distribution is shown below. pc = 0.95, σF2 = 1 × 10−4 , σN 2 = 1 × 10−4 , σW 2 = 1 × 10−4 For the sampler of θ, hyperparameters related to H and M are sampled separately, each using an HMC algorithm with L = 10 (leapfrog steps). Finally, for this particular dataset, we set S = 2000 since the Markov chain has converged sufficiently after 2000 iterations of the MCMC sampler. Then we pass the second half of the resulting samples to make posterior predictive inference. The results are shown in Fig. 1. Using the posterior mean as the estimator results in a Mean Square Error (MSE) of 0.0012. The posterior probability p(ci |x, y) (trustworthiness) of each data point is also computed. In this example, the average trustworthiness of truly honest data points is 0.9906, while the average trustworthiness of truly malicious data points is 5 × 10−5 . It is thus clear that for this set of data, the accuracy of the proposed algorithm is quite high. As for the efficiency of the algorithm, we measure the average time cost by running the algorithm 50 times using the above dataset on a MacBook Pro equipped with Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz. The resulted average time cost is 27.43 seconds per run. 10 Figure 1: Posterior Predictive Inference Results of Synthetic Dataset 7 Conclusion and Future Works In this paper, we build a Gaussian Process mixture model and design a MCMC-based algorithm to address the issue of collusion attacks in the regression setting. Honest data from observing underlying function and malicious data are modelled as two separate Gaussian Processes. Compared to [13], where malicious data are assumed to be observed from the same underlying function with extra noise, we claim that the mixture model is a more realistic and natural choice. We use synthetic dataset to show that our algorithm is able to produce accurate posterior predictive inference and is computationally efficient. As part of an ongoing study, the results of this paper suggest several directions for future studies. Firstly, a single malicious function (class) is too restrictive to handle more sophisticated real-world scenarios and the algorithm currently lacks the ability to handle cases with multiple malicious groups. It is therefore reasonable to extend this model to incorporate multiple malicious classes. One way of modelling is to build a finite mix- ture of Gaussian Processes model with unknown number of components, and design a MCMC sampler using reversible jump techniques [20]. Another choice is to build an infinite mixture of Gaussian Processes model similar to [21]. Secondly, this model is unable to handle heteroscedasticity as mentioned before, as well as non- Gaussian noises. We will explore approaches to handle these situations in future studies. Lastly, our model is not build specifically to model trustworthiness of workers, which is a significant issue in crowdsourcing appli- cations. In our future work we will improve the model to make robust, collusion-resistant predictions while maintaining trustworthiness of workers for future predictions. References [1] Howe, Jeff. ”Crowdsourcing: A definition.” Crowdsourcing: Tracking the rise of the amateur (2006). [2] To, H., Kim, S.H. and Shahabi, C., 2015, October. Effectively crowdsourcing the acquisition and analysis of visual data for disaster response. In Big Data (Big Data), 2015 IEEE International Conference on (pp. 697-706). IEEE. [3] Meier, Fred, Daniel Fenner, Tom Grassmann, Britta Jnicke, Marco Otto, and Dieter Scherer. ”Challenges and benefits from crowd-sourced atmospheric data for urban climate research using Berlin, Germany, as testbed.” [4] Krause, Andreas, Eric Horvitz, Aman Kansal, and Feng Zhao. ”Toward community sensing.” In Proceedings of the 7th international conference on Information processing in sensor networks, pp. 481-492. IEEE Computer Society, 2008. 11 [5] Kim, Samuel, Maurizio Filippone, Fabio Valente, and Alessandro Vinciarelli. ”Predicting the conflict level in television political debates: an approach based on crowdsourcing, nonverbal communication and gaus- sian processes.” In Proceedings of the 20th ACM international conference on Multimedia, pp. 793-796. ACM, 2012. [6] Shahabi, Cyrus. ”Towards a generic framework for trustworthy spatial crowdsourcing.” In Proceedings of the 12th International ACM Workshop on Data Engineering for Wireless and Mobile Acess, pp. 1-4. ACM, 2013. [7] Welinder, Peter, and Pietro Perona. ”Online crowdsourcing: rating annotators and obtaining cost-effective labels.” (2010): 25-32. [8] Hall, David L., and John M. Jordan. Human-centered information fusion. Artech House, 2014. [9] Groot, Perry, Adriana Birlutiu, and Tom Heskes. ”Learning from multiple annotators with Gaussian pro- cesses.” In Artificial Neural Networks and Machine LearningICANN 2011, pp. 159-164. Springer Berlin Hei- delberg, 2011. [10] Tarasov, Alexey, Sarah Jane Delany, and Brian Mac Namee. ”Dynamic estimation of worker reliability in crowdsourcing for regression tasks: Making it work.” Expert Systems with Applications 41, no. 14 (2014): 6190-6210. [11] Reece, Steven, Stephen Roberts, Christopher Claxton, and David Nicholson. ”Multi-sensor fault recovery in the presence of known and unknown fault types.” In Information Fusion, 2009. FUSION’09. 12th Interna- tional Conference on, pp. 1695-1703. IEEE, 2009. [12] Venanzi, Matteo, Alex Rogers, and Nicholas R. Jennings. ”Trust-based fusion of untrustworthy informa- tion in crowdsourcing applications.” In Proceedings of the 2013 international conference on autonomous agents and multi-agent systems, pp. 829-836. International Foundation for Autonomous Agents and Multiagent Systems, 2013. [13] Venanzi, Matteo, Alex Rogers, and Nicholas R. Jennings. ”Crowdsourcing spatial phenomena using trust- based heteroskedastic gaussian processes.” In First AAAI Conference on Human Computation and Crowdsourc- ing. 2013. [14] Rasmussen, Carl Edward. ”Gaussian processes for machine learning.” (2006). [15] Goldberg, Paul W., Christopher KI Williams, and Christopher M. Bishop. ”Regression with input- dependent noise: A Gaussian process treatment.” Advances in neural information processing systems 10 (1997): 493-499. [16] Titsias, Michalis K., and Miguel Lzaro-gredilla. ”Variational heteroscedastic Gaussian process regression.” In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 841-848. 2011. [17] Kersting, Kristian, Christian Plagemann, Patrick Pfaff, and Wolfram Burgard. ”Most likely heteroscedastic Gaussian process regression.” In Proceedings of the 24th international conference on Machine learning, pp. 393- 400. ACM, 2007. [18] Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian data analysis. Vol. 2. Boca Raton, FL, USA: Chapman & Hall/CRC, 2014. [19] Duane, Simon, Anthony D. Kennedy, Brian J. Pendleton, and Duncan Roweth. ”Hybrid monte carlo.” Physics letters B 195, no. 2 (1987): 216-222. [20] Green, Peter J. ”Reversible jump Markov chain Monte Carlo computation and Bayesian model determina- tion.” Biometrika 82, no. 4 (1995): 711-732. [21] Rasmussen, Carl Edward, and Zoubin Ghahramani. ”Infinite mixtures of Gaussian process experts.” Ad- vances in neural information processing systems 2 (2002): 881-888. 12