Collusion-resistant Spatial Phenomena Crowdsourcing via
            Mixture of Gaussian Processes Regression

                              Qikun Xiang1 , Ido Nevat2 , Pengfei Zhang2 , Jie Zhang1
                                          1
                                            Nanyang Technological University, Singapore.
                                      2
                                          Institute for Infocomm Research (I 2 R), Singapore.


                                                                  Abstract

                           With the rapid development of mobile devices, spatial location-based
                           crowdsourcing applications have attracted much attention. These ap-
                           plications also introduce new security risks due to untrustworthy data
                           sources. In the context of crowdsourcing applications for spatial inter-
                           polation (i.e. spatial regression) using crowdsourced data, the results
                           can be seriously affected if malicious data sources initiate a colluding
                           (collaborate) attacks which purposely alter some of the measurements.
                           To combat this serious detrimental effect, and to mitigate such attacks,
                           we develop a robust version via a Gaussian Process mixture model and
                           develop a computationally efficient algorithm which utilises a Markov
                           chain Monte Carlo (MCMC)-based methodology to produce an accu-
                           rate predictive inference in the presence of collusion attacks. The algo-
                           rithm is fully Bayesian and produces posterior predictive distribution
                           for any point-of-interest in the input space. It also assesses the trust-
                           worthiness of each worker, i.e. the probability of each worker being
                           honest (trustworthy). Simulation results demonstrate the accuracy of
                           this algorithm.


1    Introduction
Crowdsourcing, as defined by Howe [1], is often characterized by the participation of large networks of people
from the public in tasks given by companies or institutes. It has been broadly used in data acquisition, which
replaces the traditional data gathering processes that are costly. Recent works have shown many successful
applications in crowdsourced information gathering from various different domains. In [2], the authors intro-
duce a crowdsourcing framework for acquisition and analysis of mobile videos to monitor rapidly changing
disaster sites that optimally utilise bandwidth and resources. In a study done by Meier et al. [3], crowdsourced
atmospheric data in Berlin, Germany was used for urban climate research. In [4], a privacy-preserving com-
munity sensing method is developed for real-time road traffic monitoring. Another study by Kim et al. [5]
uses crowdsourcing and regression techniques to automatically analyse the conflict level in television political
debates.
   While decentralized spatial information gathering (crowdsourcing) techniques have clear benefits, such as
being inexpensive and easy to implement, their quality and reliability are not as good as data collected from

Copyright c by the paper’s authors. Copying permitted only for private and academic purposes.
In: J. Zhang, R. Cohen, and M. Sensoy (eds.): Proceedings of the 18th International Workshop on Trust in Agent Societies, Singapore,
09-MAY-2016, published at http://ceur-ws.org


                                                                        1
domain experts. The issue of trust is one of the major impediments to the success of real-world crowdsourc-
ing applications [6]. Apart from inherent noises, untrustworthiness in crowdsourced data can originate from
various ways. For example the sensor used for measurement can be faulty (if the data is in the form of sensor
reading). Data that come from humans can be subjective and highly biased [7]. For example, workers might put
very little effort into the work they are assigned in the hope that they will not be noticed and still get the reward.
Other workers might lack the essential skills to perform the task, hence produce erroneous data [7]. In addition,
there can be adversaries that report malicious data strategically to interrupt the crowdsourcing application [8].
   The challenge of identifying and filtering untrustworthy information before summarization (prediction, de-
cision making, etc.) has been addressed by a number of researchers in various domains. Some of these works
focus on binary/multi-class labelling tasks. Groot et at. [9] applied a Gaussian Process for regression model to
merge information from multiple inaccurate annotators to produce reliable labels for objects. Tarasov et at. [10]
proposed a method to dynamically estimate reliability of workers in crowdsourcing for regression by adopting
a multi-armed bandits (MAB) model.
   Some studies are applied to problems of estimation. Reece et al. [11] studied the problem of tackling sensor
faults in applications based on sensor networks. Their method consists of two stages. Firstly the model-based
step detects and filters sensor faults with known types, then the consensus-based step removes unanticipated
faults that passed the first step. Venanzi et al. [12] developed an algorithm called MaxTrust that jointly estimates
the target attribute and the trustworthiness of each worker using maximum likelihood paradigm.
   Not much attention has been given to address the problem of spatial regression (or the estimation of a spatial
process/function) with the presence of untrustworthy data sources. Venanzi et al. [13] proposed a heteroskedas-
tic Gaussian Process (HGP) model with independent noises and designed an algorithm called TrustHGP to
jointly estimate the spatial function and the trustworthiness of each worker. Their model assumes that all re-
ports are noisy observations from the same underlying function with heterogeneous and independent noises.
Malicious observations are associated with higher noises compared to honest observations. These assumptions
of their model are too intuitive to be able to handle complex real-world situations. Especially, this method is
vulnerable against collusion attacks.
   In collusion attack scenarios in crowdsourcing, multiple untrustworthy workers can collaborate to inject
deceptive false data to skew the regression results. More specifically, an effective strategy for attackers is to
make a number of similar reports that are far from the ground-truth (or follow a spatial function that is far
from the underlying spatial function) with similar input values. This often results in the regression algorithm
mistakenly accepting the false value provided by attackers. We refer to this type of attacks and similar types
as collusion attacks. We will formally present the attacker model in the model specification in Section 3 of this
paper.

1.1   The Proposed Gaussian Processes Mixture Model
In order to address the issue of collusion attacks, we propose a Gaussian Processes mixture model where ma-
licious data and non-malicious (honest) data are modelled to be from two different realizations of Gaussian
Processes. We develop a Markov chain Monte Carlo (MCMC) based algorithm that takes all data as input and
produces statistical prediction of function value at any spatial location of interest. The predictions produced
by the algorithm is based on honest data. It minimizes the impact of malicious data by excluding them before
performing posterior inference with high accuracy. The algorithm also assesses the trustworthiness of each
worker in terms of the probability that the data it produces being honest. We show that this method is truly
collusion-resistant by performing experiment on synthetic dataset.

1.2   Contribution
The contribution of this paper includes:
  • We develop a unified statistical framework for collusion attacks in the regression setting via a Gaussian
    Process mixture model.
  • We develop a novel MCMC-based algorithm to produce comprehensive summary of data in terms of pos-
    terior predictive distribution and trustworthiness of each worker (data point). The inference method is
    fully Bayesian.
  • We conduct a proof of concept on synthetic dataset and show that our method is truly resistant to collusion
    attacks.


                                                          2
2     Problem Statement
In the field of spatial monitoring, n sets of data are gathered from n different workers that are potentially un-
trustworthy. Each set of data comes in the form of a tuple (xi , yi ), i = 1, . . . , n, where xi ∈ Rd represents
the independent variables (spatial location, time, input features, etc.), and yi ∈ R represents the response (de-
pendent) variable that depends on xi . Let x := {x1 , . . . , xn }, and let y := {y1 , . . . , yn }. We refer to a tuple of
(xi , yi ) as a data point. Each data point is either honest or malicious, depending on whether it comes from a
trustworthy or untrustworthy worker. Honest data points are noisy observations from the underlying function
fH : Rd → R, while malicious data points are not (and will be defined later).
   The task of the regression problem is that given x and y, produce a reliable estimation of fH (x∗ ) with its
probability distribution, for any given x∗ ∈ Rd , that is free from the influence of malicious data. The probability
density function p(fH (x∗ )|x∗ , x, y) is referred to as posterior predictive distribution and contains all the information
about fH (x∗ ), given all the observations.

2.1    Illustrative Example of Spatial Phenomena Crowdsourcing
To motivate our model, we present a real-world example of the stated problem. Suppose a meteorological study
aims to estimate the spatial structure of atmospheric parameters (e.g. temperature, humidity, air quality, etc.)
of a particular region of interest by using crowdsourced data collected from workers in the region by hand-held
devices equipped with sensors. In this example, each report (xi , yi ) consists of: xi , the geographical location
of the i-th worker (longitude and latitude), and yi , the reading of the atmospheric parameter from the sensor
yi , e.g. air quality. The underlying function of interest fH corresponds to the inherent spatial structure of the
parameter, i.e. for a given location x∗ , fH (x∗ ) corresponds to the ground-truth of the value of the parameter
at the location. Suppose that, for some reasons, a worker reports dishonest data at location xm that does not
correspond to the actual parameter value fH (xm ). This piece of data is deemed as malicious. There can be
practical reasons why malicious data exists, such as to defame a spatial location by claiming that the air quality
in the vicinity is poor. There can be multiple malicious workers reporting dishonestly to achieve their objectives.
Thus, the data reported by these workers cannot be trusted and should be regarded as unrelated to fH .

3     Crowdsourcing Statistical Model Development
In this section, we formally present the statistical crowdsourcing model which uses Gaussian Process mixture
model.

3.1    Gaussian Process, Covariance Functions and Hyperparameters
We model the physical phenomena as spatially dependent continuous processes with a spatial correlation struc-
ture and are independent from each other. The degree of the spatial correlation in the process increases with
the decrease of the separation between two observing locations and can be accurately modelled as a Gaussian
random field 1 . A Gaussian process (GP) defines a distribution over a space of functions and it is completely
specified by the equivalent of sufficient statistics for such a process, and is formally defined as follows.

Definition 1 (Gaussian process [14]): Let X ⊂ RD be some bounded domain of a D-dimensional real valued
vector space. Denote by f (x) : X 7→ R a stochastic process parametrized by x ∈ X . Then, the random function
f (x) is a Gaussian process if all its finite dimensional distributions are Gaussian, where for any m ∈ N, the
random variables (f (x1 ) , · · · , f (xm )) are normally distributed.

We can therefore interpret a GP as formally defined by the following class of random functions:

                                   F := {f (·) : X 7→ R s.t. f (·) ∼ GP (µ (·; θ) , C (·, ·; Ψ)) , with
                            µ (x; θ) := E [f (x)] : X 7→ R,
                       C (xi , xj ; Ψ) := E [(f (xi ) − µ (xi ; θ)) (f (xj ) − µ (xj ; θ))] : X × X 7→ R+ ,

where at each point the mean of the function is µ(·; θ), parameterised by θ, and the spatial dependence between
any two points is given by the covariance function (Mercer kernel) C (·, ·; Ψ), parameterised by Ψ (see detailed
discussion in [14]).
    1 We use Gaussian Process and Gaussian random field interchangeably.


                                                                  3
3.2   Specification of the Spatial Processes
We assume that fH can be accurately modelled a Gaussian Process prior with zero mean and square exponential
                                      2    2             2
covariance function parameterised by σH , wH1 , · · · , wHd .

                                                        fH ∼ GP(0, CH ),
                                                                        d
                                                                                               !
                                                     2             1 X (xim − xi0 m )2
                                   CH (xi , xi0 ) = σH exp       −           2                     .
                                                                   2 m=1   wHm
Similarly, fM is also modelled a Gaussian Process prior with zero mean and square exponential covariance
                            2    2             2
function parameterised by σM  , wM1 , · · · , wMd .

                                                        fM ∼ GP(0, CM ),
                                                                         d
                                                                                               !
                                                     2             1 X (xim − xi0 m )2
                                   CM (xi , xi0 ) = σM exp       −           2                     .
                                                                   2 m=1   wMm

3.3   Partition of Data Points and Hyperparameters of Gaussian Processes
Data points are partitioned into 2 mutually exclusive sets, H and M, denoting honest data points and malicious
data points. In other words, either i ∈ H or i ∈ M. Without prior knowledge, H is indistinguishable from M.
Hence, we make the assumption that H contains the majority of data points, that is, |H| > |M|. The case where
malicious data points become the majority is unrealistic in real applications. When it happens, there is no way
to distinguish which set is honest without informative prior knowledge.
   Since each data point belongs either to H or M, we use c ∈ {H, M}n to denote the configuration vector, where
each component ci denotes the set which i belongs to. In other words,
                                                    (
                                                     H if i ∈ H
                                               ci =               .
                                                     M if i ∈ M

  As mentioned above, honest data points are noisy observations from underlying function fH : Rd → R.
Similarly, malicious data points are noisy observations from a different function fM : Rd → R. Hence,
                                              (
                                               fH (xi ) + i , if ci = H
                                         yi =                            ,                            (1)
                                               fM (xi ) + ωi , if ci = M

where
                                                  iid                  iid
                                               i ∼ N (0, σn2 ), ωi ∼ N (0, σω2 ).                                (2)

Within each set, the Gaussian noises are assumed to be independently and identically distributed (iid). The
assumption of homoscedasticity is made to simplify derivation. In future work we will explore the possibility
to extended it to incorporate input-dependent noises such as in [15–17].
   Let θ denote hyperparameters of above GPs (including variances of noise),
                                        2    2    2             2     2             2
                                  θ = (σH , σM , wH1 , · · · , wHd , wM1 , · · · , wMd , σn2 , σω2 ).
                                               2      2
   In these hyperparameters, signal variances σH and σM , noise variances σn2 and σω2 are given inverse-gamma
priors,
                            2    2 iid                                        iid
                           σH , σM ∼ Inv-Gamma(αf , βf ), σn2 , σω2 ∼ Inv-Gamma(αn , βn ),
                   2             2     2             2
and length-scales wH1 , · · · , wHd , wM1 , · · · , wMd are given log-normal prior,

                                      2             2     2             2    iid         2
                                     wH1 , · · · , wHd , wM1 , · · · , wMd ∼ ln N (µw , σw ).

  Placing priors over hyperparameters makes the method fully Bayesian and provides extra flexibility to the
                                            2
model. Parameters αf , βf , αn , βn , µw , σw are pre-set values to make these priors non-informative. Alternatively,


                                                                  4
when prior knowledge about hyperparameters is available, these parameters can be tuned in order to achieve
better accuracy.
   In addition, we define the following notations that are used in the following parts of the paper.
                                     xH = {xi : ci = H}, xM = {xi : ci = M},
                                    yH = {yi : ci = H}, yM = {yi : ci = M},
                                           n
                                           X                         Xn
                                nH = |H| =     δ(ci , H), nM = |M| =    δ(ci , M).
                                                i=1                         i=1

3.4   Distribution of Independent Variables
In our assumptions, xH is distributed uniformly throughout the domain of x since no region is preferred over
other regions when gathering data. When prior knowledge about distribution of independent variables is
present, the uniformity assumption can be replaced with no difficulty.
                                        iid
                                  xHim ∼ U (lm , um ), for i = 1 · · · n, m = 1 · · · d,
where xHim denotes the mth dimension of ith honest data point, lm , um are the lower bound and upper bound
of the mth dimension.
   It is assumed that xM follows a d-variate Gaussian distribution with mean µx and covariance Σx . This is a
practical assumption since in the presence of collusion attacks, the damage maximizing strategy for attackers
is to concentrate malicious data points in a particular region. In addition, attackers are assumed to possess
limited resources, making it difficult for them to inject malicious data from a broad region (for example, when
the independent variable is geographical locations). In the case of a distributed attack where independent
variables of malicious data spread through a broad region, Σx corresponds to a large value that makes the
distribution of xM flat.
                                             iid
                                         xMi ∼ N (µx , Σx ), for i = 1 · · · n,
where xMi denotes the ith malicious data point.
  Let φ denote parameters of the input distribution,
                                                       φ = (µx , Σx ),
where µx is given uniform prior,
                                         µxm ∼ U (lm , um ), for m = 1 · · · d,
and Σx is given inverse-Wishart prior,
                                                      Σx ∼ W −1 (Ψ, ν),
and the parameters Ψ, ν are set to be non-informative.

3.5   Prior Over Configurations
We assume there is no a priori information about the trustworthiness of each worker. Hence there is an identical
probability ρ of each data point being honest. Therefore, c has a n-variate Bernoulli distribution with parameter
ρ (H corresponds to the positive outcome and M corresponds to the negative outcome),
                                              iid
                                         ci ∼ Bernoulli(ρ), for i = 1 · · · n.
It is obvious that nH follows a binomial distribution with parameters n and ρ.
                                                       nH ∼ B(n, ρ).
   Strictly speaking, since nH > nM , nH should follow a truncated binomial distribution where nH > n2 . For
                                                                                                           

simplicity, this condition is ignored here. As later discussed in the algorithm, if a posterior sample violates the
condition, we simply ignore the sample.
   ρ is given a beta prior, with known parameters αρ and βρ ,
                                                      ρ ∼ Beta(αρ , βρ ).
Here αρ and βρ should be set such that αρ  βρ , because before making any observations, all workers should
be trusted as being honest.


                                                              5
4     Posterior Predictive Inference and MCMC Sampler Design
Based on the model we propose, the goal is to obtain the posterior predictive distribution, which is derived by
marginalizing over c and θ, i.e.
                                               XZ
                       p(fH (x∗ )|x∗ , x, y) =    p(fH (x∗ )|x∗ , x, y, c, θ)p(c, θ|x, y)dθ.
                                                            c   θ


The first term inside the integral p(f (x∗ )|x∗ , x, y, c, θ) is the posterior distribution of Gaussian Process regres-
sion and is Gaussian:
                                   p(fH (x∗ )|x∗ , x, y, c, θ) = N (fH (x∗ ); µf , Σf ),                            (3)
where

                           µf = CH (xH , x∗ )T [CH (xH , xH ) + σn2 I]−1 yH ,
                                                                                                                               (4)
                           Σf = CH (x∗ , x∗ ) − CH (xH , x∗ )T [CH (xH , xH ) + σn2 I]−1 CH (xH , x∗ ).

The second term inside the integral p(c, θ|x, y) is the marginal posterior distribution of c and θ obtained by
marginalizing out φ and ρ from the full posterior p(c, θ, φ, ρ|x, y).
                                                        Z 1Z
                                     p(c, θ|x, y) =                  p(c, θ, φ, ρ|x, y)dφdρ,
                                                            0    φ                                                             (5)
                                p(c, θ, φ, ρ|x, y) ∝ p(y|x, c, θ)p(x|c, φ)p(c|ρ)p(ρ)p(θ)p(φ).

This full posterior p(c, θ, φ, ρ|x, y) is clearly intractable. However, if we could sample from this posterior
p(c, θ, φ, ρ|x, y) and get S samples {(c(1) , θ(1) , φ(1) , ρ(1) ), · · · , (c(S) , θ(S) , φ(S) , ρ(S) )}, we could approximate the
posterior predictive distribution by:

                                                                     S
                                                                1X
                                 p(fH (x∗ )|x∗ , x, y) ≈              p(fH (x∗ )|x∗ , x, y, c(r) , θ(r) )                      (6)
                                                                S r=1

This posterior predictive distribution is a mixture of multivariate Gaussian distribution, whose mean µfH (x∗ )
and covariance matrix ΣfH (x∗ ) are given by:

                                                        S
                                                  1 X (r)
                                    µfH (x∗ ) =         µ ,
                                                  S r=1
                                                                                                                               (7)
                                                S
                                                X                  T
                                    ΣfH (x∗ ) =   (Σ(r) + µ(r) µ(r) ) − µfH (x∗ ) µTfH (x∗ ) ,
                                                  r=1


where µ(r) and Σ(r) are the mean and covariance matrix of the rth component of the mixture, respectively.
Inference of other characteristics of the posterior predictive distribution, such as median, mode, percentiles, etc.
can be done by generating T samples from each of the S components from the approximate posterior predictive
distribution and use the histogram of ST samples to approximate the posterior predictive distribution.
   Therefore, we develop a MCMC sampler algorithm to generate samples from the posterior distribution
p(c, θ, φ, ρ|x, y). We can use Gibbs Sampling to generate samples from conditional posterior distribution of
each component of the parameter space.

4.1   Sampling c
Since it is difficult to sample c as a whole from its conditional posterior distribution, we sample it component
by component. Let c−i denote the vector resulting from removing i-th component from c.
                                                                            
         p(ci = j|c−i , x, y, θ, φ, ρ) ∝ p yi |ci = j, c−i , xj−i , yj−i , θ p(xi |xj−i , ci = j, c−i , φ)p(ci = j|c−i , ρ)
                                                                                                                            (8)
                                       = N (yi ; µj , Σj )p(xi |xj−i , ci = j, c−i , φ)p(ci = j|c−i , ρ), for j ∈ {H, M},


                                                                         6
where
                               µH = CH (xH−i , xi )T [CH (xH−i , xH−i ) + σn2 I]−1 yH−i ,
                               ΣH = [CH (xi , xi ) + σn2 ] − CH (xH−i , xi )T [CH (xH−i , xH−i ) + σn2 I]−1 CH (xH−i , xi ),
                               µM = CM (xM−i , xi )T [CM (xM−i , xM−i ) + σω2 I]−1 yM−i ,
                              ΣM = [CM (xi , xi ) + σω2 ] − CM (xM−i , xi )T [CM (xM−i , xM−i ) + σω2 I]−1 CM (xM−i , xi ),
                                           1                                                                                    (9)
      p(xi |xj−i , ci = H, c−i , φ) = d              ,
                                      Q
                                        (um − lm )
                                       m=1
      p(xi |xj−i , ci = M, c−i , φ) = N (xi ; µx , Σx ),
                p(ci = H|c−i , ρ) = ρ,
                p(ci = M|c−i , ρ) = 1 − ρ.
In above equations, xj−i = x \ xi , yj−i = y \ yi for j ∈ {H, M}. Note that rank-one updates to the inverse
covariance matrix can be performed to reduce the number of inversion needed.

4.2    Sampling θ
We can sample the entire θ from its conditional posterior distribution.
                       p(θ|x, y, c, φ, ρ) ∝ p(y|x, c, θ)p(θ)
                                                                                           d
                                                                                           Y
                                                                 2     2
                                             =    p(y|x, c, θ)p(σH )p(σM )p(σn2 )p(σω2 )            2
                                                                                                 p(wHm     2
                                                                                                       )p(wMm )                (10)
                                                                                           m=1

The unnormalized distribution can be derived analytically, along with its partial derivative w.r.t. each pa-
rameter. Hence Hamiltonian Monte Carlo (HMC) [19] can be used to efficiently sample from this conditional
posterior.

4.3    Complementary Metropolis-Hastings Jump
When sampling c and θ, a convergence issue may occur. Intuitively, the joint posterior distribution
p(c, θ, φ, ρ|x, y) should be bimodal, since the model specification of c and θ is highly symmetric except for the
condition that nH > nM , as reflected in the prior of c, and the distribution of independent variables xH and xM .
Thus, switching the labels of all data points (labelling all honest data points as M and all malicious data points
as H) will result in an inferior mode in the posterior distribution (besides the major mode we are interested in
where all data points are correctly classified). This mode has much less probability mass as compared to the
major mode, and since its violation of the condition nH > nM , it will eventually be rejected when producing
posterior predictive inference. Nonetheless, its existence will make the MCMC sampler algorithm highly sen-
sitive to initial state. Due to the nature of Gibbs samplers, c is updated one component at a time. The slow
movement of the Gibbs sampler in the state space will make it extremely unlikely to jump from a state close
to inferior mode to a state close to the major mode, due to low probability mass between two modes. Hence,
apart from the Gibbs sampler for c and HMC sampler for θ mentioned above, an auxiliary Metropolis-Hastings
sampler is introduced to facilitate the movement between two modes. The Metropolis-Hastings proposal dis-
tribution will propose a switch of labels as well as their associated GP hyperparameters θ. Below shows the
proposal distribution q(c∗ |c) and q(θ∗ |θ) of the Metropolis-Hastings jump.
                                  q(ci∗ 6= ci |ci ) = pc ,
                                    q(ln σH∗ |θ) = N (ln σH∗ ; ln σM , σF2 ),
                                    q(ln σM∗ |θ) = N (ln σM∗ ; ln σH , σF2 ),
                                                                         2
                                     q(ln σn∗ |θ) = N (ln σn∗ ; ln σω , σN ),                                                  (11)
                                                                        2
                                    q(ln σω∗ |θ) = N (ln σω∗ ; ln σn , σN ),
                                                                         2
                                  q(ln wHm∗ |θ) = N (ln wHm∗ ; ln wMm , σW ), for m = 1 · · · d,
                                                                         2
                                  q(ln wMm∗ |θ) = N (ln wMm∗ ; ln wHm , σW ), for m = 1 · · · d.


                                                                   7
Here, pc is set to be close to 1 and σF2 , σN
                                            2      2
                                              and σW are set close to 0 so that the Metropolis-Hastings jump roughly
switches labels of all data points along with their corresponding hyperparameters. The update is accepted with
probability shown below:
                                                                                                          
                                                                  p(c∗ , θ∗ , φ, ρ|x, y)q(c|c∗ )q(θ|θ∗ )
                                   p(accept) = min                                                       ,1 .                (12)
                                                                   p(c, θ, φ, ρ|x, y)q(c∗ |c)q(θ∗ |θ)

 q(θ|θ∗ )
          is equal to 1 since the part of the Metropolis-Hastings update related to θ is symmetrical. Therefore,
 q(θ∗ |θ)
the sampler for c and θ becomes a mixture of two MCMC samplers. One is the Gibbs sampler for c and HMC
sampler for θ, and the other is a Metropolis-Hastings sampler that jointly updates c and θ. At each iteration
of the MCMC sampler algorithm, with probability pMH , the Metropolis-Hastings sampler is selected. Since the
sole purpose of this sampler is to get the chain out of the inferior mode, pMH should be set close to 0. Now
when the chain walks into states close to the inferior mode, the Metropolis-Hastings sampler will take the chain
to states close to the major mode. When the chain is already close to the major mode, the Metropolis-Hastings
jumps will almost never be accepted.


4.4   Sampling φ

To sample φ, we generate a random sample from the conditional posterior of each component, i.e. from
p(µx |xM , c, Σx ) and p(Σx |xM , c, µx ), which are analytically tractable.
  For µx ,

                                                                p(xM |c, µx , Σx )p(µx )
                               p(µx |xM , c, Σx ) = R
                                                             µx
                                                                p(xM |c, µx , Σx )p(µx )dµx
                                                                                                       d
                                                                  1                                              1
                                                                           Q                           Q
                                                                  C                N (µx ; xi , Σx )         (um −lm )
                                                                      {i:ci =M}                        m=1
                                                    =
                                                         R      1
                                                                           Q                           d
                                                                                                       Q         1
                                                                                                                             (13)
                                                             µx C
                                                                                   N (µx ; xi , Σx )         (um −lm ) dµx
                                                                      {i:ci =M}                        m=1
                                                                                          
                                                                  N       µx ; xM , nΣMx
                                                    =R                               ,
                                                             µx
                                                                  N µx ; xM , nΣMx dµx

               Q         R
where C =                 µx
                               N (µx ; xi , Σx )dµx is the product of all the normalizing constants to truncate the Gaus-
            {i:ci =M}
sian distribution. Since C appears in both the denominator and the numerator, it cancels out. Note that this
conditional posterior distribution is a truncated Gaussian distribution and can be efficiently generated.
   For Σx , we have that

                                                            p(xM |c, µx , Σx )p(Σx )
                             p(Σx |xM , c, µx ) = R
                                                         Σx
                                                            p(xM |c, µx , Σx )p(Σx )dΣx
                                                                                       !
                                                                d
                                                           1
                                                                      N (µx ; xi , Σx ) W −1 (Σx ; Ψ, ν)
                                                               Q
                                                           C
                                                                  {i:ci =M}                                                  (14)
                                            =                                                   !
                                                                      d
                                                             1
                                                R
                                                                               N (µx ; xi , Σx ) W −1 (Σx ; Ψ, ν)dΣx
                                                                      Q
                                                    Σx       C
                                                                  {i:ci =M}

                                                = W −1 (Σx ; Z + Ψ, nM + ν),

                        (xi −µx )(xi −µx )T . This conditional posterior distribution is an inverse-Wishart distribution
               P
where Z =
            {i:ci =M}
and can be directly generated.


                                                                               8
  Algorithm 1: MCMC Sampler
   Input: S, x, y and parameters of priors
   Output: S samples of posterior distribution (c(1) , θ(1) , φ(1) , ρ(1) ), · · · , (c(S) , θ(S) , φ(S) , ρ(S) )
 1 Initialize θ, φ and ρ by randomly drawing from their respective priors.
 2 Initialize c.
 3 for r = 1 · · · S do
 4     Randomly sample u1 ∼ U (0, 1).
 5     if u1 < pMH (Probability of proposing a Metropolis-Hastings jump) then
 6         Propose a jump to c∗ , θ∗ by q(c∗ |c) and q(θ∗ |θ).
 7         Compute acceptance probability p(accept).
 8         Randomly sample u2 ∼ U (0, 1).
 9         if u2 < p(accept) then
10             c ← c∗ .
11             θ ← θ∗ .
         else
12           for i = 1 · · · n do
13               Sample ci from p(ci |c−i , x, y, θ, φ, ρ).
14            Sample θ from p(θ|x, y, c, φ, ρ) by HMC.
15       Sample µx from p(µx |x, y, c, θ, Σx , ρ).
16       Sample Σx from p(Σx |x, y, c, θ, µx , ρ).
17       Sample ρ from p(ρ|x, y, c, θ, φ).
18       Save current values of c, θ, φ, ρ in (c(r) , θ(r) , φ(r) , ρ(r) ).
19   return {(c(1) , θ(1) , φ(1) , ρ(1) ), · · · , (c(S) , θ(S) , φ(S) , ρ(S) )}.

4.5     Sampling ρ
Finally, to sample ρ we generate a random sample from its conditional posterior, which is also a beta distribu-
tion.
                                            p(ρ|x, y, c, θ, φ) = p(ρ|c)
                                                                       p(c|ρ)p(ρ)
                                                               =R 1                                                 (15)
                                                                   0
                                                                       p(c|ρ)p(ρ)dρ
                                                               =Beta (ρ; αρ + nH , βρ + n − nH ) .


5     The Algorithms
Our methods consists of two algorithms. The first, presented in Algorithm 1, is a MCMC sampler used to
generate samples of model parameters from the posterior distribution p(c, θ, φ, ρ|x, y). The second algorithm,
presented in Algorithm 2, is used to summarize the posterior predictive distribution by calculating the posterior
mean and variance. Details of both algorithms are shown. S is set sufficiently large to allow the Markov chain
to approximately converge to the target posterior distribution. To assess convergence, we used the method
introduced by Gelman et al. in [18].
   Apart from mean and variance of the posterior predictive distribution, we also obtain the trustworthiness
of each worker (each data point) in the form of the posterior probability of each data point being honest, or
alternatively, p(ci |x, y) for i = 1 · · · n. This information can be used to further model the behavior of each
worker in crowdsourcing applications. However, this is not the main focus of this paper and will not be fur-
ther discussed. Derivation of posterior probability p(ci |x, y) is also straightforward and thus omitted from the
algorithm specification.
   Assuming d  n, the computational complexity of Algorithm 1 is limited by the HMC step to sample θ, in
which L leapfrog steps are taken, each requires an inverse operation that has O(n3 ) complexity. This results
in Algorithm 1 having an overall complexity of O(SLn3 ). The computational complexity of Algorithm 2 is
O(S 0 (n2∗ + n∗ n2 )), where n∗ is the length of x∗ .


                                                                           9
 Algorithm 2: Posterior Predictive Inference
  Input: S samples of posterior distribution (c(1) , θ(1) , φ(1) , ρ(1) ), · · · , (c(S) , θ(S) , φ(S) , ρ(S) ), x∗
  Output: µfH (x∗ ) , ΣfH (x∗ )
1 Discard the first half of samples as warm-up. The number of samples used for approximating posterior
  predictive distribution is S 0 .
   0
2 r ← 1.
                    0
3 for r = 1 · · · S do
              Pn
4     nH ←         δ(ci , H).
                 i=1
5        if 2nH > n then
                0
6            µ(r ) ← CH (xH , x∗ )T [CH (xH , xH ) + σn2 I]−1 yH .
                0
7            Σ(r ) ← CH (x∗ , x∗ ) − CH (xH , x∗ )T [CH (xH , xH ) + σn2 I]−1 CH (xH , x∗ ).
8            r0 ← r0 + 1.
                            0
                           S          0
     µfH (x∗ ) ← S1             µ(r ) .
                           P
9
                       r 0 =1
                       0
                    S             0        0    0 T
                           (Σ(r ) + µ(r ) µ(r ) ) − µfH (x∗ ) µTfH (x∗ ) .
                    P
10   ΣfH (x∗ ) ←
                   r 0 =1
11   return (µfH (x∗ ) , ΣfH (x∗ ) ).

6     Simulation Results
To demonstrate that the algorithms are able to make accurate inference from data, we performed experiments
on one-dimensional synthetic dataset, where n = 100 and d = 1, generated in accordance to our assumptions.
   In the synthetic dataset, the underlying function is fH (x) = sin π2 x and the malicious function is fM (x) =
1     π
2 tan 4 x. 60 noisy observations are sampled from fH , with homogeneous Gaussian noise with standard devia-
tion 0.10. The distribution of independent variables is uniform in [0, 1]. 40 noisy observations are sampled from
fM , with homogeneous Gaussian noise with standard deviation 0.12. The distribution of independent variables
is Gaussian with mean 0.80 and standard deviation 0.10, truncated to range [0, 1].
   The choices of priors in terms of their parameters are given by:

                                                              αf = 2.0, βf = 2.0,
                                                              αn = 2.0, βn = 1.0,
                                                                         2
                                                              µw = 0.7, σw = 0.3,
                                                               Ψ = 0.0025, ν = 1.

  For the Metropolis-Hastings sampler, the probability of proposing a jump is pMH = 0.05. Parameters of the
proposal distribution is shown below.

                                          pc = 0.95, σF2 = 1 × 10−4 , σN
                                                                       2
                                                                         = 1 × 10−4 , σW
                                                                                       2
                                                                                         = 1 × 10−4

   For the sampler of θ, hyperparameters related to H and M are sampled separately, each using an HMC
algorithm with L = 10 (leapfrog steps).
   Finally, for this particular dataset, we set S = 2000 since the Markov chain has converged sufficiently after
2000 iterations of the MCMC sampler. Then we pass the second half of the resulting samples to make posterior
predictive inference. The results are shown in Fig. 1. Using the posterior mean as the estimator results in a
Mean Square Error (MSE) of 0.0012. The posterior probability p(ci |x, y) (trustworthiness) of each data point is
also computed. In this example, the average trustworthiness of truly honest data points is 0.9906, while the
average trustworthiness of truly malicious data points is 5 × 10−5 . It is thus clear that for this set of data, the
accuracy of the proposed algorithm is quite high.
   As for the efficiency of the algorithm, we measure the average time cost by running the algorithm 50 times
using the above dataset on a MacBook Pro equipped with Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz. The
resulted average time cost is 27.43 seconds per run.


                                                                         10
                      Figure 1: Posterior Predictive Inference Results of Synthetic Dataset

7   Conclusion and Future Works
In this paper, we build a Gaussian Process mixture model and design a MCMC-based algorithm to address
the issue of collusion attacks in the regression setting. Honest data from observing underlying function and
malicious data are modelled as two separate Gaussian Processes. Compared to [13], where malicious data are
assumed to be observed from the same underlying function with extra noise, we claim that the mixture model
is a more realistic and natural choice. We use synthetic dataset to show that our algorithm is able to produce
accurate posterior predictive inference and is computationally efficient.
   As part of an ongoing study, the results of this paper suggest several directions for future studies. Firstly,
a single malicious function (class) is too restrictive to handle more sophisticated real-world scenarios and the
algorithm currently lacks the ability to handle cases with multiple malicious groups. It is therefore reasonable
to extend this model to incorporate multiple malicious classes. One way of modelling is to build a finite mix-
ture of Gaussian Processes model with unknown number of components, and design a MCMC sampler using
reversible jump techniques [20]. Another choice is to build an infinite mixture of Gaussian Processes model
similar to [21]. Secondly, this model is unable to handle heteroscedasticity as mentioned before, as well as non-
Gaussian noises. We will explore approaches to handle these situations in future studies. Lastly, our model is
not build specifically to model trustworthiness of workers, which is a significant issue in crowdsourcing appli-
cations. In our future work we will improve the model to make robust, collusion-resistant predictions while
maintaining trustworthiness of workers for future predictions.


References
 [1] Howe, Jeff. ”Crowdsourcing: A definition.” Crowdsourcing: Tracking the rise of the amateur (2006).

 [2] To, H., Kim, S.H. and Shahabi, C., 2015, October. Effectively crowdsourcing the acquisition and analysis of
     visual data for disaster response. In Big Data (Big Data), 2015 IEEE International Conference on (pp. 697-706).
     IEEE.

 [3] Meier, Fred, Daniel Fenner, Tom Grassmann, Britta Jnicke, Marco Otto, and Dieter Scherer. ”Challenges
     and benefits from crowd-sourced atmospheric data for urban climate research using Berlin, Germany, as
     testbed.”

 [4] Krause, Andreas, Eric Horvitz, Aman Kansal, and Feng Zhao. ”Toward community sensing.” In Proceedings
     of the 7th international conference on Information processing in sensor networks, pp. 481-492. IEEE Computer
     Society, 2008.


                                                         11
 [5] Kim, Samuel, Maurizio Filippone, Fabio Valente, and Alessandro Vinciarelli. ”Predicting the conflict level
     in television political debates: an approach based on crowdsourcing, nonverbal communication and gaus-
     sian processes.” In Proceedings of the 20th ACM international conference on Multimedia, pp. 793-796. ACM,
     2012.
 [6] Shahabi, Cyrus. ”Towards a generic framework for trustworthy spatial crowdsourcing.” In Proceedings of
     the 12th International ACM Workshop on Data Engineering for Wireless and Mobile Acess, pp. 1-4. ACM, 2013.
 [7] Welinder, Peter, and Pietro Perona. ”Online crowdsourcing: rating annotators and obtaining cost-effective
     labels.” (2010): 25-32.
 [8] Hall, David L., and John M. Jordan. Human-centered information fusion. Artech House, 2014.
 [9] Groot, Perry, Adriana Birlutiu, and Tom Heskes. ”Learning from multiple annotators with Gaussian pro-
     cesses.” In Artificial Neural Networks and Machine LearningICANN 2011, pp. 159-164. Springer Berlin Hei-
     delberg, 2011.

[10] Tarasov, Alexey, Sarah Jane Delany, and Brian Mac Namee. ”Dynamic estimation of worker reliability in
     crowdsourcing for regression tasks: Making it work.” Expert Systems with Applications 41, no. 14 (2014):
     6190-6210.
[11] Reece, Steven, Stephen Roberts, Christopher Claxton, and David Nicholson. ”Multi-sensor fault recovery
     in the presence of known and unknown fault types.” In Information Fusion, 2009. FUSION’09. 12th Interna-
     tional Conference on, pp. 1695-1703. IEEE, 2009.
[12] Venanzi, Matteo, Alex Rogers, and Nicholas R. Jennings. ”Trust-based fusion of untrustworthy informa-
     tion in crowdsourcing applications.” In Proceedings of the 2013 international conference on autonomous agents
     and multi-agent systems, pp. 829-836. International Foundation for Autonomous Agents and Multiagent
     Systems, 2013.

[13] Venanzi, Matteo, Alex Rogers, and Nicholas R. Jennings. ”Crowdsourcing spatial phenomena using trust-
     based heteroskedastic gaussian processes.” In First AAAI Conference on Human Computation and Crowdsourc-
     ing. 2013.
[14] Rasmussen, Carl Edward. ”Gaussian processes for machine learning.” (2006).

[15] Goldberg, Paul W., Christopher KI Williams, and Christopher M. Bishop. ”Regression with input-
     dependent noise: A Gaussian process treatment.” Advances in neural information processing systems 10 (1997):
     493-499.
[16] Titsias, Michalis K., and Miguel Lzaro-gredilla. ”Variational heteroscedastic Gaussian process regression.”
     In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 841-848. 2011.

[17] Kersting, Kristian, Christian Plagemann, Patrick Pfaff, and Wolfram Burgard. ”Most likely heteroscedastic
     Gaussian process regression.” In Proceedings of the 24th international conference on Machine learning, pp. 393-
     400. ACM, 2007.
[18] Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian data analysis. Vol. 2. Boca
     Raton, FL, USA: Chapman & Hall/CRC, 2014.
[19] Duane, Simon, Anthony D. Kennedy, Brian J. Pendleton, and Duncan Roweth. ”Hybrid monte carlo.”
     Physics letters B 195, no. 2 (1987): 216-222.
[20] Green, Peter J. ”Reversible jump Markov chain Monte Carlo computation and Bayesian model determina-
     tion.” Biometrika 82, no. 4 (1995): 711-732.

[21] Rasmussen, Carl Edward, and Zoubin Ghahramani. ”Infinite mixtures of Gaussian process experts.” Ad-
     vances in neural information processing systems 2 (2002): 881-888.


                                                        12