<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Collusion-resistant Spatial Phenomena Crowdsourcing via Mixture of Gaussian Processes Regression</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Qikun Xiang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ido Nevat</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pengfei Zhang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jie Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Nanyang Technological University</institution>
          ,
          <country country="SG">Singapore</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>R)</institution>
          ,
          <country country="SG">Singapore</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>With the rapid development of mobile devices, spatial location-based crowdsourcing applications have attracted much attention. These applications also introduce new security risks due to untrustworthy data sources. In the context of crowdsourcing applications for spatial interpolation (i.e. spatial regression) using crowdsourced data, the results can be seriously affected if malicious data sources initiate a colluding (collaborate) attacks which purposely alter some of the measurements. To combat this serious detrimental effect, and to mitigate such attacks, we develop a robust version via a Gaussian Process mixture model and develop a computationally efficient algorithm which utilises a Markov chain Monte Carlo (MCMC)-based methodology to produce an accurate predictive inference in the presence of collusion attacks. The algorithm is fully Bayesian and produces posterior predictive distribution for any point-of-interest in the input space. It also assesses the trustworthiness of each worker, i.e. the probability of each worker being honest (trustworthy). Simulation results demonstrate the accuracy of this algorithm.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Crowdsourcing, as defined by Howe [1], is often characterized by the participation of large networks of people
from the public in tasks given by companies or institutes. It has been broadly used in data acquisition, which
replaces the traditional data gathering processes that are costly. Recent works have shown many successful
applications in crowdsourced information gathering from various different domains. In [2], the authors
introduce a crowdsourcing framework for acquisition and analysis of mobile videos to monitor rapidly changing
disaster sites that optimally utilise bandwidth and resources. In a study done by Meier et al. [3], crowdsourced
atmospheric data in Berlin, Germany was used for urban climate research. In [4], a privacy-preserving
community sensing method is developed for real-time road traffic monitoring. Another study by Kim et al. [5]
uses crowdsourcing and regression techniques to automatically analyse the conflict level in television political
debates.</p>
      <p>While decentralized spatial information gathering (crowdsourcing) techniques have clear benefits, such as
being inexpensive and easy to implement, their quality and reliability are not as good as data collected from
domain experts. The issue of trust is one of the major impediments to the success of real-world
crowdsourcing applications [6]. Apart from inherent noises, untrustworthiness in crowdsourced data can originate from
various ways. For example the sensor used for measurement can be faulty (if the data is in the form of sensor
reading). Data that come from humans can be subjective and highly biased [7]. For example, workers might put
very little effort into the work they are assigned in the hope that they will not be noticed and still get the reward.
Other workers might lack the essential skills to perform the task, hence produce erroneous data [7]. In addition,
there can be adversaries that report malicious data strategically to interrupt the crowdsourcing application [8].</p>
      <p>The challenge of identifying and filtering untrustworthy information before summarization (prediction,
decision making, etc.) has been addressed by a number of researchers in various domains. Some of these works
focus on binary/multi-class labelling tasks. Groot et at. [9] applied a Gaussian Process for regression model to
merge information from multiple inaccurate annotators to produce reliable labels for objects. Tarasov et at. [10]
proposed a method to dynamically estimate reliability of workers in crowdsourcing for regression by adopting
a multi-armed bandits (MAB) model.</p>
      <p>Some studies are applied to problems of estimation. Reece et al. [11] studied the problem of tackling sensor
faults in applications based on sensor networks. Their method consists of two stages. Firstly the model-based
step detects and filters sensor faults with known types, then the consensus-based step removes unanticipated
faults that passed the first step. Venanzi et al. [12] developed an algorithm called MaxTrust that jointly estimates
the target attribute and the trustworthiness of each worker using maximum likelihood paradigm.</p>
      <p>Not much attention has been given to address the problem of spatial regression (or the estimation of a spatial
process/function) with the presence of untrustworthy data sources. Venanzi et al. [13] proposed a
heteroskedastic Gaussian Process (HGP) model with independent noises and designed an algorithm called TrustHGP to
jointly estimate the spatial function and the trustworthiness of each worker. Their model assumes that all
reports are noisy observations from the same underlying function with heterogeneous and independent noises.
Malicious observations are associated with higher noises compared to honest observations. These assumptions
of their model are too intuitive to be able to handle complex real-world situations. Especially, this method is
vulnerable against collusion attacks.</p>
      <p>In collusion attack scenarios in crowdsourcing, multiple untrustworthy workers can collaborate to inject
deceptive false data to skew the regression results. More specifically, an effective strategy for attackers is to
make a number of similar reports that are far from the ground-truth (or follow a spatial function that is far
from the underlying spatial function) with similar input values. This often results in the regression algorithm
mistakenly accepting the false value provided by attackers. We refer to this type of attacks and similar types
as collusion attacks. We will formally present the attacker model in the model specification in Section 3 of this
paper.
1.1</p>
      <sec id="sec-1-1">
        <title>The Proposed Gaussian Processes Mixture Model</title>
        <p>In order to address the issue of collusion attacks, we propose a Gaussian Processes mixture model where
malicious data and non-malicious (honest) data are modelled to be from two different realizations of Gaussian
Processes. We develop a Markov chain Monte Carlo (MCMC) based algorithm that takes all data as input and
produces statistical prediction of function value at any spatial location of interest. The predictions produced
by the algorithm is based on honest data. It minimizes the impact of malicious data by excluding them before
performing posterior inference with high accuracy. The algorithm also assesses the trustworthiness of each
worker in terms of the probability that the data it produces being honest. We show that this method is truly
collusion-resistant by performing experiment on synthetic dataset.
1.2</p>
      </sec>
      <sec id="sec-1-2">
        <title>Contribution</title>
        <sec id="sec-1-2-1">
          <title>The contribution of this paper includes:</title>
          <p>We develop a unified statistical framework for collusion attacks in the regression setting via a Gaussian
Process mixture model.</p>
          <p>We develop a novel MCMC-based algorithm to produce comprehensive summary of data in terms of
posterior predictive distribution and trustworthiness of each worker (data point). The inference method is
fully Bayesian.</p>
          <p>We conduct a proof of concept on synthetic dataset and show that our method is truly resistant to collusion
attacks.
In the field of spatial monitoring, n sets of data are gathered from n different workers that are potentially
untrustworthy. Each set of data comes in the form of a tuple (xi; yi); i = 1; : : : ; n, where xi 2 Rd represents
the independent variables (spatial location, time, input features, etc.), and yi 2 R represents the response
(dependent) variable that depends on xi. Let x := fx1; : : : ; xng, and let y := fy1; : : : ; yng. We refer to a tuple of
(xi; yi) as a data point. Each data point is either honest or malicious, depending on whether it comes from a
trustworthy or untrustworthy worker. Honest data points are noisy observations from the underlying function
fH : Rd ! R, while malicious data points are not (and will be defined later).</p>
          <p>The task of the regression problem is that given x and y, produce a reliable estimation of fH(x ) with its
probability distribution, for any given x 2 Rd, that is free from the influence of malicious data. The probability
density function p(fH(x )jx ; x; y) is referred to as posterior predictive distribution and contains all the information
about fH(x ), given all the observations.</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>2.1 Illustrative Example of Spatial Phenomena Crowdsourcing</title>
        <p>To motivate our model, we present a real-world example of the stated problem. Suppose a meteorological study
aims to estimate the spatial structure of atmospheric parameters (e.g. temperature, humidity, air quality, etc.)
of a particular region of interest by using crowdsourced data collected from workers in the region by hand-held
devices equipped with sensors. In this example, each report (xi; yi) consists of: xi, the geographical location
of the i-th worker (longitude and latitude), and yi, the reading of the atmospheric parameter from the sensor
yi, e.g. air quality. The underlying function of interest fH corresponds to the inherent spatial structure of the
parameter, i.e. for a given location x , fH(x ) corresponds to the ground-truth of the value of the parameter
at the location. Suppose that, for some reasons, a worker reports dishonest data at location xm that does not
correspond to the actual parameter value fH(xm). This piece of data is deemed as malicious. There can be
practical reasons why malicious data exists, such as to defame a spatial location by claiming that the air quality
in the vicinity is poor. There can be multiple malicious workers reporting dishonestly to achieve their objectives.
Thus, the data reported by these workers cannot be trusted and should be regarded as unrelated to fH.
3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Crowdsourcing Statistical Model Development</title>
      <p>In this section, we formally present the statistical crowdsourcing model which uses Gaussian Process mixture
model.</p>
      <sec id="sec-2-1">
        <title>3.1 Gaussian Process, Covariance Functions and Hyperparameters</title>
        <p>We model the physical phenomena as spatially dependent continuous processes with a spatial correlation
structure and are independent from each other. The degree of the spatial correlation in the process increases with
the decrease of the separation between two observing locations and can be accurately modelled as a Gaussian
random field 1. A Gaussian process (GP) defines a distribution over a space of functions and it is completely
specified by the equivalent of sufficient statistics for such a process, and is formally defined as follows.
Definition 1 (Gaussian process [14]): Let X RD be some bounded domain of a D-dimensional real valued
vector space. Denote by f (x) : X 7! R a stochastic process parametrized by x 2 X . Then, the random function
f (x) is a Gaussian process if all its finite dimensional distributions are Gaussian, where for any m 2 N, the
random variables (f (x1) ; ; f (xm)) are normally distributed.</p>
        <p>We can therefore interpret a GP as formally defined by the following class of random functions:</p>
        <p>F := ff ( ) : X 7! R s.t. f ( )
(x; ) := E [f (x)] : X 7! R;</p>
        <p>GP ( ( ; ) ; C ( ; ; )) ; with
C (xi; xj; ) := E [(f (xi)
(xi; )) (f (xj)
(xj; ))] : X</p>
        <p>X 7! R+ ;
where at each point the mean of the function is ( ; ); parameterised by , and the spatial dependence between
any two points is given by the covariance function (Mercer kernel) C ( ; ; ), parameterised by (see detailed
discussion in [14]).</p>
        <p>1We use Gaussian Process and Gaussian random field interchangeably.</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.2 Specification of the Spatial Processes</title>
        <p>We assume that fH can be accurately modelled a Gaussian Process prior with zero mean and square exponential
covariance function parameterised by H2; wH21; ; wH2d.</p>
        <p>Similarly, fM is also modelled a Gaussian Process prior with zero mean and square exponential covariance
function parameterised by M2; wM21; ; wM2d.</p>
        <p>CH(xi; xi0 ) = H2 exp</p>
        <p>d
1 X (xim
2
m=1
wH2m</p>
        <p>xi0m)2 !
fH</p>
        <p>GP(0; CH);
fM</p>
        <p>GP(0; CM);
CM(xi; xi0 ) = M2 exp</p>
        <p>d
1 X (xim
2
m=1
wM2m
xi0m)2 !
:
:</p>
      </sec>
      <sec id="sec-2-3">
        <title>3.3 Partition of Data Points and Hyperparameters of Gaussian Processes</title>
        <p>Data points are partitioned into 2 mutually exclusive sets, H and M, denoting honest data points and malicious
data points. In other words, either i 2 H or i 2 M. Without prior knowledge, H is indistinguishable from M.
Hence, we make the assumption that H contains the majority of data points, that is, jHj &gt; jMj. The case where
malicious data points become the majority is unrealistic in real applications. When it happens, there is no way
to distinguish which set is honest without informative prior knowledge.</p>
        <p>Since each data point belongs either to H or M, we use c 2 fH; Mgn to denote the configuration vector, where
each component ci denotes the set which i belongs to. In other words,</p>
        <p>As mentioned above, honest data points are noisy observations from underlying function fH : Rd ! R.
Similarly, malicious data points are noisy observations from a different function fM : Rd ! R. Hence,
ci =
(H if i 2 H</p>
        <p>:</p>
        <p>M if i 2 M
yi =
(fH(xi) + i; if ci = H
fM(xi) + !i; if ci = M</p>
        <p>
          ;
i iid N (0; n2); !i iid N (0; !2):
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
where
Within each set, the Gaussian noises are assumed to be independently and identically distributed (iid). The
assumption of homoscedasticity is made to simplify derivation. In future work we will explore the possibility
to extended it to incorporate input-dependent noises such as in [15–17].
        </p>
        <p>Let denote hyperparameters of above GPs (including variances of noise),
= ( H2; M2; wH21;
; wH2d; wM21;
; wM2d; n2; !2):</p>
        <p>In these hyperparameters, signal variances H2 and M2, noise variances n2 and !2 are given inverse-gamma
priors,
2 ; M2 iid Inv-Gamma( f ; f ); n2; !2 iid Inv-Gamma( n; n);</p>
        <p>H
and length-scales wH21;
; wH2d; wM21;</p>
        <p>; wM2d are given log-normal prior,
wH21;
; wH2d; wM21;</p>
        <p>; wM2d iid ln N ( w; w2):</p>
        <p>Placing priors over hyperparameters makes the method fully Bayesian and provides extra flexibility to the
model. Parameters f ; f ; n; n; w; w2 are pre-set values to make these priors non-informative. Alternatively,
when prior knowledge about hyperparameters is available, these parameters can be tuned in order to achieve
better accuracy.</p>
        <p>In addition, we define the following notations that are used in the following parts of the paper.
xH = fxi : ci = Hg; xM = fxi : ci = Mg;
yH = fyi : ci = Hg; yM = fyi : ci = Mg;</p>
        <p>n n
nH = jHj = X (ci; H); nM = jMj = X (ci; M):
i=1
i=1
3.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Distribution of Independent Variables</title>
        <p>In our assumptions, xH is distributed uniformly throughout the domain of x since no region is preferred over
other regions when gathering data. When prior knowledge about distribution of independent variables is
present, the uniformity assumption can be replaced with no difficulty.</p>
        <p>xHim
iid U (lm; um); for i = 1
n; m = 1
d;
where xHim denotes the mth dimension of ith honest data point, lm, um are the lower bound and upper bound
of the mth dimension.</p>
        <p>It is assumed that xM follows a d-variate Gaussian distribution with mean x and covariance x. This is a
practical assumption since in the presence of collusion attacks, the damage maximizing strategy for attackers
is to concentrate malicious data points in a particular region. In addition, attackers are assumed to possess
limited resources, making it difficult for them to inject malicious data from a broad region (for example, when
the independent variable is geographical locations). In the case of a distributed attack where independent
variables of malicious data spread through a broad region, x corresponds to a large value that makes the
distribution of xM flat.</p>
        <p>xMi iid N ( x; x); for i = 1
n;
where xMi denotes the ith malicious data point.</p>
        <p>Let denote parameters of the input distribution,
where x is given uniform prior,
and</p>
        <p>x is given inverse-Wishart prior,
and the parameters ; are set to be non-informative.
3.5</p>
      </sec>
      <sec id="sec-2-5">
        <title>Prior Over Configurations</title>
        <p>= ( x; x);
xm</p>
        <p>U (lm; um); for m = 1</p>
        <p>d;
x</p>
        <p>W 1( ; );
We assume there is no a priori information about the trustworthiness of each worker. Hence there is an identical
probability of each data point being honest. Therefore, c has a n-variate Bernoulli distribution with parameter
(H corresponds to the positive outcome and M corresponds to the negative outcome),
It is obvious that nH follows a binomial distribution with parameters n and .</p>
        <p>ci iid Bernoulli( ); for i = 1</p>
        <p>n:
nH</p>
        <p>B(n; ):</p>
        <p>Strictly speaking, since nH &gt; nM, nH should follow a truncated binomial distribution where nH &gt; n2 . For
simplicity, this condition is ignored here. As later discussed in the algorithm, if a posterior sample violates the
condition, we simply ignore the sample.</p>
        <p>is given a beta prior, with known parameters and ,</p>
        <sec id="sec-2-5-1">
          <title>Here and should be set such that be trusted as being honest. , because before making any observations, all workers should</title>
        </sec>
        <sec id="sec-2-5-2">
          <title>Beta( ;</title>
          <p>
            ):
This full posterior p(c; ; ; jx; y) is clearly intractable. However, if we could sample from this posterior
p(c; ; ; jx; y) and get S samples f(c(
            <xref ref-type="bibr" rid="ref1">1</xref>
            ); (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ); (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ); (
            <xref ref-type="bibr" rid="ref1">1</xref>
            )); ; (c(S); (S); (S); (S))g, we could approximate the
posterior predictive distribution by:
p(fH(x )jx ; x; y)
r=1
          </p>
          <p>S
1 X p(fH(x )jx ; x; y; c(r); (r))</p>
          <p>S
This posterior predictive distribution is a mixture of multivariate Gaussian distribution, whose mean
and covariance matrix fH(x ) are given by:
fH(x )
p(c; jx; y) =</p>
          <p>p(c; ; ; jx; y)d d ;
p(c; ; ; jx; y) / p(yjx; c; )p(xjc; )p(cj )p( )p( )p( ):</p>
          <p>S
1 X
fH(x ) = S</p>
          <p>r=1
fH(x ) =</p>
          <p>S
X( (r) +
r=1
(r);
(r) (r)T )</p>
          <p>T
fH(x ) fH(x );</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Posterior Predictive Inference and MCMC Sampler Design</title>
      <p>Based on the model we propose, the goal is to obtain the posterior predictive distribution, which is derived by
marginalizing over c and , i.e.</p>
      <p>
        p(fH(x )jx ; x; y) =
p(fH(x )jx ; x; y; c; )p(c; jx; y)d :
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
(
        <xref ref-type="bibr" rid="ref8">8</xref>
        )
where (r) and (r) are the mean and covariance matrix of the rth component of the mixture, respectively.
Inference of other characteristics of the posterior predictive distribution, such as median, mode, percentiles, etc.
can be done by generating T samples from each of the S components from the approximate posterior predictive
distribution and use the histogram of ST samples to approximate the posterior predictive distribution.
      </p>
      <p>Therefore, we develop a MCMC sampler algorithm to generate samples from the posterior distribution
p(c; ; ; jx; y). We can use Gibbs Sampling to generate samples from conditional posterior distribution of
each component of the parameter space.
4.1</p>
      <sec id="sec-3-1">
        <title>Sampling c</title>
        <p>Since it is difficult to sample c as a whole from its conditional posterior distribution, we sample it component
by component. Let c i denote the vector resulting from removing i-th component from c.
p(ci = jjc i; x; y; ; ; ) / p yijci = j; c i; xj i ; yj i ;
p(xijxj i ; ci = j; c i; )p(ci = jjc i; )
= N (yi; j ; j )p(xijxj i ; ci = j; c i; )p(ci = jjc i; ); for j 2 fH; Mg;
The first term inside the integral p(f (x )jx ; x; y; c; ) is the posterior distribution of Gaussian Process
regression and is Gaussian:</p>
        <p>p(fH(x )jx ; x; y; c; ) = N (fH(x ); f ; f );
where
f = CH(xH; x )T [CH(xH; xH) + n2I] 1yH;
f = CH(x ; x ) CH(xH; x )T [CH(xH; xH) +
n2I] 1CH(xH; x ):
The second term inside the integral p(c; jx; y) is the marginal posterior distribution of c and
marginalizing out and from the full posterior p(c; ; ; jx; y).</p>
        <p>obtained by
X Z
c
Z 1 Z
0
where
p(xijxj i ; ci = H; c i; ) =
p(xijxj i ; ci = M; c i; ) = N (xi; x; x);
p(ci = Hjc i; ) = ;
p(ci = Mjc i; ) = 1</p>
        <p>:
M = [CM(xi; xi) + !2]</p>
        <p>1
d
Q (um
m=1
lm)</p>
        <p>;
H = CH(xH i ; xi)T [CH(xH i ; xH i ) + n2I] 1yH i ;
H = [CH(xi; xi) + n2]</p>
        <p>CH(xH i ; xi)T [CH(xH i ; xH i ) + n2I] 1CH(xH i ; xi);
M = CM(xM i ; xi)T [CM(xM i ; xM i ) + !2I] 1yM i ;</p>
        <p>CM(xM i ; xi)T [CM(xM i ; xM i ) + !2I] 1CM(xM i ; xi);
q(ci 6= cijci) = pc;
q(ln H j ) = N (ln H ; ln M; F2 );
q(ln M j ) = N (ln M ; ln H; F2 );
q(ln n j ) = N (ln n ; ln !; N2 );
q(ln ! j ) = N (ln ! ; ln n; N2 );
q(ln wHm j ) = N (ln wHm ; ln wMm; W2 ); for m = 1
q(ln wMm j ) = N (ln wMm ; ln wHm; W2 ); for m = 1
d;
d:
In above equations, xj i = x n xi, yj i = y n yi for j 2 fH; Mg. Note that rank-one updates to the inverse
covariance matrix can be performed to reduce the number of inversion needed.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2 Sampling</title>
        <p>We can sample the entire from its conditional posterior distribution.</p>
        <p>p( jx; y; c; ; ) / p(yjx; c; )p( )
d
= p(yjx; c; )p( H2)p( M2)p( n2)p( !2) Y p(wH2m)p(wM2m)
m=1
The unnormalized distribution can be derived analytically, along with its partial derivative w.r.t. each
parameter. Hence Hamiltonian Monte Carlo (HMC) [19] can be used to efficiently sample from this conditional
posterior.</p>
      </sec>
      <sec id="sec-3-3">
        <title>4.3 Complementary Metropolis-Hastings Jump</title>
        <p>
          When sampling c and , a convergence issue may occur. Intuitively, the joint posterior distribution
p(c; ; ; jx; y) should be bimodal, since the model specification of c and is highly symmetric except for the
condition that nH &gt; nM, as reflected in the prior of c, and the distribution of independent variables xH and xM.
Thus, switching the labels of all data points (labelling all honest data points as M and all malicious data points
as H) will result in an inferior mode in the posterior distribution (besides the major mode we are interested in
where all data points are correctly classified). This mode has much less probability mass as compared to the
major mode, and since its violation of the condition nH &gt; nM, it will eventually be rejected when producing
posterior predictive inference. Nonetheless, its existence will make the MCMC sampler algorithm highly
sensitive to initial state. Due to the nature of Gibbs samplers, c is updated one component at a time. The slow
movement of the Gibbs sampler in the state space will make it extremely unlikely to jump from a state close
to inferior mode to a state close to the major mode, due to low probability mass between two modes. Hence,
apart from the Gibbs sampler for c and HMC sampler for mentioned above, an auxiliary Metropolis-Hastings
sampler is introduced to facilitate the movement between two modes. The Metropolis-Hastings proposal
distribution will propose a switch of labels as well as their associated GP hyperparameters . Below shows the
proposal distribution q(c jc) and q( j ) of the Metropolis-Hastings jump.
(
          <xref ref-type="bibr" rid="ref9">9</xref>
          )
(
          <xref ref-type="bibr" rid="ref10">10</xref>
          )
(
          <xref ref-type="bibr" rid="ref11">11</xref>
          )
Here, pc is set to be close to 1 and 2 , N2 and W2 are set close to 0 so that the Metropolis-Hastings jump roughly
        </p>
        <p>F
switches labels of all data points along with their corresponding hyperparameters. The update is accepted with
probability shown below:
p(accept) = min
p(c ; ; ; jx; y)q(cjc )q( j )
p(c; ; ; jx; y)q(c jc)q( j )
q( j ) is equal to 1 since the part of the Metropolis-Hastings update related to is symmetrical. Therefore,
q( j )
the sampler for c and becomes a mixture of two MCMC samplers. One is the Gibbs sampler for c and HMC
sampler for , and the other is a Metropolis-Hastings sampler that jointly updates c and . At each iteration
of the MCMC sampler algorithm, with probability pMH, the Metropolis-Hastings sampler is selected. Since the
sole purpose of this sampler is to get the chain out of the inferior mode, pMH should be set close to 0. Now
when the chain walks into states close to the inferior mode, the Metropolis-Hastings sampler will take the chain
to states close to the major mode. When the chain is already close to the major mode, the Metropolis-Hastings
jumps will almost never be accepted.
To sample , we generate a random sample from the conditional posterior of each component, i.e. from
p( xjxM; c; x) and p( xjxM; c; x), which are analytically tractable.</p>
        <p>For x,
where Z = P (xi x)(xi</p>
        <p>fi:ci=Mg
and can be directly generated.</p>
        <p>x)T : This conditional posterior distribution is an inverse-Wishart distribution
1
C
1
x C
x</p>
        <p>N
N</p>
        <p>Q
fi:ci=Mg</p>
        <p>Q
fi:ci=Mg</p>
        <p>N ( x; xi; x) Qd 1</p>
        <p>m=1 (um lm)
N ( x; xi; x) Qd 1</p>
        <p>m=1 (um lm) d x
x; xM; nMx
x; xM; nMx d x</p>
        <p>;
p( xjxM; c; x) = R
=</p>
        <p>R</p>
        <p>x
= W</p>
        <p>p(xMjc; x; x)p( x)
x p(xMjc; x; x)p( x)d x</p>
        <p>!
where C = fi:cQi=Mg R x N ( x; xi; x)d x is the product of all the normalizing constants to truncate the
Gaussian distribution. Since C appears in both the denominator and the numerator, it cancels out. Note that this
conditional posterior distribution is a truncated Gaussian distribution and can be efficiently generated.</p>
        <p>
          For x, we have that
(
          <xref ref-type="bibr" rid="ref12">12</xref>
          )
(
          <xref ref-type="bibr" rid="ref13">13</xref>
          )
(
          <xref ref-type="bibr" rid="ref14">14</xref>
          )
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Algorithm 1: MCMC Sampler</title>
        <p>Input: S, x, y and parameters of priors</p>
        <p>
          Output: S samples of posterior distribution (c(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )),
1 Initialize , and by randomly drawing from their respective priors.
2 Initialize c.
3 for r = 1 S do
4 Randomly sample u1 U (0; 1).
5 if u1 &lt; pMH (Probability of proposing a Metropolis-Hastings jump) then
6 Propose a jump to c ; by q(c jc) and q( j ).
7 Compute acceptance probability p(accept).
8 Randomly sample u2 U (0; 1).
9 if u2 &lt; p(accept) then
10 c c .
11 .
        </p>
        <p>Sample x from p( xjx; y; c; ; x; ).</p>
        <p>Sample x from p( xjx; y; c; ; x; ).</p>
        <p>Sample from p( jx; y; c; ; ).</p>
        <p>
          Save current values of c, , , in (c(r), (r), (r), (r)).
19 return f(c(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )),
        </p>
        <p>, (c(S), (S), (S), (S))g.
4.5</p>
      </sec>
      <sec id="sec-3-5">
        <title>Sampling</title>
        <p>Finally, to sample
tion.
, (c(S), (S), (S), (S))
we generate a random sample from its conditional posterior, which is also a beta
distribup( jx; y; c; ; ) = p( jc)
=</p>
        <p>p(cj )p( )</p>
        <p>
          R01 p(cj )p( )d
=Beta ( ;
+ nH;
+ n
nH) :
(
          <xref ref-type="bibr" rid="ref15">15</xref>
          )
5
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>The Algorithms</title>
      <p>Our methods consists of two algorithms. The first, presented in Algorithm 1, is a MCMC sampler used to
generate samples of model parameters from the posterior distribution p(c; ; ; jx; y). The second algorithm,
presented in Algorithm 2, is used to summarize the posterior predictive distribution by calculating the posterior
mean and variance. Details of both algorithms are shown. S is set sufficiently large to allow the Markov chain
to approximately converge to the target posterior distribution. To assess convergence, we used the method
introduced by Gelman et al. in [18].</p>
      <p>Apart from mean and variance of the posterior predictive distribution, we also obtain the trustworthiness
of each worker (each data point) in the form of the posterior probability of each data point being honest, or
alternatively, p(cijx; y) for i = 1 n. This information can be used to further model the behavior of each
worker in crowdsourcing applications. However, this is not the main focus of this paper and will not be
further discussed. Derivation of posterior probability p(cijx; y) is also straightforward and thus omitted from the
algorithm specification.</p>
      <p>Assuming d n, the computational complexity of Algorithm 1 is limited by the HMC step to sample , in
which L leapfrog steps are taken, each requires an inverse operation that has O(n3) complexity. This results
in Algorithm 1 having an overall complexity of O(SLn3). The computational complexity of Algorithm 2 is
O(S0(n2 + n n2)), where n is the length of x .</p>
      <sec id="sec-4-1">
        <title>Algorithm 2: Posterior Predictive Inference</title>
        <p>
          Input: S samples of posterior distribution (c(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )), , (c(S), (S), (S), (S)), x
Output: fH(x ), fH(x )
1 Discard the first half of samples as warm-up. The number of samples used for approximating posterior
predictive distribution is S0.
2 r0 1.
3 for r = 1 S0 do
4
        </p>
        <p>To demonstrate that the algorithms are able to make accurate inference from data, we performed experiments
on one-dimensional synthetic dataset, where n = 100 and d = 1, generated in accordance to our assumptions.</p>
        <p>In the synthetic dataset, the underlying function is fH(x) = sin 2 x and the malicious function is fM(x) =
21 tan 4 x. 60 noisy observations are sampled from fH, with homogeneous Gaussian noise with standard
deviation 0:10. The distribution of independent variables is uniform in [0; 1]. 40 noisy observations are sampled from
fM, with homogeneous Gaussian noise with standard deviation 0:12. The distribution of independent variables
is Gaussian with mean 0:80 and standard deviation 0:10, truncated to range [0; 1].</p>
        <p>The choices of priors in terms of their parameters are given by:
f = 2:0; f = 2:0;
n = 2:0; n = 1:0;
w = 0:7; w2 = 0:3;</p>
        <p>For the Metropolis-Hastings sampler, the probability of proposing a jump is pMH = 0:05. Parameters of the
proposal distribution is shown below.</p>
        <p>pc = 0:95; F2 = 1
10 4; N2 = 1
10 4; W2 = 1</p>
        <p>For the sampler of , hyperparameters related to H and M are sampled separately, each using an HMC
algorithm with L = 10 (leapfrog steps).</p>
        <p>Finally, for this particular dataset, we set S = 2000 since the Markov chain has converged sufficiently after
2000 iterations of the MCMC sampler. Then we pass the second half of the resulting samples to make posterior
predictive inference. The results are shown in Fig. 1. Using the posterior mean as the estimator results in a
Mean Square Error (MSE) of 0.0012. The posterior probability p(cijx; y) (trustworthiness) of each data point is
also computed. In this example, the average trustworthiness of truly honest data points is 0:9906, while the
average trustworthiness of truly malicious data points is 5 10 5. It is thus clear that for this set of data, the
accuracy of the proposed algorithm is quite high.</p>
        <p>As for the efficiency of the algorithm, we measure the average time cost by running the algorithm 50 times
using the above dataset on a MacBook Pro equipped with Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz. The
resulted average time cost is 27:43 seconds per run.
In this paper, we build a Gaussian Process mixture model and design a MCMC-based algorithm to address
the issue of collusion attacks in the regression setting. Honest data from observing underlying function and
malicious data are modelled as two separate Gaussian Processes. Compared to [13], where malicious data are
assumed to be observed from the same underlying function with extra noise, we claim that the mixture model
is a more realistic and natural choice. We use synthetic dataset to show that our algorithm is able to produce
accurate posterior predictive inference and is computationally efficient.</p>
        <p>As part of an ongoing study, the results of this paper suggest several directions for future studies. Firstly,
a single malicious function (class) is too restrictive to handle more sophisticated real-world scenarios and the
algorithm currently lacks the ability to handle cases with multiple malicious groups. It is therefore reasonable
to extend this model to incorporate multiple malicious classes. One way of modelling is to build a finite
mixture of Gaussian Processes model with unknown number of components, and design a MCMC sampler using
reversible jump techniques [20]. Another choice is to build an infinite mixture of Gaussian Processes model
similar to [21]. Secondly, this model is unable to handle heteroscedasticity as mentioned before, as well as
nonGaussian noises. We will explore approaches to handle these situations in future studies. Lastly, our model is
not build specifically to model trustworthiness of workers, which is a significant issue in crowdsourcing
applications. In our future work we will improve the model to make robust, collusion-resistant predictions while
maintaining trustworthiness of workers for future predictions.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Howe</surname>
          </string-name>
          , Jeff. ”
          <article-title>Crowdsourcing: A definition.” Crowdsourcing: Tracking the rise of the amateur (</article-title>
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>To</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Shahabi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <year>2015</year>
          ,
          <string-name>
            <surname>October.</surname>
          </string-name>
          <article-title>Effectively crowdsourcing the acquisition and analysis of visual data for disaster response</article-title>
          .
          <source>In Big Data (Big Data)</source>
          ,
          <year>2015</year>
          IEEE International Conference on (pp.
          <fpage>697</fpage>
          -
          <lpage>706</lpage>
          ). IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Meier</surname>
          </string-name>
          , Fred, Daniel Fenner, Tom Grassmann, Britta Jnicke, Marco Otto, and Dieter Scherer. ”
          <article-title>Challenges and benefits from crowd-sourced atmospheric data for urban climate research using Berlin, Germany, as testbed</article-title>
          .”
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Krause</surname>
            , Andreas, Eric Horvitz, Aman Kansal, and
            <given-names>Feng Zhao.</given-names>
          </string-name>
          ”
          <article-title>Toward community sensing</article-title>
          .”
          <source>In Proceedings of the 7th international conference on Information processing in sensor networks</source>
          , pp.
          <fpage>481</fpage>
          -
          <lpage>492</lpage>
          . IEEE Computer Society,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Kim</surname>
          </string-name>
          , Samuel, Maurizio Filippone, Fabio Valente, and Alessandro Vinciarelli.
          <article-title>”Predicting the conflict level in television political debates: an approach based on crowdsourcing, nonverbal communication and gaussian processes</article-title>
          .”
          <source>In Proceedings of the 20th ACM international conference on Multimedia</source>
          , pp.
          <fpage>793</fpage>
          -
          <lpage>796</lpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Shahabi</surname>
          </string-name>
          , Cyrus. ”
          <article-title>Towards a generic framework for trustworthy spatial crowdsourcing</article-title>
          .”
          <source>In Proceedings of the 12th International ACM Workshop on Data Engineering for Wireless and Mobile Acess</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          . ACM,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Welinder</surname>
          </string-name>
          , Peter, and Pietro Perona. ”
          <article-title>Online crowdsourcing: rating annotators and obtaining cost-effective labels</article-title>
          .
          <source>”</source>
          (
          <year>2010</year>
          ):
          <fpage>25</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Hall</surname>
            , David L.,
            <given-names>and John M.</given-names>
          </string-name>
          <string-name>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Human-centered information fusion</article-title>
          .
          <source>Artech House</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Groot</surname>
            , Perry,
            <given-names>Adriana</given-names>
          </string-name>
          <string-name>
            <surname>Birlutiu</surname>
          </string-name>
          , and Tom Heskes.
          <article-title>”Learning from multiple annotators with Gaussian processes</article-title>
          .
          <source>” In Artificial Neural Networks and Machine LearningICANN</source>
          <year>2011</year>
          , pp.
          <fpage>159</fpage>
          -
          <lpage>164</lpage>
          . Springer Berlin Heidelberg,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Tarasov</surname>
          </string-name>
          , Alexey, Sarah Jane Delany, and Brian Mac Namee. ”
          <article-title>Dynamic estimation of worker reliability in crowdsourcing for regression tasks: Making it work</article-title>
          .
          <source>” Expert Systems with Applications</source>
          <volume>41</volume>
          , no.
          <volume>14</volume>
          (
          <year>2014</year>
          ):
          <fpage>6190</fpage>
          -
          <lpage>6210</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Reece</surname>
          </string-name>
          , Steven, Stephen Roberts, Christopher Claxton, and David Nicholson.
          <article-title>”Multi-sensor fault recovery in the presence of known and unknown fault types</article-title>
          .” In Information Fusion,
          <year>2009</year>
          . FUSION'
          <volume>09</volume>
          . 12th International Conference on, pp.
          <fpage>1695</fpage>
          -
          <lpage>1703</lpage>
          . IEEE,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Venanzi</surname>
            , Matteo,
            <given-names>Alex</given-names>
          </string-name>
          <string-name>
            <surname>Rogers</surname>
          </string-name>
          , and
          <string-name>
            <surname>Nicholas</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Jennings</surname>
          </string-name>
          . ”
          <article-title>Trust-based fusion of untrustworthy information in crowdsourcing applications</article-title>
          .”
          <source>In Proceedings of the 2013 international conference on autonomous agents and multi-agent systems</source>
          , pp.
          <fpage>829</fpage>
          -
          <lpage>836</lpage>
          . International Foundation for Autonomous Agents and
          <string-name>
            <given-names>Multiagent</given-names>
            <surname>Systems</surname>
          </string-name>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Venanzi</surname>
            , Matteo,
            <given-names>Alex</given-names>
          </string-name>
          <string-name>
            <surname>Rogers</surname>
          </string-name>
          , and
          <string-name>
            <surname>Nicholas</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Jennings</surname>
          </string-name>
          . ”
          <article-title>Crowdsourcing spatial phenomena using trustbased heteroskedastic gaussian processes</article-title>
          .”
          <source>In First AAAI Conference on Human Computation and Crowdsourcing</source>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Rasmussen</surname>
            ,
            <given-names>Carl</given-names>
          </string-name>
          <string-name>
            <surname>Edward</surname>
          </string-name>
          . ”
          <article-title>Gaussian processes for machine learning</article-title>
          .
          <source>”</source>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Paul W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Christopher KI Williams</surname>
          </string-name>
          , and
          <string-name>
            <surname>Christopher</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bishop</surname>
          </string-name>
          . ”
          <article-title>Regression with inputdependent noise: A Gaussian process treatment</article-title>
          .
          <source>” Advances in neural information processing systems</source>
          <volume>10</volume>
          (
          <year>1997</year>
          ):
          <fpage>493</fpage>
          -
          <lpage>499</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Titsias</surname>
            ,
            <given-names>Michalis K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Miguel</surname>
          </string-name>
          Lzaro-gredilla. ”
          <article-title>Variational heteroscedastic Gaussian process regression</article-title>
          .”
          <source>In Proceedings of the 28th International Conference on Machine Learning (ICML-11)</source>
          , pp.
          <fpage>841</fpage>
          -
          <lpage>848</lpage>
          .
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Kersting</surname>
          </string-name>
          , Kristian, Christian Plagemann, Patrick Pfaff, and Wolfram Burgard. ”
          <article-title>Most likely heteroscedastic Gaussian process regression</article-title>
          .”
          <source>In Proceedings of the 24th international conference on Machine learning</source>
          , pp.
          <fpage>393</fpage>
          -
          <lpage>400</lpage>
          . ACM,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Gelman</surname>
            ,
            <given-names>Andrew</given-names>
          </string-name>
          , John B. Carlin, Hal S. Stern, and
          <string-name>
            <surname>Donald</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rubin</surname>
          </string-name>
          .
          <article-title>Bayesian data analysis</article-title>
          .
          <source>Vol</source>
          .
          <volume>2</volume>
          .
          <string-name>
            <surname>Boca</surname>
            <given-names>Raton</given-names>
          </string-name>
          , FL, USA: Chapman &amp; Hall/CRC,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Duane</surname>
            , Simon,
            <given-names>Anthony D.</given-names>
          </string-name>
          <string-name>
            <surname>Kennedy</surname>
            ,
            <given-names>Brian J.</given-names>
          </string-name>
          <string-name>
            <surname>Pendleton</surname>
          </string-name>
          , and Duncan Roweth. ”
          <article-title>Hybrid monte carlo</article-title>
          .
          <source>” Physics letters B</source>
          <volume>195</volume>
          , no.
          <issue>2</issue>
          (
          <year>1987</year>
          ):
          <fpage>216</fpage>
          -
          <lpage>222</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Green</surname>
            ,
            <given-names>Peter J</given-names>
          </string-name>
          . ”
          <article-title>Reversible jump Markov chain Monte Carlo computation and Bayesian model determination</article-title>
          .
          <source>” Biometrika</source>
          <volume>82</volume>
          , no.
          <issue>4</issue>
          (
          <year>1995</year>
          ):
          <fpage>711</fpage>
          -
          <lpage>732</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Rasmussen</surname>
            ,
            <given-names>Carl</given-names>
          </string-name>
          <string-name>
            <surname>Edward</surname>
          </string-name>
          , and Zoubin Ghahramani. ”
          <article-title>Infinite mixtures of Gaussian process experts</article-title>
          .
          <source>” Advances in neural information processing systems</source>
          <volume>2</volume>
          (
          <year>2002</year>
          ):
          <fpage>881</fpage>
          -
          <lpage>888</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>