<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>To Aggregate or Not? Learning with Separate Noisy Labels</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiaheng Wei</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhaowei Zhu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tianyi Luo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ehsan Amid</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abhishek Kumar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yang Liu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amazon Search Science</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Google Research</institution>
          ,
          <addr-line>Brain Team</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of California</institution>
          ,
          <addr-line>Santa Cruz</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The rawly collected training data often comes with separate noisy labels collected from multiple imperfect annotators (e.g., via crowdsourcing). A typically way of using these separate labels is to first aggregate them into one and apply standard training methods. The literature has also studied extensively on efective aggregation approaches. This paper revisits this choice and aims to provide an answer to the question of whether one should aggregate separate noisy labels into single ones or use them separately as given. We theoretically analyze the performance of both approaches under the empirical risk minimization framework for a number of popular loss functions, including the ones designed specifically for the problem of learning with noisy labels. Our theorems conclude that label separation is preferred over label aggregation when the noise rates are high, or the number of labelers/annotations is insuficient. Extensive empirical results validate our conclusions.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Crowdsourcing</kwd>
        <kwd>Label Aggregation</kwd>
        <kwd>Label Noise</kwd>
        <kwd>Human Annotation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The most popular approach to learning from the multiple separate labels would be aggregating
the given labels for each instance [8, 9, 10, 11, 12], through an Expectation-Maximization (EM)
inference technique. Each instance will then be provided with one single label and applied with
the standard training procedure.</p>
      <p>The primary goal of this paper is to revisit the choice of aggregating separate labels and hope
to provide practitioners with understandings for the following question:</p>
      <p>Should the learner aggregate separate noisy labels for one instance into a single
label or not?</p>
      <p>Our main contributions can be summarized as follows:
∙ We provide theoretical insights on how separation methods and aggregation ones result in
diferent biases (Theorem 3.4, 4.2, 4.6) and variances (Theorem 3.6, 4.3, 4.7) of the output
classifier from training. Our analysis considers both the standard loss functions in use, as
well as popular robust losses that are designed for the problem of learning with noisy labels.
∙ By comparing the analytical proxy of the worst-case performance bounds, our theoretical
results reveal that separating multiple noisy labels is preferred over label aggregation when
the noise rates are high, or the number of labelers/annotations is insuficient. The results are
consistent for both the basic loss function ℓ and robust designs, including loss correction and
peer loss.
∙ We carry out extensive experiments using both synthetic and real-world datasets to validate
our theoretical findings.</p>
      <sec id="sec-1-1">
        <title>1.1. Related Works</title>
        <p>Label separation vs label aggregation Existing works mainly compare the separation with
aggregation by empirical results. For example, it has been shown that label separation could
be efective in improving model performance and may be potentially more preferable than
aggregated labels through majority voting [13]. When training with the cross-entropy loss,
Sheng et.al [14] observe that label separation reduces the bias and roughness, and outperforms
majority-voting aggregated labels. However, it is unclear whether the results hold when
robust treatments are employed. Similar problems have also been studied in corrupted label
detection with a result leaning towards separation but not proved [15]. Another line of approach
concentrates on the end-to-end training scheme or ensemble methods which take all the separate
noisy labels as the input during the training process [16, 17, 18, 19, 20], and learning from
separate noisy labels directly.</p>
        <p>Learning with noisy labels Popular approaches in learning with noisy labels could be
broadly divided into following categories, i.e., (i) Adjusting the loss on noisy labels by: using the
knowledge of noise label transition matrix [21, 22, 23, 24, 25, 26, 27, 28, 29]; re-weighting the
per-sample loss by down-weighting instances with potentially wrong labels [30, 31, 32, 33, 34];
or refurbishing the noisy labels [35, 36, 37]; (ii) Robust loss designs that do not require the
knowledge of noise transition matrix [38, 39, 40, 41, 42, 43, 44, 45]; (iii) Regularization techniques
to prevent deep neural networks from memorizing noisy labels [46, 47, 48, 49, 50, 51]; (iv)
Dynamical sample selection procedure which behaves in a semi-supervised manner and begins
with a clean sample selection procedure, then makes use of the wrongly-labeled samples
[52, 53, 54, 55, 56]. For example, several methods [57, 58, 59] adopt a mentor/peer network to
select small-loss samples as “clean” ones for the student/peer network. See [60, 61] for a more
detailed survey of existing noise-robust techniques.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Formulation</title>
      <p>defined as follows:
Consider an  -class classification task and let  ∈  and  ∈  := {1, 2, ...,  } denote
the input examples and their corresponding labels, respectively. We assume that (,  ) ∼  ,
where  is the joint data distribution. Samples (, ) are generated according to random
variables (,  ). In the clean and ideal scenario, the learner has access to  training data
points  := {(, )}∈[]. Instead of having access to ground truth labels s, we only have
access to a set of noisy labels {˜,}∈[] for  ∈ [ ]. For ease of presentation, we adopt the
decorator to denote separate labels and ∙ for aggregated labels specified later. Noisy labels ˜s
are generated according to the random variable ̃︀ . We consider the class-dependent label noise
transition [30, 21] where ̃︀ is generated according to a transition matrix  with its entries
, := P(̃︀ = | = ).</p>
      <p>Most of the existing results on learning with noisy labels have considered the setting where
each  is paired with only one noisy label ˜. In practice, we often operate in a setting where
each data point  is associated with multiple separate labels drawn from the same noisy label
generation process [62, 63]. We consider this setting and assume that for each , there are 
independent noisy labels ˜,1, ..., ˜, obtained from  annotators [64].</p>
      <p>We are interested in two popular ways to leverage multiple separate noisy labels:
∙ Keep the separate labels as separate ones and apply standard learning with noisy labels
techniques to each of them.</p>
      <p>techniques.
∙ Aggregate noisy labels into one label, and then apply standard learning with noisy data
We will look into each of the above two settings separately and then answer the question:
“Should the learner aggregate multiple separate noisy labels or not?”</p>
      <sec id="sec-2-1">
        <title>2.1. Label Separation</title>
        <p>matrix  has the following form when  = 2:
· P</p>
        <p>̃︀
( )− 1
Denote the column vector P := [P(̃︀ = 1), · · · , P(̃︀ =  )]⊤ as the marginal distribution
̃︀
of ̃︀ . Accordingly, we can define P for  . Clearly, we have the relation: P =  · P , P =
̃︀
. Denote by  1 := P(̃︀ = 0| = 1),  0 := P(̃︀ = 1| = 0). The noise transition
For label separation, we define the per-sample loss function as:
 =
︂[ 1 −  0
 1</p>
        <p>0
1 −  1
︂]</p>
        <p>.
ℓ( (), ˜,1, ..., ˜, ) =</p>
        <p>∈[]

1 ∑︁ ℓ( (), ˜,).</p>
        <p>For simplicity, we shorthand ℓ( (), ˜) := ℓ( (), ˜,1, ..., ˜, ) for the loss of label
separation method when there is no confusion.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Label Aggregation</title>
        <p>The other way to leverage multiple separate noisy labels is generating a single label via label
aggregation methods using  noisy ones:</p>
        <p>˜∙ := Aggregation(˜,1, ˜,2, ..., ˜, ),
where the aggregated noisy labels ˜∙s are generated according to the random variable ̃︀ ∙ .
Denote the confusion matrix for this single &amp; aggregated noisy label as  ∙ . Popular aggregation
methods include majority vote and EM inference, which are covered by our theoretical insights
since our analyses in later sections would be built on the general label aggregation method. For
a better understanding, we introduce the majority vote as an example.</p>
        <p>An Example of Majority Vote Given the majority voted label, we could compute the
transition matrix between ̃︀ ∙ and the true label  using the knowledge of  . The lemma below
gives the closed form for  ∙ in terms of  , when adopting majority vote.</p>
        <p>Lemma 2.1. Assume  is odd and recall that in the binary classification task, , = P(̃︀ =
| = ), the noise transition matrix of the (majority voting) aggregated noisy labels ∙, becomes:
∙, =
∑2+1 ︁− 1 (︂ )︂
=0</p>
        <p>(,)− (,1− ), ,  ∈ {0, 1}.</p>
        <p>When  = 3, then  1∙,0 = P(̃︀ ∙ = 0| = 1) = (1,0)3 + (︀ 31)︀ (1,0)2(1,1). Note it still
holds that ∙, + ∙,1−  = 1. For the aggregation method, as illustrated in Figure 1, the x-axis
40
50
indicates the number of labelers , and the y-axis denotes the aggregated noise rate given that
the overall noise rate is in [0.2, 0.4, 0.6, 0.8]. When the number of labelers is large (i.e.,  &lt; 10)
and the noise rate is small, both majority vote and EM label aggregation methods significantly
reduce the noise rate. Although the expectation-maximization method consumes much more
time when generating the aggregated label, it frequently results in a lower aggregated noise
rate than the majority vote.
3. Bias and Variance Analyses w.r.t. ℓ-loss
In this section, we provide theoretical insights on how label separation and aggregation methods
result in diferent biases and variances of the classifier prediction when learning with the
standard loss function ℓ.</p>
        <p>Suppose the clean training samples {(, )}∈[] are given by variables (,  ) such
that (,  ) ∼  . Recall that instead of having access to a set of clean training samples
 = {(, )}∈[], the learner only observes  noisy labels ˜,1, ..., ˜, for each ,
denoted by ̃︀ := {(, ˜,1, ..., ˜, )}∈[]. For separation methods, the noisy training
samples are obtained through variables (, ̃︀1), ..., (, ̃︀ ) where (, ̃︀) ∼ ̃︀ for  ∈ [].
For aggregation methods such as majority vote, we assume the data points and aggregated noisy
labels ̃︀ ∙ := {(, ˜∙)}∈[] are drawn from (, ̃︀ ∙ ) ∼ ̃︀∙ where ̃︀ ∙ is produced through
the majority voting of ̃︀1, ..., ̃︀ . When we mention "noise rate", we usually refer to the average
noise: P(̃︀ u ̸=  ).
ℓ-risk under the distribution Given the loss ℓ, note that ℓ( (), ˜) is denoted as
ℓ( (), ˜,1, ..., ˜, ) = 1 ∑︀∈[] ℓ( (), ˜,), we define the empirical ℓ-risk for
learning with separated/aggregated labels under noisy labels as ^ℓ,̃︀ ( ) = 1 ∑︀
=1 ℓ ( (), ˜),
 ∈ {, ∙} unifies the treatment which is either separation or aggregation ∙ .</p>
        <p>By increasing the sample size  , we would expect ^ℓ,̃︀ ( ) to be close to the following
ℓ-risk under the noisy distribution ̃︀: ℓ,̃︀ ( ) = E(,̃︀ )∼ ̃︀ [ℓ( (), ̃︀ )].
3.1. Bias of a Given Classifier w.r.t. ℓ-Loss
We denote by  * ∈ ℱ the optimal classifier obtained through the clean data distribution
(,  ) ∼  within the hypothesis space ℱ . We formally define the bias of a given classifier
as:
^

Definition 3.1 (Classifier Prediction Bias of ℓ-Loss). Denote by ℓ,(^ ) := E[ℓ(^ (),  )],
ℓ,( * ) := E[ℓ( * (),  )]. The bias of classifier ^ writes as: Bias(^ ) = ℓ,(^ ) − ℓ,( * ).</p>
        <p>The Bias term quantifies the prediction bias (excess risk) of a given classifier ^ on the clean
data distribution  w.r.t. the optimal achievable classifier  * , which can be decomposed as [65]
Bias(^ ) = ℓ,(^ ) − ℓ,̃︀ (^ ) + ℓ,̃︀ (^ ) − ℓ,( * ) .</p>
        <p>Distribution shift</p>
        <p>
          Estimation error
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
Now we bound the distribution shift and the estimation error in the following two lemmas.
Lemma 3.2 (Distribution shift). Denote by  := P( = ), assume ℓ is upper bounded by ¯ℓ and
lower bounded by ℓ. The distribution shift in Eqn. (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) is upper bounded by
Lemma 3.3 (Estimation error). Suppose the loss function ℓ( (), ) is -Lipschitz for any feasible
. ∀ ∈ ℱ , with probability at least 1 −  , the estimation error is upper bounded by
ℓ,̃︀ (^ ) − ℓ,( * ) ≤
Δ,2 := 4 · R(ℱ ) + (ℓ − ℓ) ·
where  ∈ {, ∙} denotes either separation or aggregation methods,   = 2(log· (log+(11)))2 and  ∙ ≡ 1
indicate the richness factor, which characterizes the efect of the number of labelers, and R(ℱ ) is
the Rademacher complexity of ℱ .
        </p>
        <p>Theorem 3.4. Denote by   := ( 00 +  11) − ( ∙00 +  ∙11),  = √︀log(1/ )/2 . The
separation bias proxy Δ is smaller than the aggregation bias proxy Δ∙ if and only if</p>
        <p>
          Note that   and   are non-decreasing w.r.t. the increase of , in Section 4.3, we will
explore how the LHS of Eqn. (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) is influenced by : a short answer is that the LHS of Eqn. (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) is
(generally) monotonically increasing w.r.t.  when  is small, indicating that Eqn. (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) is easier
to be achieved given fixed ,  and a smaller  than a larger one.
3.2. Variance of a Given Classifier w.r.t. ℓ-Loss
We now move on to explore the variance of a given classifier when learning with ℓ-loss, prior
to the discussion, we define the variance of a given classifier as:
Definition 3.5 (Classifier Prediction Variance of ℓ-Loss). The variance of a given classifier ^
when learned with separation () or aggregation (∙ ) is defined as:
        </p>
        <p>Var(^ ) = E(,̃︀ )∼ ̃︀ [︁ℓ(^ (), ̃︀ ) − E(,̃︀ )∼ ̃︀ [ℓ(^ (), ̃︀ )]]︁2 .</p>
        <p>For () =  − 2, we derive the closed form of Var and the corresponding upper bound as
below.</p>
        <p>Theorem 3.6. When   ≥
2 log(1/ ) , given ℓ is 0-1 loss, we have:</p>
        <p>Var(^ ) = (ℓ,̃︀ (^ )) ≤ 
︃( √︃ 2 log(1/ ) )︃</p>
        <p>
          Theorem 3.6 provides another view to decide on the choices of separation and aggregation
methods, i.e., the proxy of classifier prediction variance. To extend the theoretical conclusions
w.r.t. ℓ loss to the multi-class setting, we only need to modify the upper bound of the distribution
shift in Eqn. (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ), as specified in the following corollary.
        </p>
        <p>
          Corollary 3.7 (Multi-Class Extension (ℓ-Loss)). In the  -class classification case, the upper
bound of the distribution shift in Eqn. (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) becomes:
ℓ,(^ ) − ℓ,̃︀ (^ ) ≤
Δ,1 :=
∑︁  · (1 − , ) · ︀( ℓ − ℓ︀) .
∈[]
4. Bias and Variance Analyses with Robust Treatments
Intuitively, the learning of noisy labels problem could benefit from more robust loss functions
built upon the generic ℓ loss, i.e., backward correction (surrogate loss) [21, 22], and peer loss
functions [42]. We move on to explore the best way to learn with multiple copies of noisy labels,
when combined with existing robust approaches.
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
4.1. Backward Loss Correction
defined as
become: ^
When combined with the backward loss correction approach (ℓ → ℓ← ), the empirical ℓ risks
ℓ← ,̃︀ ( ) = 1 ∑︀=1 ℓ← ( (), ˜), where the corrected loss in the binary case is
ℓ← ( (), ˜) =
(1 −  1− ˜ ) · ℓ( (), ˜)
        </p>
        <p>−  ˜ · ℓ( (), 1 − ˜)
1
−  0 −  1
.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Bias of given classifier w.r.t. ℓ←</title>
      <p>of the classifier ^ under the clean data distribution , with ^ = ^ ←
Lemma 4.1 gives the upper bound of classifier prediction bias when learning with ← ℓ,̃︀ ( ).
= arg min∈ℱ ℓ</p>
      <p>via
^</p>
      <sec id="sec-3-1">
        <title>Suppose the loss function ℓ( (), ) is -Lipschitz for</title>
        <p>separation or aggregation methods.</p>
        <p>Lemma 4.1. With probability at least 1 −  , we have:</p>
        <p>
          We defer our empirical analysis of the monotonicity of the LHS in Eqn. (
          <xref ref-type="bibr" rid="ref6">6</xref>
          ) to Section 4.3 as
well, which shares similar monotonicity behavior to learning w.r.t. ℓ.
        </p>
        <p>Variance of given classifiers with Backward Loss Correction
Similar to the previous
subsection, we now move on to check how separation and aggregation methods result in
diferent variance when training with loss correction.</p>
        <p>Theorem 4.3. When 
← 0(  )− 12 &lt; √︁
2(ℓ− ℓ)2 log(1/ ) , Var(^</p>
        <p>← ) (w.r.t. the 0-1 loss) satisfies:
Var(^ ← ) = (ℓ,̃︀
 (^ ← )) ≤  ← 0 · (ℓ − ℓ) ·
︃(
√︃ 2 log(1/ ) )︃</p>
        <p>Lemma 4.1 ofers the upper bound of the performance gap for the given classifier
clean distribution , comparing to the minimum achievable risk. We consider the bound Δ←
as a proxy of the bias, and we are interested in the case where training the classifier separately
yields a smaller bias proxy compared to that of the aggregation method, formally Δ←
 w.r.t the</p>
        <p>&lt; Δ∙← .
ℱ ⊂ {  :  → {0, 1}}, and the sample set  = {1, ...,  },</p>
        <p>, we give conditions when training separately yields a
the aggregation bias proxy Δ∙← if and only if
Theorem 4.2. Denote by   := 1 − ∙← /← ,  = 1/ 1 + 4 √︁ lolgo(g1(/)) )︁ , where  is the
︁(
VC-dimension of ℱ . For backward loss correction, the separation bias proxy Δ← is smaller than
ℓ− ℓ
ℓ,(^ ← ) − ℓ,( * ) ≤</p>
        <p>:= 4← · R(ℱ ) + ← 0 · (ℓ − ℓ) ·
√︃ 2 log(1/ )
  
.</p>
        <p>
          ←
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
The variance proxy of Var(^ ← ) in Eqn. (
          <xref ref-type="bibr" rid="ref7">7</xref>
          ) is smaller than that of Var(^ ∙← ) if √  &gt; ∙← .
←
        </p>
        <p>Moving a bit further, when the noise transition matrix is symmetric for both methods, the
requirement √︀  &gt; ←∙ could be further simplified as: √︀  &gt; ∙← = 11−−  ∙00−−  ∙11 . For a fixed
, a more eficient aggre←gation method decreases  ∙ , which mak e←s it harder to satisfy this
condition.</p>
        <p>Recall ← := ← 0 · , the theoretical insights of ℓ← between binary case and the multi-class
setting could be bridged by replacing 0 with the multi-class constant specified in the following
corollary.</p>
        <p>Corollary 4.4 (Multi-Class Extension (ℓ← -Loss)). Given a diagonal-dominant transition matrix
 , we have
where  min( ) denotes the minimal eigenvalue of the matrix  . Particularly, if  &lt; 0.5, ∀ ∈
[ ], we further have
← 0 = min
{︃
1</p>
        <p>,
1 − 2  min( )
2√
}︃
,
where  := m∈[ax](1 − ).</p>
        <sec id="sec-3-1-1">
          <title>4.2. Peer Loss Functions</title>
          <p>Peer Loss function [42] is a family of loss functions that are shown to be robust to
label noise, without requiring the knowledge of noise rates. Formally, ℓ↬( (), ˜) :=
ℓ( (), ˜) − ℓ( (1 ), ˜2 ), where the second term checks on mismatched data samples
with (, ˜), (1 , ˜1 ), (2 , ˜2 ), which are randomly drawn from the same data
distribution. When combined with the peer loss approach, i.e., ℓ → ℓ↬, the two risks become:
^ℓ↬,̃︀ ( ) = 1 ∑︀=1 ℓ↬( (), ˜),  ∈ {, ∙} .</p>
          <p>Bias of given classifier w.r.t. ℓ↬ Suppose the loss function ℓ( (), ) is -Lipschitz for any
feasible . Let ↬0 := 1/(1 −  0 −  1), ↬ := ↬0 ·  and ^ ↬ = arg min∈ℱ ^ℓ↬,̃︀ ( ).
Lemma 4.5. With probability at least 1 −  , we have:</p>
          <p>
            Theorem 4.6. Denote by   := 1 − ∙↬/↬,  = 1+22(¯ℓ− ℓ) √︁ 4lo glo(g4(/)) , where  denotes the
VC-dimension of ℱ . For peer loss, the separation bias proxy Δ↬ is smaller than the aggregation
bias proxy Δ∙↬ if and only if
∙↬/↬ − (  )− 21 ≤ .
(
            <xref ref-type="bibr" rid="ref8">8</xref>
            )
          </p>
          <p>Loss</p>
          <p>Loss
Loss</p>
          <p>50
Loss</p>
          <p>Loss
Loss
50</p>
          <p>︃(
10
20
30</p>
          <p>40
Number of Labelers
10
20
30</p>
          <p>40
Number of Labelers</p>
          <p>
            Note that the condition in Eqn. (
            <xref ref-type="bibr" rid="ref8">8</xref>
            ) shares a similar pattern to that which appeared in the
basic loss ℓ and ℓ← , we will empirically illustrate the monotonicity of its LHS in Section 4.3.
Variance of given classifiers with Peer Loss
We now move on to check how separation
and aggregation methods result in diferent variances when training with peer loss. Similarly,
we can obtain:
Theorem 4.7. When √︀  ≥
          </p>
          <p>Var(^ ↬) = (ℓ,̃︀
 (^ ↬)) ≤  ↬0 ·
√︃ lo2g(4 / ) · (︀ 1 + 2(¯ℓ − ℓ))︀
)︃</p>
          <p>.</p>
          <p>
            Variance proxy
(
            <xref ref-type="bibr" rid="ref9">9</xref>
            )
√︁ 2 log(4/ ) · (︀ 1 + 2(¯ℓ − ℓ))︀ , Var(^ ↬) (w.r.t. the 0-1 loss) satisfies:
          </p>
          <p>1/(1 −</p>
          <p>∑︀∈[]  ).</p>
          <p>
            The variance proxy of Var(^ ↬) in Eqn. (
            <xref ref-type="bibr" rid="ref9">9</xref>
            ) is smaller than that of Var(^ ∙↬) if √  ≥ ∙↬↬ .
↬0 to the multi-class setting along with additional conditions specified as below:
          </p>
          <p>
            Theoretical insights of ℓ↬ also have the multi-class extensions, we only need to generate
Corollary 4.8 (Multi-Class Extension (ℓ↬-Loss)). Assume ℓ↬ is classification-calibrated in
the multi-class setting, and the clean label  has equal prior  ( = ) = 1 , ∀ ∈ [ ].
For the uniform noise transition matrix [44] such that , =  , ∀ ∈ [ ], we have: ↬0 =
4.3. Analysis of the Theoretical Conditions
Recall that the established conditions in Theorems 3.4, 4.2, 4.6 are implicitly relevant to the
number of labelers , and the RHS of Eqns. (
            <xref ref-type="bibr" rid="ref3 ref6 ref8">3, 6, 8</xref>
            ) are constants. We proceed to analyze the
monotonicity of the corresponding LHS (in the form of   ·  − (1)− 21 ) w.r.t. the increase
of , where  
(  · (  − ( lo√g() ))− 1). We visualize this order under diferent symmetric  in Figure 3.
          </p>
          <p>= 1 for ℓ and ℓ← ,   = ∙↬/↬ for ℓ↬. Thus, we have: (LHS) =
Cross-Entropy</p>
          <p>Cross-Entropy
92
.20=9808
86
92
.04=8980
86
92
.06=9808
86
Instance-Dependent Noise, CIFAR-10</p>
          <p>Backward Correction
2
.080
=
60
98
.297
0
=96
95
98
.496
0
=94
92
It can be observed that when  is small (e.g.,  ≤ 5), the LHS parts of these conditions increase
with , while they may decrease with  if  is suficiently large. Recall that separation is better
if LHS is less than the constant value  . Therefore, Figure 3 shows the trends that aggregation
is generally better than separation when  is suficiently large.</p>
          <p>Tightness of the bias proxies In Theorems 3.4, 4.2, 4.6, we view the error bounds
  
Δ, Δ← , Δ↬ as proxies of the worst-case performance of the trained classifier. For the
standard loss function ℓ, it has been proven that [66, 67] under mild conditions of ℓ and ℱ , the
lower bound of the performance gap between a trained classifier ( ^ ) and the optimal achievable
one (i.e.,  * ) ℓ,(^ ) − ℓ,( * ) is of the order (√︀1/ ), which is of the same order as that
in Theorem 3.4. Noting the behavior concluded from the worst-case bounds may not always
hold for each individual case, we further use experiments to validate our analyses in the next
section.
aggregated labels (majority vote, EM inference), and separated labels. We highlight the results with
Green (for the separation method) and Red (for aggregation methods) if the performance gap is larger
than 0.05. ( is the number of labels per training image)</p>
          <p>UCI-Breast (symmetric) CE
 = 5  = 9  = 15</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Experimental Results</title>
      <p>In this section, we empirically compare the performance of diferent treatments on the multiple
noisy labels when learning with robust loss functions (CE loss, forward loss correction, and
peer loss). We consider several treatments including label aggregation methods (majority vote
and EM inference) and the label separation method. Assuming that multiple noisy labels have
diferent weights, EM inference can be used to solve the problem under this assumption by
treating the aggregated labels as hidden variables [68, 69, 8, 70]. In the E-step, the probabilities
of the aggregated labels are estimated using the weighted aggregation approach based on the
ifxed weights of multiple noisy labels. In the M-step, EM inference method re-estimates the
weights of multiple noisy labels based on the current aggregated labels. This iteration continues
until all aggregated labels remain unchanged. As for label separation, we adopted the mini-batch
separation method, i.e., each training sample  is assigned with  noisy labels in each batch.
5.1. Experiment on Synthetic Noisy Datasets
Experimental results on synthetic noisy UCI datasets [71]
We adopt six UCI datasets
to empirically compare the performances of label separation and aggregation methods when
MV
EM
Sep
MV
EM
Sep
MV
EM
Sep
MV
EM
Sep
MV
EM
Sep
MV
EM
Sep
learning with CE loss, backward correction [21, 22], and Peer Loss [42]. The noisy annotations

given by multiple annotators are simulated by symmetric label noise, which assumes , = − 1
for  ̸=  for each annotator, where  quantifies the overall noise rate of the generated noisy
labels. In Figure 4, we adopt two UCI datasets (StatLog: ( = 6); Optical: ( = 10)) for
illustration. From the results in Figure 4, it is quite clear that: the label separation method
outperforms both aggregation methods (majority-vote and EM inference) consistently, and is
considered to be more beneficial on such small scale datasets . Results on additional datasets and
more details are deferred to the Appendix.</p>
      <p>Experimental results on synthetic noisy CIFAR-10 dataset [72]
On CIFAR-10 dataset,
we consider two types of simulation for the separate noisy labels: symmetric label noise model
and instance-dependent label noise [53, 24], where  is the average noise rate and diferent
labelers follow diferent instance-dependent noise transition matrices. For a fair comparison, we
adopt the ResNet-34 model [73], the same training procedure and batch-size for all considered
treatments on the separate noisy labels.
noise regime or when  is large, aggregating separate noisy labels significantly reduces the
noise rates and aggregation methods tend out to have a better performance; while in the high
noise regime or when  is small, the performances of separation methods tend out to be more
promising. With the increasing of  or  , we can observe a preference transition from label
separation to label aggregation methods.
5.2. Empirical Verification of the Theoretical Bounds
To verify the comparisons of bias proxies (i.e., Theorem 3.4) through an empirical perspective,
we adopt two binary classification UCI datasets for demonstration: Breast and German datasets,
as shown in Table 1. Clearly, on these two binary classification tasks, label aggregation methods
tend to outperform label separation, and we attribute this phenomenon to the fact that the
”denoising efect of label aggregation is more significant in the binary case”.</p>
      <p>For Theorem 3.4 (CE loss), the condition requires   / 1
the information could be summarized in Table 2, where the column (1 − ,   ) means: when
the number of annotators belongs to the set  , the label separation method is likely to
underperform label aggregation (i.e., majority vote) with probability at least 1 −  . For example, in
the last row of Table 2, when training on UCI German dataset with CE loss under noise rate
︁(
− ( ∘ )− 21 )︁ , where  = ( ∘00 +
0.4 (the noise rate of separate noisy labels), Theorem 3.4 reveals that with probability at least
0.98, label aggregation (with majority vote) is better than label separation when  &gt; 23, which
aligns well with our empirical observations (label separation is better only when  &lt; 15).
5.3. Experiments on Realistic Noisy Datasets
Note that in real-world scenarios, the label-noise pattern may difer due to the expertise of each
human annotator. We further compare the diferent treatments on two realistic noisy datasets:
CIFAR-10N [74], and CIFAR-10H [75]. CIFAR-10N provides each CIFAR-10 train image with 3
independent human annotations, while CIFAR-10H gives ≈ 50 annotations for each CIFAR-10
test image.</p>
      <p>In Table 3, we repeat the reproduction of three robust loss functions with three diferent
treatments on the separate noisy labels. We report the best-achieved test accuracy for
CrossEntropy/Backward Correction/Peer Loss methods when learning with label aggregation methods
(majority-vote and EM inference) and the separation method (soft-label). We observe that the
separation method tends to have a better performance than aggregation ones. This may be
attributed to the relatively high noise rate ( ≈ 0.18) in CIFAR-N and the insuficient amount
of labelers ( = 3). Note that since the noise level in CIFAR-10H is low ( ≈ 0.07 wrong
labels), label aggregation methods can infer higher quality labels, and thus, result in a better
performance than separation methods (Red colored cells in Table 3 and 4).</p>
      <sec id="sec-4-1">
        <title>5.4. Hypothesis Testing</title>
        <p>We adopt the paired t-test to show which treatment on the separate noisy labels is better, under
certain conditions. In Table 5, we report the statistic and -value given by the hypothesis testing
results. The column “Methods” indicate the two methods we want to compare (A &amp; B). Positive
statistics means that A is better than B in the metric of test accuracy. Given a specific setting,
denote by Accmethod as the list of test accuracy that belongs to this setting (i.e., CIFAR-10N,
 = 3), including CE, BW, PL loss functions, the basic hypothesis could be summarized as
below:</p>
        <p>To clarify, the three cases in the above hypothesis are tested independently. For test accuracy
comparisons of CIFAR-10N in Table 3, the setting of the hypothesis test is  = 3 and the label
noise rate is relatively high (18%). All -values are larger than 0.05, indicating that we should
reject the null hypothesis, and we can conclude that the performance of these three methods on</p>
        <sec id="sec-4-1-1">
          <title>CIFAR-10N (high noise, small ) satisfies: EM&lt;MV&lt;Sep.</title>
          <p>For CIFAR-10H in Table 3 and 4, all the label noise rate is relatively low. We consider two
scenarios ( &lt; 15: the number of annotators is small;  ≥ 15: the number of annotators is
large). -values among MV and EM are always large, which means that the denoising efect of
the advanced label aggregation method (EM) is negligible under CIFAR-10H dataset. However,
-values of remaining settings are larger than 0.05, indicating that we should reject the null
hypothesis, and we can conclude that the performance of these 3 methods on CIFAR-10H (low
noise, small/large ) satisfies: EM/MV &gt; Sep.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusions</title>
      <p>When learning with separate noisy labels, we explore the answer to the question “whether one
should aggregate separate noisy labels into single ones or use them separately as given”. In the
empirical risk minimization framework, we theoretically show that label separation could be
more beneficial than label aggregation when the noise rates are high or the number of labelers is
insuficient. These insights hold for a number of popular loss functions including several robust
treatments. Empirical results on synthetic and real-world datasets validate our conclusion.
machine intelligence 39 (2017) 2409–2422.
[12] T. Luo, Y. Liu, Machine truth serum, arXiv preprint arXiv:1909.13004 (2019).
[13] P. G. Ipeirotis, F. Provost, V. S. Sheng, J. Wang, Repeated labeling using multiple noisy
labelers, Data Mining and Knowledge Discovery 28 (2014) 402–441.
[14] V. S. Sheng, J. Zhang, B. Gu, X. Wu, Majority voting and pairing with multiple noisy
labeling, IEEE Transactions on Knowledge and Data Engineering 31 (2017) 1355–1368.
[15] Z. Zhu, Z. Dong, Y. Liu, Detecting corrupted labels without training a model to predict,
arXiv preprint arXiv:2110.06283 (2022).
[16] Z.-H. Zhou, Ensemble methods: foundations and algorithms, CRC press, 2012.
[17] M. Guan, V. Gulshan, A. Dai, G. Hinton, Who said what: Modeling individual labelers
improves classification, in: Proceedings of the AAAI Conference on Artificial Intelligence,
volume 32, 2018.
[18] F. Rodrigues, F. Pereira, Deep learning from crowds, in: Proceedings of the AAAI</p>
      <p>Conference on Artificial Intelligence, volume 32, 2018.
[19] Z. Chen, H. Wang, H. Sun, P. Chen, T. Han, X. Liu, J. Yang, Structured probabilistic
end-to-end learning from crowds., in: IJCAI, 2020, pp. 1512–1518.
[20] H. Wei, R. Xie, L. Feng, B. Han, B. An, Deep learning from multiple noisy annotators as a
union, IEEE Transactions on Neural Networks and Learning Systems (2022).
[21] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, A. Tewari, Learning with noisy labels, in:</p>
      <p>Advances in neural information processing systems, 2013, pp. 1196–1204.
[22] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, L. Qu, Making deep neural networks
robust to label noise: A loss correction approach, in: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2017, pp. 1944–1952.
[23] X. Xia, T. Liu, N. Wang, B. Han, C. Gong, G. Niu, M. Sugiyama, Are anchor points really
indispensable in label-noise learning?, Advances in Neural Information Processing Systems
32 (2019).
[24] Z. Zhu, Y. Song, Y. Liu, Clusterability as an alternative to anchor points when learning
with noisy labels, in: International Conference on Machine Learning, PMLR, 2021, pp.
12912–12923.
[25] Z. Zhu, J. Wang, Y. Liu, Beyond images: Label noise transition matrix estimation for tasks
with lower-quality features, arXiv preprint arXiv:2202.01273 (2022).
[26] Z. Jiang, K. Zhou, Z. Liu, L. Li, R. Chen, S.-H. Choi, X. Hu, An information fusion approach
to learning with instance-dependent label noise, in: International Conference on Learning
Representations, 2022.
[27] Z. Zhang, Y. Li, H. Wei, K. Ma, T. Xu, Y. Zheng, Alleviating noisy-label efects in image
classification via probability transition matrix, arXiv preprint arXiv:2110.08866 (2021).
[28] S. Li, X. Xia, H. Zhang, Y. Zhan, S. Ge, T. Liu, Estimating noise transition matrix with label
correlations for noisy multi-label learning, in: Advances in Neural Information Processing
Systems, 2022.
[29] X. Xia, B. Han, N. Wang, J. Deng, J. Li, Y. Mao, T. Liu, Extended&lt;? tex ?&gt;: Learning with
mixed closed-set and open-set noisy labels, IEEE Transactions on Pattern Analysis and
Machine Intelligence (2022).
[30] T. Liu, D. Tao, Classification with noisy labels by importance reweighting, IEEE
Transactions on pattern analysis and machine intelligence 38 (2016) 447–461.
[31] H.-S. Chang, E. Learned-Miller, A. McCallum, Active bias: Training more accurate
neural networks by emphasizing high variance samples, Advances in Neural Information
Processing Systems 30 (2017).
[32] N. Bar, T. Koren, R. Giryes, Multiplicative reweighting for robust neural network
optimization, arXiv preprint arXiv:2102.12192 (2021).
[33] N. Majidi, E. Amid, H. Talebi, M. K. Warmuth, Exponentiated gradient reweighting for
robust training under label noise and beyond, arXiv preprint arXiv:2104.01493 (2021).
[34] A. Kumar, E. Amid, Constrained instance and class reweighting for robust learning under
label noise, arXiv preprint arXiv:2111.05428 (2021).
[35] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, A. Rabinovich, Training deep neural
networks on noisy labels with bootstrapping, arXiv preprint arXiv:1412.6596 (2014).
[36] M. Lukasik, S. Bhojanapalli, A. Menon, S. Kumar, Does label smoothing mitigate label
noise?, in: International Conference on Machine Learning, PMLR, 2020, pp. 6448–6458.
[37] J. Wei, H. Liu, T. Liu, G. Niu, Y. Liu, Understanding generalized label smoothing when
learning with noisy labels, arXiv preprint arXiv:2106.04149 (2021).
[38] Y. Wang, X. Ma, Z. Chen, Y. Luo, J. Yi, J. Bailey, Symmetric cross entropy for robust
learning with noisy labels, in: Proceedings of the IEEE/CVF International Conference on
Computer Vision, 2019, pp. 322–330.
[39] E. Amid, M. K. Warmuth, R. Anil, T. Koren, Robust bi-tempered logistic loss based on</p>
      <p>Bregman divergences, Advances in Neural Information Processing Systems 32 (2019).
[40] J. Wang, H. Guo, Z. Zhu, Y. Liu, Policy learning using weak supervision, Advances in</p>
      <p>Neural Information Processing Systems 34 (2021).
[41] X. Ma, H. Huang, Y. Wang, S. Romano, S. Erfani, J. Bailey, Normalized loss functions for
deep learning with noisy labels, in: International Conference on Machine Learning, PMLR,
2020, pp. 6543–6553.
[42] Y. Liu, H. Guo, Peer loss functions: Learning from noisy labels without knowing noise
rates, in: International Conference on Machine Learning, PMLR, 2020, pp. 6226–6236.
[43] Z. Zhu, T. Liu, Y. Liu, A second-order approach to learning with instance-dependent label
noise, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2021, pp. 10113–10123.
[44] J. Wei, Y. Liu, When optimizing  -divergence is robust with label noise, arXiv preprint
arXiv:2011.03687 (2020).
[45] H. Wei, H. Zhuang, R. Xie, L. Feng, G. Niu, B. An, Y. Li, Logit clipping for robust learning
against label noise, arXiv preprint arXiv:2212.04055 (2022).
[46] X. Xia, T. Liu, B. Han, C. Gong, N. Wang, Z. Ge, Y. Chang, Robust early-learning: Hindering
the memorization of noisy labels, in: International conference on learning representations,
2020.
[47] S. Liu, J. Niles-Weed, N. Razavian, C. Fernandez-Granda, Early-learning regularization
prevents memorization of noisy labels, Advances in neural information processing systems
33 (2020) 20331–20342.
[48] S. Liu, Z. Zhu, Q. Qu, C. You, Robust training under label noise by over-parameterization,
arXiv preprint arXiv:2202.14026 (2022).
[49] H. Cheng, Z. Zhu, X. Sun, Y. Liu, Demystifying how self-supervised features improve
training from noisy labels, arXiv preprint arXiv:2110.09022 (2021).
[50] H. Wei, L. Tao, R. Xie, B. An, Open-set label noise can improve robustness against inherent
label noise, Advances in Neural Information Processing Systems 34 (2021).
[51] H. Huang, H. Kang, S. Liu, O. Salvado, T. Rakotoarivelo, D. Wang, T. Liu, Paddles:
Phaseamplitude spectrum disentangled early stopping for learning with noisy labels (????).
[52] S. Liu, K. Liu, W. Zhu, Y. Shen, C. Fernandez-Granda, Adaptive early-learning correction
for segmentation from noisy annotations, arXiv preprint arXiv:2110.03740 (2021).
[53] H. Cheng, Z. Zhu, X. Li, Y. Gong, X. Sun, Y. Liu, Learning with instance-dependent label
noise: A sample sieve approach, in: International Conference on Learning Representations,
2021. URL: https://openreview.net/forum?id=2VXyy9mIyU3.
[54] T. Luo, X. Li, H. Wang, Y. Liu, Research replication prediction using weakly supervised
learning, in: In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing: Findings, 2020.
[55] Z. Wang, J. Jiang, B. Han, L. Feng, B. An, G. Niu, G. Long, Seminll: A framework of
noisy-label learning by semi-supervised learning, arXiv preprint arXiv:2012.00925 (2020).
[56] C. Qin, Y. Wang, Y. Fu, Robust semi-supervised domain adaptation against noisy labels,
in: Proceedings of the 31st ACM International Conference on Information &amp; Knowledge
Management, 2022, pp. 4409–4413.
[57] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, M. Sugiyama, Co-teaching: Robust
training of deep neural networks with extremely noisy labels, in: Advances in neural
information processing systems, 2018, pp. 8527–8537.
[58] X. Yu, B. Han, J. Yao, G. Niu, I. Tsang, M. Sugiyama, How does disagreement help
generalization against label corruption?, in: International Conference on Machine Learning,
PMLR, 2019, pp. 7164–7173.
[59] H. Wei, L. Feng, X. Chen, B. An, Combating noisy labels by agreement: A joint training
method with co-regularization, in: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2020, pp. 13726–13735.
[60] B. Han, Q. Yao, T. Liu, G. Niu, I. W. Tsang, J. T. Kwok, M. Sugiyama, A survey of label-noise
representation learning: Past, present and future, arXiv preprint arXiv:2011.04406 (2020).
[61] H. Song, M. Kim, D. Park, Y. Shin, J.-G. Lee, Learning from noisy labels with deep neural
networks: A survey, IEEE Transactions on Neural Networks and Learning Systems (2022).
[62] V. Feldman, Does learning require memorization? a short tale about a long tail, in:
Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, 2020,
pp. 954–959.
[63] Y. Liu, Understanding instance-level label noise: Disparate impacts and treatments, in:</p>
      <p>International Conference on Machine Learning, PMLR, 2021, pp. 6725–6735.
[64] W. Tang, M. Yin, C.-J. Ho, Leveraging peer communication to enhance crowdsourcing, in:</p>
      <p>The World Wide Web Conference, 2019, pp. 1794–1805.
[65] Z. Zhu, T. Luo, Y. Liu, The rich get richer: Disparate impact of semi-supervised learning,
arXiv preprint arXiv:2110.06282 (2021).
[66] S. Mendelson, Lower bounds for the empirical minimization algorithm, IEEE Transactions
on Information Theory 54 (2008) 3797–3803.
[67] G. Lecué, S. Mendelson, Sharper lower bounds on the performance of the empirical risk
minimization algorithm, Bernoulli (2010) 605–613.
[68] A. P. Dawid, A. M. Skene, Maximum likelihood estimation of observer error-rates using
the em algorithm, Journal of the Royal Statistical Society: Series C (Applied Statistics) 28
(1979) 20–28.
[69] P. Smyth, U. Fayyad, M. Burl, P. Perona, P. Baldi, Inferring ground truth from subjective
labelling of venus images, Advances in neural information processing systems 7 (1994).
[70] N. Quoc Viet Hung, N. T. Tam, L. N. Tran, K. Aberer, An evaluation of aggregation
techniques in crowdsourcing, in: International Conference on Web Information Systems
Engineering, Springer, 2013, pp. 1–15.
[71] D. Dua, C. Graf, UCI machine learning repository, 2017. URL: http://archive.ics.uci.edu/ml.
[72] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images,</p>
      <p>Technical Report, Citeseer, 2009.
[73] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in:
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.
770–778.
[74] J. Wei, Z. Zhu, H. Cheng, T. Liu, G. Niu, Y. Liu, Learning with noisy labels revisited: A
study using real-world human annotations, arXiv preprint arXiv:2110.12088 (2021).
[75] J. C. Peterson, R. M. Battleday, T. L. Grifiths, O. Russakovsky, Human uncertainty makes
classification more robust, in: Proceedings of the IEEE/CVF International Conference on
Computer Vision, 2019, pp. 9617–9626.
[76] J. M. Varah, A lower bound for the smallest singular value of a matrix, Linear Algebra and
its applications 11 (1975) 3–5.
[77] X. Xia, T. Liu, B. Han, N. Wang, M. Gong, H. Liu, G. Niu, D. Tao, M. Sugiyama,
Partdependent label noise: Towards instance-dependent label noise, Advances in Neural
Information Processing Systems 33 (2020) 7597–7610.</p>
    </sec>
    <sec id="sec-6">
      <title>A. Full Proofs</title>
      <p>In this section, we briefly introduce all omitted proofs in the main paper.</p>
      <p>We firstly give the proof of Lemma 4.1 because it is beneficial for the proofs in Section 3.</p>
      <sec id="sec-6-1">
        <title>A.1. Proof of Lemma 4.1</title>
        <p>Proof. To apply Hoefding’s inequality on the dataset of the separation method, we divide the
noisy train samples {(, ˜,)}∈[] into  groups, for  ∈ [], i.e., {(, ˜,1)}∈[], · · · ,
{(, ˜, )}∈[]. Note within each group, e.g., group {(, ˜,1)}∈[], all the  training
samples are i.i.d. Additionally, training samples between any two diferent groups are also i.i.d.
given feature set {}∈[]. Thus, with one group {(, ˜,1)}∈[], w.p. 1 −  0, we have
⃒⃒⃒ ^1← |Group-1( ) − 1← ( )⃒⃒ ≤
⃒
1← − 1← )︁
·
√︂ log(1/ 0) , ∀.</p>
        <p>2</p>
        <p>Note that:</p>
        <p>Applying the above technique on the other groups and by the union bound, we know that
w.p. at least 1 −  0, ∀ ∈ [],</p>
        <p>[︃
^
1← |Group-k( ) ∈ 1← ( ) − ← 0 ·</p>
        <p>2
√︂ log(1/ 0) , 1 ( ) + ← 0 ·
←
√︂ log(1/ 0)
2
]︃
.</p>
        <p>Each ^1← |Group-k( ),  ∈ [] can be seen as a random variable within range:
[︃
1← ( ) − ← 0 ·</p>
        <p>2
√︂ log(1/ 0) , 1 ( ) + ← 0 ·
←
√︂ log(1/ 0)
2
]︃
.
ifxed. By Hoefding’s inequality, w.p. at least
The randomness is from noisy labels ˜,. Recall that the samples between diferent groups
are i.i.d. given {}∈[]. Then the above  random variables are i.i.d. when the feature set is
1
−  0 −  1, ∀ , we have
⃒⃒⃒ ^1← ( ) − 1← ( )⃒⃒ ≤ 2 · ← 0 ·
⃒
√︂ log(1/ 0) √︂ log(1/ 1)
2
·
2
= ← 0 ·
√︂ log(1/ 1) log(1/ 0) .</p>
        <p />
        <p>For  0 =  1 = +1 , with the Rademacher bound on the maximal deviation between risks
and empirical ones, for  * ∈ ℱ and the separation method, with probability at least 1 −  , we
⃒
ℓ← ,̃︀ ( ) − 
ℓ← ,̃︀</p>
        <p>( )⃒⃒ ≤ 2R∘ (ℓ← ∘ ℱ ) + ← 0 · (ℓ − ℓ) · log
√︂ 1
 
,
=2R∙ (ℓ← ∘ ℱ ) + ∙← 0 · (ℓ − ℓ) ·
where we define ℓ, ℓ as the upper and lower bound of loss function ℓ respectively, and:
have:
max ⃒⃒ ^
 
=1 =1
∑︁ ∑︁  ℓ← ( (), ˜, )⎦
⎤
≤
1 ∑︁  ℓ← ( (), ˜, ) ,</p>
        <p>]︃
]︃</p>
        <p>R∙ (ℓ← ∘ ℱ ) := E,˜∙ ,  su∈ℱp   ℓ← ( (), ˜∙ ) .</p>
        <p>Note that we assume the noisy labels given by the  labelers follow the same noise transition
matrix, if ℓ is − Lipshitz, then for separation and aggregation methods, ℓ← is ← Lipshitz
for  ∈ {, ∙} respectively, where  = (1+| 0−  1|)
←</p>
        <p>2
1−  0−  1 ≤ 1−  0−  1 . By the Lipshitz composition
property of Rademacher averages, we have R(ℓ← ∘ ℱ )
≤ 
← · R(ℱ ). Thus, we have:
max |^
∈ℱ
ℓ← ,̃︀ ( ) − 
ℓ← ,̃︀
( )| ≤ 2← R(ℱ ) +
(1 + | 0 −  1|) · (ℓ − ℓ)
1
−  0 −  1
· log(

)
·
max |^
∈ℱ</p>
        <p>min∈ℱ ℓ,( ), for separation methods, we further have:
ℓ,(^ ← ) − ℓ,( * ) = 
ℓ← ,̃︀
(^ ← ) − 
ℓ← ,̃︀</p>
        <p>( * )
= 
ℓ← ,̃︀
(^ ← ) − 
^
ℓ← ,̃︀ (^ ← ) + ^
ℓ← ,̃︀ ( * ) − 
ℓ← ,̃︀
( * ) + ^
ℓ← ,̃︀ (^ ← ) − 
^
ℓ← ,̃︀ ( * )
∈ℱ
ℓ← ,̃︀ ( ) − 
ℓ← ,̃︀</p>
        <p>( )|
≤ 4← R(ℱ ) + 2← · (ℓ − ℓ) · log(

)
·
√︂ 1</p>
        <p>
          ·
√︂ 1
 
(
          <xref ref-type="bibr" rid="ref10">10</xref>
          )
        </p>
        <p>,
2
.</p>
        <p>Similarly, for aggregation methods, we have:
ℓ,(^ ∙← ) − ℓ,( * ) = 
(^ ∙← ) −</p>
        <p>ℓ← ,̃︀∙ ( * )
=</p>
        <p>ℓ← ,̃︀∙
∈ℱ
ℓ← ,̃︀∙ ( ) −</p>
        <p>ℓ← ,̃︀∙ ( )|
≤ 4∙← R(ℱ ) + 2∙← · (ℓ − ℓ) ·
2(log( +1 ))2 and  ∙ ≡ 1, we then have:
ℓ← ,̃︀∙ ( * ) − 
ℓ← ,̃︀∙ ( * ) + ^
ℓ← ,̃︀∙</p>
      </sec>
      <sec id="sec-6-2">
        <title>A.2. Proof of Theorem 4.2</title>
        <p>(1 +1− |  00−−   11|) , we have:
Proof. The proof is straightforward if we proceed with the proof of Lemma 4.1 with the below
discussions. With the knowledge of noise rates for both methods, remember that  =

√︂ log(1/ )
2
Δ←
&lt; Δ∙←</p>
        <p>2← R(ℱ ) + ← · (ℓ − ℓ) · log(
2
2
2
2
← − ∙
ℓ</p>
        <p>− ℓ
← ·  · R(ℱ ) &lt; ∙← ·
 + 1</p>
        <p>)</p>
        <p>·
2
√︂ 1
 
− ← · log(
√︃ 1 )︃ √︂ log(1/ )
·
√︂ 1
 
← ·  · R(ℱ ) &lt; ∙← − ← ·
For any finite concept class ℱ ⊂ {  :  → {0, 1}}, and the sample set  = {1, ...,  }, the
Rademacher complexity is upper bounded by
√︁ 2 log() where  is the VC dimension of ℱ .</p>
        <p>&lt; Δ∙← , we simply need to find the condition of  (or   ) that satisfies the
To achieve Δ←
below in-equation:
Δ←
&lt; Δ∙←
=⇒
2
√︃</p>
        <p>2

&lt;
∙← − ← ·
√︃ 1 )︃</p>
        <p />
      </sec>
      <sec id="sec-6-3">
        <title>A.4. Proof for Corollary 4.4</title>
        <p>For a general matrix  = ()− 1, we firstly note</p>
        <p>∑︁
∈[],&gt;0
| + | min
∈[]</p>
        <p>∑︁
∈[],&lt;0
|.
. Then
Recall 1 = 1 ⇒ 1 = ()− 11. We know the above maximum and minimum take the same</p>
        <p>∑︁
∈[],&gt;0
| + | min
∈[]</p>
        <p>∑︁
∈[],&lt;0
≤ 1 − 2</p>
        <p>1
≤ min∈[] ( − ∑︀</p>
        <p≯= )
,  := m∈[ax](1 − ),  &lt; 0.5.</p>
        <sec id="sec-6-3-1">
          <title>Now we prove the inequality () [76]. Let  satisfy</title>
          <p>and let  = ()− 1 . Then
‖()− 1‖∞ = ‖()− 1 ‖∞/‖ ‖∞</p>
          <p>‖()− 1‖∞ = ‖ ‖∞/‖ ‖∞
To bound ‖ ‖, we choose  such that   = ‖ ‖∞. Then
  =   −
∑︁    ,
̸=
which further gives
Therefore,
and
The term of distribution shift can be upper bounded by:
=E(, )∼
[︁
max ⃒ E(, )∼ [ℓ( (),  )] −
∈ℱ ⃒
= max ⃒ E(, =1)∼ [ℓ( (), 1)] + E(, =0)∼ [ℓ( (), 0)]
∈ℱ ⃒</p>
          <p>E(,̃︀)∼ ̃︀
[︁</p>
          <p>ℓ(^ (), ̃︀)]︁
E(,̃︀)∼ ̃︀
[︁
ℓ( (), ̃︀)]︁⃒⃒</p>
          <p>⃒
∑︁
̸=
||‖ ‖∞ ≤ |  | +</p>
          <p>| ||  | ≤ |  | + ‖ ‖∞
‖ ‖∞ ≤
 −
| |</p>
          <p>,
‖( )− 1
‖∞ = ‖ ‖∞/‖ ‖∞ ≤</p>
          <p>−
1
← −
1← ≤‖  ‖∞ ≤
  max( ) =
∑︁
̸=</p>
          <p>| |.
1
 min( )
.
,</p>
          <p>On the other hand, denoting by ‖ ‖max := max,∈[] | |, from eigenvalues, we know
where  min( ) denotes the minimal eigenvalue of the matrix  . Therefore,
1
← −
1 = ∘</p>
          <p>← 0 = min{
←
1 −
1</p>
          <p>,
2max  min( ) }
the matrix  .
where  := max∈[](1 − ),  &lt; 0.5, and  min( ) denotes the minimal eigenvalue of</p>
        </sec>
      </sec>
      <sec id="sec-6-4">
        <title>A.5. Proof of Lemma 3.2</title>
        <p>Proof. Note that for ^ = ^ , we have:
∈ℱ
= ℓ,(^ ) − 
 (^ ) + 
min ℓ,( ) = ℓ,(^ ) − ℓ,( * )
min 
∈ℱ
∈ℱ</p>
        <p>Estimation error
⃒
= max ⃒ E(, =1)∼ [ℓ( (), 1)] + E(, =0)∼ [ℓ( (), 0)]</p>
        <p>Combine similar terms, we then have:
=(1 1 + 0 0) · (︀ ℓ − ℓ︀) .</p>
        <p>Thus, we have:
[︁
[︁</p>
        <p>P(̃︀ = 1| = 1) · ℓ( (), 1)]︁ − E(, =1)∼
P(̃︀ = 1| = 0) · ℓ( (), 1)]︁ − E(, =0)∼
[︁
[︁</p>
        <p>P(̃︀ = 0| = 1) · ℓ( (), 0)]︁</p>
        <p>P(̃︀ = 0| = 0) · ℓ( (), 0) ⃒ .
[︁</p>
        <p>P(̃︀ = 0| = 1) · ℓ( (), 1)]︁ + E(, =0)∼
[︁</p>
        <p>P(̃︀ = 1| = 0) · ℓ( (), 0)]︁
[︁</p>
        <p>P(̃︀ = 0| = 1) · ℓ( (), 0)]︁ − E(, =0)∼
[︁</p>
        <p>P(̃︀ = 1| = 0) · ℓ( (), 1)]︁ ⃒⃒
]︁⃒
⃒
⃒
⃒
+ E(, =0)∼
∈ℱ ⃒
ℓ,(^ ) − ℓ,̃︀
 (^ ) ≤
Proof. For the term Estimation error, we have:</p>
        <p>∈ℱ
Estimation error
 ( ) + |min ℓ,̃︀
∈ℱ
 ( ) − ℓ,( * )
 ( ) − ℓ,( * )|</p>
        <p>Error 2</p>
        <p>The upper bound of Error 1 could be derived directly from the proof of Lemma 4.1: since
the loss function makes no use of loss correction, the L-Lipschitz constant does not have to
multiply with the constant and 
←</p>
        <p>→ . Besides, the constant for the variance term (square
term) reduces to (ℓ − ℓ). Thus, we have:</p>
        <p>Error 1 ≤ 4R(ℱ) + (ℓ − ℓ) ·
 
, ∀ ∈ ℱ.
Proof. To achieve a smaller upper bound for the separation method, mathematically, we want:
4R(ℱ) + (ℓ − ℓ) ·
≤ 4R(ℱ) + (ℓ − ℓ) ·
1− ( )− 21 ≤ , where   := ( 00 +  11) − ( ∙00 +  ∙11),  = √︀log(1/ )/2.
1
( 00+ 11)− ( ∙00+ ∙11), which is mentioned as   ·</p>
        <p>(1− ( )− 21)</p>
      </sec>
      <sec id="sec-6-5">
        <title>A.8. Proof of Theorem 3.6</title>
        <p>Proof. For  ∈ {, ∙} , we have:
Var(^) =E(,̃︀)∼ ̃︀ [︁ℓ(^(), ̃︀) − E(,̃︀)∼ ̃︀[ℓ(^(), ̃︀)]]︁2
=Ẽ︀</p>
        <p>[︃ [︁ℓ(^(), ̃︀)]︁2 + [︁Ẽ︀[ℓ(^(), ̃︀)]]︁2 − 2ℓ(^(), ̃︀)Ẽ︀[ℓ(^(), ̃︀)]]︃
=E
̃︀
[︁
̃︀ [ℓ(^ ( ), ̃︀ )]]︁2</p>
        <p>(^ ))2.</p>
        <p>A special case is the 0-1 loss, i.e., ℓ(· ) = 1(· ), we then have:
[︁
Ẽ︀ 2ℓ(^ ( ), ̃︀ )E
̃︀ [ℓ(^ ( ), ̃︀ )]]︁
Var(^ ) =E
=E
=</p>
        <p>−
ℓ(^ ( ), ̃︀  ]︁</p>
        <p>) −
−</p>
        <p>ℓ,̃︀
where 
when
 (^ ) ∈ [0, 1] and () =  −
2 is monotonically increasing when  &lt;
1
2
. Thus,
 (^ )</p>
        <p>ℓ) ·
reduces to 1
log(1/ )
2  
1
2 log(1/ )
,
we could derive Var(^ )
√︁ 2 log(1/ )</p>
        <p>).</p>
      </sec>
      <sec id="sec-6-6">
        <title>A.9. Proof of Corollary 3.7</title>
        <p>
          Proof. In the multi-class extension, the only diference is the upper bound of the Distribution
Shift term in Eqn. (
          <xref ref-type="bibr" rid="ref11">11</xref>
          ), which now becomes:
ℓ,(^ )
=E(, )∼
[︁
        </p>
        <p>]︁
ℓ(^ ( ),  ) −
max ⃒ E(, )∼
= max ⃒⃒ ⎣
 ∈ℱ ⃒
⃒ ⎡
⃒ ∈[ ]
⃒ ⎡
= m∈aℱx ⃒⃒⃒ ⎣
= m∈aℱx ⃒⃒⃒ ⎣
⃒ ⎡
∈[ ]
∈[ ]</p>
        <p>E
(,̃︀ )
∼ ̃︀
[︁</p>
        <p>ℓ(^ ( ), ̃︀ )]︁
[ℓ( ( ),  )] −</p>
        <p>E
(,̃︀ )
∼ ̃︀
[︁</p>
        <p>ℓ( ( ), ̃︀ )]︁ ⃒⃒
[︃
[︁
E(, =)∼</p>
        <p>ℓ( ( ), )⎦⎦ − ⎣
E(, =)∼
[ℓ( ( ), )]⎦ − ⎣</p>
        <p>∑︁
∈[ ] ∈[ ]
E(, =)∼</p>
        <p>P(̃︀  ̸= | = ) · ℓ( ( ), )
⃒
⃒
⃒
E(, =)∼
]︁
⎦ − ⎣</p>
        <p>⎡
(,̃︀ )
∼ ̃︀, =
[︁
ℓ( ( ), ̃︀ )]︁ ⃒⃒
⎤ ⃒
⎦ ⃒</p>
        <p>⃒
[︁</p>
        <p>P(̃︀  = | = ) · ℓ( ( ), )</p>
        <p>⎤ ⃒
]︁ ⃒
⎦ ⃒
⃒
⃒
∑︁</p>
        <p>∑︁
∈[ ],̸= ∈[ ]</p>
        <p>E(, =)∼
[︁</p>
        <p>P(̃︀  = | =
P(̃︀  ̸= | = ) · ℓ( (), ) −</p>
        <p>P(̃︀  = | = ) · ℓ( (), ) ⃒
P(̃︀  ̸= | = ) · (︀ ℓ − ℓ︀) ⃒⃒
]︃⃒
⃒
⃒
]︃⃒
⃒
⃒
= max ⃒⃒ ∑︁ E(, =)∼
≤ m∈aℱx ⃒⃒⃒⃒ ∑∈[︁] E(, =)∼
(Assumed uniform prior)
= ∑︁ P( = ) · (1 − , ) ︀( ℓ − ℓ︀) .
Proof. The proof of Lemma 4.5 builds on Theorem 7 in [42]: The performance bound for
aggregation methods is the special case of Theorem 7 in [42] (adopting  * = 1 defined in [ 42]).
As for that of separation methods, the incurred diference lies in the appearance of the weight
of sample complexity   . Thus, we have:
ℓ,(^ ↬) − ℓ,( * ) ≤
where Δ↬ := 8↬R(ℱ ) + ↬0√︁ 2 log(4/ ) (︀ 1 + 2(¯ℓ − ℓ))︀ .</p>
        <p>8R(ℱ ) +</p>
        <p>√︃ 2 log(4/ ) ︀( 1 + 2(¯ℓ − ℓ))︀
 
√︃ 1 ]︃ 4√︂ log(4/ ) ︀( 1 + 2(¯ℓ − ℓ))︀
2</p>
      </sec>
      <sec id="sec-6-7">
        <title>A.11. Proof of Theorem 4.6</title>
        <p>Proof. Denote by Δ↬ := 18−R0(−ℱ )1 +
we require Δ↬ &lt; Δ∙↬, which is equivalent to:
4√︂ lo2g(4/) (1+2(¯ℓ− ℓ))</p>
        <p>1−  0−  1
8R(ℱ )
4
√︁ log(4/ ) (︀ 1 + 2(¯ℓ − ℓ))︀
2</p>
        <p>8R(ℱ )
1 −  ∙0 −  ∙1
which is further equivalent to:</p>
        <p>8R(ℱ )
1 −  0 −  1 −</p>
        <p>8R(ℱ )
1 −  ∙0 −  ∙1
4
√︁ log(4/ ) (︀ 1 + 2(¯ℓ − ℓ))︀
2
1 −  ∙0 −  ∙1
[︃
, in order to achieve Δ↬ &lt; Δ∙↬,
+
−
4
√︁ log(4/ ) (︀ 1 + 2(¯ℓ − ℓ))︀
2
1 −  ∙0 −  ∙1
4
√︁ log(4/ ) (︀ 1 + 2(¯ℓ − ℓ))︀
2</p>
        <p>Note that both 1 −  0 −  1 and 1 −  ∙0 −  ∙1 are positive, the above requirement then reduces to:
[( 0 +  1) − ( ∙0 +  ∙1)]8R(ℱ ) &lt; (1 −  0 −  1) − (1 −  ∙0 −  ∙1)
 
&lt; ((11 −−  ∙00 −−  ∙11)) −</p>
        <p>[( 0 +  1) − ( ∙0 +  ∙1)]8√︁ 2 log()
4(1 −  ∙0 −  ∙1)√︁ log2(4/ ) (︀ 1 + 2(¯ℓ − ℓ))︀
.</p>
        <p>Denote by   := 1 − ∙↬/↬,  = 1+22(¯ℓ− ℓ) √︁ 4lo glo(g4(/)) . The above condition is satisfied if
and only if
Proof. Similar to the proof of Theorem 3.6, for  ∈ {, ∙} , we have:</p>
        <p>Var(^ ↬) = E
̃︀</p>
        <p>[︁ℓ(^ ↬(), ̃︀ )]︁2 − (ℓ,̃︀ (^ ↬))2.</p>
        <p>A special case is the 0-1 loss, i.e., ℓ(· ) = 1(· ), we then have:</p>
        <p>Var(^ ↬) =E
=E
̃︀
̃︀
[︁ℓ(^ ↬(), ̃︀ )]︁2 − (ℓ,̃︀ (^ ↬))2
[︁ℓ(^ ↬(), ̃︀ )]︁ − (ℓ,̃︀ (^ ↬))2
=ℓ,̃︀ (^ ↬) − (ℓ,̃︀ (^ ↬))2 =  ︁( ℓ,̃︀ (^ ↬))︁
where ℓ,̃︀ (^ ↬) ∈ [0, 1] and () =  − 2 is monotonically increasing when  &lt; 12 . Note
that:
 (^ ↬) &lt;
⇐⇒
√︀  ≥
√︂ 2 log(4/ ) 1 + 2(¯ℓ − ℓ) ,
 1 −  0 −  1
we have: Var(^ ← ) ≤  ︁( √︁ lo2g(4/ ) 11−+20(¯ℓ−−  ℓ1) )︁ . To achieve: Var(^ ↬) &lt; Var(^ ∙↬), we simply
need:
√︃ log(4/ ) 1 + 2(¯ℓ − ℓ)
2   1 −  0 −  1 ≤</p>
        <p>√︃ lo2g (∙4 / ) 11 +− 2 (∙0¯ℓ− −  ℓ∙1) ⇐⇒ √  ≥ ∙↬↬00 .
Proof. Regarding the multi-class extension of Lemma 4.5, the only diferent thing lies in the
constant: ↬0. The following Lemma A.1 helps us find out the multi-class form of ↬0.
Lemma A.1. Assume the clean label  has equal prior  ( = ) = 1 , ∀ ∈ [ ]. For the
uniform noise transition matrix [44] such that , =  , ∀ ∈ [ ], the expected ℓ↬ in the
multi-class setting is invariant to label noise up to an afine transformation:</p>
        <p>E(,̃︀ )∼ ̃︀ [ℓ↬( ( ), ̃︀ )] = ⎝1 −
⎛</p>
        <p>⎞
∑︁  ⎠ E[ℓ↬( ( ),  )].
∈[]
Proof of Lemma A.1 Recall that  and ̃︀ refer to the joint distribution over (,  ) and
(, ̃︀ ), respectively. We further denote the marginal distributions of  ,  , and ̃︀  by  ,
 , and ̃︀̃︀  , respectively. Let  ∼   , ̃︀ ∼
to the peer samples. The peer loss function is definẽ︀d̃︀ asbe the random variables corresponding
ℓ↬( (), ˜) = ℓ( (), ˜) − ℓ( (,), ˜,),
where (, ˜) is a normal training sample pair, , and ˜, are corresponding peer samples.</p>
        <p>Taking expectation for (14) yields</p>
        <p>Ẽ︀ [ℓ↬( ( ), ̃︀ )] = Ẽ︀ [ℓ( ( ), ̃︀ )] − E
̃︀̃︀ 
[︁</p>
        <p>E [ℓ( (), ̃︀)]]︁ .</p>
        <p>Accordingly, noting  and ̃︀ are independent, the second term in (15) is
(13)
(14)
(15)
The first term in (15) is</p>
        <p>Ẽ︀ [ℓ( (), ̃︀ )]
= ∑︁</p>
        <p>∑︁  · P( = ) · E| =[ℓ( (), )]
∈[] ∈[]
[︃</p>
        <p>· P( = ) · E| = [ℓ( (), )] +
= ∑︁
[︃ ⎛
⎝1 −
̸=,∈[]
[︁</p>
        <p>E [ℓ( (), ̃︀)]]︁
P(̃︀ = ) · E [ℓ( (), )]
∑︁  · P( = ) · E [ℓ( (), )]
∈[] ∈[]
[︃
 · P( = ) · E [ℓ( (), )] +
⎠ · P( = ) · E| = [ℓ( (), )] +</p>
        <p>∑︁
∈[],̸=
 · P( = ) · E| =[ℓ( (), )]</p>
        <p>]︃
 · P( = ) · E [ℓ( (), )]
]︃
 · P( = ) · E[ℓ((), )] .</p>
        <p>In this case, we have   = , ∀ ∈ [],  ̸= . The first term becomes
Ẽ︀[ℓ((), ̃︀)]
∑︁   · P( = ) · E| =[ℓ((), )]
∑︁  ⎠ · E [E[ℓ((), )]] + ∑︁   · E[ℓ((), )].
⎛
⎞
Comparing the above two terms we have:</p>
        <p>Ẽ︀[ℓ↬((), ̃︀)] = ⎝1 −
∑︁  ⎠ E[ℓ↬((),  )].
∈[]
the corresponding proof of the binary task.</p>
        <p>1 1
Thus, substituting ↬0 := 1−  0−  1 by 1− ∑︀∈[]   , the proof of Corollary 4.8 is finished if we repeat
∈[]
The second term becomes
E[ℓ((), ̃︀)]]︁
̸=,∈[]
∈[]
∈[],̸=
B. Additional Results and Details
B.1. Experiment Details on UCI Datasets
Datasets</p>
        <p>In this paper, we conducted experiments on two binary (Breast and German) and
two multi-class (StatLog and Optical) UCI classification datasets. As for the splitting of training
and testing, the original settings are used when training and testing files are provided. The
remaining datasets only give one data file. We adopt 50/50 splitting for the testing results’
]︃
(16)
]︃</p>
        <p>]︃
statistical significance as more data is distributed to testing dataset. More specifically, the
numbers of (training, testing) samples in Breast, German, StatLog, and Optical datasets are (285,
284), (500, 500), (4435, 2000), and (3823, 1797).</p>
        <p>Generating the noisy labels on UCI datasets For each UCI dataset adopted in this paper, the
label of each sample in the training dataset will be flipped to the other classes with the probability
 (noise rate). For the multi-class classification datasets, the specific label which will be flipped
is randomly selected with equal probabilities. For binary and multi-class classification datasets,
(0.1, 0.2, 0.3, 0.4) and (0.2, 0.4, 0.6, 0.8) are used as diferent lists of noise rates respectively.
Implementation details We implemented a simple two-layer ReLU Multi-Layer Perceptron
(MLP) for the classification task on these four UCI datasets. The Adam optimizer is used with a
learning rate of 0.001 and the batch size is 128.</p>
        <p>B.2. Detailed Results on UCI Datasets
In Table 6, we highlight the results with Green (for separation method) and Red (for aggregation
methods) if the performance gap is large than 0.05. Clearly, the label separation method
outperforms both aggregation methods (majority-vote and EM inference) consistently on
StatLog and Optical datasets. For the two binary tasks (Breast and German), aggregation
methods tend to outperform label separation, and we attribute this phenomenon to the fact that
the ”denoising efect of label aggregation is more significant in the binary case”.
B.3. Experiment Details on CIFAR-10 Datasets
The generation of the symmetric noisy dataset is adopted from [44]. As for the
instancedependent label noise, the generating algorithm follows the state-of-the-art method [77]. Both
cases adopt noise rates: [0.2, 0.4, 0.6, 0.8]. The basic hyper-parameters settings for all methods
are listed as follows: mini-batch size (128), optimizer (SGD), initial learning rate (0.1), momentum
(0.9), weight decay (0.0005), number of epochs (120) and learning rate decay (0.1 at 50 epochs).
Standard data augmentation is applied to each dataset. All experiments run on 8 Nvidia RTX
A5000 GPUs.</p>
        <p>B.4. Details Results on CIFAR-10 Dataset
Table 7 includes all the detailed accuracy values that appeared in Figure 5. The results on
the synthetic noisy CIFAR-10 dataset align well with the theoretical observations: label
separation is preferred over label aggregation when the noise rates are high, or the number of
labelers/annotations is insuficient.</p>
        <p>UCI-StatLog (symmetric) CE
 = 5  = 9  = 15
 = 15
UCI-StatLog (symmetric) BW
 = 5  = 9  = 15</p>
        <p>UCI-Optical (symmetric) BW
 = 5  = 9  = 15</p>
        <p>UCI-StatLog (symmetric) PeerLoss
 = 5  = 9  = 15  = 25</p>
        <p>UCI-Optical (symmetric) PeerLoss
 = 5  = 9  = 15  = 25
UCI-pop failuers (symmetric) CE
 = 5  = 9  = 15
 = 25</p>
        <p>UCI-forest fire (symmetric) CE
 = 5  = 9  = 15
UCI-pop failuers (symmetric) BW
 = 5  = 9  = 15</p>
        <p>CIFAR-10, Symmetric BW
 = 5  = 9  = 15</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Estellés-Arolas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>González-Ladrón-de Guevara</surname>
          </string-name>
          ,
          <article-title>Towards an integrated crowdsourcing definition</article-title>
          ,
          <source>Journal of Information science 38</source>
          (
          <year>2012</year>
          )
          <fpage>189</fpage>
          -
          <lpage>200</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Howe</surname>
          </string-name>
          , et al.,
          <source>The rise of crowdsourcing, Wired magazine 14</source>
          (
          <year>2006</year>
          )
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>An online learning approach to improving the quality of crowd-sourcing</article-title>
          ,
          <source>ACM SIGMETRICS Performance Evaluation Review</source>
          <volume>43</volume>
          (
          <year>2015</year>
          )
          <fpage>217</fpage>
          -
          <lpage>230</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Albarqouni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Baur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Achilles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Belagiannis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Demirci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Navab</surname>
          </string-name>
          ,
          <article-title>Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images</article-title>
          ,
          <source>IEEE transactions on medical imaging 35</source>
          (
          <year>2016</year>
          )
          <fpage>1313</fpage>
          -
          <lpage>1321</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. A. A.</given-names>
            <surname>Setio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Traverso</surname>
          </string-name>
          , T. De Bel,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Berens</surname>
          </string-name>
          , C. Van Den Bogaard, P. Cerello,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Dou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Fantacci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Geurts</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Validation</surname>
          </string-name>
          , comparison, and
          <article-title>combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the luna16 challenge, Medical image analysis 42 (</article-title>
          <year>2017</year>
          )
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mitra</surname>
          </string-name>
          , E. Gilbert,
          <article-title>Credbank: A large-scale social media corpus with associated credibility annotations</article-title>
          ,
          <source>in: Proceedings of the International AAAI Conference on Web and Social Media</source>
          , volume
          <volume>9</volume>
          ,
          <year>2015</year>
          , pp.
          <fpage>258</fpage>
          -
          <lpage>267</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Pennycook</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. G.</given-names>
            <surname>Rand</surname>
          </string-name>
          ,
          <article-title>Fighting misinformation on social media using crowdsourced judgments of news source quality</article-title>
          ,
          <source>Proceedings of the National Academy of Sciences</source>
          <volume>116</volume>
          (
          <year>2019</year>
          )
          <fpage>2521</fpage>
          -
          <lpage>2526</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V. C.</given-names>
            <surname>Raykar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. H.</given-names>
            <surname>Valadez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Florin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bogoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moy</surname>
          </string-name>
          ,
          <article-title>Learning from crowds</article-title>
          .,
          <source>Journal of machine learning research 11</source>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Whitehill</surname>
          </string-name>
          , T.-f. Wu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bergsma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Movellan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ruvolo</surname>
          </string-name>
          ,
          <article-title>Whose vote should count more: Optimal integration of labels from labelers of unknown expertise</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>22</volume>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rodrigues</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <article-title>Gaussian process classification and active learning with multiple annotators</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>433</fpage>
          -
          <lpage>441</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rodrigues</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lourenco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. C.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <article-title>Learning supervised topic models for classification and regression from crowds, IEEE transactions on pattern analysis and</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>