=Paper= {{Paper |id=Vol-3357/paper1 |storemode=property |title=To Aggregate or Not? Learning with Separate Noisy Labels |pdfUrl=https://ceur-ws.org/Vol-3357/paper1.pdf |volume=Vol-3357 |authors=Jiaheng Wei,Zhaowei Zhu,Tianyi Luo,Ehsan Amid,Abhishek Kumar,Yang Liu |dblpUrl=https://dblp.org/rec/conf/csw/WeiZLAKL23 }} ==To Aggregate or Not? Learning with Separate Noisy Labels== https://ceur-ws.org/Vol-3357/paper1.pdf
To Aggregate or Not?
Learning with Separate Noisy Labels
Jiaheng Wei1,*,† , Zhaowei Zhu1,† , Tianyi Luo2 , Ehsan Amid3 , Abhishek Kumar3 and
Yang Liu1,*
1
  University of California, Santa Cruz
2
  Amazon Search Science and AI
3
  Google Research, Brain Team


                                         Abstract
                                         The rawly collected training data often comes with separate noisy labels collected from multiple imperfect
                                         annotators (e.g., via crowdsourcing). A typically way of using these separate labels is to first aggregate
                                         them into one and apply standard training methods. The literature has also studied extensively on effective
                                         aggregation approaches. This paper revisits this choice and aims to provide an answer to the question of
                                         whether one should aggregate separate noisy labels into single ones or use them separately as given.
                                         We theoretically analyze the performance of both approaches under the empirical risk minimization
                                         framework for a number of popular loss functions, including the ones designed specifically for the
                                         problem of learning with noisy labels. Our theorems conclude that label separation is preferred over
                                         label aggregation when the noise rates are high, or the number of labelers/annotations is insufficient.
                                         Extensive empirical results validate our conclusions.

                                         Keywords
                                         Crowdsourcing, Label Aggregation, Label Noise, Human Annotation




1. Introduction
Training high-quality deep neural networks for classification tasks typically requires a large
quantity of annotated data. The raw training data often comes with separate noisy labels
collected from multiple imperfect annotators. For example, the popular data collection paradigm
crowdsourcing [1, 2, 3] offers the platform to collect such annotations from the unverified
crowd; medical records are often accompanied by diagnoses from multiple doctors [4, 5]; news
articles can receive multiple checkings (of the article being fake or not) from different experts
[6, 7]. This leads to the situation considered in this paper: learning with multiple separate noisy
labels.


WSDM 2023 Crowd Science Workshop on Collaboration of Humans and Learning Algorithms for Data Labeling, March
3, 2023, Singapore
*
  Corresponding author.
†
  These authors contributed equally.
$ jiahengwei@ucsc.edu (J. Wei); yangliu@ucsc.edu (Y. Liu)
 https://weijiaheng.github.io/ (J. Wei); https://users.soe.ucsc.edu/~zhaoweizhu/ (Z. Zhu);
https://scholar.google.com.hk/citations?user=xnAVjF8AAAAJ&hl=en (T. Luo); https://sites.google.com/view/eamid/
(E. Amid); https://abhishek.umiacs.io/ (A. Kumar); http://www.yliuu.com/ (Y. Liu)
                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
   The most popular approach to learning from the multiple separate labels would be aggregating
the given labels for each instance [8, 9, 10, 11, 12], through an Expectation-Maximization (EM)
inference technique. Each instance will then be provided with one single label and applied with
the standard training procedure.
   The primary goal of this paper is to revisit the choice of aggregating separate labels and hope
to provide practitioners with understandings for the following question:
   Should the learner aggregate separate noisy labels for one instance into a single
label or not?
   Our main contributions can be summarized as follows:
∙ We provide theoretical insights on how separation methods and aggregation ones result in
   different biases (Theorem 3.4, 4.2, 4.6) and variances (Theorem 3.6, 4.3, 4.7) of the output
   classifier from training. Our analysis considers both the standard loss functions in use, as
   well as popular robust losses that are designed for the problem of learning with noisy labels.
∙ By comparing the analytical proxy of the worst-case performance bounds, our theoretical
   results reveal that separating multiple noisy labels is preferred over label aggregation when
   the noise rates are high, or the number of labelers/annotations is insufficient. The results are
   consistent for both the basic loss function ℓ and robust designs, including loss correction and
   peer loss.
∙ We carry out extensive experiments using both synthetic and real-world datasets to validate
   our theoretical findings.

1.1. Related Works
Label separation vs label aggregation Existing works mainly compare the separation with
aggregation by empirical results. For example, it has been shown that label separation could
be effective in improving model performance and may be potentially more preferable than
aggregated labels through majority voting [13]. When training with the cross-entropy loss,
Sheng et.al [14] observe that label separation reduces the bias and roughness, and outperforms
majority-voting aggregated labels. However, it is unclear whether the results hold when
robust treatments are employed. Similar problems have also been studied in corrupted label
detection with a result leaning towards separation but not proved [15]. Another line of approach
concentrates on the end-to-end training scheme or ensemble methods which take all the separate
noisy labels as the input during the training process [16, 17, 18, 19, 20], and learning from
separate noisy labels directly.

Learning with noisy labels Popular approaches in learning with noisy labels could be
broadly divided into following categories, i.e., (i) Adjusting the loss on noisy labels by: using the
knowledge of noise label transition matrix [21, 22, 23, 24, 25, 26, 27, 28, 29]; re-weighting the
per-sample loss by down-weighting instances with potentially wrong labels [30, 31, 32, 33, 34];
or refurbishing the noisy labels [35, 36, 37]; (ii) Robust loss designs that do not require the
knowledge of noise transition matrix [38, 39, 40, 41, 42, 43, 44, 45]; (iii) Regularization techniques
to prevent deep neural networks from memorizing noisy labels [46, 47, 48, 49, 50, 51]; (iv)
Dynamical sample selection procedure which behaves in a semi-supervised manner and begins
with a clean sample selection procedure, then makes use of the wrongly-labeled samples
[52, 53, 54, 55, 56]. For example, several methods [57, 58, 59] adopt a mentor/peer network to
select small-loss samples as “clean” ones for the student/peer network. See [60, 61] for a more
detailed survey of existing noise-robust techniques.


2. Formulation
Consider an 𝑀 -class classification task and let 𝑋 ∈ 𝒳 and 𝑌 ∈ 𝒴 := {1, 2, ..., 𝑀 } denote
the input examples and their corresponding labels, respectively. We assume that (𝑋, 𝑌 ) ∼ 𝒟,
where 𝒟 is the joint data distribution. Samples (𝑥, 𝑦) are generated according to random
variables (𝑋, 𝑌 ). In the clean and ideal scenario, the learner has access to 𝑁 training data
points 𝐷 := {(𝑥𝑛 , 𝑦𝑛 )}𝑛∈[𝑁 ] . Instead of having access to ground truth labels 𝑦𝑛 s, we only have
access to a set of noisy labels {𝑦˜𝑛,𝑖 }𝑖∈[𝐾] for 𝑛 ∈ [𝑁 ]. For ease of presentation, we adopt the
decorator to denote separate labels and ∙ for aggregated labels specified later. Noisy labels 𝑦˜𝑛 s
are generated according to the random variable 𝑌̃︀ . We consider the class-dependent label noise
transition [30, 21] where 𝑌̃︀ is generated according to a transition matrix 𝑇 with its entries
defined as follows:
                                      𝑇𝑘,𝑙 := P(𝑌̃︀ = 𝑙|𝑌 = 𝑘).
Most of the existing results on learning with noisy labels have considered the setting where
each 𝑥𝑛 is paired with only one noisy label 𝑦˜𝑛 . In practice, we often operate in a setting where
each data point 𝑥𝑛 is associated with multiple separate labels drawn from the same noisy label
generation process [62, 63]. We consider this setting and assume that for each 𝑥𝑛 , there are 𝐾
independent noisy labels 𝑦˜𝑛,1 , ..., 𝑦˜𝑛,𝐾 obtained from 𝐾 annotators [64].
   We are interested in two popular ways to leverage multiple separate noisy labels:
 ∙ Keep the separate labels as separate ones and apply standard learning with noisy labels
   techniques to each of them.
 ∙ Aggregate noisy labels into one label, and then apply standard learning with noisy data
   techniques.
We will look into each of the above two settings separately and then answer the question:
“Should the learner aggregate multiple separate noisy labels or not?”

2.1. Label Separation
Denote the column vector P𝑌̃︀ := [P(𝑌̃︀ = 1), · · · , P(𝑌̃︀ = 𝑀 )]⊤ as the marginal distribution
of 𝑌̃︀ . Accordingly, we can define P𝑌 for 𝑌 . Clearly, we have the relation: P𝑌̃︀ = 𝑇 · P𝑌 , P𝑌 =
(𝑇 )−1 · P𝑌̃︀ . Denote by 𝜌1 := P(𝑌̃︀ = 0|𝑌 = 1), 𝜌0 := P(𝑌̃︀ = 1|𝑌 = 0). The noise transition
matrix 𝑇 has the following form when 𝑀 = 2:
                                          [︂                ]︂
                                             1 − 𝜌0     𝜌0
                                      𝑇 =                      .
                                               𝜌1    1 − 𝜌1

For label separation, we define the per-sample loss function as:
                                                           1 ∑︁
                       ℓ(𝑓 (𝑥𝑛 ), 𝑦˜𝑛,1 , ..., 𝑦˜𝑛,𝐾 ) =        ℓ(𝑓 (𝑥𝑛 ), 𝑦˜𝑛,𝑖 ).
                                                           𝐾
                                                              𝑖∈[𝐾]
For simplicity, we shorthand ℓ(𝑓 (𝑥𝑛 ), 𝑦˜𝑛 ) := ℓ(𝑓 (𝑥𝑛 ), 𝑦˜𝑛,1 , ..., 𝑦˜𝑛,𝐾 ) for the loss of label
separation method when there is no confusion.

2.2. Label Aggregation
The other way to leverage multiple separate noisy labels is generating a single label via label
aggregation methods using 𝐾 noisy ones:

                             𝑦˜∙𝑛 := Aggregation(𝑦˜𝑛,1 , 𝑦˜𝑛,2 , ..., 𝑦˜𝑛,𝐾 ),

where the aggregated noisy labels 𝑦˜∙𝑛 s are generated according to the random variable 𝑌̃︀ ∙ .
Denote the confusion matrix for this single & aggregated noisy label as 𝑇 ∙ . Popular aggregation
methods include majority vote and EM inference, which are covered by our theoretical insights
since our analyses in later sections would be built on the general label aggregation method. For
a better understanding, we introduce the majority vote as an example.

An Example of Majority Vote Given the majority voted label, we could compute the
transition matrix between 𝑌̃︀ ∙ and the true label 𝑌 using the knowledge of 𝑇 . The lemma below
gives the closed form for 𝑇 ∙ in terms of 𝑇 , when adopting majority vote.

Lemma 2.1. Assume 𝐾 is odd and recall that in the binary classification task, 𝑇𝑖,𝑗 = P(𝑌̃︀ =
                                                                                        ∙ becomes:
𝑗|𝑌 = 𝑖), the noise transition matrix of the (majority voting) aggregated noisy labels 𝑇𝑝,𝑞
                             𝐾+1
                              2
                                 −1
                                        ∑︁ (︂𝐾 )︂
                     ∙
                    𝑇𝑝,𝑞 =                       (𝑇𝑝,𝑞 )𝐾−𝑖 (𝑇𝑝,1−𝑞 )𝑖 ,                              𝑝, 𝑞 ∈ {0, 1}.
                                             𝑖
                                            𝑖=0

  When 𝐾 = 3, then 𝑇1,0∙ = P(𝑌  ̃︀ ∙ = 0|𝑌 = 1) = (𝑇 )3 + 3 (𝑇 )2 (𝑇 ). Note it still
                                                               (︀ )︀
                                                     1,0         1   1,0    1,1
holds that 𝑇𝑝,𝑞  𝑝,1−𝑞 = 1. For the aggregation method, as illustrated in Figure 1, the x-axis
            ∙ + 𝑇∙



                                                               80                                  MV 0.2
                                                                                                   EM 0.2
                                                                                                   MV 0.4
                                                                                                   EM 0.4
                             Noise rate of aggregated labels




                                                               60                                  MV 0.6
                                                                                                   EM 0.6
                                                                                                   MV 0.8
                                                                                                   EM 0.8
                                                               40


                                                               20


                                                               0
                                                                    10   20          30       40       50
                                                                         Number of Labelers



Figure 1: Noise rates of the aggregated labels in synthetic noisy CIFAR-10. MV: majority vote. EM:
expectation maximization. 0.2–0.8: Original noise rates before aggregation.

indicates the number of labelers 𝐾, and the y-axis denotes the aggregated noise rate given that
the overall noise rate is in [0.2, 0.4, 0.6, 0.8]. When the number of labelers is large (i.e., 𝐾 < 10)
and the noise rate is small, both majority vote and EM label aggregation methods significantly
reduce the noise rate. Although the expectation-maximization method consumes much more
time when generating the aggregated label, it frequently results in a lower aggregated noise
rate than the majority vote.


3. Bias and Variance Analyses w.r.t. ℓ-loss
In this section, we provide theoretical insights on how label separation and aggregation methods
result in different biases and variances of the classifier prediction when learning with the
standard loss function ℓ.
   Suppose the clean training samples {(𝑥𝑛 , 𝑦𝑛 )}𝑛∈[𝑁 ] are given by variables (𝑋, 𝑌 ) such
that (𝑋, 𝑌 ) ∼ 𝒟. Recall that instead of having access to a set of clean training samples
𝐷 = {(𝑥𝑛 , 𝑦𝑛 )}𝑛∈[𝑁 ] , the learner only observes 𝐾 noisy labels 𝑦˜𝑛,1 , ..., 𝑦˜𝑛,𝐾 for each 𝑥𝑛 ,
denoted by 𝐷    ̃︀ := {(𝑥𝑛 , 𝑦˜ , ..., 𝑦˜ )}𝑛∈[𝑁 ] . For separation methods, the noisy training
                                 𝑛,1       𝑛,𝐾
samples are obtained through variables (𝑋, 𝑌̃︀1 ), ..., (𝑋, 𝑌̃︀𝐾 ) where (𝑋, 𝑌̃︀𝑖 ) ∼ 𝒟   ̃︀ for 𝑖 ∈ [𝐾].
For aggregation methods such as majority vote, we assume the data points and aggregated noisy
labels 𝐷̃︀ ∙ := {(𝑥𝑛 , 𝑦˜∙𝑛 )}𝑛∈[𝑁 ] are drawn from (𝑋, 𝑌̃︀ ∙ ) ∼ 𝒟
                                                                  ̃︀ ∙ where 𝑌̃︀ ∙ is produced through
the majority voting of 𝑌̃︀1 , ..., 𝑌̃︀𝐾 . When we mention "noise rate", we usually refer to the average
noise: P(𝑌̃︀ u ̸= 𝑌 ).
ℓ-risk under the distribution           Given the loss ℓ, note that ℓ(𝑓 (𝑥𝑛 ), 𝑦˜𝑛 ) is denoted as
ℓ(𝑓 (𝑥𝑛 ), 𝑦˜𝑛,1 , ..., 𝑦˜𝑛,𝐾 ) = 𝐾 𝑖∈[𝐾] ℓ(𝑓 (𝑥𝑛 ), 𝑦˜𝑛,𝑖 ), we define the empirical ℓ-risk for learn-
                                  1 ∑︀

ing with separated/aggregated labels under noisy labels as 𝑅        ^ ̃︀ 𝑢 (𝑓 ) = 1 ∑︀𝑁 ℓ (𝑓 (𝑥𝑖 ), 𝑦˜𝑢 ),
                                                                           ℓ,𝐷           𝑁      𝑖=1     𝑖
𝑢 ∈ {, ∙} unifies the treatment which is either separation or aggregation ∙.
   By increasing the sample size 𝑁 , we would expect 𝑅     ^ ̃︀ 𝑢 (𝑓 ) to be close to the following
                                                            ℓ,𝐷
ℓ-risk under the noisy distribution 𝒟
                                    ̃︀ 𝑢 : 𝑅 ̃︀ 𝑢 (𝑓 ) = E ̃︀ 𝑢 ̃︀ 𝑢 [ℓ(𝑓 (𝑋), 𝑌̃︀ 𝑢 )].
                                                  ℓ,𝒟            (𝑋,𝑌 )∼𝒟


3.1. Bias of a Given Classifier w.r.t. ℓ-Loss
We denote by 𝑓 * ∈ ℱ the optimal classifier obtained through the clean data distribution
(𝑋, 𝑌 ) ∼ 𝒟 within the hypothesis space ℱ. We formally define the bias of a given classifier 𝑓^
as:

Definition 3.1 (Classifier Prediction Bias of ℓ-Loss). Denote by 𝑅ℓ,𝒟 (𝑓^ ) := E𝒟 [ℓ(𝑓^ (𝑋), 𝑌 )],
𝑅ℓ,𝒟 (𝑓 * ) := E𝒟 [ℓ(𝑓 * (𝑋), 𝑌 )]. The bias of classifier 𝑓^ writes as: Bias(𝑓^) = 𝑅ℓ,𝒟 (𝑓^) − 𝑅ℓ,𝒟 (𝑓 * ).

  The Bias term quantifies the prediction bias (excess risk) of a given classifier 𝑓^ on the clean
data distribution 𝒟 w.r.t. the optimal achievable classifier 𝑓 * , which can be decomposed as [65]

                     Bias(𝑓^ ) = 𝑅ℓ,𝒟 (𝑓^ ) − 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) + 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) − 𝑅ℓ,𝒟 (𝑓 * ) .           (1)
                                       Distribution shift             Estimation error

Now we bound the distribution shift and the estimation error in the following two lemmas.
Lemma 3.2 (Distribution shift). Denote by 𝑝𝑖 := P(𝑌 = 𝑖), assume ℓ is upper bounded by ℓ̄ and
lower bounded by ℓ. The distribution shift in Eqn. (1) is upper bounded by
                                                  𝑢,1
                   𝑅ℓ,𝒟 (𝑓^ ) − 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) ≤ Δ𝑅 := (𝜌𝑢0 𝑝0 + 𝜌𝑢1 𝑝1 ) · ℓ − ℓ .                             (2)
                                                                           (︀    )︀


Lemma 3.3 (Estimation error). Suppose the loss function ℓ(𝑓 (𝑥), 𝑦) is 𝐿-Lipschitz for any feasible
𝑦. ∀𝑓 ∈ ℱ, with probability at least 1 − 𝛿, the estimation error is upper bounded by
                                                                    √︃
                                *     𝑢,2                              2 log(1/𝛿)      𝑢,1
        𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) − 𝑅ℓ,𝒟 (𝑓 ) ≤ Δ𝑅 := 4𝐿 · R(ℱ) + (ℓ − ℓ) ·           𝑢𝑁     + Δ𝑅 ,
                                                                          𝜂𝐾

                                                                                        𝐾·log( 1𝛿 )            ∙ ≡1
where 𝑢 ∈ {, ∙} denotes either separation or aggregation methods, 𝜂𝐾 =                                2   and 𝜂𝐾
                                                                                      2(log( 𝐾+1
                                                                                              𝛿
                                                                                                 ))
indicate the richness factor, which characterizes the effect of the number of labelers, and R(ℱ) is
the Rademacher complexity of ℱ.

                                 2.75
                                             = 1e-1
                                 2.50        = 1e-2
                                             = 1e-4
                                 2.25        = 1e-8
                                 2.00        = 1e-12
                                             = 1e-20
                           (K)




                                 1.75
                                 1.50
                                 1.25
                                 1.00
                                        0    20        40       60        80    100
                                                   Number of Labelers


Figure 2: The visualization of estimated 𝜂𝐾 given varied 𝛿.


   Noting that the number of unique instances 𝑥𝑖 is the same for both treatments, the duplicated
copies of 𝑥𝑖 are supposed to introduce at least no less effective samples, i.e., the richness factor
satisfies that 𝜂𝐾
                𝑢 ≥ 1. Thus, we update 𝜂 := max{𝜂 , 1}, and Figure 2 visualizes the estimated
                                         𝐾             𝐾
𝜂𝐾 given different number of labelers as well as 𝛿. It is clear that when the number of labelers is
larger, or 𝛿 is smaller, 𝜂𝐾 > 𝜂𝐾
                               ∙ . Later we shall show how 𝜂 𝑢 influences the bias and variance
                                                                𝐾
of the classifier prediction.
   To give a more intuitive comparison of the performance of both mechanisms, we adopt the
                                  𝑢      𝑢,1      𝑢,2
worst-case bias upper bound Δ𝑅 := Δ𝑅 + Δ𝑅 from Lemma 3.2 and Lemma 3.3 as a proxy
and derive Theorem 3.4.

Theorem 3.4. Denote by 𝛼𝐾 := (𝜌0 𝑝0 + 𝜌1 𝑝1 ) − (𝜌∙0 𝑝0 + 𝜌∙1 𝑝1 ), 𝛾 = log(1/𝛿)/2𝑁 . The
                                                                             √︀
                                                                         ∙
separation bias proxy Δ𝑅 is smaller than the aggregation bias proxy Δ𝑅 if and only if
                                                            1
                                            𝛼𝐾 ·                     1   ≤ 𝛾.                                   (3)
                                                   1 − (𝜂𝐾 )− 2
   Note that 𝛼𝐾 and 𝜂𝐾 are non-decreasing w.r.t. the increase of 𝐾, in Section 4.3, we will
explore how the LHS of Eqn. (3) is influenced by 𝐾: a short answer is that the LHS of Eqn. (3) is
(generally) monotonically increasing w.r.t. 𝐾 when 𝐾 is small, indicating that Eqn. (3) is easier
to be achieved given fixed 𝛿, 𝑁 and a smaller 𝐾 than a larger one.

3.2. Variance of a Given Classifier w.r.t. ℓ-Loss
We now move on to explore the variance of a given classifier when learning with ℓ-loss, prior
to the discussion, we define the variance of a given classifier as:

Definition 3.5 (Classifier Prediction Variance of ℓ-Loss). The variance of a given classifier 𝑓^
when learned with separation () or aggregation (∙) is defined as:
                                         [︁                                                        ]︁2
             Var(𝑓^ ) = E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 ) − E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 [ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 )] .

  For 𝑔(𝑥) = 𝑥 − 𝑥2 , we derive the closed form of Var and the corresponding upper bound as
below.
                   𝑢 ≥ 2 log(1/𝛿) , given ℓ is 0-1 loss, we have:
Theorem 3.6. When 𝜂𝐾       𝑁
                                                                (︃√︃                 )︃
                                 𝑢                    𝑢                2 log(1/𝛿)
                          Var(𝑓^ ) = 𝑔(𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ )) ≤ 𝑔                 𝑢𝑁             .              (4)
                                                                          𝜂𝐾
                                                                    Variance proxy

                                                                         ∙
The variance proxy of Var(𝑓^ ) in Eqn. (4) is smaller than that of Var(𝑓^ ).

  Theorem 3.6 provides another view to decide on the choices of separation and aggregation
methods, i.e., the proxy of classifier prediction variance. To extend the theoretical conclusions
w.r.t. ℓ loss to the multi-class setting, we only need to modify the upper bound of the distribution
shift in Eqn. (2), as specified in the following corollary.

Corollary 3.7 (Multi-Class Extension (ℓ-Loss)). In the 𝑀 -class classification case, the upper
bound of the distribution shift in Eqn. (2) becomes:
                                               𝑢,1   ∑︁
                𝑅ℓ,𝒟 (𝑓^ ) − 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) ≤ Δ𝑅 :=                𝑢
                                                                                            (5)
                                                                         (︀  )︀
                                                        𝑝𝑗 · (1 − 𝑇𝑗,𝑗 )· ℓ−ℓ .
                                                           𝑗∈[𝑀 ]


4. Bias and Variance Analyses with Robust Treatments
Intuitively, the learning of noisy labels problem could benefit from more robust loss functions
built upon the generic ℓ loss, i.e., backward correction (surrogate loss) [21, 22], and peer loss
functions [42]. We move on to explore the best way to learn with multiple copies of noisy labels,
when combined with existing robust approaches.
4.1. Backward Loss Correction
When combined with the ∑︀   backward loss correction approach (ℓ → ℓ← ), the empirical ℓ risks
become: 𝑅ℓ← ,𝐷̃︀ 𝑢 (𝑓 ) = 𝑁 𝑁
         ^                1
                                               ˜𝑢𝑖 ), where the corrected loss in the binary case is
                              𝑖=1 ℓ← (𝑓 (𝑥𝑖 ), 𝑦
defined as
                                (1 − 𝜌1−𝑦˜𝑢 ) · ℓ(𝑓 (𝑥), 𝑦˜𝑢 ) − 𝜌𝑦˜𝑢 · ℓ(𝑓 (𝑥), 1 − 𝑦˜𝑢 )
           ℓ← (𝑓 (𝑥), 𝑦˜𝑢 ) =                                                              .
                                                      1 − 𝜌𝑢0 − 𝜌𝑢1

Bias of given classifier w.r.t. ℓ← Suppose the loss function ℓ(𝑓 (𝑥), 𝑦) is 𝐿-Lipschitz for
                                                             𝑢    𝑢
any feasible 𝑦. Define 𝐿𝑢← := 𝐿𝑢←0 · 𝐿, where 𝐿𝑢←0 := (1+|𝜌                           ^
                                                                  𝑢 . Denote by 𝑅ℓ,𝒟 (𝑓 ) the ℓ-risk
                                                             0 −𝜌1 |)
                                                          1−𝜌𝑢
                                                             0 −𝜌 1
                                                                      𝑢
of the classifier 𝑓^ under the clean data distribution 𝒟, with 𝑓^ = 𝑓^ = arg min
                                                                               ←       𝑅^    ̃︀ 𝑢 (𝑓 ).
                                                                                                 𝑓 ∈ℱ    ℓ← ,𝐷
Lemma 4.1 gives the upper bound of classifier prediction bias when learning with ℓ← via
separation or aggregation methods.
Lemma 4.1. With probability at least 1 − 𝛿, we have:
                                                                                         √︃
                𝑢                    𝑢                                                        2 log(1/𝛿)
       𝑅ℓ,𝒟 (𝑓^ ← ) − 𝑅ℓ,𝒟 (𝑓 * ) ≤ Δ𝑅← := 4𝐿𝑢← · R(ℱ) + 𝐿𝑢←0 · (ℓ − ℓ) ·                         𝑢𝑁     .
                                                                                                 𝜂𝐾
   Lemma 4.1 offers the upper bound of the performance gap for the given classifier 𝑓 w.r.t the
                                                                                              𝑢
clean distribution 𝒟, comparing to the minimum achievable risk. We consider the bound Δ𝑅←
as a proxy of the bias, and we are interested in the case where training the classifier separately
                                                                                              ∙
yields a smaller bias proxy compared to that of the aggregation method, formally Δ𝑅← < Δ𝑅← .
For any finite hypothesis class ℱ ⊂ {𝑓 : 𝑋 → {0, 1}}, and the sample set 𝑆 = {𝑥1 , ..., 𝑥𝑁 },
denote by 𝑑 the VC-dimension of ℱ, we give conditions when training separately yields a
smaller bias proxy.
                                                            (︁       √︁           )︁
                                                                        𝑑 log(𝑁 )
Theorem 4.2. Denote by 𝛼𝐾 := 1 − 𝐿∙← /𝐿← , 𝛾 = 1/ 1 + ℓ−ℓ         4𝐿
                                                                        log(1/𝛿) , where 𝑑 is the
VC-dimension of ℱ. For backward loss correction, the separation bias proxy Δ𝑅← is smaller than
                            ∙
the aggregation bias proxy Δ𝑅← if and only if
                                                       1
                                        𝛼𝐾 ·                   1   ≤ 𝛾.                                          (6)
                                               1 − (𝜂𝐾 )− 2
  We defer our empirical analysis of the monotonicity of the LHS in Eqn. (6) to Section 4.3 as
well, which shares similar monotonicity behavior to learning w.r.t. ℓ.

Variance of given classifiers with Backward Loss Correction Similar to the previous
subsection, we now move on to check how separation and aggregation methods result in
different variance when training with loss correction.
                             𝑢 )− 12 <
                                       √︁                           𝑢
Theorem 4.3. When 𝐿𝑢←0 (𝜂𝐾                       𝑁
                                          2(ℓ−ℓ)2 log(1/𝛿)
                                                           , Var(𝑓^ ← ) (w.r.t. the 0-1 loss) satisfies:
                                                       (︃                     √︃                )︃
                       𝑢                    𝑢                                      2 log(1/𝛿)
                Var(𝑓^ ← ) = 𝑔(𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ← )) ≤ 𝑔       𝐿𝑢←0 · (ℓ − ℓ) ·            𝑢𝑁            .           (7)
                                                                                      𝜂𝐾
                                                                    Variance proxy
                                                                              ∙      √
The variance proxy of Var(𝑓^ ← ) in Eqn. (7) is smaller than that of Var(𝑓^ ← ) if            𝐿
                                                                                               ∙ .
                                                                                         𝜂𝐾 > 𝐿←
                                                                                                  ←

   Moving a bit further, when the noise transition matrix is symmetric for both methods, the
                                                                     𝐿←     1−𝜌∙0 −𝜌∙1
requirement 𝜂𝐾      > 𝐿∙← could be further simplified as: 𝜂𝐾
                       𝐿
                                                                      ∙ = 1−𝜌 −𝜌 . For a fixed
              √︀ 𝑢                                          √︀ 𝑢
                                                                  > 𝐿
                         ←                                            ←         0    1
𝐾, a more efficient aggregation method decreases 𝜌∙𝑖 , which makes it harder to satisfy this
condition.
   Recall 𝐿𝑢← := 𝐿𝑢←0 · 𝐿, the theoretical insights of ℓ← between binary case and the multi-class
setting could be bridged by replacing 𝐿𝑢0 with the multi-class constant specified in the following
corollary.
Corollary 4.4 (Multi-Class Extension (ℓ← -Loss)). Given a diagonal-dominant transition matrix
𝑇 𝑢 , we have                                       √
                                         𝑢         2 𝑀
                                       𝐿←0 =                ,
                                                𝜆min (𝑇 𝑢 )
where 𝜆min (𝑇 𝑢 ) denotes the minimal eigenvalue of the matrix 𝑇 𝑢 . Particularly, if 𝑇𝑖𝑖𝑢 < 0.5, ∀𝑖 ∈
[𝑀 ], we further have
                           {︃              √      }︃
                                 1       2   𝑀
              𝐿𝑢←0 = min             ,               , where 𝑒𝑢 := max (1 − 𝑇𝑖𝑖𝑢 ).
                              1 − 2𝑒𝑢 𝜆min (𝑇 𝑢 )                     𝑖∈[𝑀 ]


4.2. Peer Loss Functions
Peer Loss function [42] is a family of loss functions that are shown to be robust to la-
bel noise, without requiring the knowledge of noise rates. Formally, ℓ↬ (𝑓 (𝑥𝑖 ), 𝑦˜𝑖 ) :=
ℓ(𝑓 (𝑥𝑖 ), 𝑦˜𝑖 ) − ℓ(𝑓 (𝑥1𝑖 ), 𝑦˜2𝑖 ), where the second term checks on mismatched data samples
with (𝑥𝑖 , 𝑦˜𝑖 ), (𝑥1𝑖 , 𝑦˜1𝑖 ), (𝑥2𝑖 , 𝑦˜2𝑖 ), which are randomly drawn from the same data distribu-
tion. When combined              with the peer loss approach, i.e., ℓ → ℓ↬ , the two risks become:
^ ̃︀ 𝑢 (𝑓 ) = 1 ∑︀𝑁 ℓ↬ (𝑓 (𝑥𝑖 ), 𝑦˜𝑢 ), 𝑢 ∈ {, ∙}.
𝑅 ℓ↬ ,𝐷           𝑁       𝑖=1                    𝑖


Bias of given classifier w.r.t. ℓ↬ Suppose the loss function ℓ(𝑓 (𝑥), 𝑦) is 𝐿-Lipschitz for any
                                                                   𝑢
feasible 𝑦. Let 𝐿𝑢↬0 := 1/(1 − 𝜌𝑢0 − 𝜌𝑢1 ), 𝐿𝑢↬ := 𝐿𝑢↬0 · 𝐿 and 𝑓^ ↬ = arg min𝑓 ∈ℱ 𝑅
                                                                                   ^ ̃︀ 𝑢 (𝑓 ).
                                                                                     ℓ↬ ,𝐷
Lemma 4.5. With probability at least 1 − 𝛿, we have:
                                                                  √︃
            𝑢                    𝑢                                     2 log(4/𝛿) (︀
   𝑅ℓ,𝒟 (𝑓^ ↬ ) − 𝑅ℓ,𝒟 (𝑓 * ) ≤ Δ𝑅↬ := 8𝐿𝑢↬ · R(ℱ) + 𝐿𝑢↬0 ·
                                                                                                )︀
                                                                           𝑢     · 1 + 2(ℓ̄ − ℓ) .
                                                                          𝜂𝐾 𝑁
   To evaluate the performance of a given classifier yielded by the optimization w.r.t. ℓ↬ , Lemma
                             𝑢
4.5 provides the bias proxy Δ𝑅↬ for both treatments. Similarly, we adopt such a proxy to analyze
which treatment is more preferable.
                                                                  √︁
Theorem 4.6. Denote by 𝛼𝐾 := 1 − 𝐿∙↬ /𝐿↬ , 𝛾 = 1+2(ℓ̄−ℓ)    2𝐿
                                                                      log(4/𝛿)
                                                                     4𝑑 log(𝑁 ) , where 𝑑 denotes the
VC-dimension of ℱ. For peer loss, the separation bias proxy Δ𝑅↬ is smaller than the aggregation
            ∙
bias proxy Δ𝑅↬ if and only if
                                                  1
                                  𝛼𝐾 ·                      1   ≤ 𝛾.                                  (8)
                                         𝐿∙↬ /𝐿↬ − (𝜂𝐾 )− 2
                         Noise rate 0.2                                            Noise rate 0.4




                                                 Loss                                                        Loss
                                                   Loss                                                        Loss
                                                  Loss                                                        Loss
                  10     20      30         40      50                     10      20       30          40      50
                       Number of Labelers                                        Number of Labelers

Figure 3: The monotonicity of the LHS in Eqn. (3, 6, 8) w.r.t. the increase of 𝐾.


  Note that the condition in Eqn. (8) shares a similar pattern to that which appeared in the
basic loss ℓ and ℓ← , we will empirically illustrate the monotonicity of its LHS in Section 4.3.

Variance of given classifiers with Peer Loss We now move on to check how separation
and aggregation methods result in different variances when training with peer loss. Similarly,
we can obtain:
                                                                   𝑢
                              √︁
                          𝜂𝐾 ≥ 2 log(4/𝛿) · 1 + 2(ℓ̄ − ℓ) , Var(𝑓^ ↬ ) (w.r.t. the 0-1 loss) satisfies:
                        √︀ 𝑢               (︀            )︀
Theorem 4.7. When                  𝑁

                                                          (︃        √︃                            )︃
                   𝑢                    𝑢                                log(4/𝛿) (︀
            Var(𝑓^ ↬ ) = 𝑔(𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ↬ )) ≤ 𝑔           𝐿𝑢↬0 ·                                                         (9)
                                                                                               )︀
                                                                            𝑢 𝑁 · 1 + 2(ℓ̄ − ℓ)
                                                                          2𝜂𝐾
                                                                                                     .

                                                                           Variance proxy

                                                                            ∙     √
The variance proxy of Var(𝑓^ ↬ ) in Eqn. (9) is smaller than that of Var(𝑓^ ↬ ) if 𝜂𝐾 ≥ 𝐿↬
                                                                                        𝐿∙ .                          ↬

  Theoretical insights of ℓ↬ also have the multi-class extensions, we only need to generate
𝐿𝑢↬0 to the multi-class setting along with additional conditions specified as below:
Corollary 4.8 (Multi-Class Extension (ℓ↬ -Loss)). Assume ℓ↬ is classification-calibrated in
                                                                               1
the multi-class setting, and the clean label 𝑌 has equal prior 𝑃 (𝑌 = 𝑗) = 𝑀     , ∀𝑗 ∈ [𝑀 ].
                                                        𝑢 = 𝜌𝑢 , ∀𝑗 ∈ [𝑀 ], we have: 𝐿𝑢
        uniform noise transition matrix [44] such that 𝑇𝑗,𝑖
For the ∑︀                                                     𝑖                       ↬0 =
                 𝑢
1/(1 − 𝑖∈[𝑀 ] 𝜌𝑖 ).


4.3. Analysis of the Theoretical Conditions
Recall that the established conditions in Theorems 3.4, 4.2, 4.6 are implicitly relevant to the
number of labelers 𝐾, and the RHS of Eqns. (3, 6, 8) are constants. We proceed to analyze the
monotonicity of the corresponding LHS (in the form of 𝛼𝐾 ·           1
                                                                       −1
                                                                           ) w.r.t. the increase
                                                                                    𝛽𝐾 −(𝜂𝐾 )       2

of 𝐾, where 𝛽𝐾 = 1 for ℓ and ℓ← , 𝛽𝐾 = 𝐿∙↬ /𝐿↬ for ℓ↬ . Thus, we have: 𝑂(LHS) =
𝑂(𝛼𝐾 · (𝛽𝐾 − 𝑂( log(𝐾)
                  √
                    𝐾
                       ))−1 ). We visualize this order under different symmetric 𝑇 in Figure 3.
                                           Symmetric Noise, Statlog-6                                                                                              Symmetric Noise, Optical-10
                Cross-Entropy                Backward Correction                                   Peer Loss                            Cross-Entropy                 Backward Correction                                  Peer Loss
           92                                      92                                      92                                                                              98                                      98
                                                                                                                                   98                                      97
           90                                      90                                      90                                                                                                                      97




                                                                                                                                                                   = 0.2
                                                                                                                           = 0.2




                                                                                                                                                                                                           = 0.2
   = 0.2




                                           = 0.2




                                                                                   = 0.2
           88                                      88                                      88                                      97                                      96                                      96
           86                                      86                                      86                                      96                                      95                                      95
           92                                      92                                      92                                      98                                      98                                      98
           90                                      90                                      90                                                                                                                      96




                                                                                                                           = 0.4




                                                                                                                                                                                                           = 0.4
                                                                                                                                   96                                      96
   = 0.4




                                           = 0.4




                                                                                   = 0.4




                                                                                                                                                                   = 0.4
           88                                      88                                      88                                                                                                                      94
           86                                      86                                      86                                      94                                      94                                      92
           92                                      92                                      92
                                                                                                                                                                           95
           90                                      90                                      90                                      95                                                                              95




                                                                                                                           = 0.6




                                                                                                                                                                                                           = 0.6
                                                                                                                                                                   = 0.6
   = 0.6




                                           = 0.6




                                                                                   = 0.6
           88                                      88                                      88                                                                              90
                                                                                                          Aggregation                                                                                              90             Aggregation
           86                                      86                                      86             Separation               90                                      85                                                     Separation
                  2      4        6                       2      4        6                       2      4        6                       2      4        6                       2      4        6                       2      4        6
                Square Root of #Labelers                Square Root of #Labelers                Square Root of #Labelers                Square Root of #Labelers                Square Root of #Labelers                Square Root of #Labelers

Figure 4: The performances of Cross-Entropy, Backward Loss Correction, and Peer Loss trained on
synthetic noisy Statlog-6/Optical-10 aggregated labels (we report the better results between majority
vote and EM inference
                   √ for each 𝐾, and noise rate 𝜖), and separated labels. 𝑋-axis: the value of the
number of labelers 𝐾; 𝑌 -axis: the best test accuracy achieved.
                                           Symmetric Noise, CIFAR-10                                                                                        Instance-Dependent Noise, CIFAR-10
                Cross-Entropy                Backward Correction                                   Peer Loss                            Cross-Entropy               Backward Correction                                    Peer Loss
           93.5                                                                            94                                                                                                                      90
                                                     94
                                                                                                                             = 0.2
                                             = 0.2




                                                                                                                                                                                                           = 0.2
   = 0.2




                                                                                   = 0.2




                                                                                                                                                                     = 0.2
           93.0                                                                                                                      80                                      80                                    80
           92.5                                      92                                    93                                                                                                                      70
                                                                                                                                     60                                      60
                                                   95.0                                    94                                                                                                                      94
             92                                    92.5                                                                              93                                      94
                                           = 0.4
     = 0.4




                                                                                   = 0.4




                                                                                                                                                                     = 0.4




                                                                                                                                                                                                           = 0.4
                                                                                                                             = 0.4



                                                                                           92                                                                                                                      93
             90                                    90.0                                    90                                        92                                      92                                    92
                                                     95                                                                                                                    95.0                                    94
             90                                      90                                    90                                      92.5
     = 0.6




                                                                                   = 0.6




                                                                                                                                                                   = 0.6
                                                                                                                                                                           92.5




                                                                                                                                                                                                           = 0.6
                                             = 0.6




                                                                                                                           = 0.6




             85                                      85                                                                            90.0                                                                            92
                                                                                           85                                                                              90.0

             80                                      75                                    80                                        90                                      90                                    90
     = 0.8




                                                                                   = 0.8




                                                                                                                                                                     = 0.8




                                                                                                                                                                                                           = 0.8
                                             = 0.8




                                                                                                                             = 0.8




             60                                      50                                    60            Aggregation                                                         80                                                  Aggregation
                                                     25                                                  Separation                  80                                      70                                    80            Separation
             40                                                                            40
                   2      4      6                         2      4      6                        2      4       6                        2      4       6                         2      4      6                        2      4       6
                Square Root of #Labelers                Square Root of #Labelers                Square Root of #Labelers                Square Root of #Labelers                Square Root of #Labelers                Square Root of #Labelers

Figure 5: The performances of Cross-Entropy, Backward Loss Correction, and Peer Loss trained on
synthetic noisy CIFAR-10 aggregated labels (we report the better results between   √ majority vote, EM
inference for each 𝐾, and noise rate 𝜖), and separated labels. 𝑋-axis: the value of 𝐾 where 𝐾 denotes
the number of labels per training example; 𝑌 -axis: the best achieved test accuracy.


It can be observed that when 𝐾 is small (e.g., 𝐾 ≤ 5), the LHS parts of these conditions increase
with 𝐾, while they may decrease with 𝐾 if 𝐾 is sufficiently large. Recall that separation is better
if LHS is less than the constant value 𝛾. Therefore, Figure 3 shows the trends that aggregation
is generally better than separation when 𝐾 is sufficiently large.

Tightness of the bias proxies In Theorems 3.4, 4.2, 4.6, we view the error bounds
  𝑢     𝑢      𝑢
Δ𝑅 , Δ𝑅← , Δ𝑅↬ as proxies of the worst-case performance of the trained classifier. For the
standard loss function ℓ, it has been proven that [66, 67] under mild conditions of ℓ and ℱ, the
lower bound of the performance gap between a trained                  ^
                                                       √︀ classifier (𝑓 ) and the optimal achievable
            *        ^
one (i.e., 𝑓 ) 𝑅ℓ,𝒟 (𝑓 ) − 𝑅ℓ,𝒟 (𝑓 ) is of the order 𝑂( 1/𝑁 ), which is of the same order as that
                                  *

in Theorem 3.4. Noting the behavior concluded from the worst-case bounds may not always
hold for each individual case, we further use experiments to validate our analyses in the next
section.
Table 1
The performances of CE/BW/PeerLoss trained on 2 UCI datasets (Breast, and German datasets), with
aggregated labels (majority vote, EM inference), and separated labels. We highlight the results with
Green (for the separation method) and Red (for aggregation methods) if the performance gap is larger
than 0.05. (𝐾 is the number of labels per training image)
                     UCI-Breast (symmetric) CE                                        UCI-German (symmetric) CE
 𝜖 = 0.2   𝐾=3      𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25              𝐾 = 49   𝜖 = 0.2   𝐾=3    𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25              𝐾 = 49
   MV      96.05    96.05     96.49     96.93      97.37    97.37     MV      69.00   71.50   71.50     73.50       73.00    73.00
   EM      96.93    96.05     96.49     96.93      97.37    97.37     EM      58.75   63.50   65.75     66.50       65.50    65.50
   Sep     96.49    95.18     96.49     96.93      97.81    98.25     Sep     70.00   70.75   66.00     69.75       70.75    69.25
 𝜖 = 0.4   𝐾=3      𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25              𝐾 = 49   𝜖 = 0.4   𝐾=3    𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25              𝐾 = 49
   MV      96.05    96.49     95.18     95.18      96.49    96.93     MV      65.75   62.25   62.75     68.50       71.75    70.50
   EM      96.05    92.98     89.47     94.30      96.05    96.93     EM      61.00   60.00   61.50     54.00       62.00    63.25
   Sep     92.11    94.30     95.61     96.49      96.93    96.93     Sep     68.25   65.50    65.00    64.50       64.75    69.50
                     UCI-Breast (symmetric) BW                                       UCI-German (symmetric) BW
 𝜖 = 0.2   𝐾=3      𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25              𝐾 = 49   𝜖 = 0.2   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25               𝐾 = 49
   MV      95.61    96.49     96.05     96.93      96.93    96.93     MV      72.75   71.50   74.00     75.50       76.50    76.50
   EM      95.61     96.49    96.05     96.93      96.93    96.93     EM      62.75   61.50   59.25     64.50       62.50    62.50
   Sep     95.18    93.42     96.49     96.05      97.37    98.25     Sep     70.50   70.50   73.75     68.25       70.00    72.75
 𝜖 = 0.4   𝐾=3      𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25              𝐾 = 49   𝜖 = 0.4   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25               𝐾 = 49
   MV      89.91    96.05     94.74     94.30      96.05    96.49     MV      65.25   69.50    67.50    69.50       70.50    71.75
   EM      81.14    94.30     92.11     94.74      92.54    96.49     EM      57.75   60.25   55.25     53.50       54.00    62.25
   Sep     91.67    93.42     94.30     89.47      92.54    97.37     Sep     60.25   63.50   63.00     64.25       69.00    64.75
                   UCI-Breast (symmetric) PeerLoss                                  UCI-German (symmetric) PeerLoss
 𝜖 = 0.2   𝐾=3      𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25              𝐾 = 49   𝜖 = 0.2   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25               𝐾 = 49
   MV      96.05    96.49     96.49     96.93      96.93    96.93     MV      72.75   71.75   73.00     73.00       72.50    72.50
   EM      96.05     96.49    96.49     96.93      96.93    96.93     EM      62.25   64.50   63.75     64.25       62.75    62.75
   Sep     94.74    94.30     96.93     96.93      96.93    97.81     Sep     70.25   68.00   70.50     70.00       67.00    73.50
 𝜖 = 0.4   𝐾=3      𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25              𝐾 = 49   𝜖 = 0.4   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25               𝐾 = 49
   MV      92.11    95.61      95.18    92.54      96.49    96.05     MV      69.50   66.25   69.50     68.75       69.00    70.00
   EM      92.11    92.11     86.40     93.86      95.61    96.93     EM      62.50   61.25   64.25     57.75       59.75    65.00
   Sep     92.11    94.30      95.18    95.18      95.61    96.05     Sep     64.00   61.25   66.50     68.00       69.25    69.00



5. Experimental Results
In this section, we empirically compare the performance of different treatments on the multiple
noisy labels when learning with robust loss functions (CE loss, forward loss correction, and
peer loss). We consider several treatments including label aggregation methods (majority vote
and EM inference) and the label separation method. Assuming that multiple noisy labels have
different weights, EM inference can be used to solve the problem under this assumption by
treating the aggregated labels as hidden variables [68, 69, 8, 70]. In the E-step, the probabilities
of the aggregated labels are estimated using the weighted aggregation approach based on the
fixed weights of multiple noisy labels. In the M-step, EM inference method re-estimates the
weights of multiple noisy labels based on the current aggregated labels. This iteration continues
until all aggregated labels remain unchanged. As for label separation, we adopted the mini-batch
separation method, i.e., each training sample 𝑥𝑛 is assigned with 𝐾 noisy labels in each batch.

5.1. Experiment on Synthetic Noisy Datasets
Experimental results on synthetic noisy UCI datasets [71] We adopt six UCI datasets
to empirically compare the performances of label separation and aggregation methods when
Table 2
Empirical verification of Theorem 3.4 on Breast & German UCI datasets.
                        Dataset    𝜌∘𝑖     𝑝0       𝑁       (1 − 𝛿, 𝑆𝐾 )
                         Breast    0.2   0.3726    569   (0.62, {𝐾 > 49})
                         Breast    0.4   0.3726    569   (0.62, {𝐾 > 49})
                        German     0.2     0.3    1000   (0.98, {𝐾 > 15})
                        German     0.4     0.3    1000   (0.98, {𝐾 > 23})


learning with CE loss, backward correction [21, 22], and Peer Loss [42]. The noisy annotations
given by multiple annotators are simulated by symmetric label noise, which assumes 𝑇𝑖,𝑗 = 𝑀𝜖−1
for 𝑗 ̸= 𝑖 for each annotator, where 𝜖 quantifies the overall noise rate of the generated noisy
labels. In Figure 4, we adopt two UCI datasets (StatLog: (𝑀 = 6); Optical: (𝑀 = 10)) for
illustration. From the results in Figure 4, it is quite clear that: the label separation method
outperforms both aggregation methods (majority-vote and EM inference) consistently, and is
considered to be more beneficial on such small scale datasets. Results on additional datasets and
more details are deferred to the Appendix.

Experimental results on synthetic noisy CIFAR-10 dataset [72] On CIFAR-10 dataset,
we consider two types of simulation for the separate noisy labels: symmetric label noise model
and instance-dependent label noise [53, 24], where 𝜖 is the average noise rate and different
labelers follow different instance-dependent noise transition matrices. For a fair comparison, we
adopt the ResNet-34 model [73], the same training procedure and batch-size for all considered
treatments on the separate noisy labels.
   Figure 5 shares the following insights regarding the preference of the treatments: in the low
noise regime or when 𝐾 is large, aggregating separate noisy labels significantly reduces the
noise rates and aggregation methods tend out to have a better performance; while in the high
noise regime or when 𝐾 is small, the performances of separation methods tend out to be more
promising. With the increasing of 𝐾 or 𝜖, we can observe a preference transition from label
separation to label aggregation methods.

5.2. Empirical Verification of the Theoretical Bounds
To verify the comparisons of bias proxies (i.e., Theorem 3.4) through an empirical perspective,
we adopt two binary classification UCI datasets for demonstration: Breast and German datasets,
as shown in Table 1. Clearly, on these two binary classification tasks, label aggregation methods
tend to outperform label separation, and we attribute this phenomenon to the fact that the
”denoising effect of label aggregation is more significant in(︁the binary case”.
                                                                             )︁
    For Theorem 3.4 (CE loss), the condition requires 𝛼𝐾 / 1 − (𝜂𝐾    ∘ )− 12 , where 𝛼 = (𝜌∘ 𝑝 +
                                                                                            0 0
𝜌1 𝑝1 ) − (𝜌0 𝑝0 + 𝜌1 𝑝1 ), 𝛾 = log(1/𝛿)/2𝑁 . For two binary UCI datasets (Breast & German),
  ∘         ∙       ∙
                               √︀

the information could be summarized in Table 2, where the column (1 − 𝛿, 𝑆𝐾 ) means: when
the number of annotators belongs to the set 𝑆𝐾 , the label separation method is likely to under-
perform label aggregation (i.e., majority vote) with probability at least 1 − 𝛿. For example, in
the last row of Table 2, when training on UCI German dataset with CE loss under noise rate
Table 3
Experimental results on CIFAR-10N and CIFAR-10H dataset with 𝐾 = 3. We highlight the results with
Green (for separation method) and Red (for aggregation methods) if the performance gap is large than
0.05.
                           CIFAR-10N (𝜖 ≈ 0.18)      CE     BW       PL
                               Majority-Vote        89.52 89.23 89.84
                               EM-Inference         89.19 88.88 88.92
                                Separation          89.77 89.20 89.97
                           CIFAR-10H (𝜖 ≈ 0.09)      CE     BW       PL
                               Majority-Vote        80.86 82.72 82.11
                               EM-Inference         80.81 82.43 81.73
                                Separation          76.75 79.07 78.08


0.4 (the noise rate of separate noisy labels), Theorem 3.4 reveals that with probability at least
0.98, label aggregation (with majority vote) is better than label separation when 𝐾 > 23, which
aligns well with our empirical observations (label separation is better only when 𝐾 < 15).

5.3. Experiments on Realistic Noisy Datasets
Note that in real-world scenarios, the label-noise pattern may differ due to the expertise of each
human annotator. We further compare the different treatments on two realistic noisy datasets:
CIFAR-10N [74], and CIFAR-10H [75]. CIFAR-10N provides each CIFAR-10 train image with 3
independent human annotations, while CIFAR-10H gives ≈ 50 annotations for each CIFAR-10
test image.
   In Table 3, we repeat the reproduction of three robust loss functions with three different
treatments on the separate noisy labels. We report the best-achieved test accuracy for Cross-
Entropy/Backward Correction/Peer Loss methods when learning with label aggregation methods
(majority-vote and EM inference) and the separation method (soft-label). We observe that the
separation method tends to have a better performance than aggregation ones. This may be
attributed to the relatively high noise rate (𝜖 ≈ 0.18) in CIFAR-N and the insufficient amount
of labelers (𝐾 = 3). Note that since the noise level in CIFAR-10H is low (𝜖 ≈ 0.07 wrong
labels), label aggregation methods can infer higher quality labels, and thus, result in a better
performance than separation methods (Red colored cells in Table 3 and 4).

5.4. Hypothesis Testing
We adopt the paired t-test to show which treatment on the separate noisy labels is better, under
certain conditions. In Table 5, we report the statistic and 𝑝-value given by the hypothesis testing
results. The column “Methods” indicate the two methods we want to compare (A & B). Positive
statistics means that A is better than B in the metric of test accuracy. Given a specific setting,
denote by Accmethod as the list of test accuracy that belongs to this setting (i.e., CIFAR-10N,
𝐾 = 3), including CE, BW, PL loss functions, the basic hypothesis could be summarized as
below:
Table 4
Experimental results on CIFAR10-H with 𝐾 ≥ 5. We highlight the results with Green (for separation
method) and Red (for aggregation methods) if the performance gap is large than 0.05.
                     CE          𝐾=5      𝐾=9      𝐾 = 15     𝐾 = 25    𝐾 = 49
                 Majority-Vote   80.69    80.73     81.37      81.79     81.66
                 EM-Inference    80.97    80.96     81.24      81.01     81.68
                  Separation     79.65    80.91     81.07      80.78     80.81
                     BW          𝐾=5      𝐾=9      𝐾 = 15     𝐾 = 25    𝐾 = 49
                 Majority-Vote   82.51    82.75     83.27      83.59     83.68
                 EM-Inference    82.30    82.68     82.74      82.89     83.08
                  Separation     82.14    82.48     81.92      81.72     81.69
                     PL          𝐾=5      𝐾=9      𝐾 = 15     𝐾 = 25    𝐾 = 49
                 Majority-Vote   81.84    81.85     82.39      82.98     82.83
                 EM-Inference    81.89    82.30     82.53      82.86     82.73
                  Separation     80.25    81.89     81.00      80.71     80.89


    • Null hypothesis: there exists zero mean difference between (1) AccMV and AccEM ; or (2)
      AccMV and AccSep ; or (3) AccEM and AccSep ;
    • Alternative hypothesis: there exists non-zero mean difference between (1) AccMV and
      AccEM ; or (2) AccMV and AccSep ; or (3) AccEM and AccSep .

To clarify, the three cases in the above hypothesis are tested independently. For test accuracy
comparisons of CIFAR-10N in Table 3, the setting of the hypothesis test is 𝐾 = 3 and the label
noise rate is relatively high (18%). All 𝑝-values are larger than 0.05, indicating that we should
reject the null hypothesis, and we can conclude that the performance of these three methods on
CIFAR-10N (high noise, small 𝐾) satisfies: EM Sep.


6. Conclusions
When learning with separate noisy labels, we explore the answer to the question “whether one
should aggregate separate noisy labels into single ones or use them separately as given”. In the
empirical risk minimization framework, we theoretically show that label separation could be
more beneficial than label aggregation when the noise rates are high or the number of labelers is
insufficient. These insights hold for a number of popular loss functions including several robust
treatments. Empirical results on synthetic and real-world datasets validate our conclusion.
Table 5
Hypothesis testing results of the comparisons between label aggregation methods and the label separa-
tion method on realistic noisy datasets.
                              Setting              Methods      Statistic   𝑝-value
                   CIFAR-10N (𝐾 = 3, high noise)   MV & EM        2.650      0.057
                   CIFAR-10N (𝐾 = 3, high noise)   MV & Sep      -0.401      0.708
                   CIFAR-10N (𝐾 = 3, high noise)   EM & Sep      -2.596      0.060
                   CIFAR-10H (𝐾 < 15, low noise)   MV & EM       -0.003      0.998
                   CIFAR-10H (𝐾 < 15, low noise)   MV & Sep       2.336      0.033
                   CIFAR-10H (𝐾 < 15, low noise)   EM & Sep       2.390      0.030
                   CIFAR-10H (𝐾 ≥ 15, low noise)   MV & EM        0.805      0.433
                   CIFAR-10H (𝐾 ≥ 15, low noise)   MV & Sep       4.426      0.000
                   CIFAR-10H (𝐾 ≥ 15, low noise)   EM & Sep       3.727      0.002


References
 [1] E. Estellés-Arolas, F. González-Ladrón-de Guevara, Towards an integrated crowdsourcing
     definition, Journal of Information science 38 (2012) 189–200.
 [2] J. Howe, et al., The rise of crowdsourcing, Wired magazine 14 (2006) 1–4.
 [3] Y. Liu, M. Liu, An online learning approach to improving the quality of crowd-sourcing,
     ACM SIGMETRICS Performance Evaluation Review 43 (2015) 217–230.
 [4] S. Albarqouni, C. Baur, F. Achilles, V. Belagiannis, S. Demirci, N. Navab, Aggnet: deep
     learning from crowds for mitosis detection in breast cancer histology images, IEEE
     transactions on medical imaging 35 (2016) 1313–1321.
 [5] A. A. A. Setio, A. Traverso, T. De Bel, M. S. Berens, C. Van Den Bogaard, P. Cerello,
     H. Chen, Q. Dou, M. E. Fantacci, B. Geurts, et al., Validation, comparison, and combination
     of algorithms for automatic detection of pulmonary nodules in computed tomography
     images: the luna16 challenge, Medical image analysis 42 (2017) 1–13.
 [6] T. Mitra, E. Gilbert, Credbank: A large-scale social media corpus with associated credibility
     annotations, in: Proceedings of the International AAAI Conference on Web and Social
     Media, volume 9, 2015, pp. 258–267.
 [7] G. Pennycook, D. G. Rand, Fighting misinformation on social media using crowdsourced
     judgments of news source quality, Proceedings of the National Academy of Sciences 116
     (2019) 2521–2526.
 [8] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, L. Moy, Learning from
     crowds., Journal of machine learning research 11 (2010).
 [9] J. Whitehill, T.-f. Wu, J. Bergsma, J. Movellan, P. Ruvolo, Whose vote should count more:
     Optimal integration of labels from labelers of unknown expertise, Advances in neural
     information processing systems 22 (2009).
[10] F. Rodrigues, F. Pereira, B. Ribeiro, Gaussian process classification and active learning
     with multiple annotators, in: International conference on machine learning, PMLR, 2014,
     pp. 433–441.
[11] F. Rodrigues, M. Lourenco, B. Ribeiro, F. C. Pereira, Learning supervised topic models
     for classification and regression from crowds, IEEE transactions on pattern analysis and
     machine intelligence 39 (2017) 2409–2422.
[12] T. Luo, Y. Liu, Machine truth serum, arXiv preprint arXiv:1909.13004 (2019).
[13] P. G. Ipeirotis, F. Provost, V. S. Sheng, J. Wang, Repeated labeling using multiple noisy
     labelers, Data Mining and Knowledge Discovery 28 (2014) 402–441.
[14] V. S. Sheng, J. Zhang, B. Gu, X. Wu, Majority voting and pairing with multiple noisy
     labeling, IEEE Transactions on Knowledge and Data Engineering 31 (2017) 1355–1368.
[15] Z. Zhu, Z. Dong, Y. Liu, Detecting corrupted labels without training a model to predict,
     arXiv preprint arXiv:2110.06283 (2022).
[16] Z.-H. Zhou, Ensemble methods: foundations and algorithms, CRC press, 2012.
[17] M. Guan, V. Gulshan, A. Dai, G. Hinton, Who said what: Modeling individual labelers
     improves classification, in: Proceedings of the AAAI Conference on Artificial Intelligence,
     volume 32, 2018.
[18] F. Rodrigues, F. Pereira, Deep learning from crowds, in: Proceedings of the AAAI
     Conference on Artificial Intelligence, volume 32, 2018.
[19] Z. Chen, H. Wang, H. Sun, P. Chen, T. Han, X. Liu, J. Yang, Structured probabilistic
     end-to-end learning from crowds., in: IJCAI, 2020, pp. 1512–1518.
[20] H. Wei, R. Xie, L. Feng, B. Han, B. An, Deep learning from multiple noisy annotators as a
     union, IEEE Transactions on Neural Networks and Learning Systems (2022).
[21] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, A. Tewari, Learning with noisy labels, in:
     Advances in neural information processing systems, 2013, pp. 1196–1204.
[22] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, L. Qu, Making deep neural networks
     robust to label noise: A loss correction approach, in: Proceedings of the IEEE Conference
     on Computer Vision and Pattern Recognition, 2017, pp. 1944–1952.
[23] X. Xia, T. Liu, N. Wang, B. Han, C. Gong, G. Niu, M. Sugiyama, Are anchor points really
     indispensable in label-noise learning?, Advances in Neural Information Processing Systems
     32 (2019).
[24] Z. Zhu, Y. Song, Y. Liu, Clusterability as an alternative to anchor points when learning
     with noisy labels, in: International Conference on Machine Learning, PMLR, 2021, pp.
     12912–12923.
[25] Z. Zhu, J. Wang, Y. Liu, Beyond images: Label noise transition matrix estimation for tasks
     with lower-quality features, arXiv preprint arXiv:2202.01273 (2022).
[26] Z. Jiang, K. Zhou, Z. Liu, L. Li, R. Chen, S.-H. Choi, X. Hu, An information fusion approach
     to learning with instance-dependent label noise, in: International Conference on Learning
     Representations, 2022.
[27] Z. Zhang, Y. Li, H. Wei, K. Ma, T. Xu, Y. Zheng, Alleviating noisy-label effects in image
     classification via probability transition matrix, arXiv preprint arXiv:2110.08866 (2021).
[28] S. Li, X. Xia, H. Zhang, Y. Zhan, S. Ge, T. Liu, Estimating noise transition matrix with label
     correlations for noisy multi-label learning, in: Advances in Neural Information Processing
     Systems, 2022.
[29] X. Xia, B. Han, N. Wang, J. Deng, J. Li, Y. Mao, T. Liu, Extended: Learning with
     mixed closed-set and open-set noisy labels, IEEE Transactions on Pattern Analysis and
     Machine Intelligence (2022).
[30] T. Liu, D. Tao, Classification with noisy labels by importance reweighting, IEEE Transac-
     tions on pattern analysis and machine intelligence 38 (2016) 447–461.
[31] H.-S. Chang, E. Learned-Miller, A. McCallum, Active bias: Training more accurate neu-
     ral networks by emphasizing high variance samples, Advances in Neural Information
     Processing Systems 30 (2017).
[32] N. Bar, T. Koren, R. Giryes, Multiplicative reweighting for robust neural network optimiza-
     tion, arXiv preprint arXiv:2102.12192 (2021).
[33] N. Majidi, E. Amid, H. Talebi, M. K. Warmuth, Exponentiated gradient reweighting for
     robust training under label noise and beyond, arXiv preprint arXiv:2104.01493 (2021).
[34] A. Kumar, E. Amid, Constrained instance and class reweighting for robust learning under
     label noise, arXiv preprint arXiv:2111.05428 (2021).
[35] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, A. Rabinovich, Training deep neural
     networks on noisy labels with bootstrapping, arXiv preprint arXiv:1412.6596 (2014).
[36] M. Lukasik, S. Bhojanapalli, A. Menon, S. Kumar, Does label smoothing mitigate label
     noise?, in: International Conference on Machine Learning, PMLR, 2020, pp. 6448–6458.
[37] J. Wei, H. Liu, T. Liu, G. Niu, Y. Liu, Understanding generalized label smoothing when
     learning with noisy labels, arXiv preprint arXiv:2106.04149 (2021).
[38] Y. Wang, X. Ma, Z. Chen, Y. Luo, J. Yi, J. Bailey, Symmetric cross entropy for robust
     learning with noisy labels, in: Proceedings of the IEEE/CVF International Conference on
     Computer Vision, 2019, pp. 322–330.
[39] E. Amid, M. K. Warmuth, R. Anil, T. Koren, Robust bi-tempered logistic loss based on
     Bregman divergences, Advances in Neural Information Processing Systems 32 (2019).
[40] J. Wang, H. Guo, Z. Zhu, Y. Liu, Policy learning using weak supervision, Advances in
     Neural Information Processing Systems 34 (2021).
[41] X. Ma, H. Huang, Y. Wang, S. Romano, S. Erfani, J. Bailey, Normalized loss functions for
     deep learning with noisy labels, in: International Conference on Machine Learning, PMLR,
     2020, pp. 6543–6553.
[42] Y. Liu, H. Guo, Peer loss functions: Learning from noisy labels without knowing noise
     rates, in: International Conference on Machine Learning, PMLR, 2020, pp. 6226–6236.
[43] Z. Zhu, T. Liu, Y. Liu, A second-order approach to learning with instance-dependent label
     noise, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
     Recognition, 2021, pp. 10113–10123.
[44] J. Wei, Y. Liu, When optimizing 𝑓 -divergence is robust with label noise, arXiv preprint
     arXiv:2011.03687 (2020).
[45] H. Wei, H. Zhuang, R. Xie, L. Feng, G. Niu, B. An, Y. Li, Logit clipping for robust learning
     against label noise, arXiv preprint arXiv:2212.04055 (2022).
[46] X. Xia, T. Liu, B. Han, C. Gong, N. Wang, Z. Ge, Y. Chang, Robust early-learning: Hindering
     the memorization of noisy labels, in: International conference on learning representations,
     2020.
[47] S. Liu, J. Niles-Weed, N. Razavian, C. Fernandez-Granda, Early-learning regularization
     prevents memorization of noisy labels, Advances in neural information processing systems
     33 (2020) 20331–20342.
[48] S. Liu, Z. Zhu, Q. Qu, C. You, Robust training under label noise by over-parameterization,
     arXiv preprint arXiv:2202.14026 (2022).
[49] H. Cheng, Z. Zhu, X. Sun, Y. Liu, Demystifying how self-supervised features improve
     training from noisy labels, arXiv preprint arXiv:2110.09022 (2021).
[50] H. Wei, L. Tao, R. Xie, B. An, Open-set label noise can improve robustness against inherent
     label noise, Advances in Neural Information Processing Systems 34 (2021).
[51] H. Huang, H. Kang, S. Liu, O. Salvado, T. Rakotoarivelo, D. Wang, T. Liu, Paddles: Phase-
     amplitude spectrum disentangled early stopping for learning with noisy labels (????).
[52] S. Liu, K. Liu, W. Zhu, Y. Shen, C. Fernandez-Granda, Adaptive early-learning correction
     for segmentation from noisy annotations, arXiv preprint arXiv:2110.03740 (2021).
[53] H. Cheng, Z. Zhu, X. Li, Y. Gong, X. Sun, Y. Liu, Learning with instance-dependent label
     noise: A sample sieve approach, in: International Conference on Learning Representations,
     2021. URL: https://openreview.net/forum?id=2VXyy9mIyU3.
[54] T. Luo, X. Li, H. Wang, Y. Liu, Research replication prediction using weakly supervised
     learning, in: In Proceedings of the 2020 Conference on Empirical Methods in Natural
     Language Processing: Findings, 2020.
[55] Z. Wang, J. Jiang, B. Han, L. Feng, B. An, G. Niu, G. Long, Seminll: A framework of
     noisy-label learning by semi-supervised learning, arXiv preprint arXiv:2012.00925 (2020).
[56] C. Qin, Y. Wang, Y. Fu, Robust semi-supervised domain adaptation against noisy labels,
     in: Proceedings of the 31st ACM International Conference on Information & Knowledge
     Management, 2022, pp. 4409–4413.
[57] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, M. Sugiyama, Co-teaching: Robust
     training of deep neural networks with extremely noisy labels, in: Advances in neural
     information processing systems, 2018, pp. 8527–8537.
[58] X. Yu, B. Han, J. Yao, G. Niu, I. Tsang, M. Sugiyama, How does disagreement help
     generalization against label corruption?, in: International Conference on Machine Learning,
     PMLR, 2019, pp. 7164–7173.
[59] H. Wei, L. Feng, X. Chen, B. An, Combating noisy labels by agreement: A joint training
     method with co-regularization, in: Proceedings of the IEEE/CVF Conference on Computer
     Vision and Pattern Recognition, 2020, pp. 13726–13735.
[60] B. Han, Q. Yao, T. Liu, G. Niu, I. W. Tsang, J. T. Kwok, M. Sugiyama, A survey of label-noise
     representation learning: Past, present and future, arXiv preprint arXiv:2011.04406 (2020).
[61] H. Song, M. Kim, D. Park, Y. Shin, J.-G. Lee, Learning from noisy labels with deep neural
     networks: A survey, IEEE Transactions on Neural Networks and Learning Systems (2022).
[62] V. Feldman, Does learning require memorization? a short tale about a long tail, in:
     Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, 2020,
     pp. 954–959.
[63] Y. Liu, Understanding instance-level label noise: Disparate impacts and treatments, in:
     International Conference on Machine Learning, PMLR, 2021, pp. 6725–6735.
[64] W. Tang, M. Yin, C.-J. Ho, Leveraging peer communication to enhance crowdsourcing, in:
     The World Wide Web Conference, 2019, pp. 1794–1805.
[65] Z. Zhu, T. Luo, Y. Liu, The rich get richer: Disparate impact of semi-supervised learning,
     arXiv preprint arXiv:2110.06282 (2021).
[66] S. Mendelson, Lower bounds for the empirical minimization algorithm, IEEE Transactions
     on Information Theory 54 (2008) 3797–3803.
[67] G. Lecué, S. Mendelson, Sharper lower bounds on the performance of the empirical risk
     minimization algorithm, Bernoulli (2010) 605–613.
[68] A. P. Dawid, A. M. Skene, Maximum likelihood estimation of observer error-rates using
     the em algorithm, Journal of the Royal Statistical Society: Series C (Applied Statistics) 28
     (1979) 20–28.
[69] P. Smyth, U. Fayyad, M. Burl, P. Perona, P. Baldi, Inferring ground truth from subjective
     labelling of venus images, Advances in neural information processing systems 7 (1994).
[70] N. Quoc Viet Hung, N. T. Tam, L. N. Tran, K. Aberer, An evaluation of aggregation
     techniques in crowdsourcing, in: International Conference on Web Information Systems
     Engineering, Springer, 2013, pp. 1–15.
[71] D. Dua, C. Graff, UCI machine learning repository, 2017. URL: http://archive.ics.uci.edu/ml.
[72] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images,
     Technical Report, Citeseer, 2009.
[73] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Pro-
     ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.
     770–778.
[74] J. Wei, Z. Zhu, H. Cheng, T. Liu, G. Niu, Y. Liu, Learning with noisy labels revisited: A
     study using real-world human annotations, arXiv preprint arXiv:2110.12088 (2021).
[75] J. C. Peterson, R. M. Battleday, T. L. Griffiths, O. Russakovsky, Human uncertainty makes
     classification more robust, in: Proceedings of the IEEE/CVF International Conference on
     Computer Vision, 2019, pp. 9617–9626.
[76] J. M. Varah, A lower bound for the smallest singular value of a matrix, Linear Algebra and
     its applications 11 (1975) 3–5.
[77] X. Xia, T. Liu, B. Han, N. Wang, M. Gong, H. Liu, G. Niu, D. Tao, M. Sugiyama, Part-
     dependent label noise: Towards instance-dependent label noise, Advances in Neural
     Information Processing Systems 33 (2020) 7597–7610.
Appendices

A. Full Proofs
In this section, we briefly introduce all omitted proofs in the main paper.
   We firstly give the proof of Lemma 4.1 because it is beneficial for the proofs in Section 3.

A.1. Proof of Lemma 4.1
Proof. To apply Hoeffding’s inequality on the dataset of the separation method, we divide the
noisy train samples {(𝑥𝑛 , 𝑦˜𝑛,𝑘 )}𝑛∈[𝑁 ] into 𝐾 groups, for 𝑘 ∈ [𝐾], i.e., {(𝑥𝑛 , 𝑦˜𝑛,1 )}𝑛∈[𝑁 ] , · · · ,
{(𝑥𝑛 , 𝑦˜𝑛,𝐾 )}𝑛∈[𝑁 ] . Note within each group, e.g., group {(𝑥𝑛 , 𝑦˜𝑛,1 )}𝑛∈[𝑁 ] , all the 𝑁 training
samples are i.i.d. Additionally, training samples between any two different groups are also i.i.d.
given feature set {𝑥𝑛 }𝑛∈[𝑁 ] . Thus, with one group {(𝑥𝑛 , 𝑦˜𝑛,1 )}𝑛∈[𝑁 ] , w.p. 1 − 𝛿0 , we have
                                                               √︂
                    ⃒
                    ⃒^
                                                 ⃒ (︁       )︁    log(1/𝛿0 )
                    ⃒𝑅1← |Group-1 (𝑓 ) − 𝑅1← (𝑓 )⃒ ≤ 1← − 1← ·               , ∀𝑓.
                                                 ⃒
                                                                     2𝑁
  Note that:
                                       1       (︁1 − 𝜌𝑢 −𝜌𝑢 )︁
                     (𝑇 𝑢 )−1 =                       1   0    ,        for 𝑢 ∈ {, ∙},
                                  1 − 𝜌𝑢0 − 𝜌𝑢1 −𝜌𝑢1 1 − 𝜌𝑢0

we have:
                                                          (1 + |𝜌0 − 𝜌1 |)
                                  1← − 1← := 𝐿←0 =                         .
                                                            1 − 𝜌0 − 𝜌1

  Applying the above technique on the other groups and by the union bound, we know that
w.p. at least 1 − 𝐾𝛿0 , ∀𝑘 ∈ [𝐾],
                         [︃             √︂                                √︂             ]︃
      ^ 1 |Group-k (𝑓 ) ∈ 𝑅1 (𝑓 ) − 𝐿 ·    log(1/𝛿 0 )                       log(1/𝛿 0 )
     𝑅   ←                  ←        ←0                , 𝑅1← (𝑓 ) + 𝐿←0 ·                   .
                                              2𝑁                                2𝑁

Each 𝑅
     ^ 1 |Group-k (𝑓 ), 𝑘 ∈ [𝐾] can be seen as a random variable within range:
        ←

               [︃                      √︂                                   √︂              ]︃
                                            log(1/𝛿0 )                           log(1/𝛿0 )
                    𝑅1← (𝑓 ) − 𝐿←0 ·                   , 𝑅1← (𝑓 ) + 𝐿←0 ·                      .
                                               2𝑁                                   2𝑁

The randomness is from noisy labels 𝑦˜𝑛,𝑘 . Recall that the samples between different groups
are i.i.d. given {𝑥𝑛 }𝑛∈[𝑁 ] . Then the above 𝐾 random variables are i.i.d. when the feature set is
fixed. By Hoeffding’s inequality, w.p. at least 1 − 𝐾𝛿0 − 𝛿1 , ∀𝑓 , we have
                                      √︂              √︂                    √︂
⃒
⃒^
                      ⃒                  log(1/𝛿0 )      log(1/𝛿1 )            log(1/𝛿1 ) log(1/𝛿0 )
⃒𝑅1← (𝑓 ) − 𝑅1← (𝑓 )⃒ ≤ 2 · 𝐿←0 ·                   ·               = 𝐿←0 ·                          .
                      ⃒
                                            2𝑁              2𝐾                        𝑁𝐾
  For 𝛿0 = 𝛿1 = 𝐾+1   𝛿
                         , with the Rademacher bound on the maximal deviation between risks
and empirical ones, for 𝑓 * ∈ ℱ and the separation method, with probability at least 1 − 𝛿, we
have:
                                                                            (︂      )︂ √︂
        ⃒
        ⃒^
                                       ⃒
                                               ∘                               𝐾 +1        1
   max ⃒𝑅ℓ← ,𝐷̃︀ (𝑓 ) − 𝑅ℓ← ,𝒟̃︀ (𝑓 )⃒ ≤ 2R (ℓ← ∘ ℱ) + 𝐿←0 · (ℓ − ℓ) · log            ·      ,
                                       ⃒
   𝑓 ∈ℱ                                                                          𝛿        𝑁𝐾
                                                                        √︂
        ⃒
        ⃒^
                                         ⃒
                                                 ∙
                                                          (︁
                                                                  ∙
                                                                    )︁     log(1/𝛿)
   max ⃒𝑅ℓ← ,𝐷̃︀ ∙ (𝑓 ) − 𝑅ℓ← ,𝒟̃︀ ∙ (𝑓 )⃒ ≤ 2R (ℓ← ∘ ℱ) + ℓ∙← − ℓ← ·
                                         ⃒
   𝑓 ∈ℱ                                                                        2𝑁
                                          √︂
                                             log(1/𝛿)
 =2R∙ (ℓ← ∘ ℱ) + 𝐿∙←0 · (ℓ − ℓ) ·                     ,
                                                2𝑁
where we define ℓ, ℓ as the upper and lower bound of loss function ℓ respectively, and:
                                                  ⎡                                           ⎤
                                                                𝑁 ∑︁𝐾
                                                           1  ∑︁
           R(ℓ← ∘ ℱ) := E𝑥𝑖 ,𝑦˜𝑖,1 ,...,𝑦˜𝑖,𝐾 ,𝜖𝑖 ⎣ sup                𝜖𝑖 ℓ← (𝑓 (𝑥𝑖 ), 𝑦˜𝑖,𝑗 )⎦
                                                     𝑓 ∈ℱ 𝑁 𝐾
                                                               𝑖=1 𝑗=1
                               𝐾                             𝑁
                                                  [︃                                   ]︃
                            1 ∑︁                          1 ∑︁
                         ≤        E𝑥𝑖 ,𝑦˜𝑖,𝑗 ,𝜖𝑖 sup            𝜖𝑖 ℓ← (𝑓 (𝑥𝑖 ), 𝑦˜𝑖,𝑗 ) ,
                            𝐾                        𝑓 ∈ℱ 𝑁
                                    𝑗=1                           𝑖=1

                                                       [︃                           ]︃
                           ∙                                     1                ∙
                         R (ℓ← ∘ ℱ) := E𝑥𝑖 ,𝑦˜∙ ,𝜖𝑖         sup 𝜖𝑖 ℓ← (𝑓 (𝑥𝑖 ), 𝑦˜ ) .
                                                            𝑓 ∈ℱ 𝑁

Note that we assume the noisy labels given by the 𝐾 labelers follow the same noise transition
matrix, if ℓ is 𝐿−Lipshitz, then for separation and aggregation methods, ℓ← is 𝐿𝑢← Lipshitz
                                            (1+|𝜌𝑢 −𝜌𝑢 |)𝐿
for 𝑢 ∈ {, ∙} respectively, where 𝐿𝑢← = 1−𝜌0𝑢 −𝜌1𝑢 ≤ 1−𝜌2𝐿       𝑢 −𝜌𝑢 . By the Lipshitz composition
                                                  0   1          0   1
property of Rademacher averages, we have R (ℓ← ∘ ℱ) ≤ 𝐿𝑢← · R(ℱ). Thus, we have:
                                                  𝑢

                                                                                              √︂
      ^                                             (1 + |𝜌0 − 𝜌1 |) · (ℓ − ℓ)        𝐾 +1         1
 max |𝑅ℓ← ,𝐷̃︀ (𝑓 ) − 𝑅ℓ← ,𝒟̃︀ (𝑓 )| ≤ 2𝐿← R(ℱ) +                              · log(      )·         ,
 𝑓 ∈ℱ                                                      1 − 𝜌0 − 𝜌1                  𝛿        𝑁𝐾
                                                                                                 (10)

                                                                                           √︂
         ^                                       ∙        (1 + |𝜌∙0 − 𝜌∙1 |) · (ℓ − ℓ)          log(1/𝛿)
    max |𝑅         ∙ (𝑓 ) − 𝑅        ∙ (𝑓 )| ≤ 2𝐿← R(ℱ) +                              ·                 .
    𝑓 ∈ℱ   ℓ ← ,𝐷            ℓ ← ,𝒟                             1 − 𝜌∙0 − 𝜌∙1                      2𝑁
                ̃︀                ̃︀


  Assume 𝑓 * ← min𝑓 ∈ℱ 𝑅ℓ,𝒟 (𝑓 ), for separation methods, we further have:

   𝑅ℓ,𝒟 (𝑓^ ← ) − 𝑅ℓ,𝒟 (𝑓 * ) = 𝑅ℓ← ,𝒟̃︀ (𝑓^ ← ) − 𝑅ℓ← ,𝒟̃︀ (𝑓 * )
                                                     *              *                               *
   = 𝑅ℓ← ,𝒟̃︀ (𝑓^ ← ) − 𝑅
                        ^          ^       ^    ̃︀ (𝑓 ) − 𝑅ℓ← ,𝒟
                               ̃︀ (𝑓 ← ) + 𝑅ℓ← ,𝐷
                          ℓ ← ,𝐷
                                                                         ^
                                                               ̃︀ (𝑓 ) + 𝑅ℓ← ,𝐷
                                                                                  ^       ^
                                                                              ̃︀ (𝑓 ← ) − 𝑅ℓ← ,𝐷
                                                                                               ̃︀ (𝑓 )

                ^
   ≤ 0 + 2 max |𝑅     ̃︀ (𝑓 ) − 𝑅ℓ← ,𝒟
                                     ̃︀ (𝑓 )|
              𝑓 ∈ℱℓ← ,𝐷
                                                             √︂
                                     𝐾 +1                          1
   ≤ 4𝐿← R(ℱ) + 2𝐿← · (ℓ − ℓ) · log(      )·                         .
                                       𝛿                          𝑁𝐾
Similarly, for aggregation methods, we have:
          ∙                                       ∙
𝑅ℓ,𝒟 (𝑓^ ← ) − 𝑅ℓ,𝒟 (𝑓 * ) = 𝑅ℓ← ,𝒟̃︀ ∙ (𝑓^ ← ) − 𝑅ℓ← ,𝒟̃︀ ∙ (𝑓 * )
                 ∙
= 𝑅ℓ← ,𝒟̃︀ ∙ (𝑓^ ← ) − 𝑅
                       ^            ^∙      ^           *
                                                 ̃︀ ∙ (𝑓 ) − 𝑅ℓ← ,𝒟
                              ̃︀ ∙ (𝑓 ← ) + 𝑅ℓ← ,𝐷
                                                                         *    ^          ^∙      ^
                                                                                   ̃︀ ∙ (𝑓 ← ) − 𝑅ℓ← ,𝐷
                                                                  ̃︀ ∙ (𝑓 ) + 𝑅ℓ← ,𝐷
                                                                                                             *
                                                                                                      ̃︀ ∙ (𝑓 )
                         ℓ ← ,𝐷
             ^
≤ 0 + 2 max |𝑅     ̃︀ ∙ (𝑓 ) − 𝑅ℓ← ,𝒟
                                    ̃︀ ∙ (𝑓 )|
        𝑓 ∈ℱ   ℓ← ,𝐷
                                  √︂
     ∙             ∙                   log(1/𝛿)
≤ 4𝐿← R(ℱ) + 2𝐿← · (ℓ − ℓ) ·                    .
                                          2𝑁
                          𝐾·log( 1𝛿 )
Note that 𝜂𝐾 =                          2   and 𝜂𝐾
                                                 ∙ ≡ 1, we then have:
                      2(log( 𝐾+1
                              𝛿
                                 ))
                                                                                   √︃
                           𝑢                              𝐿𝑢 · (ℓ − ℓ)                  2 log(1/𝛿)
                  𝑅ℓ,𝒟 (𝑓^ ← ) − 𝑅ℓ,𝒟 (𝑓 * ) ≤ 4𝐿𝑢← R(ℱ) + ←           ·                    𝑢𝑁     .
                                                                        𝐿                  𝜂𝐾
                                                                            𝑢
                                                                 Defined as: Δ𝑅←




A.2. Proof of Theorem 4.2
Proof. The proof is straightforward if we proceed with the proof of Lemma 4.1 with the below
discussions. With the knowledge of noise rates for both methods, remember that 𝐿𝑢← =
(1+|𝜌𝑢   𝑢
     0 −𝜌1 |)𝐿
  1−𝜌𝑢 −𝜌𝑢 , we have:
      0       1

                      ∙
       Δ𝑅← < Δ𝑅←
                                                    √︂                                      √︂
                     𝐿←                    𝐾 +1          1       ∙          𝐿∙←                log(1/𝛿)
=⇒     2𝐿← R(ℱ) +         · (ℓ − ℓ) · log(       )·        < 2𝐿← R(ℱ) +         · (ℓ − ℓ) ·
                      𝐿                      𝛿         𝑁𝐾                    𝐿                    2𝑁
                                           √︂                               √︂
           𝐿 − 𝐿←  ∙                          log(1/𝛿)              𝐾 +1         1
=⇒     2· ←          · 𝐿 · R(ℱ) < 𝐿∙← ·                 − 𝐿← · log(      )·
              ℓ−ℓ                                2𝑁                   𝛿        𝑁𝐾
                                      (︃            √︃ )︃ √︂
           𝐿 − 𝐿∙←                                      1     log(1/𝛿)
=⇒     2· ←          · 𝐿 · R(ℱ) < 𝐿∙← − 𝐿← ·
              ℓ−ℓ                                      𝜂𝐾        2𝑁
         √︃                                                √︃
               2𝑁 𝐿← − 𝐿∙←                                     1
=⇒     2                        · 𝐿 · R(ℱ) < 𝐿∙← − 𝐿← ·          .
            log(1/𝛿) ℓ − ℓ                                    𝜂𝐾

For any finite concept class ℱ ⊂ {𝑓 : 𝑋 → {0,√︁
                                              1}}, and the sample set 𝑆 = {𝑥1 , ..., 𝑥𝑁 }, the
                                                              2𝑑 log(𝑁 )
Rademacher complexity is upper bounded by                         𝑁      where 𝑑 is the VC dimension of ℱ.
                              ∙
To achieve Δ𝑅← < Δ𝑅← , we simply need to find the condition of 𝐾 (or 𝜂𝐾 ) that satisfies the
below in-equation:
                   √︃                          √︂               (︃             √︃ )︃
           ∙             2𝑁 𝐿← − 𝐿∙←              2𝑑 log(𝑁 )                       1
Δ𝑅← < Δ𝑅← =⇒ 2                            ·𝐿·                < 𝐿∙← − 𝐿← ·
                      log(1/𝛿) ℓ − ℓ                  𝑁                           𝜂𝐾
                                                    √︃               (︃         √︃ )︃
                            𝐿 − 𝐿∙←                      𝑑 log(𝑁 )                 1
                  =⇒       4 ←      ·𝐿·                            < 𝐿∙← − 𝐿← ·
                              ℓ−ℓ                        log(1/𝛿)                 𝜂𝐾
                                                                   √︃                (︃       √︃ )︃
                                                           4          𝑑 log(𝑁 )                  1
                  =⇒       (𝐿← − 𝐿∙← ) ·                       ·𝐿·                      ∙
                                                                                  < 𝐿← − 𝐿← ·
                                                         ℓ−ℓ          log(1/𝛿)                  𝜂𝐾
                                                denoted as 𝛼← ,which is a function of 𝑁,𝛿,𝑑,𝐿
                                                         (︃                √︃
                                                                           )︃
                                                                         1
                  =⇒       (𝐿← − 𝐿∙← ) · 𝛼← <              𝐿∙← − 𝐿← ·
                                                                        𝜂𝐾
                                                                      (︃    √︃               )︃
                                                                                         1
                  =⇒       (𝐿← − 𝐿∙← ) · 𝛼← < (𝐿∙← − 𝐿← ) +                     1−                · 𝐿←
                                                                                        𝜂𝐾
                                           (︃)︃     √︃
                                          1         𝐿←
            =⇒ 𝛼← + 1 < 1 −                     ·
                                         𝜂𝐾       𝐿← − 𝐿∙←
                    1 (︁             1
                                       )︁       𝐿←
            =⇒        < 1 − (𝜂𝐾 )− 2 ·
                    𝛾                       𝐿← − 𝐿∙←
                             1
            =⇒ 𝛼𝐾 ·                1 ≤ 𝛾,
                        1 − (𝜂𝐾 )− 2
                                                         √︁
                                                            𝑑 log(𝑁 )
where we denote by 𝛼𝐾 := 1 − 𝐿∙← /𝐿← , 𝛾 = 1/(1 + ℓ−ℓ 4𝐿
                                                            log(1/𝛿) ).


A.3. Proof of Theorem 4.3
Proof.
          𝑢
                                [︁ 𝑢                                           𝑢
                                                                                              ]︁2
  Var(𝑓^ ← ) = E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 ) − E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 [ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 )]
        [︃                                                                                              ]︃
           [︁ 𝑢             ]︁2 [︁            𝑢
                                                             ]︁2             𝑢                      𝑢
              ^           𝑢                ^             𝑢                 ^            𝑢
 =E ̃︀ 𝑢 ℓ(𝑓 ← (𝑋), 𝑌̃︀ ) + E ̃︀ 𝑢 [ℓ(𝑓 ← (𝑋), 𝑌̃︀ )] − 2ℓ(𝑓 ← (𝑋), 𝑌̃︀ )E ̃︀ 𝑢 [ℓ(𝑓 ← (𝑋), 𝑌̃︀ )]^   𝑢
     𝒟                                 𝒟                                                   𝒟

        [︁ 𝑢                ]︁2 [︁           𝑢
                                                           ]︁2      [︁     𝑢                          𝑢
                                                                                                                     ]︁
 =E𝒟̃︀ 𝑢 ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 ) + E𝒟̃︀ 𝑢 [ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 )] − E𝒟̃︀ 𝑢 2ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 )E𝒟̃︀ 𝑢 [ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 )]
        [︁ 𝑢                ]︁2 [︁           𝑢
                                                           ]︁2
 =E𝒟̃︀ 𝑢 ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 ) − E𝒟̃︀ 𝑢 [ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 )]
        [︁ 𝑢                ]︁2              𝑢
 =E𝒟̃︀ 𝑢 ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ← ))2 .

A special case is the 0-1 loss, i.e., ℓ(·) = 1(·), we then have:

                                 𝑢
                                            [︁ 𝑢                ]︁2               𝑢
                          Var(𝑓^ ← ) =E𝒟̃︀ 𝑢 ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ← ))2
                                            [︁ 𝑢                ]︁               𝑢
                                     =E𝒟̃︀ 𝑢 ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ← ))2
                                                    𝑢                   𝑢
                                      =𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ← ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ← ))2 ,
                 𝑢
where 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ← ) ∈ [0, 1] and 𝑔(𝑎) = 𝑎 − 𝑎2 is monotonically increasing when 𝑎 < 12 . Note
                   𝑢
                                         √︁
that: 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ← ) < 𝐿𝑢←0 · (ℓ − ℓ) · log(1/𝛿)
                                            2𝜂 𝑢 𝑁 , when
                                                      𝐾

                       √︃                                                                         √︃
                            log(1/𝛿)   1                                               − 12                    𝑁
    𝐿𝑢←0 · (ℓ − ℓ) ·           𝑢     ≤                  ⇐⇒                𝐿𝑢←0 (𝜂𝐾
                                                                                 𝑢
                                                                                   )          <                           ,
                             2𝜂𝐾 𝑁     2                                                               2(ℓ − ℓ)2 log(1/𝛿)

                𝑢
                       (︁ 𝑢          √︁            )︁
we have: Var(𝑓^ ← ) ≤ 𝑔 𝐿← ·(ℓ−ℓ)
                            𝐿      ·    2 log(1/𝛿)
                                             𝑢        .
                 (︁         √︁            )︁𝜂𝐾 𝑁 (︁ ∙           √︁            )︁
  To achieve: 𝑔 𝐿← ·(ℓ−ℓ)
                     𝐿    ·    2 log(1/𝛿)
                                  𝜂 𝑁        ≤  𝑔   𝐿← ·(ℓ−ℓ)
                                                        𝐿     ·    2 log(1/𝛿)
                                                                       ∙
                                                                      𝜂 𝑁        , we simply need:
                                             𝐾                                           𝐾

                             √︃                                            √︃
                                  2 log(1/𝛿)                                    2 log(1/𝛿)         𝐿
                                             ≤ 𝐿∙←0 · (ℓ − ℓ) ·                                   > ←
                                                                                             √︀ 𝑢
           𝐿←0 · (ℓ − ℓ) ·                                                          ∙      ⇐⇒ 𝜂𝐾       .
                                     𝜂𝐾 𝑁                                          𝜂𝐾 𝑁            𝐿∙←




A.4. Proof for Corollary 4.4
For a general matrix 𝑈 = (𝑇 𝑢 )−1 , we firstly note

                 1𝑢← − 1𝑢← = max 𝑈𝑖𝑗𝑢 − min 𝑈𝑖𝑗𝑢
                                      𝑖,𝑗∈[𝑀 ]            𝑖,𝑗∈[𝑀 ]

                               ≤| max             𝑈𝑖𝑗𝑢 | + |    min 𝑈𝑖𝑗𝑢 |
                                      𝑖,𝑗∈[𝑀 ]                 𝑖,𝑗∈[𝑀 ]
                                                     ∑︁                                       ∑︁
                               ≤| max                            𝑈𝑖𝑗𝑢 | + | min                           𝑈𝑖𝑗𝑢 |.
                                      𝑖∈[𝑀 ]                                 𝑖∈[𝑀 ]
                                                 𝑗∈[𝑀 ],𝑈𝑖𝑗 >0                         𝑗∈[𝑀 ],𝑈𝑖𝑗 <0


Recall 𝑇 𝑢 1 = 1 ⇒ 1 = (𝑇 𝑢 )−1 1. We know the above maximum and minimum take the same
𝑖. Then
                                       ∑︁                    ∑︁
                1𝑢← − 1𝑢← ≤| max              𝑈𝑖𝑗𝑢 | + | min        𝑈𝑖𝑗𝑢 |
                                       𝑖∈[𝑀 ]                                𝑖∈[𝑀 ]
                                                 𝑗∈[𝑀 ],𝑈𝑖𝑗 >0                         𝑗∈[𝑀 ],𝑈𝑖𝑗 <0
                                         𝑢
                                  =‖𝑈 ‖∞
                               (𝑎)                      1
                               ≤
                                      min𝑖∈[𝑀 ] (𝑇𝑖𝑖𝑢 −                   𝑢
                                                                ∑︀
                                                                   𝑗̸=𝑖 𝑇𝑖𝑗 )
                                         1
                                  ≤           ,       𝑒𝑢 := max (1 − 𝑇𝑖𝑖𝑢 ),                  𝑒𝑢 < 0.5.
                                      1 − 2𝑒𝑢                    𝑖∈[𝑀 ]

  Now we prove the inequality (𝑎) [76]. Let 𝜈 satisfy

                                      ‖(𝑇 𝑢 )−1 ‖∞ = ‖(𝑇 𝑢 )−1 𝜈‖∞ /‖𝜈‖∞

and let 𝜇 = (𝑇 𝑢 )−1 𝜈. Then
                                             ‖(𝑇 𝑢 )−1 ‖∞ = ‖𝜇‖∞ /‖𝜈‖∞
To bound ‖𝜇‖, we choose 𝑖 such that 𝜇𝑖 = ‖𝜇‖∞ . Then
                                                ∑︁
                                 𝑇𝑖𝑖𝑢 𝜇𝑖 = 𝜈𝑖 −    𝑇𝑖𝑗𝑢 𝜇𝑗 ,
                                                               𝑗̸=𝑖

which further gives
                                              ∑︁                                           ∑︁
                      |𝑇𝑖𝑖𝑢 |‖𝜇‖∞ ≤ |𝜈𝑖 | +           |𝑇𝑖𝑗𝑢 ||𝜇𝑗 | ≤ |𝜈𝑖 | + ‖𝜇‖∞                 |𝑇𝑖𝑗𝑢 |.
                                              𝑗̸=𝑖                                         𝑗̸=𝑖

Therefore,
                                                                |𝜈𝑖 |
                                         ‖𝜇‖∞ ≤                              𝑢,
                                                       𝑇𝑖𝑖𝑢 −
                                                                ∑︀
                                                                      𝑗̸=𝑖 𝑇𝑖𝑗

and
                                                                                    1
                                ‖(𝑇 𝑢 )−1 ‖∞ = ‖𝜇‖∞ /‖𝜈‖∞ ≤                                 𝑢.
                                                                          𝑇𝑖𝑖𝑢 −
                                                                                   ∑︀
                                                                                     𝑗̸=𝑖 𝑇𝑖𝑗

  On the other hand, denoting by ‖𝑈 ‖max := max𝑖,𝑗∈[𝑀 ] |𝑈𝑖𝑗 |, from eigenvalues, we know
                                                                √
                            𝑢      𝑢
                                          √                        𝑀
                      𝑢
                    1← − 1← ≤‖𝑈 ‖∞ ≤ 𝑀 𝜆max (𝑈 ) =                      .
                                                            𝜆min (𝑇 𝑢 )
where 𝜆min (𝑇 𝑢 ) denotes the minimal eigenvalue of the matrix 𝑇 𝑢 . Therefore,
                                                              √
                        𝑢     𝑢     ∘                1          𝑀
                      1← − 1← = 𝐿←0 = min{             𝑢  ,            },
                                                1 − 2𝑒𝑖max 𝜆min (𝑇 𝑢 )

where 𝑒𝑢 := max𝑖∈[𝑀 ] (1 − 𝑇𝑖𝑖𝑢 ),        𝑒𝑢 < 0.5, and 𝜆min (𝑇 𝑢 ) denotes the minimal eigenvalue of
the matrix 𝑇 𝑢 .

A.5. Proof of Lemma 3.2
                                   𝑢
Proof. Note that for 𝑓^ = 𝑓^ , we have:
             𝑢                                    𝑢
      𝑅ℓ,𝒟 (𝑓^ ) − min 𝑅ℓ,𝒟 (𝑓 ) = 𝑅ℓ,𝒟 (𝑓^ ) − 𝑅ℓ,𝒟 (𝑓 * )
                    𝑓 ∈ℱ
              𝑢                   𝑢           𝑢
  = 𝑅ℓ,𝒟 (𝑓^ ) − 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) + 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) − min 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓 ) + min 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓 ) − 𝑅ℓ,𝒟 (𝑓 * ).      (11)
                                                        𝑓 ∈ℱ                       𝑓 ∈ℱ
           Distribution shift
                                                                        Estimation error

The term of distribution shift can be upper bounded by:
         𝑢                𝑢
 𝑅ℓ,𝒟 (𝑓^ ) − 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ )
            [︁ 𝑢             ]︁              [︁ 𝑢               ]︁
=E(𝑋,𝑌 )∼𝒟 ℓ(𝑓^ (𝑋), 𝑌 ) − E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 ℓ(𝑓^ (𝑋), 𝑌̃︀𝑖𝑢 )
                                   𝑖
       ⃒                                           [︁               ]︁⃒
≤ max ⃒E(𝑋,𝑌 )∼𝒟 [ℓ(𝑓 (𝑋), 𝑌 )] − E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 ℓ(𝑓 (𝑋), 𝑌̃︀𝑖𝑢 ) ⃒
       ⃒                                                              ⃒
  𝑓 ∈ℱ                                   𝑖
       ⃒
= max ⃒E(𝑋,𝑌 =1)∼𝒟 [ℓ(𝑓 (𝑋), 1)] + E(𝑋,𝑌 =0)∼𝒟 [ℓ(𝑓 (𝑋), 0)]
       ⃒
  𝑓 ∈ℱ
                           [︁               ]︁                        [︁               ]︁ ⃒
  − E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 ,𝑌 =1 ℓ(𝑓 (𝑋), 𝑌̃︀𝑖𝑢 ) − E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 ,𝑌 =0 ℓ(𝑓 (𝑋), 𝑌̃︀𝑖𝑢 ) ⃒
                                                                                          ⃒
           𝑖                                          𝑖
      ⃒
= max ⃒E(𝑋,𝑌 =1)∼𝒟 [ℓ(𝑓 (𝑋), 1)] + E(𝑋,𝑌 =0)∼𝒟 [ℓ(𝑓 (𝑋), 0)]
      ⃒
   𝑓 ∈ℱ

   − E(𝑋,𝑌̃︀ 𝑢 =1)∼𝒟̃︀ 𝑢 ,𝑌 =1 [ℓ(𝑓 (𝑋), 1)] − E(𝑋,𝑌̃︀ 𝑢 =0)∼𝒟̃︀ 𝑢 ,𝑌 =1 [ℓ(𝑓 (𝑋), 0)]
             𝑖                                            𝑖
                                                                                      ⃒
  − E(𝑋,𝑌̃︀ 𝑢 =1)∼𝒟̃︀ 𝑢 ,𝑌 =0 [ℓ(𝑓 (𝑋), 1)] − E(𝑋,𝑌̃︀ 𝑢 =0)∼𝒟̃︀ 𝑢 ,𝑌 =0 [ℓ(𝑓 (𝑋), 0)] ⃒
                                                                                      ⃒
           𝑖                                         𝑖
       ⃒
= max ⃒E(𝑋,𝑌 =1)∼𝒟 [ℓ(𝑓 (𝑋), 1)] + E(𝑋,𝑌 =0)∼𝒟 [ℓ(𝑓 (𝑋), 0)]
       ⃒
  𝑓 ∈ℱ
                    [︁                                     ]︁                  [︁              ]︁
  − E(𝑋,𝑌 =1)∼𝒟 P(𝑌̃︀𝑖𝑢 = 1|𝑌 = 1) · ℓ(𝑓 (𝑋), 1) − E(𝑋,𝑌 =1)∼𝒟 P(𝑌̃︀𝑖𝑢 = 0|𝑌 = 1) · ℓ(𝑓 (𝑋), 0)
                    [︁                                     ]︁                  [︁              ]︁⃒
  − E(𝑋,𝑌 =0)∼𝒟 P(𝑌̃︀𝑖𝑢 = 1|𝑌 = 0) · ℓ(𝑓 (𝑋), 1) − E(𝑋,𝑌 =0)∼𝒟 P(𝑌̃︀𝑖𝑢 = 0|𝑌 = 0) · ℓ(𝑓 (𝑋), 0) ⃒.
                                                                                                 ⃒

 Combine similar terms, we then have:
        ⃒                [︁                          ]︁                 [︁                              ]︁
= max ⃒E(𝑋,𝑌 =1)∼𝒟 P(𝑌̃︀𝑖𝑢 = 0|𝑌 = 1) · ℓ(𝑓 (𝑋), 1) + E(𝑋,𝑌 =0)∼𝒟 P(𝑌̃︀𝑖𝑢 = 1|𝑌 = 0) · ℓ(𝑓 (𝑋), 0)
        ⃒
  𝑓 ∈ℱ
                     [︁                          ]︁                  [︁                            ]︁ ⃒
  − E(𝑋,𝑌 =1)∼𝒟 P(𝑌̃︀𝑖𝑢 = 0|𝑌 = 1) · ℓ(𝑓 (𝑋), 0) − E(𝑋,𝑌 =0)∼𝒟 P(𝑌̃︀𝑖𝑢 = 1|𝑌 = 0) · ℓ(𝑓 (𝑋), 1) ⃒
                                                                                                      ⃒
        ⃒                                                                                               ⃒
                            𝑢                                             𝑢
= max ⃒E(𝑋,𝑌 =1)∼𝒟 [𝜌1 · (ℓ(𝑓 (𝑋), 1) − ℓ(𝑓 (𝑋), 0))] + E(𝑋,𝑌 =0)∼𝒟 [𝜌0 · (ℓ(𝑓 (𝑋), 0) − ℓ(𝑓 (𝑋), 1))] ⃒
        ⃒                                                                                               ⃒
  𝑓 ∈ℱ
        ⃒                                                    )︀]︀ ⃒⃒
≤ max ⃒E(𝑋,𝑌 =1)∼𝒟 𝜌𝑢1 · ℓ − ℓ + E(𝑋,𝑌 =0)∼𝒟 𝜌𝑢0 · ℓ − ℓ ⃒
        ⃒                [︀      (︀ )︀]︀          [︀    (︀
  𝑓 ∈ℱ

=(𝑝1 𝜌𝑢1 + 𝑝0 𝜌𝑢0 ) · ℓ − ℓ .
                      (︀      )︀


Thus, we have:
                                                     𝑢,1
                      𝑅ℓ,𝒟 (𝑓^ ) − 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) ≤ Δ𝑅 := (𝜌𝑢0 𝑝0 + 𝜌𝑢1 𝑝1 ) · ℓ − ℓ .
                                                                              (︀    )︀




A.6. Proof of Lemma 3.3
Proof. For the term Estimation error, we have:

                       𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) − 𝑅ℓ,𝒟 (𝑓 * )
                                  𝑢
                     = 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) − min 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓 ) + min 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓 ) − 𝑅ℓ,𝒟 (𝑓 * )
                                         𝑓 ∈ℱ                   𝑓 ∈ℱ
                                                      Estimation error
                                   𝑢
                     ≤ 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) − min 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓 ) + |min 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓 ) − 𝑅ℓ,𝒟 (𝑓 * )|
                                         𝑓 ∈ℱ                    𝑓 ∈ℱ
                                      Error 1                             Error 2

  The upper bound of Error 1 could be derived directly from the proof of Lemma 4.1: since
the loss function makes no use of loss correction, the L-Lipschitz constant does not have to
multiply with the constant and 𝐿𝑢← → 𝐿. Besides, the constant for the variance term (square
term) reduces to (ℓ − ℓ). Thus, we have:
                                                               √︃
                                                                    2 log(1/𝛿)
                     Error 1 ≤ 4𝐿R(ℱ) + (ℓ − ℓ) ·                       𝑢𝑁     , ∀𝑓 ∈ ℱ.
                                                                       𝜂𝐾

For the term Error 2, the upper bound could be derived with the same procedure as adopted in
the proof of Lemma 3.2. Thus, we obtain:
                                                               √︃
                                                                  2 log(1/𝛿)    𝑢,1
             𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) − 𝑅ℓ,𝒟 (𝑓 * ) ≤ 4𝐿R(ℱ) + (ℓ − ℓ) ·        𝑢      + Δ𝑅 .
                                                                     𝜂𝐾 𝑁
                                                                                 𝑢,2
                                                                  Defined as: Δ𝑅




A.7. Proof of Theorem 3.4
Proof. To achieve a smaller upper bound for the separation method, mathematically, we want:
                                    √︃
                                       2 log(1/𝛿)                         (︀       )︀
                4𝐿R(ℱ) + (ℓ − ℓ) ·                + 2(𝜌0 𝑝0 + 𝜌1 𝑝1 ) · ℓ − ℓ
                                          𝜂𝐾 𝑁
                                    √︃
                                       2 log(1/𝛿)       ∙       ∙
                                                                          (︀       )︀
               ≤4𝐿R(ℱ) + (ℓ − ℓ) ·         ∙ 𝑁    +  2(𝜌0 𝑝0 + 𝜌1 𝑝 1 ) ·    ℓ − ℓ    ,
                                          𝜂𝐾

which is equivalent to proving:
         √︂
            log(1/𝛿)         1
                     ((𝜂𝐾 )− 2 − 1) · (ℓ − ℓ) ≤ [(𝜌∙0 𝑝0 + 𝜌∙1 𝑝1 ) − (𝜌0 𝑝0 + 𝜌1 𝑝1 )] · ℓ − ℓ
                                                                                         (︀     )︀
         √︂ 2𝑁
            log(1/𝛿)             1
    ⇐⇒               (1 − (𝜂𝐾 )− 2 ) ≥ [(𝜌0 𝑝0 + 𝜌1 𝑝1 ) − (𝜌∙0 𝑝0 + 𝜌∙1 𝑝1 )] .                   (12)
               2𝑁
                                                     De-noising effect of aggregation ≥0
                             √︁
                                  log(1/𝛿)       (𝜌0 𝑝0 +𝜌1 𝑝1 )−(𝜌∙0 𝑝0 +𝜌∙1 𝑝1 )
Eqn. (12) then requires:             2𝑁      ≥                        1              , which is mentioned as 𝛼𝐾 ·
                                                (1−(𝜂𝐾 )− 2 )
    1
             ≤ 𝛾, where 𝛼𝐾 := (𝜌0 𝑝0 + 𝜌1 𝑝1 ) − (𝜌∙0 𝑝0 + 𝜌∙1 𝑝1 ), 𝛾 = log(1/𝛿)/2𝑁 .
                                                                        √︀
         1
1−(𝜂𝐾 )− 2




A.8. Proof of Theorem 3.6
Proof. For 𝑢 ∈ {, ∙}, we have:
      𝑢
                           [︁ 𝑢                                        𝑢
                                                                                     ]︁2
Var(𝑓^ ) =E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 ) − E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 [ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 )]
                [︃                                                                                            ]︃
                   [︁ 𝑢            ]︁2 [︁              𝑢
                                                                        ]︁2         𝑢           𝑢
         =E ̃︀ 𝑢 ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 ) + E ̃︀ 𝑢 [ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 )] − 2ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 )E ̃︀ 𝑢 [ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 )]
              𝒟                                  𝒟                                                  𝒟
                [︁ 𝑢              ]︁2 [︁          𝑢
                                                               ]︁2      [︁    𝑢                        𝑢
                                                                                                                     ]︁
         =E𝒟̃︀ 𝑢 ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 ) + E𝒟̃︀ 𝑢 [ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 )] − E𝒟̃︀ 𝑢 2ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 )E𝒟̃︀ 𝑢 [ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 )]
                [︁ 𝑢              ]︁2 [︁          𝑢
                                                               ]︁2
         =E𝒟̃︀ 𝑢 ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 ) − E𝒟̃︀ 𝑢 [ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 )]
                [︁ 𝑢              ]︁2             𝑢
         =E𝒟̃︀ 𝑢 ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ))2 .

A special case is the 0-1 loss, i.e., ℓ(·) = 1(·), we then have:

                          𝑢
                                     [︁ 𝑢               ]︁2             𝑢
                    Var(𝑓^ ) =E𝒟̃︀ 𝑢 ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ))2
                                     [︁ 𝑢               ]︁             𝑢
                             =E𝒟̃︀ 𝑢 ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ))2
                                          𝑢                 𝑢                    𝑢
                                                                    (︁             )︁
                             =𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ))2 = 𝑔 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) .

                𝑢
where 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) ∈ [0, 1] and 𝑔(𝑎) = 𝑎 − 𝑎2 is monotonically increasing when 𝑎 < 12 . Thus,
when
                                     √︃
                     𝑢                  log(1/𝛿)   1                      2 log(1/𝛿)
         𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) ≤ (ℓ − ℓ) ·        𝑢     ≤       ⇐⇒          𝑢
                                                                    𝜂𝐾 ≥             ,
                                         2𝜂𝐾 𝑁     2                          𝑁
                            reduces to 1

                        𝑢
                                    √︁
                                           2 log(1/𝛿)
we could derive Var(𝑓^ ) ≤ 𝑔(                  𝑢𝑁
                                              𝜂𝐾      ).



A.9. Proof of Corollary 3.7
Proof. In the multi-class extension, the only difference is the upper bound of the Distribution
Shift term in Eqn. (11), which now becomes:
         𝑢                𝑢
 𝑅ℓ,𝒟 (𝑓^ ) − 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ )
             [︁ 𝑢            ]︁              [︁ 𝑢                ]︁
=E(𝑋,𝑌 )∼𝒟 ℓ(𝑓^ (𝑋), 𝑌 ) − E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 )
       ⃒                                                                 ⃒
       ⃒                                            [︁                ]︁ ⃒
≤ max ⃒E(𝑋,𝑌 )∼𝒟 [ℓ(𝑓 (𝑋), 𝑌 )] − E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 ℓ(𝑓 (𝑋), 𝑌̃︀ 𝑢 ) ⃒
       ⃒                                                                 ⃒
  𝑓 ∈ℱ ⃒                                                                 ⃒
       ⃒⎡                       [︃       ⎤⎤ ⎡                                                         ⎤⃒
       ⃒                                                                           [︁               ]︁ ⃒
       ⃒ ∑︁                                            ∑︁
= max ⃒⃒⎣        E(𝑋,𝑌 =𝑗)∼𝒟 ℓ(𝑓 (𝑋), 𝑗)⎦⎦ − ⎣              E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 ,𝑌 =𝑗 ℓ(𝑓 (𝑋), 𝑌̃︀ 𝑢 ) ⎦ ⃒
                                                                                                       ⃒
  𝑓 ∈ℱ ⃒                                                                                               ⃒
          𝑗∈[𝑀 ]                                     𝑗∈[𝑀 ]
       ⃒ ⎡                                ⎤       ⎡                                                                     ⎤⃒
       ⃒ ∑︁                                           ∑︁    ∑︁                      [︁                                ]︁ ⃒
= max ⃒ ⎣        E(𝑋,𝑌 =𝑗)∼𝒟 [ℓ(𝑓 (𝑋), 𝑗)]⎦ − ⎣                  E(𝑋,𝑌 =𝑗)∼𝒟 P(𝑌̃︀ 𝑢 = 𝑘|𝑌 = 𝑗) · ℓ(𝑓 (𝑋), 𝑘) ⎦ ⃒
       ⃒                                                                                                                 ⃒
  𝑓 ∈ℱ ⃒                                                                                                                 ⃒
          𝑗∈[𝑀 ]                                   𝑘∈[𝑀 ] 𝑗∈[𝑀 ]
       ⃒ ⎡                                                              ⎤      ⎡
       ⃒ ∑︁                      [︁                                  ]︁            ∑︁     ∑︁                 [︁
= max ⃒ ⎣        E(𝑋,𝑌 =𝑗)∼𝒟 P(𝑌̃︀ 𝑢 ̸= 𝑗|𝑌 = 𝑗) · ℓ(𝑓 (𝑋), 𝑗) ⎦ − ⎣                              E(𝑋,𝑌 =𝑗)∼𝒟 P(𝑌̃︀ 𝑢 = 𝑘|𝑌 = 𝑗
       ⃒
  𝑓 ∈ℱ ⃒
           𝑗∈[𝑀 ]                                                            𝑘∈[𝑀 ],𝑘̸=𝑗 𝑗∈[𝑀 ]
       ⃒                   [︃                                                                           ]︃⃒
       ⃒ ∑︁                                                      ∑︁                                       ⃒
                                 𝑢                                             𝑢
= max ⃒         E(𝑋,𝑌 =𝑗)∼𝒟 P(𝑌̃︀ ̸= 𝑗|𝑌 = 𝑗) · ℓ(𝑓 (𝑋), 𝑗) −             P(𝑌̃︀ = 𝑘|𝑌 = 𝑗) · ℓ(𝑓 (𝑋), 𝑘) ⃒
       ⃒                                                                                                  ⃒
  𝑓 ∈ℱ ⃒                                                                                                  ⃒
         𝑗∈[𝑀 ]                                               𝑘∈[𝑀 ],𝑘̸=𝑗
       ⃒                   [︃                           ]︃⃒
       ⃒ ∑︁                                           )︀ ⃒⃒
                                 𝑢
                                                (︀
≤ max ⃒         E(𝑋,𝑌 =𝑗)∼𝒟 P(𝑌̃︀ ̸= 𝑗|𝑌 = 𝑗) · ℓ − ℓ ⃒
       ⃒
  𝑓 ∈ℱ ⃒                                                  ⃒
             𝑗∈[𝑀 ]

         (Assumed uniform prior)
    ∑︁
                           𝑢
                                (︀  )︀
=         P(𝑌 = 𝑗) · (1 − 𝑇𝑗,𝑗 ) ℓ−ℓ .
    𝑗∈[𝑀 ]




A.10. Proof of Lemma 4.5
Proof. The proof of Lemma 4.5 builds on Theorem 7 in [42]: The performance bound for
aggregation methods is the special case of Theorem 7 in [42] (adopting 𝛼* = 1 defined in [42]).
As for that of separation methods, the incurred difference lies in the appearance of the weight
of sample complexity 𝜂𝐾 . Thus, we have:
                                                     (︃          √︃                               )︃
                   𝑢                        1                       2 log(4/𝛿)
          𝑅ℓ,𝒟 (𝑓^ ↬ ) − 𝑅ℓ,𝒟 (𝑓 * ) ≤
                                                                               (︀              )︀
                                                        8𝐿R(ℱ) +                  1 + 2(ℓ̄ − ℓ)
                                       1 − 𝜌𝑢0 − 𝜌𝑢1                    𝑢𝑁
                                                                       𝜂𝐾
                      𝑢                    𝑢
⇐⇒       𝑅ℓ,𝒟 (𝑓^ ↬ ) − 𝑅ℓ,𝒟 (𝑓 * ) ≤ Δ𝑅↬ ,
                                     √︁
       𝑢
where Δ𝑅↬ := 8𝐿𝑢↬ R(ℱ) + 𝐿𝑢↬0 2 log(4/𝛿)                       .
                                            (︀              )︀
                                         𝑢
                                        𝜂 𝑁    1 + 2(ℓ̄ − ℓ)
                                                 𝐾




A.11. Proof of Theorem 4.6
                                                 √︂
                                                    log(4/𝛿)
                                                     2𝜂 𝑢 𝑁 (
                                                4             1+2(ℓ̄−ℓ))
                          𝑢
                         8𝐿R(ℱ )                                                                                      ∙
Proof. Denote by Δ𝑅↬ := 1−𝜌𝑢 −𝜌𝑢 +
                                                      𝐾
                                                      1−𝜌𝑢   𝑢             , in order to achieve Δ𝑅↬ < Δ𝑅↬ ,
                                  0   1                  0 −𝜌1
                              ∙
we require Δ𝑅↬ < Δ𝑅↬ , which is equivalent to:
                  √︁                                                 √︁
                     log(4/𝛿) (︀              )︀                        log(4/𝛿) (︀
                                          −
                                                                                                  )︀
     8𝐿R(ℱ)      4    2𝜂𝐾 𝑁      1 + 2(ℓ̄   ℓ)       8𝐿R(ℱ)        4       2𝑁       1 + 2(ℓ̄ − ℓ)
               +                                 <               +                                   ,
   1 − 𝜌0 − 𝜌1           1 − 𝜌0 − 𝜌1               1 − 𝜌∙0 − 𝜌∙1            1 − 𝜌∙0 − 𝜌∙1
which is further equivalent to:
                                          √︁                                      √︁
                                               log(4/𝛿) (︀                             log(4/𝛿) (︀               )︀
                                                                                                            −
                                                                         )︀
      8𝐿R(ℱ)      8𝐿R(ℱ)              4           2𝑁       1 + 2(ℓ̄ − ℓ)      4         2𝜂𝐾 𝑁      1 + 2(ℓ̄   ℓ)
               −              <                                             −                                         .
    1 − 𝜌0 − 𝜌1 1 − 𝜌∙0 − 𝜌∙1                              ∙
                                                   1 − 𝜌0 − 𝜌1  ∙                         1 − 𝜌0 − 𝜌1
Note that both 1 − 𝜌0 − 𝜌1 and 1 − 𝜌∙0 − 𝜌∙1 are positive, the above requirement then reduces to:
                                               [︃                              √︃ ]︃ √︂
                                                                                   1       log(4/𝛿) (︀
         [(𝜌0 + 𝜌1 ) − (𝜌∙0 + 𝜌∙1 )]8𝐿R(ℱ) < (1 − 𝜌0 − 𝜌1 ) − (1 − 𝜌∙0 − 𝜌∙1 )
                                                                                                                    )︀
                                                                                        4              1 + 2(ℓ̄ − ℓ)
                                                                                  𝜂𝐾          2𝑁
                                                                                                      √︃
           [(𝜌0 + 𝜌1 ) − (𝜌∙0 + 𝜌∙1 )]8𝐿R(ℱ)                                                                 1
⇐⇒              √︁                              < (1 − 𝜌0 − 𝜌1 ) − (1 − 𝜌∙0 − 𝜌∙1 )                            .
                   log(4/𝛿) (︀               )︀                                                             𝜂𝐾
              4       2𝑁       1 + 2(ℓ̄ − ℓ)
 Note that for any finite concept class ℱ ⊂ {𝑓 : 𝑋 → {0, 1}},
                                                          √︁ and the sample set 𝑆 =
                                                                                          2𝑑 log(𝑁 )
{𝑥1 , ..., 𝑥𝑁 }, the Rademacher complexity is upper bounded by                                𝑁      where 𝑑 is the VC
dimension of ℱ, a more strict condition to get becomes:
                                                                                       √︁
                     √︃                                               ∙       ∙           2𝑑 log(𝑁 )
                           1   (1 − 𝜌0 − 𝜌1 )      [(𝜌 0 + 𝜌 1 ) − (𝜌 0  +  𝜌 1 )]8𝐿          𝑁
                             <                  −                                                    )︀ .
                               (1 − 𝜌∙0 − 𝜌∙1 )
                                                                    √︁
                          𝜂𝐾                                            log(4/𝛿)
                                                  4(1 − 𝜌∙0 − 𝜌∙1 )
                                                                                   (︀
                                                                           2𝑁         1 + 2(ℓ̄ − ℓ)
                                             √︁
 Denote by 𝛼𝐾 := 1 − 𝐿∙↬ /𝐿↬ , 𝛾 = 1+2(ℓ̄−ℓ)
                                       2𝐿
                                                 log(4/𝛿)
                                                4𝑑 log(𝑁 ) . The above condition is satisfied if
and only if
                                           1
                            𝛼𝐾 ·                       1 ≤ 𝛾.
                                  𝐿∙↬ /𝐿↬ − (𝜂𝐾 )− 2



A.12. Proof of Theorem 4.7
Proof. Similar to the proof of Theorem 3.6, for 𝑢 ∈ {, ∙}, we have:
                            𝑢
                                        [︁ 𝑢                ]︁2              𝑢
                     Var(𝑓^ ↬ ) = E𝒟̃︀ 𝑢 ℓ(𝑓^ ↬ (𝑋), 𝑌̃︀ 𝑢 ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ↬ ))2 .
A special case is the 0-1 loss, i.e., ℓ(·) = 1(·), we then have:
                          𝑢
                                      [︁ 𝑢                 ]︁2              𝑢
                   Var(𝑓^ ↬ ) =E𝒟̃︀ 𝑢 ℓ(𝑓^ ↬ (𝑋), 𝑌̃︀ 𝑢 ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ↬ ))2
                                      [︁ 𝑢                 ]︁             𝑢
                              =E𝒟̃︀ 𝑢 ℓ(𝑓^ ↬ (𝑋), 𝑌̃︀ 𝑢 ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ↬ ))2
                                            𝑢                   𝑢                      𝑢
                                                                         (︁               )︁
                              =𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ↬ ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ↬ ))2 = 𝑔 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ↬ )
                     𝑢
where 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ↬ ) ∈ [0, 1] and 𝑔(𝑎) = 𝑎 − 𝑎2 is monotonically increasing when 𝑎 < 12 . Note
that:                                                 √︃
                                    𝑢          1         log(4/𝛿) (︀
                       𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ↬ ) <
                                                                                  )︀
                                               𝑢    𝑢       𝑢        1 + 2(ℓ̄ − ℓ) ,
                                          1 − 𝜌0 − 𝜌1     2𝜂𝐾 𝑁
when
                √︃                                                       √︂
     1         log(4/𝛿) (︀               )︀ 1                   √︀ 𝑢        2 log(4/𝛿) 1 + 2(ℓ̄ − ℓ)
                           1 + 2(ℓ̄ − ℓ) ≤           ⇐⇒             𝜂𝐾 ≥                             ,
1 − 𝜌𝑢0 − 𝜌𝑢1      𝑢
                2𝜂𝐾 𝑁                          2                                𝑁      1 − 𝜌𝑢0 − 𝜌𝑢1
                𝑢
                                                                                  ^∙
                        (︁√︁                     )︁
                              log(4/𝛿) 1+2(ℓ̄−ℓ)
we have: Var(𝑓^ ← ) ≤ 𝑔          𝑢         𝑢   𝑢
                                                                      ^
                               2𝜂𝐾 𝑁 1−𝜌0 −𝜌1 . To achieve: Var(𝑓 ↬ ) < Var(𝑓 ↬ ), we simply
need:
         √︃                              √︃
            log(4/𝛿) 1 + 2(ℓ̄ − ℓ)           log(4/𝛿) 1 + 2(ℓ̄ − ℓ)      √        𝐿
                                       ≤          ∙        ∙      ∙  ⇐⇒ 𝜂𝐾 ≥ ↬0        .
             2𝜂𝐾 𝑁 1 − 𝜌0 − 𝜌1                2𝜂𝐾 𝑁 1 − 𝜌0 − 𝜌1                   𝐿∙↬0
A.13. Proof of Corollary 4.8
Proof. Regarding the multi-class extension of Lemma 4.5, the only different thing lies in the
constant: 𝐿𝑢↬0 . The following Lemma A.1 helps us find out the multi-class form of 𝐿𝑢↬0 .
                                                                                1
Lemma A.1. Assume the clean label 𝑌 has equal prior 𝑃 (𝑌 = 𝑗) = 𝑀                 , ∀𝑗 ∈ [𝑀 ]. For the
                                                          𝑢    𝑢
uniform noise transition matrix [44] such that 𝑇𝑖,𝑗 = 𝜌𝑖 , ∀𝑗 ∈ [𝑀 ], the expected ℓ↬ in the
multi-class setting is invariant to label noise up to an affine transformation:
                                                       ⎛           ⎞
                                                            ∑︁
              E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 [ℓ↬ (𝑓 (𝑋), 𝑌̃︀ 𝑢 )] = ⎝1 −    𝜌𝑢𝑗 ⎠ E𝒟 [ℓ↬ (𝑓 (𝑋), 𝑌 )].         (13)
                                                                          𝑗∈[𝑀 ]


Proof of Lemma A.1 Recall that 𝒟 and 𝒟               ̃︀ 𝑢 refer to the joint distribution over (𝑋, 𝑌 ) and
(𝑋, 𝑌̃︀ 𝑢 ), respectively. We further denote the marginal distributions of 𝑋, 𝑌 , and 𝑌̃︀ 𝑢 by 𝒟𝑋 ,
𝒟𝑌 , and 𝒟   ̃︀ ̃︀ 𝑢 , respectively. Let 𝑋𝑝 ∼ 𝒟𝑋 , 𝑌̃︀𝑝𝑢 ∼ 𝒟
               𝑌
                                                             ̃︀ ̃︀ 𝑢 be the random variables corresponding
                                                               𝑌
to the peer samples. The peer loss function is defined as

                                  ℓ↬ (𝑓 (𝑥𝑛 ), 𝑦˜𝑢𝑛 ) = ℓ(𝑓 (𝑥𝑛 ), 𝑦˜𝑢𝑛 ) − ℓ(𝑓 (𝑥𝑝,𝑛 ), 𝑦˜𝑢𝑝,𝑛 ),                              (14)

where (𝑥𝑛 , 𝑦˜𝑢𝑛 ) is a normal training sample pair, 𝑥𝑝,𝑛 and 𝑦˜𝑢𝑝,𝑛 are corresponding peer samples.
  Taking expectation for (14) yields
                                                                        [︁                       ]︁
        E𝒟̃︀ 𝑢 [ℓ↬ (𝑓 (𝑋), 𝑌̃︀ 𝑢 )] = E𝒟̃︀ 𝑢 [ℓ(𝑓 (𝑋), 𝑌̃︀ 𝑢 )] − E𝒟̃︀ 𝑢 E𝒟𝑋 [ℓ(𝑓 (𝑋𝑝 ), 𝑌̃︀𝑝𝑢 )] . (15)
                                                                                   𝑌
                                                                                   ̃︀


The first term in (15) is
  E𝒟̃︀ 𝑢 [ℓ(𝑓 (𝑋), 𝑌̃︀ 𝑢 )]
   ∑︁ ∑︁
=               𝑇𝑖𝑗𝑢 · P(𝑌 = 𝑖) · E𝒟|𝑌 =𝑖 [ℓ(𝑓 (𝑋), 𝑗)]
    𝑗∈[𝑀 ] 𝑖∈[𝑀 ]
             [︃                                                                                                            ]︃
     ∑︁                                                           ∑︁
                   𝑢
=                 𝑇𝑗𝑗 · P(𝑌 = 𝑗) · E𝒟|𝑌 =𝑗 [ℓ(𝑓 (𝑋), 𝑗)] +                   𝑇𝑖𝑗𝑢 · P(𝑌 = 𝑖) · E𝒟|𝑌 =𝑖 [ℓ(𝑓 (𝑋), 𝑗)]
    𝑗∈[𝑀 ]                                                     𝑖∈[𝑀 ],𝑖̸=𝑗
             [︃ ⎛                        ⎞                                                                                                    ]︃
     ∑︁                     ∑︁                                                             ∑︁
                                        𝑢⎠
=                 ⎝1 −                 𝑇𝑗𝑖 · P(𝑌   = 𝑗) · E𝒟|𝑌 =𝑗 [ℓ(𝑓 (𝑋), 𝑗)] +                     𝑇𝑖𝑗𝑢 · P(𝑌   = 𝑖) · E𝒟|𝑌 =𝑖 [ℓ(𝑓 (𝑋), 𝑗)] .
    𝑗∈[𝑀 ]               𝑖̸=𝑗,𝑖∈[𝑀 ]                                                    𝑖∈[𝑀 ],𝑖̸=𝑗


 Accordingly, noting 𝑋𝑝 and 𝑌̃︀𝑝𝑢 are independent, the second term in (15) is
           [︁                       ]︁
  E𝒟̃︀ ̃︀ 𝑢 E𝒟𝑋 [ℓ(𝑓 (𝑋𝑝 ), 𝑌̃︀𝑝𝑢 )]
      𝑌
   ∑︁
=          P(𝑌̃︀𝑝𝑢 = 𝑗) · E𝒟𝑋 [ℓ(𝑓 (𝑋𝑝 ), 𝑗)]
    𝑗∈[𝑀 ]
     ∑︁ ∑︁
=                     𝑇𝑖𝑗𝑢 · P(𝑌𝑝 = 𝑖) · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)]
    𝑗∈[𝑀 ] 𝑖∈[𝑀 ]
             [︃                                                                                                       ]︃
     ∑︁                                                        ∑︁
                   𝑢
=                 𝑇𝑗𝑗 · P(𝑌𝑝 = 𝑗) · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)] +                   𝑇𝑖𝑗𝑢 · P(𝑌𝑝 = 𝑖) · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)]
    𝑗∈[𝑀 ]                                                  𝑖∈[𝑀 ],𝑖̸=𝑗
             ⎡(︃                         ⎞                                                                                                  ]︃
     ∑︁                    ∑︁                                                            ∑︁
                                       𝑢⎠
=            ⎣ 1−                     𝑇𝑗𝑖 · P(𝑌𝑝 = 𝑗) · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)] +                       𝑇𝑖𝑗𝑢 · P(𝑌𝑝 = 𝑖) · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)] .
    𝑗∈[𝑀 ]              𝑖̸=𝑗,𝑖∈[𝑀 ]                                                 𝑖∈[𝑀 ],𝑖̸=𝑗

    In this case, we have 𝜌𝑢𝑖 = 𝑇𝑗𝑖
                                 𝑢
                                    , ∀𝑗 ∈ [𝑀 ], 𝑗 ̸= 𝑖. The first term becomes

  E𝒟̃︀ 𝑢 [ℓ(𝑓 (𝑋), 𝑌̃︀ 𝑢 )]
           [︃ ⎛                         ⎞                                                                                                        ]︃
   ∑︁                    ∑︁                                                                 ∑︁
=             ⎝1 −                    𝜌𝑢𝑖 ⎠ · P(𝑌 = 𝑗) · E𝒟|𝑌 =𝑗 [ℓ(𝑓 (𝑋), 𝑗)] +                       𝜌𝑢𝑗 · P(𝑌 = 𝑖) · E𝒟|𝑌 =𝑖 [ℓ(𝑓 (𝑋), 𝑗)]
    𝑗∈[𝑀 ]              𝑖̸=𝑗,𝑖∈[𝑀 ]                                                      𝑖∈[𝑀 ],𝑖̸=𝑗
             [︃ ⎛                  ⎞                                                                                                  ]︃
     ∑︁                  ∑︁                                                         ∑︁
=             ⎝1 −               𝜌𝑢𝑖 ⎠ · P(𝑌 = 𝑗) · E𝒟|𝑌 =𝑗 [ℓ(𝑓 (𝑋), 𝑗)] +                 𝜌𝑢𝑗 · P(𝑌 = 𝑖) · E𝒟|𝑌 =𝑖 [ℓ(𝑓 (𝑋), 𝑗)]
    𝑗∈[𝑀 ]              𝑖∈[𝑀 ]                                                     𝑖∈[𝑀 ]
    ⎡⎛                     ⎞                                                                ⎤
                ∑︁                                     ∑︁
= ⎣⎝1 −                 𝜌𝑢𝑖 ⎠ · E𝒟 [ℓ(𝑓 (𝑋), 𝑌 )] +            𝜌𝑢𝑗 · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)]⎦ .
               𝑖∈[𝑀 ]                                 𝑗∈[𝑀 ]

The second term becomes
           [︁                       ]︁
  E𝒟̃︀ ̃︀ 𝑢 E𝒟𝑋 [ℓ(𝑓 (𝑋𝑝 ), 𝑌̃︀𝑝𝑢 )]
      𝑌
           [︃ ⎛                   ⎞                                                                                                        ]︃
   ∑︁                 ∑︁                                                                 ∑︁
=             ⎝1 −            𝜌𝑢𝑖 ⎠ · P(𝑌𝑝 = 𝑗) · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)] +                             𝜌𝑢𝑗 · P(𝑌𝑝 = 𝑖) · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)]
    𝑗∈[𝑀 ]              𝑖̸=𝑗,𝑖∈[𝑀 ]                                                 𝑖∈[𝑀 ],𝑖̸=𝑗
             [︃ ⎛                  ⎞                                                                                            ]︃
     ∑︁                  ∑︁                                                      ∑︁
=             ⎝1 −               𝜌𝑢𝑖 ⎠ · P(𝑌𝑝 = 𝑗) · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)] +                 𝜌𝑢𝑗 · P(𝑌𝑝 = 𝑖) · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)]
    𝑗∈[𝑀 ]              𝑖∈[𝑀 ]                                                  𝑖∈[𝑀 ]
    ⎛                    ⎞
              ∑︁                                                  ∑︁
= ⎝1 −                𝜌𝑢𝑖 ⎠ · E𝒟𝑌 [E𝒟𝑋 [ℓ(𝑓 (𝑋𝑝 ), 𝑌𝑝 )]] +               𝜌𝑢𝑗 · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)].
             𝑖∈[𝑀 ]                                              𝑗∈[𝑀 ]

Comparing the above two terms we have:
                                                          ⎛                    ⎞
                                                                    ∑︁
                               E𝒟̃︀ 𝑢 [ℓ↬ (𝑓 (𝑋), 𝑌̃︀ 𝑢 )] = ⎝1 −            𝜌𝑢𝑖 ⎠ E𝒟 [ℓ↬ (𝑓 (𝑋), 𝑌 )].                        (16)
                                                                    𝑖∈[𝑀 ]


   Thus, substituting 𝐿𝑢↬0 := 1−𝜌𝑢1−𝜌𝑢 by 1−∑︀ 1 𝜌𝑢 , the proof of Corollary 4.8 is finished if we repeat
                                 0   1        𝑖∈[𝑀 ] 𝑖
the corresponding proof of the binary task.



B. Additional Results and Details
B.1. Experiment Details on UCI Datasets
Datasets In this paper, we conducted experiments on two binary (Breast and German) and
two multi-class (StatLog and Optical) UCI classification datasets. As for the splitting of training
and testing, the original settings are used when training and testing files are provided. The
remaining datasets only give one data file. We adopt 50/50 splitting for the testing results’
statistical significance as more data is distributed to testing dataset. More specifically, the
numbers of (training, testing) samples in Breast, German, StatLog, and Optical datasets are (285,
284), (500, 500), (4435, 2000), and (3823, 1797).

Generating the noisy labels on UCI datasets For each UCI dataset adopted in this paper, the
label of each sample in the training dataset will be flipped to the other classes with the probability
𝜖 (noise rate). For the multi-class classification datasets, the specific label which will be flipped
is randomly selected with equal probabilities. For binary and multi-class classification datasets,
(0.1, 0.2, 0.3, 0.4) and (0.2, 0.4, 0.6, 0.8) are used as different lists of noise rates respectively.

Implementation details We implemented a simple two-layer ReLU Multi-Layer Perceptron
(MLP) for the classification task on these four UCI datasets. The Adam optimizer is used with a
learning rate of 0.001 and the batch size is 128.

B.2. Detailed Results on UCI Datasets
In Table 6, we highlight the results with Green (for separation method) and Red (for aggregation
methods) if the performance gap is large than 0.05. Clearly, the label separation method
outperforms both aggregation methods (majority-vote and EM inference) consistently on
StatLog and Optical datasets. For the two binary tasks (Breast and German), aggregation
methods tend to outperform label separation, and we attribute this phenomenon to the fact that
the ”denoising effect of label aggregation is more significant in the binary case”.

B.3. Experiment Details on CIFAR-10 Datasets
The generation of the symmetric noisy dataset is adopted from [44]. As for the instance-
dependent label noise, the generating algorithm follows the state-of-the-art method [77]. Both
cases adopt noise rates: [0.2, 0.4, 0.6, 0.8]. The basic hyper-parameters settings for all methods
are listed as follows: mini-batch size (128), optimizer (SGD), initial learning rate (0.1), momentum
(0.9), weight decay (0.0005), number of epochs (120) and learning rate decay (0.1 at 50 epochs).
Standard data augmentation is applied to each dataset. All experiments run on 8 Nvidia RTX
A5000 GPUs.

B.4. Details Results on CIFAR-10 Dataset
Table 7 includes all the detailed accuracy values that appeared in Figure 5. The results on
the synthetic noisy CIFAR-10 dataset align well with the theoretical observations: label sep-
aration is preferred over label aggregation when the noise rates are high, or the number of
labelers/annotations is insufficient.
                    UCI-StatLog (symmetric) CE                                          UCI-Optical (symmetric) CE
 𝜖 = 0.2   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                𝐾 = 49   𝜖 = 0.2   𝐾=3      𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                𝐾 = 49
   MV      87.94    88.88     88.69     88.69       88.69    88.69     MV      95.27    96.73       95.41     96.73      96.73    96.66
   EM      87.63    87.94     88.63     88.69       88.69    88.69     EM      95.34    95.69       96.04     96.73      96.73    96.73
   Sep     89.25    90.56     91.19     91.13       91.50    91.25     Sep     97.08    97.57       97.98     98.33      98.26    98.61
 𝜖 = 0.4   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                𝐾 = 49   𝜖 = 0.4   𝐾=3      𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                𝐾 = 49
   MV      87.00    87.50     88.25     88.63       88.75    88.69     MV      92.35    96.04       96.11     96.18      96.73    96.73
   EM      87.50    87.38     87.75     88.94       88.81    88.69     EM      93.25    94.78       95.68     96.03      96.73    96.73
   Sep     87.75    89.31     90.38     91.19       91.19    91.00     Sep     96.59    96.80       97.44     97.64      98.26    98.05
 𝜖 = 0.6   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                𝐾 = 49   𝜖 = 0.6   𝐾=3      𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                𝐾 = 49
   MV      80.75    85.00     86.75     87.25       87.75    88.62     MV      83.31    91.51       95.34     95.41      93.11    97.14
   EM      85.06    84.37     86.68     87.00       87.81    88.93     EM      88.31    92.35       93.81     95.13      96.31    95.96
   Sep     87.18    87.31     88.12     89.81       90.81    90.93     Sep     94.43    95.54       96.52     97.35      97.63    97.98
                    UCI-StatLog (symmetric) BW                                          UCI-Optical (symmetric) BW
 𝜖 = 0.2   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                  𝐾 = 49   𝜖 = 0.2   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                   𝐾 = 49
   MV      89.75    90.06     89.87     89.87       89.87    89.87     MV      94.43    95.41       95.61     96.94      96.94    96.94
   EM      89.68    89.87     89.75     89.87       89.87    89.87     EM      94.43    95.82       95.68     96.94      96.94    96.94
   Sep     89.87    91.18     92.00     91.43       91.37    91.06     Sep     96.87    97.21       97.63     97.77      98.05    97.63
 𝜖 = 0.4   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                  𝐾 = 49   𝜖 = 0.4   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                   𝐾 = 49
   MV      87.62    88.25     89.75     89.62       89.87    89.87     MV      90.19    94.29       94.64     96.80      96.94    96.94
   EM      87.93    88.31     89.56     89.81       89.75    89.87     EM      93.53    95.41       96.38     95.89      96.94    96.94
   Sep     89.62    89.87     90.18     91.00       91.06    91.37     Sep     95.54    96.59       96.66     97.28      97.77    97.77
 𝜖 = 0.6   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                  𝐾 = 49   𝜖 = 0.6   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                   𝐾 = 49
   MV      82.12    87.31     87.00     88.50       89.50    89.81     MV      82.54    88.10       92.97     95.75      93.67    96.52
   EM      86.68    86.50     87.31     88.25       89.12    89.81     EM      80.52    90.54       93.11     95.06      96.10    95.82
   Sep     86.18    87.62     88.31     89.62       90.93    90.75     Sep     91.72    93.39       93.46     96.38      96.66    97.21
                  UCI-StatLog (symmetric) PeerLoss                                    UCI-Optical (symmetric) PeerLoss
 𝜖 = 0.2   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                  𝐾 = 49   𝜖 = 0.2   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                   𝐾 = 49
   MV      90.25    90.06     90.56     90.56       90.56    90.56     MV      94.71    96.03       96.38     96.52      96.52    96.38
   EM      89.87    90.06     90.56     90.56       90.56    90.56     EM      94.43    96.17       96.52     96.52      96.52    96.52
   Sep     90.36    91.12     91.68     91.43       91.68    91.43     Sep     97.07    97.35       97.70     98.12      98.05    98.12
 𝜖 = 0.4   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                  𝐾 = 49   𝜖 = 0.4   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                   𝐾 = 49
   MV      87.93    89.06     90.18     90.43       90.68    90.56     MV      91.86    94.57       95.54     96.87      96.52    96.52
   EM      88.00    88.93     90.12     90.37       90.56    90.56     EM      91.86    93.94       95.96     96.87      96.52    96.52
   Sep     89.18    90.25     90.37     91.43       91.68    91.87     Sep     96.66    96.73       96.94     97.63      98.05    98.19
 𝜖 = 0.6   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                  𝐾 = 49   𝜖 = 0.6   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                   𝐾 = 49
   MV      78.68    86.68     87.31     88.81       89.75    90.56     MV      76.77    87.34       94.71     95.61      93.88    96.38
   EM      86.25    86.68     88.00     88.75       89.12    90.31     EM      86.92    89.98       93.11     96.10      95.82    96.10
   Sep     87.50    88.25     88.93     90.18       91.31    91.31     Sep     93.32    95.54       96.10     96.73      97.77    97.91
                  UCI-pop failuers (symmetric) CE                                      UCI-forest fire (symmetric) CE
 𝜖 = 0.2   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                  𝐾 = 49   𝜖 = 0.2   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                   𝐾 = 49
   MV      93.06    93.06     93.06     93.06       93.06    93.06     MV      92.86    92.86       91.84     91.84      91.84    91.84
   EM      93.06    93.06     93.06     93.06       93.06    93.06     EM      92.86    92.86       91.84     91.84      91.84    91.84
   Sep     93.06    93.52     93.98     94.91       95.83    96.30     Sep     93.88    93.88       95.92     94.90      91.84    93.88
 𝜖 = 0.4   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                  𝐾 = 49   𝜖 = 0.4   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                   𝐾 = 49
   MV      93.06    93.06     93.06     93.06       93.06    93.06     MV      81.63    81.63       90.82     91.84      91.84    91.84
   EM      93.06    93.06     93.06     93.52       93.06    93.52     EM      58.16    77.55       90.82     88.78      91.84    91.84
   Sep     93.06    93.52     93.98     94.91       95.37    95.37     Sep     79.59    79.59       87.76     87.76      90.82    92.86
                  UCI-pop failuers (symmetric) BW                                      UCI-forest fire (symmetric) BW
 𝜖 = 0.2   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                  𝐾 = 49   𝜖 = 0.2   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                   𝐾 = 49
   MV      93.06    93.06     93.06     93.06       93.06    93.06     MV      91.84    92.86       91.84     91.84      91.84    92.86
   EM      93.06    93.06     93.06     93.06       93.06    93.06     EM      91.84    92.86       91.84     91.84      91.84    92.86
   Sep     93.06    93.52     93.98     95.37       95.83    95.83     Sep     88.78    90.82       92.86     93.88      91.84    94.90
 𝜖 = 0.4   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                  𝐾 = 49   𝜖 = 0.4   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                   𝐾 = 49
   MV      92.13    90.59     90.59     90.59       92.91    93.06     MV      84.69    86.73       89.80     91.84      91.84    92.86
   EM      93.06    89.35     93.06     91.20       93.06    93.06     EM      73.47    75.51       88.78     91.84      91.84    91.84
   Sep     93.06    93.06     93.06     91.20       93.98    94.91     Sep     84.69    84.69       86.73     87.76      86.73    90.82
                UCI-pop failuers (symmetric) Peer Loss                               UCI-forest fire (symmetric) Peer Loss
 𝜖 = 0.2   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                  𝐾 = 49   𝜖 = 0.2   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                   𝐾 = 49
   MV      93.06    93.52     94.91     94.44       94.44    94.44     MV      92.86    92.86       92.86     92.86      92.86    94.90
   EM      93.06    93.06     94.44     94.44       94.44    94.44     EM      92.86    92.86       92.86     92.86      92.86    94.90
   Sep     93.06    92.59     93.98     94.44       95.83    97.22     Sep     88.78    89.80       93.88     93.88      92.86    94.90
 𝜖 = 0.4   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                  𝐾 = 49   𝜖 = 0.4   𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25                   𝐾 = 49
   MV      93.06    93.52     93.52     93.98       94.91    93.52     MV      78.57    78.57       89.80     91.84      89.80    91.84
   EM      93.06    93.06     93.06     93.98       93.06    93.52     EM      59.18    73.47       88.78     89.80      89.80    91.84
   Sep     88.89    92.59     92.59     93.98       95.83    95.37     Sep     83.67    80.61       85.71     88.78      91.84    92.86

Table 6
The performances of CE/BW/PeerLoss trained on 4 UCI datasets (StatLog, Optical, Pop-Failures, and
Forest Fair datasets), with aggregated labels (majority vote, EM inference), and separated labels. (𝐾 is
the number of labels per training image)
                      CIFAR-10, Symmetric CE                                          CIFAR-10, Instance CE
 𝜖 = 0.2   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25            𝐾 = 49   𝜖 = 0.2   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25           𝐾 = 49
   MV      92.21    92.98    93.54    93.43     93.73    93.40     MV      91.99   93.29    93.57      93.47   93.68    93.60
   EM      92.08    92.93    93.54    93.64     93.35    93.37     EM      91.92   93.21    93.55      93.61   93.44    93.44
   Sep     92.52    92.89    93.35    93.15     93.42    93.40     Sep     92.36   92.97    93.43      93.24   93.33    93.35
 𝜖 = 0.4   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25            𝐾 = 49   𝜖 = 0.4   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25           𝐾 = 49
   MV      89.09    91.59    93.18    93.43     93.26    93.44     MV      87.14   91.15    93.10      93.15   93.23    93.48
   EM      88.83    91.02    92.54    93.45     93.69    93.68     EM      88.07   92.40    93.70      93.58   93.74    93.53
   Sep     90.61    91.95    92.70    92.92     93.32    93.13     Sep     90.83   91.90    92.63      92.46   93.08    93.26
 𝜖 = 0.6   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25            𝐾 = 49   𝜖 = 0.6   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25           𝐾 = 49
   MV      81.85    87.33    89.88    91.88     92.96    93.40     MV      49.22   83.95    89.45      91.60   92.88    93.65
   EM      81.04    85.91    89.76    91.57     92.55    93.10     EM      78.34   88.79    91.95      92.97   93.46    93.65
   Sep     87.00    89.19    90.70    91.97     92.40    93.17     Sep     83.79   87.55    90.15      91.58   91.86    92.74
 𝜖 = 0.8   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25            𝐾 = 49   𝜖 = 0.8   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25           𝐾 = 49
   MV      20.94    44.62    70.91    79.61     84.83    89.09     MV      14.59   25.25    34.47      57.99   57.51    87.08
   EM      37.91    50.78    67.19    75.26     82.97    87.97     EM      20.03   26.54    65.16      80.10   88.59    92.14
   Sep     61.47    70.10    79.61    83.93     86.82    90.04     Sep     26.16   28.89    50.35      74.15   71.39    87.54
                     CIFAR-10, Symmetric BW                                           CIFAR-10, Instance BW
 𝜖 = 0.2   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25            𝐾 = 49   𝜖 = 0.2   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25           𝐾 = 49
   MV      92.08    94.09    94.92    94.90     94.79    94.90     MV      92.03   93.87    95.12      95.11   94.97    94.75
   EM      92.13    93.08    94.90    94.91     94.90    94.86     EM      91.93   94.39    94.90      94.84   95.05    94.54
   Sep     91.74    92.61    92.75    92.59     94.44    92.97     Sep     91.93   92.07    92.70      91.75   93.02    92.47
 𝜖 = 0.4   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25            𝐾 = 49   𝜖 = 0.4   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25           𝐾 = 49
   MV      88.28    91.11    92.73    94.60     94.62    94.81     MV      86.61   90.64    93.00      94.73   94.72    94.72
   EM      87.41    90.23    92.83    94.77     94.80    95.18     EM      89.83   92.04    94.74      95.00   94.94    94.80
   Sep     89.14    89.68    91.07    92.46     92.26    94.24     Sep     88.86   87.89    92.09      89.92   91.05    91.96
 𝜖 = 0.6   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25            𝐾 = 49   𝜖 = 0.6   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25           𝐾 = 49
   MV      81.21    86.29    89.51    91.33     93.52    94.81     MV      43.78   82.59    88.56      91.47   93.27    95.06
   EM      78.13    84.33    89.44    91.17     92.45    94.60     EM      44.92   87.33    91.39      93.58   94.72    94.99
   Sep     83.84    87.05    88.10    89.80     90.95    92.11     Sep     80.88   86.22    88.45      90.69   91.16    92.61
 𝜖 = 0.8   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25            𝐾 = 49   𝜖 = 0.8   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25           𝐾 = 49
   MV      16.43    60.97    71.11    77.86     82.72    88.41     MV      16.00   25.03    33.80      67.91   68.52    86.49
   EM      10.00    45.97    66.02    74.37     80.08    87.42     EM      16.06   22.73    53.96      76.24   86.74    92.02
   Sep     58.48    69.86    76.03    79.79     82.60    86.31     Sep     27.84   26.68    32.72      37.27   54.41    83.37
                   CIFAR-10, Symmetric PeerLoss                                    CIFAR-10, Instance PeerLoss
 𝜖 = 0.2   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25            𝐾 = 49   𝜖 = 0.2   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25           𝐾 = 49
   MV      92.69    93.35    93.90    94.12     94.15    93.81     MV      92.13   93.53     94.00     93.78   94.13    94.08
   EM      92.39    93.25    93.76    93.93     93.52    93.77     EM      91.93   93.51    93.78      93.88   94.03    93.82
   Sep     93.15    93.51    93.77    93.51     93.56    93.73     Sep     92.86   93.23    93.56      93.72   93.63    93.95
 𝜖 = 0.4   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25            𝐾 = 49   𝜖 = 0.4   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25           𝐾 = 49
   MV      89.40    91.88    93.42    93.84     93.83    94.04     MV      88.15   91.61    93.21      93.64   93.84    93.69
   EM      89.23    91.41    93.06    93.83     93.85    94.11     EM      90.59   92.60     93.95     94.02   94.06    93.68
   Sep     91.08    92.38    93.17    93.40     93.56    93.37     Sep     91.06    92.70   93.22      92.92   93.65    93.67
 𝜖 = 0.6   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25            𝐾 = 49   𝜖 = 0.6   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25           𝐾 = 49
   MV      82.88    87.95    90.42    92.31     93.61    93.79     MV      60.66   84.99    90.30      91.93   93.16    93.81
   EM      81.64    86.45    90.09    91.98     93.23    93.58     EM      78.53   89.11    92.44      93.17   93.96    93.85
   Sep     87.28    89.80    91.19    92.42     93.18    93.65     Sep     85.76   89.07    91.05      92.22   92.45    93.39
 𝜖 = 0.8   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25            𝐾 = 49   𝜖 = 0.8   𝐾=3     𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25           𝐾 = 49
   MV      21.82    48.71    72.81    80.32     85.27    89.38     MV      14.35   24.83    40.49      65.47   69.28    88.05
   EM      38.29    52.63    68.70    77.42     83.94    88.45     EM      26.52   28.43    66.72      80.71   89.40    92.41
   Sep     64.32    72.52    80.31    84.65     87.40    90.56     Sep     33.87   37.49    57.36      77.43   80.51    89.15

Table 7
The performances of CE/BW/PeerLoss trained on (Left half: symmetric noise; right half: instance noise)
CIFAR-10 aggregated labels (majority vote, EM inference), and separated labels. (Different number of
labels per training image)