1. Introduction

To Aggregate or Not? Learning with Separate Noisy Labels

Jiaheng Wei

Zhaowei Zhu

Tianyi Luo

Ehsan Amid

Abhishek Kumar

Yang Liu

Amazon Search Science

0 Google Research , Brain Team 1 University of California , Santa Cruz , USA

The rawly collected training data often comes with separate noisy labels collected from multiple imperfect annotators (e.g., via crowdsourcing). A typically way of using these separate labels is to first aggregate them into one and apply standard training methods. The literature has also studied extensively on efective aggregation approaches. This paper revisits this choice and aims to provide an answer to the question of whether one should aggregate separate noisy labels into single ones or use them separately as given. We theoretically analyze the performance of both approaches under the empirical risk minimization framework for a number of popular loss functions, including the ones designed specifically for the problem of learning with noisy labels. Our theorems conclude that label separation is preferred over label aggregation when the noise rates are high, or the number of labelers/annotations is insuficient. Extensive empirical results validate our conclusions.

eol>Crowdsourcing Label Aggregation Label Noise Human Annotation

1. Introduction

The most popular approach to learning from the multiple separate labels would be aggregating the given labels for each instance [8, 9, 10, 11, 12], through an Expectation-Maximization (EM) inference technique. Each instance will then be provided with one single label and applied with the standard training procedure.

The primary goal of this paper is to revisit the choice of aggregating separate labels and hope to provide practitioners with understandings for the following question:

Should the learner aggregate separate noisy labels for one instance into a single label or not?

Our main contributions can be summarized as follows: ∙ We provide theoretical insights on how separation methods and aggregation ones result in diferent biases (Theorem 3.4, 4.2, 4.6) and variances (Theorem 3.6, 4.3, 4.7) of the output classifier from training. Our analysis considers both the standard loss functions in use, as well as popular robust losses that are designed for the problem of learning with noisy labels. ∙ By comparing the analytical proxy of the worst-case performance bounds, our theoretical results reveal that separating multiple noisy labels is preferred over label aggregation when the noise rates are high, or the number of labelers/annotations is insuficient. The results are consistent for both the basic loss function ℓ and robust designs, including loss correction and peer loss. ∙ We carry out extensive experiments using both synthetic and real-world datasets to validate our theoretical findings.

1.1. Related Works

Label separation vs label aggregation Existing works mainly compare the separation with aggregation by empirical results. For example, it has been shown that label separation could be efective in improving model performance and may be potentially more preferable than aggregated labels through majority voting [13]. When training with the cross-entropy loss, Sheng et.al [14] observe that label separation reduces the bias and roughness, and outperforms majority-voting aggregated labels. However, it is unclear whether the results hold when robust treatments are employed. Similar problems have also been studied in corrupted label detection with a result leaning towards separation but not proved [15]. Another line of approach concentrates on the end-to-end training scheme or ensemble methods which take all the separate noisy labels as the input during the training process [16, 17, 18, 19, 20], and learning from separate noisy labels directly.

Learning with noisy labels Popular approaches in learning with noisy labels could be broadly divided into following categories, i.e., (i) Adjusting the loss on noisy labels by: using the knowledge of noise label transition matrix [21, 22, 23, 24, 25, 26, 27, 28, 29]; re-weighting the per-sample loss by down-weighting instances with potentially wrong labels [30, 31, 32, 33, 34]; or refurbishing the noisy labels [35, 36, 37]; (ii) Robust loss designs that do not require the knowledge of noise transition matrix [38, 39, 40, 41, 42, 43, 44, 45]; (iii) Regularization techniques to prevent deep neural networks from memorizing noisy labels [46, 47, 48, 49, 50, 51]; (iv) Dynamical sample selection procedure which behaves in a semi-supervised manner and begins with a clean sample selection procedure, then makes use of the wrongly-labeled samples [52, 53, 54, 55, 56]. For example, several methods [57, 58, 59] adopt a mentor/peer network to select small-loss samples as “clean” ones for the student/peer network. See [60, 61] for a more detailed survey of existing noise-robust techniques.

2. Formulation

defined as follows: Consider an -class classification task and let ∈ and ∈ := {1, 2, ..., } denote the input examples and their corresponding labels, respectively. We assume that (, ) ∼ , where is the joint data distribution. Samples (, ) are generated according to random variables (, ). In the clean and ideal scenario, the learner has access to training data points := {(, )}∈[]. Instead of having access to ground truth labels s, we only have access to a set of noisy labels {˜,}∈[] for ∈ [ ]. For ease of presentation, we adopt the decorator to denote separate labels and ∙ for aggregated labels specified later. Noisy labels ˜s are generated according to the random variable ̃︀ . We consider the class-dependent label noise transition [30, 21] where ̃︀ is generated according to a transition matrix with its entries , := P(̃︀ = | = ).

Most of the existing results on learning with noisy labels have considered the setting where each is paired with only one noisy label ˜. In practice, we often operate in a setting where each data point is associated with multiple separate labels drawn from the same noisy label generation process [62, 63]. We consider this setting and assume that for each , there are independent noisy labels ˜,1, ..., ˜, obtained from annotators [64].

We are interested in two popular ways to leverage multiple separate noisy labels: ∙ Keep the separate labels as separate ones and apply standard learning with noisy labels techniques to each of them.

techniques. ∙ Aggregate noisy labels into one label, and then apply standard learning with noisy data We will look into each of the above two settings separately and then answer the question: “Should the learner aggregate multiple separate noisy labels or not?”

2.1. Label Separation

matrix has the following form when = 2: · P

̃︀ ( )− 1 Denote the column vector P := [P(̃︀ = 1), · · · , P(̃︀ = )]⊤ as the marginal distribution ̃︀ of ̃︀ . Accordingly, we can define P for . Clearly, we have the relation: P = · P , P = ̃︀ . Denote by 1 := P(̃︀ = 0| = 1), 0 := P(̃︀ = 1| = 0). The noise transition For label separation, we define the per-sample loss function as: = ︂[ 1 − 0 1

0 1 − 1 ︂]

. ℓ( (), ˜,1, ..., ˜, ) =

∈[] 1 ∑︁ ℓ( (), ˜,).

For simplicity, we shorthand ℓ( (), ˜) := ℓ( (), ˜,1, ..., ˜, ) for the loss of label separation method when there is no confusion.

2.2. Label Aggregation

The other way to leverage multiple separate noisy labels is generating a single label via label aggregation methods using noisy ones:

˜∙ := Aggregation(˜,1, ˜,2, ..., ˜, ), where the aggregated noisy labels ˜∙s are generated according to the random variable ̃︀ ∙ . Denote the confusion matrix for this single & aggregated noisy label as ∙ . Popular aggregation methods include majority vote and EM inference, which are covered by our theoretical insights since our analyses in later sections would be built on the general label aggregation method. For a better understanding, we introduce the majority vote as an example.

An Example of Majority Vote Given the majority voted label, we could compute the transition matrix between ̃︀ ∙ and the true label using the knowledge of . The lemma below gives the closed form for ∙ in terms of , when adopting majority vote.

Lemma 2.1. Assume is odd and recall that in the binary classification task, , = P(̃︀ = | = ), the noise transition matrix of the (majority voting) aggregated noisy labels ∙, becomes: ∙, = ∑2+1 ︁− 1 (︂ )︂ =0

(,)− (,1− ), , ∈ {0, 1}.

When = 3, then 1∙,0 = P(̃︀ ∙ = 0| = 1) = (1,0)3 + (︀ 31)︀ (1,0)2(1,1). Note it still holds that ∙, + ∙,1− = 1. For the aggregation method, as illustrated in Figure 1, the x-axis 40 50 indicates the number of labelers , and the y-axis denotes the aggregated noise rate given that the overall noise rate is in [0.2, 0.4, 0.6, 0.8]. When the number of labelers is large (i.e., < 10) and the noise rate is small, both majority vote and EM label aggregation methods significantly reduce the noise rate. Although the expectation-maximization method consumes much more time when generating the aggregated label, it frequently results in a lower aggregated noise rate than the majority vote. 3. Bias and Variance Analyses w.r.t. ℓ-loss In this section, we provide theoretical insights on how label separation and aggregation methods result in diferent biases and variances of the classifier prediction when learning with the standard loss function ℓ.

Suppose the clean training samples {(, )}∈[] are given by variables (, ) such that (, ) ∼ . Recall that instead of having access to a set of clean training samples = {(, )}∈[], the learner only observes noisy labels ˜,1, ..., ˜, for each , denoted by ̃︀ := {(, ˜,1, ..., ˜, )}∈[]. For separation methods, the noisy training samples are obtained through variables (, ̃︀1), ..., (, ̃︀ ) where (, ̃︀) ∼ ̃︀ for ∈ []. For aggregation methods such as majority vote, we assume the data points and aggregated noisy labels ̃︀ ∙ := {(, ˜∙)}∈[] are drawn from (, ̃︀ ∙ ) ∼ ̃︀∙ where ̃︀ ∙ is produced through the majority voting of ̃︀1, ..., ̃︀ . When we mention "noise rate", we usually refer to the average noise: P(̃︀ u ̸= ). ℓ-risk under the distribution Given the loss ℓ, note that ℓ( (), ˜) is denoted as ℓ( (), ˜,1, ..., ˜, ) = 1 ∑︀∈[] ℓ( (), ˜,), we define the empirical ℓ-risk for learning with separated/aggregated labels under noisy labels as ^ℓ,̃︀ ( ) = 1 ∑︀ =1 ℓ ( (), ˜), ∈ {, ∙} unifies the treatment which is either separation or aggregation ∙ .

By increasing the sample size , we would expect ^ℓ,̃︀ ( ) to be close to the following ℓ-risk under the noisy distribution ̃︀: ℓ,̃︀ ( ) = E(,̃︀ )∼ ̃︀ [ℓ( (), ̃︀ )]. 3.1. Bias of a Given Classifier w.r.t. ℓ-Loss We denote by * ∈ ℱ the optimal classifier obtained through the clean data distribution (, ) ∼ within the hypothesis space ℱ . We formally define the bias of a given classifier as: ^ Definition 3.1 (Classifier Prediction Bias of ℓ-Loss). Denote by ℓ,(^ ) := E[ℓ(^ (), )], ℓ,( * ) := E[ℓ( * (), )]. The bias of classifier ^ writes as: Bias(^ ) = ℓ,(^ ) − ℓ,( * ).

The Bias term quantifies the prediction bias (excess risk) of a given classifier ^ on the clean data distribution w.r.t. the optimal achievable classifier * , which can be decomposed as [65] Bias(^ ) = ℓ,(^ ) − ℓ,̃︀ (^ ) + ℓ,̃︀ (^ ) − ℓ,( * ) .

Distribution shift

Estimation error ( 1 ) Now we bound the distribution shift and the estimation error in the following two lemmas. Lemma 3.2 (Distribution shift). Denote by := P( = ), assume ℓ is upper bounded by ¯ℓ and lower bounded by ℓ. The distribution shift in Eqn. ( 1 ) is upper bounded by Lemma 3.3 (Estimation error). Suppose the loss function ℓ( (), ) is -Lipschitz for any feasible . ∀ ∈ ℱ , with probability at least 1 − , the estimation error is upper bounded by ℓ,̃︀ (^ ) − ℓ,( * ) ≤ Δ,2 := 4 · R(ℱ ) + (ℓ − ℓ) · where ∈ {, ∙} denotes either separation or aggregation methods, = 2(log· (log+(11)))2 and ∙ ≡ 1 indicate the richness factor, which characterizes the efect of the number of labelers, and R(ℱ ) is the Rademacher complexity of ℱ .

Theorem 3.4. Denote by := ( 00 + 11) − ( ∙00 + ∙11), = √︀log(1/ )/2 . The separation bias proxy Δ is smaller than the aggregation bias proxy Δ∙ if and only if

Note that and are non-decreasing w.r.t. the increase of , in Section 4.3, we will explore how the LHS of Eqn. ( 3 ) is influenced by : a short answer is that the LHS of Eqn. ( 3 ) is (generally) monotonically increasing w.r.t. when is small, indicating that Eqn. ( 3 ) is easier to be achieved given fixed , and a smaller than a larger one. 3.2. Variance of a Given Classifier w.r.t. ℓ-Loss We now move on to explore the variance of a given classifier when learning with ℓ-loss, prior to the discussion, we define the variance of a given classifier as: Definition 3.5 (Classifier Prediction Variance of ℓ-Loss). The variance of a given classifier ^ when learned with separation () or aggregation (∙ ) is defined as:

Var(^ ) = E(,̃︀ )∼ ̃︀ [︁ℓ(^ (), ̃︀ ) − E(,̃︀ )∼ ̃︀ [ℓ(^ (), ̃︀ )]]︁2 .

For () = − 2, we derive the closed form of Var and the corresponding upper bound as below.

Theorem 3.6. When ≥ 2 log(1/ ) , given ℓ is 0-1 loss, we have:

Var(^ ) = (ℓ,̃︀ (^ )) ≤ ︃( √︃ 2 log(1/ ) )︃

Theorem 3.6 provides another view to decide on the choices of separation and aggregation methods, i.e., the proxy of classifier prediction variance. To extend the theoretical conclusions w.r.t. ℓ loss to the multi-class setting, we only need to modify the upper bound of the distribution shift in Eqn. ( 2 ), as specified in the following corollary.

Corollary 3.7 (Multi-Class Extension (ℓ-Loss)). In the -class classification case, the upper bound of the distribution shift in Eqn. ( 2 ) becomes: ℓ,(^ ) − ℓ,̃︀ (^ ) ≤ Δ,1 := ∑︁ · (1 − , ) · ︀( ℓ − ℓ︀) . ∈[] 4. Bias and Variance Analyses with Robust Treatments Intuitively, the learning of noisy labels problem could benefit from more robust loss functions built upon the generic ℓ loss, i.e., backward correction (surrogate loss) [21, 22], and peer loss functions [42]. We move on to explore the best way to learn with multiple copies of noisy labels, when combined with existing robust approaches. ( 4 ) ( 5 ) 4.1. Backward Loss Correction defined as become: ^ When combined with the backward loss correction approach (ℓ → ℓ← ), the empirical ℓ risks ℓ← ,̃︀ ( ) = 1 ∑︀=1 ℓ← ( (), ˜), where the corrected loss in the binary case is ℓ← ( (), ˜) = (1 − 1− ˜ ) · ℓ( (), ˜)

− ˜ · ℓ( (), 1 − ˜) 1 − 0 − 1 .

Bias of given classifier w.r.t. ℓ←

of the classifier ^ under the clean data distribution , with ^ = ^ ← Lemma 4.1 gives the upper bound of classifier prediction bias when learning with ← ℓ,̃︀ ( ). = arg min∈ℱ ℓ

via ^

Suppose the loss function ℓ( (), ) is -Lipschitz for

separation or aggregation methods.

Lemma 4.1. With probability at least 1 − , we have:

We defer our empirical analysis of the monotonicity of the LHS in Eqn. ( 6 ) to Section 4.3 as well, which shares similar monotonicity behavior to learning w.r.t. ℓ.

Variance of given classifiers with Backward Loss Correction Similar to the previous subsection, we now move on to check how separation and aggregation methods result in diferent variance when training with loss correction.

Theorem 4.3. When ← 0( )− 12 < √︁ 2(ℓ− ℓ)2 log(1/ ) , Var(^

← ) (w.r.t. the 0-1 loss) satisfies: Var(^ ← ) = (ℓ,̃︀ (^ ← )) ≤ ← 0 · (ℓ − ℓ) · ︃( √︃ 2 log(1/ ) )︃

Lemma 4.1 ofers the upper bound of the performance gap for the given classifier clean distribution , comparing to the minimum achievable risk. We consider the bound Δ← as a proxy of the bias, and we are interested in the case where training the classifier separately yields a smaller bias proxy compared to that of the aggregation method, formally Δ← w.r.t the

< Δ∙← . ℱ ⊂ { : → {0, 1}}, and the sample set = {1, ..., },

, we give conditions when training separately yields a the aggregation bias proxy Δ∙← if and only if Theorem 4.2. Denote by := 1 − ∙← /← , = 1/ 1 + 4 √︁ lolgo(g1(/)) )︁ , where is the ︁( VC-dimension of ℱ . For backward loss correction, the separation bias proxy Δ← is smaller than ℓ− ℓ ℓ,(^ ← ) − ℓ,( * ) ≤

:= 4← · R(ℱ ) + ← 0 · (ℓ − ℓ) · √︃ 2 log(1/ ) .

← ( 6 ) ( 7 ) The variance proxy of Var(^ ← ) in Eqn. ( 7 ) is smaller than that of Var(^ ∙← ) if √ > ∙← . ←

Moving a bit further, when the noise transition matrix is symmetric for both methods, the requirement √︀ > ←∙ could be further simplified as: √︀ > ∙← = 11−− ∙00−− ∙11 . For a fixed , a more eficient aggre←gation method decreases ∙ , which mak e←s it harder to satisfy this condition.

Recall ← := ← 0 · , the theoretical insights of ℓ← between binary case and the multi-class setting could be bridged by replacing 0 with the multi-class constant specified in the following corollary.

Corollary 4.4 (Multi-Class Extension (ℓ← -Loss)). Given a diagonal-dominant transition matrix , we have where min( ) denotes the minimal eigenvalue of the matrix . Particularly, if < 0.5, ∀ ∈ [ ], we further have ← 0 = min {︃ 1

, 1 − 2 min( ) 2√ }︃ , where := m∈[ax](1 − ).

4.2. Peer Loss Functions

Peer Loss function [42] is a family of loss functions that are shown to be robust to label noise, without requiring the knowledge of noise rates. Formally, ℓ↬( (), ˜) := ℓ( (), ˜) − ℓ( (1 ), ˜2 ), where the second term checks on mismatched data samples with (, ˜), (1 , ˜1 ), (2 , ˜2 ), which are randomly drawn from the same data distribution. When combined with the peer loss approach, i.e., ℓ → ℓ↬, the two risks become: ^ℓ↬,̃︀ ( ) = 1 ∑︀=1 ℓ↬( (), ˜), ∈ {, ∙} .

Bias of given classifier w.r.t. ℓ↬ Suppose the loss function ℓ( (), ) is -Lipschitz for any feasible . Let ↬0 := 1/(1 − 0 − 1), ↬ := ↬0 · and ^ ↬ = arg min∈ℱ ^ℓ↬,̃︀ ( ). Lemma 4.5. With probability at least 1 − , we have:

Theorem 4.6. Denote by := 1 − ∙↬/↬, = 1+22(¯ℓ− ℓ) √︁ 4lo glo(g4(/)) , where denotes the VC-dimension of ℱ . For peer loss, the separation bias proxy Δ↬ is smaller than the aggregation bias proxy Δ∙↬ if and only if ∙↬/↬ − ( )− 21 ≤ . ( 8 )

Loss

Loss Loss

50 Loss

Loss Loss 50

︃( 10 20 30

40 Number of Labelers 10 20 30

40 Number of Labelers

Note that the condition in Eqn. ( 8 ) shares a similar pattern to that which appeared in the basic loss ℓ and ℓ← , we will empirically illustrate the monotonicity of its LHS in Section 4.3. Variance of given classifiers with Peer Loss We now move on to check how separation and aggregation methods result in diferent variances when training with peer loss. Similarly, we can obtain: Theorem 4.7. When √︀ ≥

Var(^ ↬) = (ℓ,̃︀ (^ ↬)) ≤ ↬0 · √︃ lo2g(4 / ) · (︀ 1 + 2(¯ℓ − ℓ))︀ )︃

Variance proxy ( 9 ) √︁ 2 log(4/ ) · (︀ 1 + 2(¯ℓ − ℓ))︀ , Var(^ ↬) (w.r.t. the 0-1 loss) satisfies:

1/(1 −

∑︀∈[] ).

The variance proxy of Var(^ ↬) in Eqn. ( 9 ) is smaller than that of Var(^ ∙↬) if √ ≥ ∙↬↬ . ↬0 to the multi-class setting along with additional conditions specified as below:

Theoretical insights of ℓ↬ also have the multi-class extensions, we only need to generate Corollary 4.8 (Multi-Class Extension (ℓ↬-Loss)). Assume ℓ↬ is classification-calibrated in the multi-class setting, and the clean label has equal prior ( = ) = 1 , ∀ ∈ [ ]. For the uniform noise transition matrix [44] such that , = , ∀ ∈ [ ], we have: ↬0 = 4.3. Analysis of the Theoretical Conditions Recall that the established conditions in Theorems 3.4, 4.2, 4.6 are implicitly relevant to the number of labelers , and the RHS of Eqns. ( 3, 6, 8 ) are constants. We proceed to analyze the monotonicity of the corresponding LHS (in the form of · − (1)− 21 ) w.r.t. the increase of , where ( · ( − ( lo√g() ))− 1). We visualize this order under diferent symmetric in Figure 3.

= 1 for ℓ and ℓ← , = ∙↬/↬ for ℓ↬. Thus, we have: (LHS) = Cross-Entropy

Cross-Entropy 92 .20=9808 86 92 .04=8980 86 92 .06=9808 86 Instance-Dependent Noise, CIFAR-10

Backward Correction 2 .080 = 60 98 .297 0 =96 95 98 .496 0 =94 92 It can be observed that when is small (e.g., ≤ 5), the LHS parts of these conditions increase with , while they may decrease with if is suficiently large. Recall that separation is better if LHS is less than the constant value . Therefore, Figure 3 shows the trends that aggregation is generally better than separation when is suficiently large.

Tightness of the bias proxies In Theorems 3.4, 4.2, 4.6, we view the error bounds Δ, Δ← , Δ↬ as proxies of the worst-case performance of the trained classifier. For the standard loss function ℓ, it has been proven that [66, 67] under mild conditions of ℓ and ℱ , the lower bound of the performance gap between a trained classifier ( ^ ) and the optimal achievable one (i.e., * ) ℓ,(^ ) − ℓ,( * ) is of the order (√︀1/ ), which is of the same order as that in Theorem 3.4. Noting the behavior concluded from the worst-case bounds may not always hold for each individual case, we further use experiments to validate our analyses in the next section. aggregated labels (majority vote, EM inference), and separated labels. We highlight the results with Green (for the separation method) and Red (for aggregation methods) if the performance gap is larger than 0.05. ( is the number of labels per training image)

UCI-Breast (symmetric) CE = 5 = 9 = 15

5. Experimental Results

In this section, we empirically compare the performance of diferent treatments on the multiple noisy labels when learning with robust loss functions (CE loss, forward loss correction, and peer loss). We consider several treatments including label aggregation methods (majority vote and EM inference) and the label separation method. Assuming that multiple noisy labels have diferent weights, EM inference can be used to solve the problem under this assumption by treating the aggregated labels as hidden variables [68, 69, 8, 70]. In the E-step, the probabilities of the aggregated labels are estimated using the weighted aggregation approach based on the ifxed weights of multiple noisy labels. In the M-step, EM inference method re-estimates the weights of multiple noisy labels based on the current aggregated labels. This iteration continues until all aggregated labels remain unchanged. As for label separation, we adopted the mini-batch separation method, i.e., each training sample is assigned with noisy labels in each batch. 5.1. Experiment on Synthetic Noisy Datasets Experimental results on synthetic noisy UCI datasets [71] We adopt six UCI datasets to empirically compare the performances of label separation and aggregation methods when MV EM Sep MV EM Sep MV EM Sep MV EM Sep MV EM Sep MV EM Sep learning with CE loss, backward correction [21, 22], and Peer Loss [42]. The noisy annotations given by multiple annotators are simulated by symmetric label noise, which assumes , = − 1 for ̸= for each annotator, where quantifies the overall noise rate of the generated noisy labels. In Figure 4, we adopt two UCI datasets (StatLog: ( = 6); Optical: ( = 10)) for illustration. From the results in Figure 4, it is quite clear that: the label separation method outperforms both aggregation methods (majority-vote and EM inference) consistently, and is considered to be more beneficial on such small scale datasets . Results on additional datasets and more details are deferred to the Appendix.

Experimental results on synthetic noisy CIFAR-10 dataset [72] On CIFAR-10 dataset, we consider two types of simulation for the separate noisy labels: symmetric label noise model and instance-dependent label noise [53, 24], where is the average noise rate and diferent labelers follow diferent instance-dependent noise transition matrices. For a fair comparison, we adopt the ResNet-34 model [73], the same training procedure and batch-size for all considered treatments on the separate noisy labels. noise regime or when is large, aggregating separate noisy labels significantly reduces the noise rates and aggregation methods tend out to have a better performance; while in the high noise regime or when is small, the performances of separation methods tend out to be more promising. With the increasing of or , we can observe a preference transition from label separation to label aggregation methods. 5.2. Empirical Verification of the Theoretical Bounds To verify the comparisons of bias proxies (i.e., Theorem 3.4) through an empirical perspective, we adopt two binary classification UCI datasets for demonstration: Breast and German datasets, as shown in Table 1. Clearly, on these two binary classification tasks, label aggregation methods tend to outperform label separation, and we attribute this phenomenon to the fact that the ”denoising efect of label aggregation is more significant in the binary case”.

For Theorem 3.4 (CE loss), the condition requires / 1 the information could be summarized in Table 2, where the column (1 − , ) means: when the number of annotators belongs to the set , the label separation method is likely to underperform label aggregation (i.e., majority vote) with probability at least 1 − . For example, in the last row of Table 2, when training on UCI German dataset with CE loss under noise rate ︁( − ( ∘ )− 21 )︁ , where = ( ∘00 + 0.4 (the noise rate of separate noisy labels), Theorem 3.4 reveals that with probability at least 0.98, label aggregation (with majority vote) is better than label separation when > 23, which aligns well with our empirical observations (label separation is better only when < 15). 5.3. Experiments on Realistic Noisy Datasets Note that in real-world scenarios, the label-noise pattern may difer due to the expertise of each human annotator. We further compare the diferent treatments on two realistic noisy datasets: CIFAR-10N [74], and CIFAR-10H [75]. CIFAR-10N provides each CIFAR-10 train image with 3 independent human annotations, while CIFAR-10H gives ≈ 50 annotations for each CIFAR-10 test image.

In Table 3, we repeat the reproduction of three robust loss functions with three diferent treatments on the separate noisy labels. We report the best-achieved test accuracy for CrossEntropy/Backward Correction/Peer Loss methods when learning with label aggregation methods (majority-vote and EM inference) and the separation method (soft-label). We observe that the separation method tends to have a better performance than aggregation ones. This may be attributed to the relatively high noise rate ( ≈ 0.18) in CIFAR-N and the insuficient amount of labelers ( = 3). Note that since the noise level in CIFAR-10H is low ( ≈ 0.07 wrong labels), label aggregation methods can infer higher quality labels, and thus, result in a better performance than separation methods (Red colored cells in Table 3 and 4).

5.4. Hypothesis Testing

We adopt the paired t-test to show which treatment on the separate noisy labels is better, under certain conditions. In Table 5, we report the statistic and -value given by the hypothesis testing results. The column “Methods” indicate the two methods we want to compare (A & B). Positive statistics means that A is better than B in the metric of test accuracy. Given a specific setting, denote by Accmethod as the list of test accuracy that belongs to this setting (i.e., CIFAR-10N, = 3), including CE, BW, PL loss functions, the basic hypothesis could be summarized as below:

To clarify, the three cases in the above hypothesis are tested independently. For test accuracy comparisons of CIFAR-10N in Table 3, the setting of the hypothesis test is = 3 and the label noise rate is relatively high (18%). All -values are larger than 0.05, indicating that we should reject the null hypothesis, and we can conclude that the performance of these three methods on

CIFAR-10N (high noise, small ) satisfies: EM<MV<Sep.

For CIFAR-10H in Table 3 and 4, all the label noise rate is relatively low. We consider two scenarios ( < 15: the number of annotators is small; ≥ 15: the number of annotators is large). -values among MV and EM are always large, which means that the denoising efect of the advanced label aggregation method (EM) is negligible under CIFAR-10H dataset. However, -values of remaining settings are larger than 0.05, indicating that we should reject the null hypothesis, and we can conclude that the performance of these 3 methods on CIFAR-10H (low noise, small/large ) satisfies: EM/MV > Sep.

6. Conclusions

When learning with separate noisy labels, we explore the answer to the question “whether one should aggregate separate noisy labels into single ones or use them separately as given”. In the empirical risk minimization framework, we theoretically show that label separation could be more beneficial than label aggregation when the noise rates are high or the number of labelers is insuficient. These insights hold for a number of popular loss functions including several robust treatments. Empirical results on synthetic and real-world datasets validate our conclusion. machine intelligence 39 (2017) 2409–2422. [12] T. Luo, Y. Liu, Machine truth serum, arXiv preprint arXiv:1909.13004 (2019). [13] P. G. Ipeirotis, F. Provost, V. S. Sheng, J. Wang, Repeated labeling using multiple noisy labelers, Data Mining and Knowledge Discovery 28 (2014) 402–441. [14] V. S. Sheng, J. Zhang, B. Gu, X. Wu, Majority voting and pairing with multiple noisy labeling, IEEE Transactions on Knowledge and Data Engineering 31 (2017) 1355–1368. [15] Z. Zhu, Z. Dong, Y. Liu, Detecting corrupted labels without training a model to predict, arXiv preprint arXiv:2110.06283 (2022). [16] Z.-H. Zhou, Ensemble methods: foundations and algorithms, CRC press, 2012. [17] M. Guan, V. Gulshan, A. Dai, G. Hinton, Who said what: Modeling individual labelers improves classification, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [18] F. Rodrigues, F. Pereira, Deep learning from crowds, in: Proceedings of the AAAI

Conference on Artificial Intelligence, volume 32, 2018. [19] Z. Chen, H. Wang, H. Sun, P. Chen, T. Han, X. Liu, J. Yang, Structured probabilistic end-to-end learning from crowds., in: IJCAI, 2020, pp. 1512–1518. [20] H. Wei, R. Xie, L. Feng, B. Han, B. An, Deep learning from multiple noisy annotators as a union, IEEE Transactions on Neural Networks and Learning Systems (2022). [21] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, A. Tewari, Learning with noisy labels, in:

Advances in neural information processing systems, 2013, pp. 1196–1204. [22] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, L. Qu, Making deep neural networks robust to label noise: A loss correction approach, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1944–1952. [23] X. Xia, T. Liu, N. Wang, B. Han, C. Gong, G. Niu, M. Sugiyama, Are anchor points really indispensable in label-noise learning?, Advances in Neural Information Processing Systems 32 (2019). [24] Z. Zhu, Y. Song, Y. Liu, Clusterability as an alternative to anchor points when learning with noisy labels, in: International Conference on Machine Learning, PMLR, 2021, pp. 12912–12923. [25] Z. Zhu, J. Wang, Y. Liu, Beyond images: Label noise transition matrix estimation for tasks with lower-quality features, arXiv preprint arXiv:2202.01273 (2022). [26] Z. Jiang, K. Zhou, Z. Liu, L. Li, R. Chen, S.-H. Choi, X. Hu, An information fusion approach to learning with instance-dependent label noise, in: International Conference on Learning Representations, 2022. [27] Z. Zhang, Y. Li, H. Wei, K. Ma, T. Xu, Y. Zheng, Alleviating noisy-label efects in image classification via probability transition matrix, arXiv preprint arXiv:2110.08866 (2021). [28] S. Li, X. Xia, H. Zhang, Y. Zhan, S. Ge, T. Liu, Estimating noise transition matrix with label correlations for noisy multi-label learning, in: Advances in Neural Information Processing Systems, 2022. [29] X. Xia, B. Han, N. Wang, J. Deng, J. Li, Y. Mao, T. Liu, Extended<? tex ?>: Learning with mixed closed-set and open-set noisy labels, IEEE Transactions on Pattern Analysis and Machine Intelligence (2022). [30] T. Liu, D. Tao, Classification with noisy labels by importance reweighting, IEEE Transactions on pattern analysis and machine intelligence 38 (2016) 447–461. [31] H.-S. Chang, E. Learned-Miller, A. McCallum, Active bias: Training more accurate neural networks by emphasizing high variance samples, Advances in Neural Information Processing Systems 30 (2017). [32] N. Bar, T. Koren, R. Giryes, Multiplicative reweighting for robust neural network optimization, arXiv preprint arXiv:2102.12192 (2021). [33] N. Majidi, E. Amid, H. Talebi, M. K. Warmuth, Exponentiated gradient reweighting for robust training under label noise and beyond, arXiv preprint arXiv:2104.01493 (2021). [34] A. Kumar, E. Amid, Constrained instance and class reweighting for robust learning under label noise, arXiv preprint arXiv:2111.05428 (2021). [35] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, A. Rabinovich, Training deep neural networks on noisy labels with bootstrapping, arXiv preprint arXiv:1412.6596 (2014). [36] M. Lukasik, S. Bhojanapalli, A. Menon, S. Kumar, Does label smoothing mitigate label noise?, in: International Conference on Machine Learning, PMLR, 2020, pp. 6448–6458. [37] J. Wei, H. Liu, T. Liu, G. Niu, Y. Liu, Understanding generalized label smoothing when learning with noisy labels, arXiv preprint arXiv:2106.04149 (2021). [38] Y. Wang, X. Ma, Z. Chen, Y. Luo, J. Yi, J. Bailey, Symmetric cross entropy for robust learning with noisy labels, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 322–330. [39] E. Amid, M. K. Warmuth, R. Anil, T. Koren, Robust bi-tempered logistic loss based on

Bregman divergences, Advances in Neural Information Processing Systems 32 (2019). [40] J. Wang, H. Guo, Z. Zhu, Y. Liu, Policy learning using weak supervision, Advances in

Neural Information Processing Systems 34 (2021). [41] X. Ma, H. Huang, Y. Wang, S. Romano, S. Erfani, J. Bailey, Normalized loss functions for deep learning with noisy labels, in: International Conference on Machine Learning, PMLR, 2020, pp. 6543–6553. [42] Y. Liu, H. Guo, Peer loss functions: Learning from noisy labels without knowing noise rates, in: International Conference on Machine Learning, PMLR, 2020, pp. 6226–6236. [43] Z. Zhu, T. Liu, Y. Liu, A second-order approach to learning with instance-dependent label noise, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10113–10123. [44] J. Wei, Y. Liu, When optimizing -divergence is robust with label noise, arXiv preprint arXiv:2011.03687 (2020). [45] H. Wei, H. Zhuang, R. Xie, L. Feng, G. Niu, B. An, Y. Li, Logit clipping for robust learning against label noise, arXiv preprint arXiv:2212.04055 (2022). [46] X. Xia, T. Liu, B. Han, C. Gong, N. Wang, Z. Ge, Y. Chang, Robust early-learning: Hindering the memorization of noisy labels, in: International conference on learning representations, 2020. [47] S. Liu, J. Niles-Weed, N. Razavian, C. Fernandez-Granda, Early-learning regularization prevents memorization of noisy labels, Advances in neural information processing systems 33 (2020) 20331–20342. [48] S. Liu, Z. Zhu, Q. Qu, C. You, Robust training under label noise by over-parameterization, arXiv preprint arXiv:2202.14026 (2022). [49] H. Cheng, Z. Zhu, X. Sun, Y. Liu, Demystifying how self-supervised features improve training from noisy labels, arXiv preprint arXiv:2110.09022 (2021). [50] H. Wei, L. Tao, R. Xie, B. An, Open-set label noise can improve robustness against inherent label noise, Advances in Neural Information Processing Systems 34 (2021). [51] H. Huang, H. Kang, S. Liu, O. Salvado, T. Rakotoarivelo, D. Wang, T. Liu, Paddles: Phaseamplitude spectrum disentangled early stopping for learning with noisy labels (????). [52] S. Liu, K. Liu, W. Zhu, Y. Shen, C. Fernandez-Granda, Adaptive early-learning correction for segmentation from noisy annotations, arXiv preprint arXiv:2110.03740 (2021). [53] H. Cheng, Z. Zhu, X. Li, Y. Gong, X. Sun, Y. Liu, Learning with instance-dependent label noise: A sample sieve approach, in: International Conference on Learning Representations, 2021. URL: https://openreview.net/forum?id=2VXyy9mIyU3. [54] T. Luo, X. Li, H. Wang, Y. Liu, Research replication prediction using weakly supervised learning, in: In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, 2020. [55] Z. Wang, J. Jiang, B. Han, L. Feng, B. An, G. Niu, G. Long, Seminll: A framework of noisy-label learning by semi-supervised learning, arXiv preprint arXiv:2012.00925 (2020). [56] C. Qin, Y. Wang, Y. Fu, Robust semi-supervised domain adaptation against noisy labels, in: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022, pp. 4409–4413. [57] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, M. Sugiyama, Co-teaching: Robust training of deep neural networks with extremely noisy labels, in: Advances in neural information processing systems, 2018, pp. 8527–8537. [58] X. Yu, B. Han, J. Yao, G. Niu, I. Tsang, M. Sugiyama, How does disagreement help generalization against label corruption?, in: International Conference on Machine Learning, PMLR, 2019, pp. 7164–7173. [59] H. Wei, L. Feng, X. Chen, B. An, Combating noisy labels by agreement: A joint training method with co-regularization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13726–13735. [60] B. Han, Q. Yao, T. Liu, G. Niu, I. W. Tsang, J. T. Kwok, M. Sugiyama, A survey of label-noise representation learning: Past, present and future, arXiv preprint arXiv:2011.04406 (2020). [61] H. Song, M. Kim, D. Park, Y. Shin, J.-G. Lee, Learning from noisy labels with deep neural networks: A survey, IEEE Transactions on Neural Networks and Learning Systems (2022). [62] V. Feldman, Does learning require memorization? a short tale about a long tail, in: Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, 2020, pp. 954–959. [63] Y. Liu, Understanding instance-level label noise: Disparate impacts and treatments, in:

International Conference on Machine Learning, PMLR, 2021, pp. 6725–6735. [64] W. Tang, M. Yin, C.-J. Ho, Leveraging peer communication to enhance crowdsourcing, in:

The World Wide Web Conference, 2019, pp. 1794–1805. [65] Z. Zhu, T. Luo, Y. Liu, The rich get richer: Disparate impact of semi-supervised learning, arXiv preprint arXiv:2110.06282 (2021). [66] S. Mendelson, Lower bounds for the empirical minimization algorithm, IEEE Transactions on Information Theory 54 (2008) 3797–3803. [67] G. Lecué, S. Mendelson, Sharper lower bounds on the performance of the empirical risk minimization algorithm, Bernoulli (2010) 605–613. [68] A. P. Dawid, A. M. Skene, Maximum likelihood estimation of observer error-rates using the em algorithm, Journal of the Royal Statistical Society: Series C (Applied Statistics) 28 (1979) 20–28. [69] P. Smyth, U. Fayyad, M. Burl, P. Perona, P. Baldi, Inferring ground truth from subjective labelling of venus images, Advances in neural information processing systems 7 (1994). [70] N. Quoc Viet Hung, N. T. Tam, L. N. Tran, K. Aberer, An evaluation of aggregation techniques in crowdsourcing, in: International Conference on Web Information Systems Engineering, Springer, 2013, pp. 1–15. [71] D. Dua, C. Graf, UCI machine learning repository, 2017. URL: http://archive.ics.uci.edu/ml. [72] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images,

Technical Report, Citeseer, 2009. [73] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [74] J. Wei, Z. Zhu, H. Cheng, T. Liu, G. Niu, Y. Liu, Learning with noisy labels revisited: A study using real-world human annotations, arXiv preprint arXiv:2110.12088 (2021). [75] J. C. Peterson, R. M. Battleday, T. L. Grifiths, O. Russakovsky, Human uncertainty makes classification more robust, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9617–9626. [76] J. M. Varah, A lower bound for the smallest singular value of a matrix, Linear Algebra and its applications 11 (1975) 3–5. [77] X. Xia, T. Liu, B. Han, N. Wang, M. Gong, H. Liu, G. Niu, D. Tao, M. Sugiyama, Partdependent label noise: Towards instance-dependent label noise, Advances in Neural Information Processing Systems 33 (2020) 7597–7610.

A. Full Proofs

In this section, we briefly introduce all omitted proofs in the main paper.

We firstly give the proof of Lemma 4.1 because it is beneficial for the proofs in Section 3.

A.1. Proof of Lemma 4.1

Proof. To apply Hoefding’s inequality on the dataset of the separation method, we divide the noisy train samples {(, ˜,)}∈[] into groups, for ∈ [], i.e., {(, ˜,1)}∈[], · · · , {(, ˜, )}∈[]. Note within each group, e.g., group {(, ˜,1)}∈[], all the training samples are i.i.d. Additionally, training samples between any two diferent groups are also i.i.d. given feature set {}∈[]. Thus, with one group {(, ˜,1)}∈[], w.p. 1 − 0, we have ⃒⃒⃒ ^1← |Group-1( ) − 1← ( )⃒⃒ ≤ ⃒ 1← − 1← )︁ · √︂ log(1/ 0) , ∀.

Note that:

Applying the above technique on the other groups and by the union bound, we know that w.p. at least 1 − 0, ∀ ∈ [],

[︃ ^ 1← |Group-k( ) ∈ 1← ( ) − ← 0 ·

2 √︂ log(1/ 0) , 1 ( ) + ← 0 · ← √︂ log(1/ 0) 2 ]︃ .

Each ^1← |Group-k( ), ∈ [] can be seen as a random variable within range: [︃ 1← ( ) − ← 0 ·

2 √︂ log(1/ 0) , 1 ( ) + ← 0 · ← √︂ log(1/ 0) 2 ]︃ . ifxed. By Hoefding’s inequality, w.p. at least The randomness is from noisy labels ˜,. Recall that the samples between diferent groups are i.i.d. given {}∈[]. Then the above random variables are i.i.d. when the feature set is 1 − 0 − 1, ∀ , we have ⃒⃒⃒ ^1← ( ) − 1← ( )⃒⃒ ≤ 2 · ← 0 · ⃒ √︂ log(1/ 0) √︂ log(1/ 1) 2 · 2 = ← 0 · √︂ log(1/ 1) log(1/ 0) .

For 0 = 1 = +1 , with the Rademacher bound on the maximal deviation between risks and empirical ones, for * ∈ ℱ and the separation method, with probability at least 1 − , we ⃒ ℓ← ,̃︀ ( ) − ℓ← ,̃︀

( )⃒⃒ ≤ 2R∘ (ℓ← ∘ ℱ ) + ← 0 · (ℓ − ℓ) · log √︂ 1 , =2R∙ (ℓ← ∘ ℱ ) + ∙← 0 · (ℓ − ℓ) · where we define ℓ, ℓ as the upper and lower bound of loss function ℓ respectively, and: have: max ⃒⃒ ^ =1 =1 ∑︁ ∑︁ ℓ← ( (), ˜, )⎦ ⎤ ≤ 1 ∑︁ ℓ← ( (), ˜, ) ,

]︃ ]︃

R∙ (ℓ← ∘ ℱ ) := E,˜∙ , su∈ℱp ℓ← ( (), ˜∙ ) .

Note that we assume the noisy labels given by the labelers follow the same noise transition matrix, if ℓ is − Lipshitz, then for separation and aggregation methods, ℓ← is ← Lipshitz for ∈ {, ∙} respectively, where = (1+| 0− 1|) ←

2 1− 0− 1 ≤ 1− 0− 1 . By the Lipshitz composition property of Rademacher averages, we have R(ℓ← ∘ ℱ ) ≤ ← · R(ℱ ). Thus, we have: max |^ ∈ℱ ℓ← ,̃︀ ( ) − ℓ← ,̃︀ ( )| ≤ 2← R(ℱ ) + (1 + | 0 − 1|) · (ℓ − ℓ) 1 − 0 − 1 · log( ) · max |^ ∈ℱ

min∈ℱ ℓ,( ), for separation methods, we further have: ℓ,(^ ← ) − ℓ,( * ) = ℓ← ,̃︀ (^ ← ) − ℓ← ,̃︀

( * ) = ℓ← ,̃︀ (^ ← ) − ^ ℓ← ,̃︀ (^ ← ) + ^ ℓ← ,̃︀ ( * ) − ℓ← ,̃︀ ( * ) + ^ ℓ← ,̃︀ (^ ← ) − ^ ℓ← ,̃︀ ( * ) ∈ℱ ℓ← ,̃︀ ( ) − ℓ← ,̃︀

( )| ≤ 4← R(ℱ ) + 2← · (ℓ − ℓ) · log( ) · √︂ 1

· √︂ 1 ( 10 )

, 2 .

Similarly, for aggregation methods, we have: ℓ,(^ ∙← ) − ℓ,( * ) = (^ ∙← ) −

ℓ← ,̃︀∙ ( * ) =

ℓ← ,̃︀∙ ∈ℱ ℓ← ,̃︀∙ ( ) −

ℓ← ,̃︀∙ ( )| ≤ 4∙← R(ℱ ) + 2∙← · (ℓ − ℓ) · 2(log( +1 ))2 and ∙ ≡ 1, we then have: ℓ← ,̃︀∙ ( * ) − ℓ← ,̃︀∙ ( * ) + ^ ℓ← ,̃︀∙

A.2. Proof of Theorem 4.2

(1 +1− | 00−− 11|) , we have: Proof. The proof is straightforward if we proceed with the proof of Lemma 4.1 with the below discussions. With the knowledge of noise rates for both methods, remember that = √︂ log(1/ ) 2 Δ← < Δ∙←

2← R(ℱ ) + ← · (ℓ − ℓ) · log( 2 2 2 2 ← − ∙ ℓ

− ℓ ← · · R(ℱ ) < ∙← · + 1

)

· 2 √︂ 1 − ← · log( √︃ 1 )︃ √︂ log(1/ ) · √︂ 1 ← · · R(ℱ ) < ∙← − ← · For any finite concept class ℱ ⊂ { : → {0, 1}}, and the sample set = {1, ..., }, the Rademacher complexity is upper bounded by √︁ 2 log() where is the VC dimension of ℱ .

< Δ∙← , we simply need to find the condition of (or ) that satisfies the To achieve Δ← below in-equation: Δ← < Δ∙← =⇒ 2 √︃

2 < ∙← − ← · √︃ 1 )︃

A.4. Proof for Corollary 4.4

For a general matrix = ()− 1, we firstly note

∑︁ ∈[],>0 | + | min ∈[]

∑︁ ∈[],<0 |. . Then Recall 1 = 1 ⇒ 1 = ()− 11. We know the above maximum and minimum take the same

∑︁ ∈[],>0 | + | min ∈[]

∑︁ ∈[],<0 ≤ 1 − 2

1 ≤ min∈[] ( − ∑︀

̸= ) , := m∈[ax](1 − ), < 0.5.

Now we prove the inequality () [76]. Let satisfy

and let = ()− 1 . Then ‖()− 1‖∞ = ‖()− 1 ‖∞/‖ ‖∞

‖()− 1‖∞ = ‖ ‖∞/‖ ‖∞ To bound ‖ ‖, we choose such that = ‖ ‖∞. Then = − ∑︁ , ̸= which further gives Therefore, and The term of distribution shift can be upper bounded by: =E(, )∼ [︁ max ⃒ E(, )∼ [ℓ( (), )] − ∈ℱ ⃒ = max ⃒ E(, =1)∼ [ℓ( (), 1)] + E(, =0)∼ [ℓ( (), 0)] ∈ℱ ⃒

E(,̃︀)∼ ̃︀ [︁

ℓ(^ (), ̃︀)]︁ E(,̃︀)∼ ̃︀ [︁ ℓ( (), ̃︀)]︁⃒⃒

⃒ ∑︁ ̸= ||‖ ‖∞ ≤ | | +

| || | ≤ | | + ‖ ‖∞ ‖ ‖∞ ≤ − | |

, ‖( )− 1 ‖∞ = ‖ ‖∞/‖ ‖∞ ≤

− 1 ← − 1← ≤‖ ‖∞ ≤ max( ) = ∑︁ ̸=

| |. 1 min( ) . ,

On the other hand, denoting by ‖ ‖max := max,∈[] | |, from eigenvalues, we know where min( ) denotes the minimal eigenvalue of the matrix . Therefore, 1 ← − 1 = ∘

← 0 = min{ ← 1 − 1

, 2max min( ) } the matrix . where := max∈[](1 − ), < 0.5, and min( ) denotes the minimal eigenvalue of

A.5. Proof of Lemma 3.2

Proof. Note that for ^ = ^ , we have: ∈ℱ = ℓ,(^ ) − (^ ) + min ℓ,( ) = ℓ,(^ ) − ℓ,( * ) min ∈ℱ ∈ℱ

Estimation error ⃒ = max ⃒ E(, =1)∼ [ℓ( (), 1)] + E(, =0)∼ [ℓ( (), 0)]

Combine similar terms, we then have: =(1 1 + 0 0) · (︀ ℓ − ℓ︀) .

Thus, we have: [︁ [︁

P(̃︀ = 1| = 1) · ℓ( (), 1)]︁ − E(, =1)∼ P(̃︀ = 1| = 0) · ℓ( (), 1)]︁ − E(, =0)∼ [︁ [︁

P(̃︀ = 0| = 1) · ℓ( (), 0)]︁

P(̃︀ = 0| = 0) · ℓ( (), 0) ⃒ . [︁

P(̃︀ = 0| = 1) · ℓ( (), 1)]︁ + E(, =0)∼ [︁

P(̃︀ = 1| = 0) · ℓ( (), 0)]︁ [︁

P(̃︀ = 0| = 1) · ℓ( (), 0)]︁ − E(, =0)∼ [︁

P(̃︀ = 1| = 0) · ℓ( (), 1)]︁ ⃒⃒ ]︁⃒ ⃒ ⃒ ⃒ + E(, =0)∼ ∈ℱ ⃒ ℓ,(^ ) − ℓ,̃︀ (^ ) ≤ Proof. For the term Estimation error, we have:

∈ℱ Estimation error ( ) + |min ℓ,̃︀ ∈ℱ ( ) − ℓ,( * ) ( ) − ℓ,( * )|

Error 2

The upper bound of Error 1 could be derived directly from the proof of Lemma 4.1: since the loss function makes no use of loss correction, the L-Lipschitz constant does not have to multiply with the constant and ←

→ . Besides, the constant for the variance term (square term) reduces to (ℓ − ℓ). Thus, we have:

Error 1 ≤ 4R(ℱ) + (ℓ − ℓ) · , ∀ ∈ ℱ. Proof. To achieve a smaller upper bound for the separation method, mathematically, we want: 4R(ℱ) + (ℓ − ℓ) · ≤ 4R(ℱ) + (ℓ − ℓ) · 1− ( )− 21 ≤ , where := ( 00 + 11) − ( ∙00 + ∙11), = √︀log(1/ )/2. 1 ( 00+ 11)− ( ∙00+ ∙11), which is mentioned as ·

(1− ( )− 21)

A.8. Proof of Theorem 3.6

Proof. For ∈ {, ∙} , we have: Var(^) =E(,̃︀)∼ ̃︀ [︁ℓ(^(), ̃︀) − E(,̃︀)∼ ̃︀[ℓ(^(), ̃︀)]]︁2 =Ẽ︀

[︃ [︁ℓ(^(), ̃︀)]︁2 + [︁Ẽ︀[ℓ(^(), ̃︀)]]︁2 − 2ℓ(^(), ̃︀)Ẽ︀[ℓ(^(), ̃︀)]]︃ =E ̃︀ [︁ ̃︀ [ℓ(^ ( ), ̃︀ )]]︁2

(^ ))2.

A special case is the 0-1 loss, i.e., ℓ(· ) = 1(· ), we then have: [︁ Ẽ︀ 2ℓ(^ ( ), ̃︀ )E ̃︀ [ℓ(^ ( ), ̃︀ )]]︁ Var(^ ) =E =E =

− ℓ(^ ( ), ̃︀ ]︁

) − −

ℓ,̃︀ where when (^ ) ∈ [0, 1] and () = − 2 is monotonically increasing when < 1 2 . Thus, (^ )

ℓ) · reduces to 1 log(1/ ) 2 1 2 log(1/ ) , we could derive Var(^ ) √︁ 2 log(1/ )

A.9. Proof of Corollary 3.7

Proof. In the multi-class extension, the only diference is the upper bound of the Distribution Shift term in Eqn. ( 11 ), which now becomes: ℓ,(^ ) =E(, )∼ [︁

]︁ ℓ(^ ( ), ) − max ⃒ E(, )∼ = max ⃒⃒ ⎣ ∈ℱ ⃒ ⃒ ⎡ ⃒ ∈[ ] ⃒ ⎡ = m∈aℱx ⃒⃒⃒ ⎣ = m∈aℱx ⃒⃒⃒ ⎣ ⃒ ⎡ ∈[ ] ∈[ ]

E (,̃︀ ) ∼ ̃︀ [︁

ℓ(^ ( ), ̃︀ )]︁ [ℓ( ( ), )] −

E (,̃︀ ) ∼ ̃︀ [︁

ℓ( ( ), ̃︀ )]︁ ⃒⃒ [︃ [︁ E(, =)∼

ℓ( ( ), )⎦⎦ − ⎣ E(, =)∼ [ℓ( ( ), )]⎦ − ⎣

∑︁ ∈[ ] ∈[ ] E(, =)∼

P(̃︀ ̸= | = ) · ℓ( ( ), ) ⃒ ⃒ ⃒ E(, =)∼ ]︁ ⎦ − ⎣

⎡ (,̃︀ ) ∼ ̃︀, = [︁ ℓ( ( ), ̃︀ )]︁ ⃒⃒ ⎤ ⃒ ⎦ ⃒

⃒ [︁

P(̃︀ = | = ) · ℓ( ( ), )

⎤ ⃒ ]︁ ⃒ ⎦ ⃒ ⃒ ⃒ ∑︁

∑︁ ∈[ ],̸= ∈[ ]

E(, =)∼ [︁

P(̃︀ = | = P(̃︀ ̸= | = ) · ℓ( (), ) −

P(̃︀ = | = ) · ℓ( (), ) ⃒ P(̃︀ ̸= | = ) · (︀ ℓ − ℓ︀) ⃒⃒ ]︃⃒ ⃒ ⃒ ]︃⃒ ⃒ ⃒ = max ⃒⃒ ∑︁ E(, =)∼ ≤ m∈aℱx ⃒⃒⃒⃒ ∑∈[︁] E(, =)∼ (Assumed uniform prior) = ∑︁ P( = ) · (1 − , ) ︀( ℓ − ℓ︀) . Proof. The proof of Lemma 4.5 builds on Theorem 7 in [42]: The performance bound for aggregation methods is the special case of Theorem 7 in [42] (adopting * = 1 defined in [ 42]). As for that of separation methods, the incurred diference lies in the appearance of the weight of sample complexity . Thus, we have: ℓ,(^ ↬) − ℓ,( * ) ≤ where Δ↬ := 8↬R(ℱ ) + ↬0√︁ 2 log(4/ ) (︀ 1 + 2(¯ℓ − ℓ))︀ .

8R(ℱ ) +

√︃ 2 log(4/ ) ︀( 1 + 2(¯ℓ − ℓ))︀ √︃ 1 ]︃ 4√︂ log(4/ ) ︀( 1 + 2(¯ℓ − ℓ))︀ 2

A.11. Proof of Theorem 4.6

Proof. Denote by Δ↬ := 18−R0(−ℱ )1 + we require Δ↬ < Δ∙↬, which is equivalent to: 4√︂ lo2g(4/) (1+2(¯ℓ− ℓ))

1− 0− 1 8R(ℱ ) 4 √︁ log(4/ ) (︀ 1 + 2(¯ℓ − ℓ))︀ 2

8R(ℱ ) 1 − ∙0 − ∙1 which is further equivalent to:

8R(ℱ ) 1 − 0 − 1 −

8R(ℱ ) 1 − ∙0 − ∙1 4 √︁ log(4/ ) (︀ 1 + 2(¯ℓ − ℓ))︀ 2 1 − ∙0 − ∙1 [︃ , in order to achieve Δ↬ < Δ∙↬, + − 4 √︁ log(4/ ) (︀ 1 + 2(¯ℓ − ℓ))︀ 2 1 − ∙0 − ∙1 4 √︁ log(4/ ) (︀ 1 + 2(¯ℓ − ℓ))︀ 2

Note that both 1 − 0 − 1 and 1 − ∙0 − ∙1 are positive, the above requirement then reduces to: [( 0 + 1) − ( ∙0 + ∙1)]8R(ℱ ) < (1 − 0 − 1) − (1 − ∙0 − ∙1) < ((11 −− ∙00 −− ∙11)) −

[( 0 + 1) − ( ∙0 + ∙1)]8√︁ 2 log() 4(1 − ∙0 − ∙1)√︁ log2(4/ ) (︀ 1 + 2(¯ℓ − ℓ))︀ .

Denote by := 1 − ∙↬/↬, = 1+22(¯ℓ− ℓ) √︁ 4lo glo(g4(/)) . The above condition is satisfied if and only if Proof. Similar to the proof of Theorem 3.6, for ∈ {, ∙} , we have:

Var(^ ↬) = E ̃︀

[︁ℓ(^ ↬(), ̃︀ )]︁2 − (ℓ,̃︀ (^ ↬))2.

A special case is the 0-1 loss, i.e., ℓ(· ) = 1(· ), we then have:

Var(^ ↬) =E =E ̃︀ ̃︀ [︁ℓ(^ ↬(), ̃︀ )]︁2 − (ℓ,̃︀ (^ ↬))2 [︁ℓ(^ ↬(), ̃︀ )]︁ − (ℓ,̃︀ (^ ↬))2 =ℓ,̃︀ (^ ↬) − (ℓ,̃︀ (^ ↬))2 = ︁( ℓ,̃︀ (^ ↬))︁ where ℓ,̃︀ (^ ↬) ∈ [0, 1] and () = − 2 is monotonically increasing when < 12 . Note that: (^ ↬) < ⇐⇒ √︀ ≥ √︂ 2 log(4/ ) 1 + 2(¯ℓ − ℓ) , 1 − 0 − 1 we have: Var(^ ← ) ≤ ︁( √︁ lo2g(4/ ) 11−+20(¯ℓ−− ℓ1) )︁ . To achieve: Var(^ ↬) < Var(^ ∙↬), we simply need: √︃ log(4/ ) 1 + 2(¯ℓ − ℓ) 2 1 − 0 − 1 ≤

√︃ lo2g (∙4 / ) 11 +− 2 (∙0¯ℓ− − ℓ∙1) ⇐⇒ √ ≥ ∙↬↬00 . Proof. Regarding the multi-class extension of Lemma 4.5, the only diferent thing lies in the constant: ↬0. The following Lemma A.1 helps us find out the multi-class form of ↬0. Lemma A.1. Assume the clean label has equal prior ( = ) = 1 , ∀ ∈ [ ]. For the uniform noise transition matrix [44] such that , = , ∀ ∈ [ ], the expected ℓ↬ in the multi-class setting is invariant to label noise up to an afine transformation:

E(,̃︀ )∼ ̃︀ [ℓ↬( ( ), ̃︀ )] = ⎝1 − ⎛

⎞ ∑︁ ⎠ E[ℓ↬( ( ), )]. ∈[] Proof of Lemma A.1 Recall that and ̃︀ refer to the joint distribution over (, ) and (, ̃︀ ), respectively. We further denote the marginal distributions of , , and ̃︀ by , , and ̃︀̃︀ , respectively. Let ∼ , ̃︀ ∼ to the peer samples. The peer loss function is definẽ︀d̃︀ asbe the random variables corresponding ℓ↬( (), ˜) = ℓ( (), ˜) − ℓ( (,), ˜,), where (, ˜) is a normal training sample pair, , and ˜, are corresponding peer samples.

Taking expectation for (14) yields

Ẽ︀ [ℓ↬( ( ), ̃︀ )] = Ẽ︀ [ℓ( ( ), ̃︀ )] − E ̃︀̃︀ [︁

E [ℓ( (), ̃︀)]]︁ .

Accordingly, noting and ̃︀ are independent, the second term in (15) is (13) (14) (15) The first term in (15) is

Ẽ︀ [ℓ( (), ̃︀ )] = ∑︁

∑︁ · P( = ) · E| =[ℓ( (), )] ∈[] ∈[] [︃

· P( = ) · E| = [ℓ( (), )] + = ∑︁ [︃ ⎛ ⎝1 − ̸=,∈[] [︁

E [ℓ( (), ̃︀)]]︁ P(̃︀ = ) · E [ℓ( (), )] ∑︁ · P( = ) · E [ℓ( (), )] ∈[] ∈[] [︃ · P( = ) · E [ℓ( (), )] + ⎠ · P( = ) · E| = [ℓ( (), )] +

∑︁ ∈[],̸= · P( = ) · E| =[ℓ( (), )]

]︃ · P( = ) · E [ℓ( (), )] ]︃ · P( = ) · E[ℓ((), )] .

In this case, we have = , ∀ ∈ [], ̸= . The first term becomes Ẽ︀[ℓ((), ̃︀)] ∑︁ · P( = ) · E| =[ℓ((), )] ∑︁ ⎠ · E [E[ℓ((), )]] + ∑︁ · E[ℓ((), )]. ⎛ ⎞ Comparing the above two terms we have:

Ẽ︀[ℓ↬((), ̃︀)] = ⎝1 − ∑︁ ⎠ E[ℓ↬((), )]. ∈[] the corresponding proof of the binary task.

1 1 Thus, substituting ↬0 := 1− 0− 1 by 1− ∑︀∈[] , the proof of Corollary 4.8 is finished if we repeat ∈[] The second term becomes E[ℓ((), ̃︀)]]︁ ̸=,∈[] ∈[] ∈[],̸= B. Additional Results and Details B.1. Experiment Details on UCI Datasets Datasets

In this paper, we conducted experiments on two binary (Breast and German) and two multi-class (StatLog and Optical) UCI classification datasets. As for the splitting of training and testing, the original settings are used when training and testing files are provided. The remaining datasets only give one data file. We adopt 50/50 splitting for the testing results’ ]︃ (16) ]︃

]︃ statistical significance as more data is distributed to testing dataset. More specifically, the numbers of (training, testing) samples in Breast, German, StatLog, and Optical datasets are (285, 284), (500, 500), (4435, 2000), and (3823, 1797).

Generating the noisy labels on UCI datasets For each UCI dataset adopted in this paper, the label of each sample in the training dataset will be flipped to the other classes with the probability (noise rate). For the multi-class classification datasets, the specific label which will be flipped is randomly selected with equal probabilities. For binary and multi-class classification datasets, (0.1, 0.2, 0.3, 0.4) and (0.2, 0.4, 0.6, 0.8) are used as diferent lists of noise rates respectively. Implementation details We implemented a simple two-layer ReLU Multi-Layer Perceptron (MLP) for the classification task on these four UCI datasets. The Adam optimizer is used with a learning rate of 0.001 and the batch size is 128.

B.2. Detailed Results on UCI Datasets In Table 6, we highlight the results with Green (for separation method) and Red (for aggregation methods) if the performance gap is large than 0.05. Clearly, the label separation method outperforms both aggregation methods (majority-vote and EM inference) consistently on StatLog and Optical datasets. For the two binary tasks (Breast and German), aggregation methods tend to outperform label separation, and we attribute this phenomenon to the fact that the ”denoising efect of label aggregation is more significant in the binary case”. B.3. Experiment Details on CIFAR-10 Datasets The generation of the symmetric noisy dataset is adopted from [44]. As for the instancedependent label noise, the generating algorithm follows the state-of-the-art method [77]. Both cases adopt noise rates: [0.2, 0.4, 0.6, 0.8]. The basic hyper-parameters settings for all methods are listed as follows: mini-batch size (128), optimizer (SGD), initial learning rate (0.1), momentum (0.9), weight decay (0.0005), number of epochs (120) and learning rate decay (0.1 at 50 epochs). Standard data augmentation is applied to each dataset. All experiments run on 8 Nvidia RTX A5000 GPUs.

B.4. Details Results on CIFAR-10 Dataset Table 7 includes all the detailed accuracy values that appeared in Figure 5. The results on the synthetic noisy CIFAR-10 dataset align well with the theoretical observations: label separation is preferred over label aggregation when the noise rates are high, or the number of labelers/annotations is insuficient.

UCI-StatLog (symmetric) CE = 5 = 9 = 15 = 15 UCI-StatLog (symmetric) BW = 5 = 9 = 15

UCI-Optical (symmetric) BW = 5 = 9 = 15

UCI-StatLog (symmetric) PeerLoss = 5 = 9 = 15 = 25

UCI-Optical (symmetric) PeerLoss = 5 = 9 = 15 = 25 UCI-pop failuers (symmetric) CE = 5 = 9 = 15 = 25

UCI-forest fire (symmetric) CE = 5 = 9 = 15 UCI-pop failuers (symmetric) BW = 5 = 9 = 15

CIFAR-10, Symmetric BW = 5 = 9 = 15

[1]

Estellés-Arolas ,

González-Ladrón-de Guevara , Towards an integrated crowdsourcing definition , Journal of Information science 38 ( 2012 ) 189 - 200 .

[2]

Howe , et al., The rise of crowdsourcing, Wired magazine 14 ( 2006 ) 1 - 4 .

[3]

Liu ,

Liu , An online learning approach to improving the quality of crowd-sourcing , ACM SIGMETRICS Performance Evaluation Review 43 ( 2015 ) 217 - 230 .

[4]

Albarqouni ,

Baur ,

Achilles ,

Belagiannis ,

Demirci ,

Navab , Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images , IEEE transactions on medical imaging 35 ( 2016 ) 1313 - 1321 .

[5]

A. A. A.

Setio ,

Traverso , T. De Bel,

M. S.

Berens , C. Van Den Bogaard, P. Cerello,

Chen ,

Dou ,

M. E.

Fantacci ,

Geurts , et al., Validation , comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the luna16 challenge, Medical image analysis 42 ( 2017 ) 1 - 13 .

[6]

Mitra , E. Gilbert, Credbank: A large-scale social media corpus with associated credibility annotations , in: Proceedings of the International AAAI Conference on Web and Social Media , volume 9 , 2015 , pp. 258 - 267 .

[7]

Pennycook ,

D. G.

Rand , Fighting misinformation on social media using crowdsourced judgments of news source quality , Proceedings of the National Academy of Sciences 116 ( 2019 ) 2521 - 2526 .

[8]

V. C.

Raykar ,

Yu ,

L. H.

Zhao ,

G. H.

Valadez ,

Florin ,

Bogoni ,

Moy , Learning from crowds ., Journal of machine learning research 11 ( 2010 ).

[9]

Whitehill , T.-f. Wu,

Bergsma ,

Movellan ,

Ruvolo , Whose vote should count more: Optimal integration of labels from labelers of unknown expertise , Advances in neural information processing systems 22 ( 2009 ).

[10]

Rodrigues ,

Pereira ,

Ribeiro , Gaussian process classification and active learning with multiple annotators , in: International conference on machine learning, PMLR , 2014 , pp. 433 - 441 .

[11]

Rodrigues ,

Lourenco ,

Ribeiro ,

F. C.

Pereira , Learning supervised topic models for classification and regression from crowds, IEEE transactions on pattern analysis and