=Paper=
{{Paper
|id=Vol-3357/paper1
|storemode=property
|title=To Aggregate or Not? Learning with Separate Noisy Labels
|pdfUrl=https://ceur-ws.org/Vol-3357/paper1.pdf
|volume=Vol-3357
|authors=Jiaheng Wei,Zhaowei Zhu,Tianyi Luo,Ehsan Amid,Abhishek Kumar,Yang Liu
|dblpUrl=https://dblp.org/rec/conf/csw/WeiZLAKL23
}}
==To Aggregate or Not? Learning with Separate Noisy Labels==
To Aggregate or Not? Learning with Separate Noisy Labels Jiaheng Wei1,*,† , Zhaowei Zhu1,† , Tianyi Luo2 , Ehsan Amid3 , Abhishek Kumar3 and Yang Liu1,* 1 University of California, Santa Cruz 2 Amazon Search Science and AI 3 Google Research, Brain Team Abstract The rawly collected training data often comes with separate noisy labels collected from multiple imperfect annotators (e.g., via crowdsourcing). A typically way of using these separate labels is to first aggregate them into one and apply standard training methods. The literature has also studied extensively on effective aggregation approaches. This paper revisits this choice and aims to provide an answer to the question of whether one should aggregate separate noisy labels into single ones or use them separately as given. We theoretically analyze the performance of both approaches under the empirical risk minimization framework for a number of popular loss functions, including the ones designed specifically for the problem of learning with noisy labels. Our theorems conclude that label separation is preferred over label aggregation when the noise rates are high, or the number of labelers/annotations is insufficient. Extensive empirical results validate our conclusions. Keywords Crowdsourcing, Label Aggregation, Label Noise, Human Annotation 1. Introduction Training high-quality deep neural networks for classification tasks typically requires a large quantity of annotated data. The raw training data often comes with separate noisy labels collected from multiple imperfect annotators. For example, the popular data collection paradigm crowdsourcing [1, 2, 3] offers the platform to collect such annotations from the unverified crowd; medical records are often accompanied by diagnoses from multiple doctors [4, 5]; news articles can receive multiple checkings (of the article being fake or not) from different experts [6, 7]. This leads to the situation considered in this paper: learning with multiple separate noisy labels. WSDM 2023 Crowd Science Workshop on Collaboration of Humans and Learning Algorithms for Data Labeling, March 3, 2023, Singapore * Corresponding author. † These authors contributed equally. $ jiahengwei@ucsc.edu (J. Wei); yangliu@ucsc.edu (Y. Liu) https://weijiaheng.github.io/ (J. Wei); https://users.soe.ucsc.edu/~zhaoweizhu/ (Z. Zhu); https://scholar.google.com.hk/citations?user=xnAVjF8AAAAJ&hl=en (T. Luo); https://sites.google.com/view/eamid/ (E. Amid); https://abhishek.umiacs.io/ (A. Kumar); http://www.yliuu.com/ (Y. Liu) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) The most popular approach to learning from the multiple separate labels would be aggregating the given labels for each instance [8, 9, 10, 11, 12], through an Expectation-Maximization (EM) inference technique. Each instance will then be provided with one single label and applied with the standard training procedure. The primary goal of this paper is to revisit the choice of aggregating separate labels and hope to provide practitioners with understandings for the following question: Should the learner aggregate separate noisy labels for one instance into a single label or not? Our main contributions can be summarized as follows: ∙ We provide theoretical insights on how separation methods and aggregation ones result in different biases (Theorem 3.4, 4.2, 4.6) and variances (Theorem 3.6, 4.3, 4.7) of the output classifier from training. Our analysis considers both the standard loss functions in use, as well as popular robust losses that are designed for the problem of learning with noisy labels. ∙ By comparing the analytical proxy of the worst-case performance bounds, our theoretical results reveal that separating multiple noisy labels is preferred over label aggregation when the noise rates are high, or the number of labelers/annotations is insufficient. The results are consistent for both the basic loss function ℓ and robust designs, including loss correction and peer loss. ∙ We carry out extensive experiments using both synthetic and real-world datasets to validate our theoretical findings. 1.1. Related Works Label separation vs label aggregation Existing works mainly compare the separation with aggregation by empirical results. For example, it has been shown that label separation could be effective in improving model performance and may be potentially more preferable than aggregated labels through majority voting [13]. When training with the cross-entropy loss, Sheng et.al [14] observe that label separation reduces the bias and roughness, and outperforms majority-voting aggregated labels. However, it is unclear whether the results hold when robust treatments are employed. Similar problems have also been studied in corrupted label detection with a result leaning towards separation but not proved [15]. Another line of approach concentrates on the end-to-end training scheme or ensemble methods which take all the separate noisy labels as the input during the training process [16, 17, 18, 19, 20], and learning from separate noisy labels directly. Learning with noisy labels Popular approaches in learning with noisy labels could be broadly divided into following categories, i.e., (i) Adjusting the loss on noisy labels by: using the knowledge of noise label transition matrix [21, 22, 23, 24, 25, 26, 27, 28, 29]; re-weighting the per-sample loss by down-weighting instances with potentially wrong labels [30, 31, 32, 33, 34]; or refurbishing the noisy labels [35, 36, 37]; (ii) Robust loss designs that do not require the knowledge of noise transition matrix [38, 39, 40, 41, 42, 43, 44, 45]; (iii) Regularization techniques to prevent deep neural networks from memorizing noisy labels [46, 47, 48, 49, 50, 51]; (iv) Dynamical sample selection procedure which behaves in a semi-supervised manner and begins with a clean sample selection procedure, then makes use of the wrongly-labeled samples [52, 53, 54, 55, 56]. For example, several methods [57, 58, 59] adopt a mentor/peer network to select small-loss samples as “clean” ones for the student/peer network. See [60, 61] for a more detailed survey of existing noise-robust techniques. 2. Formulation Consider an 𝑀 -class classification task and let 𝑋 ∈ 𝒳 and 𝑌 ∈ 𝒴 := {1, 2, ..., 𝑀 } denote the input examples and their corresponding labels, respectively. We assume that (𝑋, 𝑌 ) ∼ 𝒟, where 𝒟 is the joint data distribution. Samples (𝑥, 𝑦) are generated according to random variables (𝑋, 𝑌 ). In the clean and ideal scenario, the learner has access to 𝑁 training data points 𝐷 := {(𝑥𝑛 , 𝑦𝑛 )}𝑛∈[𝑁 ] . Instead of having access to ground truth labels 𝑦𝑛 s, we only have access to a set of noisy labels {𝑦˜𝑛,𝑖 }𝑖∈[𝐾] for 𝑛 ∈ [𝑁 ]. For ease of presentation, we adopt the decorator to denote separate labels and ∙ for aggregated labels specified later. Noisy labels 𝑦˜𝑛 s are generated according to the random variable 𝑌̃︀ . We consider the class-dependent label noise transition [30, 21] where 𝑌̃︀ is generated according to a transition matrix 𝑇 with its entries defined as follows: 𝑇𝑘,𝑙 := P(𝑌̃︀ = 𝑙|𝑌 = 𝑘). Most of the existing results on learning with noisy labels have considered the setting where each 𝑥𝑛 is paired with only one noisy label 𝑦˜𝑛 . In practice, we often operate in a setting where each data point 𝑥𝑛 is associated with multiple separate labels drawn from the same noisy label generation process [62, 63]. We consider this setting and assume that for each 𝑥𝑛 , there are 𝐾 independent noisy labels 𝑦˜𝑛,1 , ..., 𝑦˜𝑛,𝐾 obtained from 𝐾 annotators [64]. We are interested in two popular ways to leverage multiple separate noisy labels: ∙ Keep the separate labels as separate ones and apply standard learning with noisy labels techniques to each of them. ∙ Aggregate noisy labels into one label, and then apply standard learning with noisy data techniques. We will look into each of the above two settings separately and then answer the question: “Should the learner aggregate multiple separate noisy labels or not?” 2.1. Label Separation Denote the column vector P𝑌̃︀ := [P(𝑌̃︀ = 1), · · · , P(𝑌̃︀ = 𝑀 )]⊤ as the marginal distribution of 𝑌̃︀ . Accordingly, we can define P𝑌 for 𝑌 . Clearly, we have the relation: P𝑌̃︀ = 𝑇 · P𝑌 , P𝑌 = (𝑇 )−1 · P𝑌̃︀ . Denote by 𝜌1 := P(𝑌̃︀ = 0|𝑌 = 1), 𝜌0 := P(𝑌̃︀ = 1|𝑌 = 0). The noise transition matrix 𝑇 has the following form when 𝑀 = 2: [︂ ]︂ 1 − 𝜌0 𝜌0 𝑇 = . 𝜌1 1 − 𝜌1 For label separation, we define the per-sample loss function as: 1 ∑︁ ℓ(𝑓 (𝑥𝑛 ), 𝑦˜𝑛,1 , ..., 𝑦˜𝑛,𝐾 ) = ℓ(𝑓 (𝑥𝑛 ), 𝑦˜𝑛,𝑖 ). 𝐾 𝑖∈[𝐾] For simplicity, we shorthand ℓ(𝑓 (𝑥𝑛 ), 𝑦˜𝑛 ) := ℓ(𝑓 (𝑥𝑛 ), 𝑦˜𝑛,1 , ..., 𝑦˜𝑛,𝐾 ) for the loss of label separation method when there is no confusion. 2.2. Label Aggregation The other way to leverage multiple separate noisy labels is generating a single label via label aggregation methods using 𝐾 noisy ones: 𝑦˜∙𝑛 := Aggregation(𝑦˜𝑛,1 , 𝑦˜𝑛,2 , ..., 𝑦˜𝑛,𝐾 ), where the aggregated noisy labels 𝑦˜∙𝑛 s are generated according to the random variable 𝑌̃︀ ∙ . Denote the confusion matrix for this single & aggregated noisy label as 𝑇 ∙ . Popular aggregation methods include majority vote and EM inference, which are covered by our theoretical insights since our analyses in later sections would be built on the general label aggregation method. For a better understanding, we introduce the majority vote as an example. An Example of Majority Vote Given the majority voted label, we could compute the transition matrix between 𝑌̃︀ ∙ and the true label 𝑌 using the knowledge of 𝑇 . The lemma below gives the closed form for 𝑇 ∙ in terms of 𝑇 , when adopting majority vote. Lemma 2.1. Assume 𝐾 is odd and recall that in the binary classification task, 𝑇𝑖,𝑗 = P(𝑌̃︀ = ∙ becomes: 𝑗|𝑌 = 𝑖), the noise transition matrix of the (majority voting) aggregated noisy labels 𝑇𝑝,𝑞 𝐾+1 2 −1 ∑︁ (︂𝐾 )︂ ∙ 𝑇𝑝,𝑞 = (𝑇𝑝,𝑞 )𝐾−𝑖 (𝑇𝑝,1−𝑞 )𝑖 , 𝑝, 𝑞 ∈ {0, 1}. 𝑖 𝑖=0 When 𝐾 = 3, then 𝑇1,0∙ = P(𝑌 ̃︀ ∙ = 0|𝑌 = 1) = (𝑇 )3 + 3 (𝑇 )2 (𝑇 ). Note it still (︀ )︀ 1,0 1 1,0 1,1 holds that 𝑇𝑝,𝑞 𝑝,1−𝑞 = 1. For the aggregation method, as illustrated in Figure 1, the x-axis ∙ + 𝑇∙ 80 MV 0.2 EM 0.2 MV 0.4 EM 0.4 Noise rate of aggregated labels 60 MV 0.6 EM 0.6 MV 0.8 EM 0.8 40 20 0 10 20 30 40 50 Number of Labelers Figure 1: Noise rates of the aggregated labels in synthetic noisy CIFAR-10. MV: majority vote. EM: expectation maximization. 0.2–0.8: Original noise rates before aggregation. indicates the number of labelers 𝐾, and the y-axis denotes the aggregated noise rate given that the overall noise rate is in [0.2, 0.4, 0.6, 0.8]. When the number of labelers is large (i.e., 𝐾 < 10) and the noise rate is small, both majority vote and EM label aggregation methods significantly reduce the noise rate. Although the expectation-maximization method consumes much more time when generating the aggregated label, it frequently results in a lower aggregated noise rate than the majority vote. 3. Bias and Variance Analyses w.r.t. ℓ-loss In this section, we provide theoretical insights on how label separation and aggregation methods result in different biases and variances of the classifier prediction when learning with the standard loss function ℓ. Suppose the clean training samples {(𝑥𝑛 , 𝑦𝑛 )}𝑛∈[𝑁 ] are given by variables (𝑋, 𝑌 ) such that (𝑋, 𝑌 ) ∼ 𝒟. Recall that instead of having access to a set of clean training samples 𝐷 = {(𝑥𝑛 , 𝑦𝑛 )}𝑛∈[𝑁 ] , the learner only observes 𝐾 noisy labels 𝑦˜𝑛,1 , ..., 𝑦˜𝑛,𝐾 for each 𝑥𝑛 , denoted by 𝐷 ̃︀ := {(𝑥𝑛 , 𝑦˜ , ..., 𝑦˜ )}𝑛∈[𝑁 ] . For separation methods, the noisy training 𝑛,1 𝑛,𝐾 samples are obtained through variables (𝑋, 𝑌̃︀1 ), ..., (𝑋, 𝑌̃︀𝐾 ) where (𝑋, 𝑌̃︀𝑖 ) ∼ 𝒟 ̃︀ for 𝑖 ∈ [𝐾]. For aggregation methods such as majority vote, we assume the data points and aggregated noisy labels 𝐷̃︀ ∙ := {(𝑥𝑛 , 𝑦˜∙𝑛 )}𝑛∈[𝑁 ] are drawn from (𝑋, 𝑌̃︀ ∙ ) ∼ 𝒟 ̃︀ ∙ where 𝑌̃︀ ∙ is produced through the majority voting of 𝑌̃︀1 , ..., 𝑌̃︀𝐾 . When we mention "noise rate", we usually refer to the average noise: P(𝑌̃︀ u ̸= 𝑌 ). ℓ-risk under the distribution Given the loss ℓ, note that ℓ(𝑓 (𝑥𝑛 ), 𝑦˜𝑛 ) is denoted as ℓ(𝑓 (𝑥𝑛 ), 𝑦˜𝑛,1 , ..., 𝑦˜𝑛,𝐾 ) = 𝐾 𝑖∈[𝐾] ℓ(𝑓 (𝑥𝑛 ), 𝑦˜𝑛,𝑖 ), we define the empirical ℓ-risk for learn- 1 ∑︀ ing with separated/aggregated labels under noisy labels as 𝑅 ^ ̃︀ 𝑢 (𝑓 ) = 1 ∑︀𝑁 ℓ (𝑓 (𝑥𝑖 ), 𝑦˜𝑢 ), ℓ,𝐷 𝑁 𝑖=1 𝑖 𝑢 ∈ {, ∙} unifies the treatment which is either separation or aggregation ∙. By increasing the sample size 𝑁 , we would expect 𝑅 ^ ̃︀ 𝑢 (𝑓 ) to be close to the following ℓ,𝐷 ℓ-risk under the noisy distribution 𝒟 ̃︀ 𝑢 : 𝑅 ̃︀ 𝑢 (𝑓 ) = E ̃︀ 𝑢 ̃︀ 𝑢 [ℓ(𝑓 (𝑋), 𝑌̃︀ 𝑢 )]. ℓ,𝒟 (𝑋,𝑌 )∼𝒟 3.1. Bias of a Given Classifier w.r.t. ℓ-Loss We denote by 𝑓 * ∈ ℱ the optimal classifier obtained through the clean data distribution (𝑋, 𝑌 ) ∼ 𝒟 within the hypothesis space ℱ. We formally define the bias of a given classifier 𝑓^ as: Definition 3.1 (Classifier Prediction Bias of ℓ-Loss). Denote by 𝑅ℓ,𝒟 (𝑓^ ) := E𝒟 [ℓ(𝑓^ (𝑋), 𝑌 )], 𝑅ℓ,𝒟 (𝑓 * ) := E𝒟 [ℓ(𝑓 * (𝑋), 𝑌 )]. The bias of classifier 𝑓^ writes as: Bias(𝑓^) = 𝑅ℓ,𝒟 (𝑓^) − 𝑅ℓ,𝒟 (𝑓 * ). The Bias term quantifies the prediction bias (excess risk) of a given classifier 𝑓^ on the clean data distribution 𝒟 w.r.t. the optimal achievable classifier 𝑓 * , which can be decomposed as [65] Bias(𝑓^ ) = 𝑅ℓ,𝒟 (𝑓^ ) − 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) + 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) − 𝑅ℓ,𝒟 (𝑓 * ) . (1) Distribution shift Estimation error Now we bound the distribution shift and the estimation error in the following two lemmas. Lemma 3.2 (Distribution shift). Denote by 𝑝𝑖 := P(𝑌 = 𝑖), assume ℓ is upper bounded by ℓ̄ and lower bounded by ℓ. The distribution shift in Eqn. (1) is upper bounded by 𝑢,1 𝑅ℓ,𝒟 (𝑓^ ) − 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) ≤ Δ𝑅 := (𝜌𝑢0 𝑝0 + 𝜌𝑢1 𝑝1 ) · ℓ − ℓ . (2) (︀ )︀ Lemma 3.3 (Estimation error). Suppose the loss function ℓ(𝑓 (𝑥), 𝑦) is 𝐿-Lipschitz for any feasible 𝑦. ∀𝑓 ∈ ℱ, with probability at least 1 − 𝛿, the estimation error is upper bounded by √︃ * 𝑢,2 2 log(1/𝛿) 𝑢,1 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) − 𝑅ℓ,𝒟 (𝑓 ) ≤ Δ𝑅 := 4𝐿 · R(ℱ) + (ℓ − ℓ) · 𝑢𝑁 + Δ𝑅 , 𝜂𝐾 𝐾·log( 1𝛿 ) ∙ ≡1 where 𝑢 ∈ {, ∙} denotes either separation or aggregation methods, 𝜂𝐾 = 2 and 𝜂𝐾 2(log( 𝐾+1 𝛿 )) indicate the richness factor, which characterizes the effect of the number of labelers, and R(ℱ) is the Rademacher complexity of ℱ. 2.75 = 1e-1 2.50 = 1e-2 = 1e-4 2.25 = 1e-8 2.00 = 1e-12 = 1e-20 (K) 1.75 1.50 1.25 1.00 0 20 40 60 80 100 Number of Labelers Figure 2: The visualization of estimated 𝜂𝐾 given varied 𝛿. Noting that the number of unique instances 𝑥𝑖 is the same for both treatments, the duplicated copies of 𝑥𝑖 are supposed to introduce at least no less effective samples, i.e., the richness factor satisfies that 𝜂𝐾 𝑢 ≥ 1. Thus, we update 𝜂 := max{𝜂 , 1}, and Figure 2 visualizes the estimated 𝐾 𝐾 𝜂𝐾 given different number of labelers as well as 𝛿. It is clear that when the number of labelers is larger, or 𝛿 is smaller, 𝜂𝐾 > 𝜂𝐾 ∙ . Later we shall show how 𝜂 𝑢 influences the bias and variance 𝐾 of the classifier prediction. To give a more intuitive comparison of the performance of both mechanisms, we adopt the 𝑢 𝑢,1 𝑢,2 worst-case bias upper bound Δ𝑅 := Δ𝑅 + Δ𝑅 from Lemma 3.2 and Lemma 3.3 as a proxy and derive Theorem 3.4. Theorem 3.4. Denote by 𝛼𝐾 := (𝜌0 𝑝0 + 𝜌1 𝑝1 ) − (𝜌∙0 𝑝0 + 𝜌∙1 𝑝1 ), 𝛾 = log(1/𝛿)/2𝑁 . The √︀ ∙ separation bias proxy Δ𝑅 is smaller than the aggregation bias proxy Δ𝑅 if and only if 1 𝛼𝐾 · 1 ≤ 𝛾. (3) 1 − (𝜂𝐾 )− 2 Note that 𝛼𝐾 and 𝜂𝐾 are non-decreasing w.r.t. the increase of 𝐾, in Section 4.3, we will explore how the LHS of Eqn. (3) is influenced by 𝐾: a short answer is that the LHS of Eqn. (3) is (generally) monotonically increasing w.r.t. 𝐾 when 𝐾 is small, indicating that Eqn. (3) is easier to be achieved given fixed 𝛿, 𝑁 and a smaller 𝐾 than a larger one. 3.2. Variance of a Given Classifier w.r.t. ℓ-Loss We now move on to explore the variance of a given classifier when learning with ℓ-loss, prior to the discussion, we define the variance of a given classifier as: Definition 3.5 (Classifier Prediction Variance of ℓ-Loss). The variance of a given classifier 𝑓^ when learned with separation () or aggregation (∙) is defined as: [︁ ]︁2 Var(𝑓^ ) = E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 ) − E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 [ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 )] . For 𝑔(𝑥) = 𝑥 − 𝑥2 , we derive the closed form of Var and the corresponding upper bound as below. 𝑢 ≥ 2 log(1/𝛿) , given ℓ is 0-1 loss, we have: Theorem 3.6. When 𝜂𝐾 𝑁 (︃√︃ )︃ 𝑢 𝑢 2 log(1/𝛿) Var(𝑓^ ) = 𝑔(𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ )) ≤ 𝑔 𝑢𝑁 . (4) 𝜂𝐾 Variance proxy ∙ The variance proxy of Var(𝑓^ ) in Eqn. (4) is smaller than that of Var(𝑓^ ). Theorem 3.6 provides another view to decide on the choices of separation and aggregation methods, i.e., the proxy of classifier prediction variance. To extend the theoretical conclusions w.r.t. ℓ loss to the multi-class setting, we only need to modify the upper bound of the distribution shift in Eqn. (2), as specified in the following corollary. Corollary 3.7 (Multi-Class Extension (ℓ-Loss)). In the 𝑀 -class classification case, the upper bound of the distribution shift in Eqn. (2) becomes: 𝑢,1 ∑︁ 𝑅ℓ,𝒟 (𝑓^ ) − 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) ≤ Δ𝑅 := 𝑢 (5) (︀ )︀ 𝑝𝑗 · (1 − 𝑇𝑗,𝑗 )· ℓ−ℓ . 𝑗∈[𝑀 ] 4. Bias and Variance Analyses with Robust Treatments Intuitively, the learning of noisy labels problem could benefit from more robust loss functions built upon the generic ℓ loss, i.e., backward correction (surrogate loss) [21, 22], and peer loss functions [42]. We move on to explore the best way to learn with multiple copies of noisy labels, when combined with existing robust approaches. 4.1. Backward Loss Correction When combined with the ∑︀ backward loss correction approach (ℓ → ℓ← ), the empirical ℓ risks become: 𝑅ℓ← ,𝐷̃︀ 𝑢 (𝑓 ) = 𝑁 𝑁 ^ 1 ˜𝑢𝑖 ), where the corrected loss in the binary case is 𝑖=1 ℓ← (𝑓 (𝑥𝑖 ), 𝑦 defined as (1 − 𝜌1−𝑦˜𝑢 ) · ℓ(𝑓 (𝑥), 𝑦˜𝑢 ) − 𝜌𝑦˜𝑢 · ℓ(𝑓 (𝑥), 1 − 𝑦˜𝑢 ) ℓ← (𝑓 (𝑥), 𝑦˜𝑢 ) = . 1 − 𝜌𝑢0 − 𝜌𝑢1 Bias of given classifier w.r.t. ℓ← Suppose the loss function ℓ(𝑓 (𝑥), 𝑦) is 𝐿-Lipschitz for 𝑢 𝑢 any feasible 𝑦. Define 𝐿𝑢← := 𝐿𝑢←0 · 𝐿, where 𝐿𝑢←0 := (1+|𝜌 ^ 𝑢 . Denote by 𝑅ℓ,𝒟 (𝑓 ) the ℓ-risk 0 −𝜌1 |) 1−𝜌𝑢 0 −𝜌 1 𝑢 of the classifier 𝑓^ under the clean data distribution 𝒟, with 𝑓^ = 𝑓^ = arg min ← 𝑅^ ̃︀ 𝑢 (𝑓 ). 𝑓 ∈ℱ ℓ← ,𝐷 Lemma 4.1 gives the upper bound of classifier prediction bias when learning with ℓ← via separation or aggregation methods. Lemma 4.1. With probability at least 1 − 𝛿, we have: √︃ 𝑢 𝑢 2 log(1/𝛿) 𝑅ℓ,𝒟 (𝑓^ ← ) − 𝑅ℓ,𝒟 (𝑓 * ) ≤ Δ𝑅← := 4𝐿𝑢← · R(ℱ) + 𝐿𝑢←0 · (ℓ − ℓ) · 𝑢𝑁 . 𝜂𝐾 Lemma 4.1 offers the upper bound of the performance gap for the given classifier 𝑓 w.r.t the 𝑢 clean distribution 𝒟, comparing to the minimum achievable risk. We consider the bound Δ𝑅← as a proxy of the bias, and we are interested in the case where training the classifier separately ∙ yields a smaller bias proxy compared to that of the aggregation method, formally Δ𝑅← < Δ𝑅← . For any finite hypothesis class ℱ ⊂ {𝑓 : 𝑋 → {0, 1}}, and the sample set 𝑆 = {𝑥1 , ..., 𝑥𝑁 }, denote by 𝑑 the VC-dimension of ℱ, we give conditions when training separately yields a smaller bias proxy. (︁ √︁ )︁ 𝑑 log(𝑁 ) Theorem 4.2. Denote by 𝛼𝐾 := 1 − 𝐿∙← /𝐿← , 𝛾 = 1/ 1 + ℓ−ℓ 4𝐿 log(1/𝛿) , where 𝑑 is the VC-dimension of ℱ. For backward loss correction, the separation bias proxy Δ𝑅← is smaller than ∙ the aggregation bias proxy Δ𝑅← if and only if 1 𝛼𝐾 · 1 ≤ 𝛾. (6) 1 − (𝜂𝐾 )− 2 We defer our empirical analysis of the monotonicity of the LHS in Eqn. (6) to Section 4.3 as well, which shares similar monotonicity behavior to learning w.r.t. ℓ. Variance of given classifiers with Backward Loss Correction Similar to the previous subsection, we now move on to check how separation and aggregation methods result in different variance when training with loss correction. 𝑢 )− 12 < √︁ 𝑢 Theorem 4.3. When 𝐿𝑢←0 (𝜂𝐾 𝑁 2(ℓ−ℓ)2 log(1/𝛿) , Var(𝑓^ ← ) (w.r.t. the 0-1 loss) satisfies: (︃ √︃ )︃ 𝑢 𝑢 2 log(1/𝛿) Var(𝑓^ ← ) = 𝑔(𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ← )) ≤ 𝑔 𝐿𝑢←0 · (ℓ − ℓ) · 𝑢𝑁 . (7) 𝜂𝐾 Variance proxy ∙ √ The variance proxy of Var(𝑓^ ← ) in Eqn. (7) is smaller than that of Var(𝑓^ ← ) if 𝐿 ∙ . 𝜂𝐾 > 𝐿← ← Moving a bit further, when the noise transition matrix is symmetric for both methods, the 𝐿← 1−𝜌∙0 −𝜌∙1 requirement 𝜂𝐾 > 𝐿∙← could be further simplified as: 𝜂𝐾 𝐿 ∙ = 1−𝜌 −𝜌 . For a fixed √︀ 𝑢 √︀ 𝑢 > 𝐿 ← ← 0 1 𝐾, a more efficient aggregation method decreases 𝜌∙𝑖 , which makes it harder to satisfy this condition. Recall 𝐿𝑢← := 𝐿𝑢←0 · 𝐿, the theoretical insights of ℓ← between binary case and the multi-class setting could be bridged by replacing 𝐿𝑢0 with the multi-class constant specified in the following corollary. Corollary 4.4 (Multi-Class Extension (ℓ← -Loss)). Given a diagonal-dominant transition matrix 𝑇 𝑢 , we have √ 𝑢 2 𝑀 𝐿←0 = , 𝜆min (𝑇 𝑢 ) where 𝜆min (𝑇 𝑢 ) denotes the minimal eigenvalue of the matrix 𝑇 𝑢 . Particularly, if 𝑇𝑖𝑖𝑢 < 0.5, ∀𝑖 ∈ [𝑀 ], we further have {︃ √ }︃ 1 2 𝑀 𝐿𝑢←0 = min , , where 𝑒𝑢 := max (1 − 𝑇𝑖𝑖𝑢 ). 1 − 2𝑒𝑢 𝜆min (𝑇 𝑢 ) 𝑖∈[𝑀 ] 4.2. Peer Loss Functions Peer Loss function [42] is a family of loss functions that are shown to be robust to la- bel noise, without requiring the knowledge of noise rates. Formally, ℓ↬ (𝑓 (𝑥𝑖 ), 𝑦˜𝑖 ) := ℓ(𝑓 (𝑥𝑖 ), 𝑦˜𝑖 ) − ℓ(𝑓 (𝑥1𝑖 ), 𝑦˜2𝑖 ), where the second term checks on mismatched data samples with (𝑥𝑖 , 𝑦˜𝑖 ), (𝑥1𝑖 , 𝑦˜1𝑖 ), (𝑥2𝑖 , 𝑦˜2𝑖 ), which are randomly drawn from the same data distribu- tion. When combined with the peer loss approach, i.e., ℓ → ℓ↬ , the two risks become: ^ ̃︀ 𝑢 (𝑓 ) = 1 ∑︀𝑁 ℓ↬ (𝑓 (𝑥𝑖 ), 𝑦˜𝑢 ), 𝑢 ∈ {, ∙}. 𝑅 ℓ↬ ,𝐷 𝑁 𝑖=1 𝑖 Bias of given classifier w.r.t. ℓ↬ Suppose the loss function ℓ(𝑓 (𝑥), 𝑦) is 𝐿-Lipschitz for any 𝑢 feasible 𝑦. Let 𝐿𝑢↬0 := 1/(1 − 𝜌𝑢0 − 𝜌𝑢1 ), 𝐿𝑢↬ := 𝐿𝑢↬0 · 𝐿 and 𝑓^ ↬ = arg min𝑓 ∈ℱ 𝑅 ^ ̃︀ 𝑢 (𝑓 ). ℓ↬ ,𝐷 Lemma 4.5. With probability at least 1 − 𝛿, we have: √︃ 𝑢 𝑢 2 log(4/𝛿) (︀ 𝑅ℓ,𝒟 (𝑓^ ↬ ) − 𝑅ℓ,𝒟 (𝑓 * ) ≤ Δ𝑅↬ := 8𝐿𝑢↬ · R(ℱ) + 𝐿𝑢↬0 · )︀ 𝑢 · 1 + 2(ℓ̄ − ℓ) . 𝜂𝐾 𝑁 To evaluate the performance of a given classifier yielded by the optimization w.r.t. ℓ↬ , Lemma 𝑢 4.5 provides the bias proxy Δ𝑅↬ for both treatments. Similarly, we adopt such a proxy to analyze which treatment is more preferable. √︁ Theorem 4.6. Denote by 𝛼𝐾 := 1 − 𝐿∙↬ /𝐿↬ , 𝛾 = 1+2(ℓ̄−ℓ) 2𝐿 log(4/𝛿) 4𝑑 log(𝑁 ) , where 𝑑 denotes the VC-dimension of ℱ. For peer loss, the separation bias proxy Δ𝑅↬ is smaller than the aggregation ∙ bias proxy Δ𝑅↬ if and only if 1 𝛼𝐾 · 1 ≤ 𝛾. (8) 𝐿∙↬ /𝐿↬ − (𝜂𝐾 )− 2 Noise rate 0.2 Noise rate 0.4 Loss Loss Loss Loss Loss Loss 10 20 30 40 50 10 20 30 40 50 Number of Labelers Number of Labelers Figure 3: The monotonicity of the LHS in Eqn. (3, 6, 8) w.r.t. the increase of 𝐾. Note that the condition in Eqn. (8) shares a similar pattern to that which appeared in the basic loss ℓ and ℓ← , we will empirically illustrate the monotonicity of its LHS in Section 4.3. Variance of given classifiers with Peer Loss We now move on to check how separation and aggregation methods result in different variances when training with peer loss. Similarly, we can obtain: 𝑢 √︁ 𝜂𝐾 ≥ 2 log(4/𝛿) · 1 + 2(ℓ̄ − ℓ) , Var(𝑓^ ↬ ) (w.r.t. the 0-1 loss) satisfies: √︀ 𝑢 (︀ )︀ Theorem 4.7. When 𝑁 (︃ √︃ )︃ 𝑢 𝑢 log(4/𝛿) (︀ Var(𝑓^ ↬ ) = 𝑔(𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ↬ )) ≤ 𝑔 𝐿𝑢↬0 · (9) )︀ 𝑢 𝑁 · 1 + 2(ℓ̄ − ℓ) 2𝜂𝐾 . Variance proxy ∙ √ The variance proxy of Var(𝑓^ ↬ ) in Eqn. (9) is smaller than that of Var(𝑓^ ↬ ) if 𝜂𝐾 ≥ 𝐿↬ 𝐿∙ . ↬ Theoretical insights of ℓ↬ also have the multi-class extensions, we only need to generate 𝐿𝑢↬0 to the multi-class setting along with additional conditions specified as below: Corollary 4.8 (Multi-Class Extension (ℓ↬ -Loss)). Assume ℓ↬ is classification-calibrated in 1 the multi-class setting, and the clean label 𝑌 has equal prior 𝑃 (𝑌 = 𝑗) = 𝑀 , ∀𝑗 ∈ [𝑀 ]. 𝑢 = 𝜌𝑢 , ∀𝑗 ∈ [𝑀 ], we have: 𝐿𝑢 uniform noise transition matrix [44] such that 𝑇𝑗,𝑖 For the ∑︀ 𝑖 ↬0 = 𝑢 1/(1 − 𝑖∈[𝑀 ] 𝜌𝑖 ). 4.3. Analysis of the Theoretical Conditions Recall that the established conditions in Theorems 3.4, 4.2, 4.6 are implicitly relevant to the number of labelers 𝐾, and the RHS of Eqns. (3, 6, 8) are constants. We proceed to analyze the monotonicity of the corresponding LHS (in the form of 𝛼𝐾 · 1 −1 ) w.r.t. the increase 𝛽𝐾 −(𝜂𝐾 ) 2 of 𝐾, where 𝛽𝐾 = 1 for ℓ and ℓ← , 𝛽𝐾 = 𝐿∙↬ /𝐿↬ for ℓ↬ . Thus, we have: 𝑂(LHS) = 𝑂(𝛼𝐾 · (𝛽𝐾 − 𝑂( log(𝐾) √ 𝐾 ))−1 ). We visualize this order under different symmetric 𝑇 in Figure 3. Symmetric Noise, Statlog-6 Symmetric Noise, Optical-10 Cross-Entropy Backward Correction Peer Loss Cross-Entropy Backward Correction Peer Loss 92 92 92 98 98 98 97 90 90 90 97 = 0.2 = 0.2 = 0.2 = 0.2 = 0.2 = 0.2 88 88 88 97 96 96 86 86 86 96 95 95 92 92 92 98 98 98 90 90 90 96 = 0.4 = 0.4 96 96 = 0.4 = 0.4 = 0.4 = 0.4 88 88 88 94 86 86 86 94 94 92 92 92 92 95 90 90 90 95 95 = 0.6 = 0.6 = 0.6 = 0.6 = 0.6 = 0.6 88 88 88 90 Aggregation 90 Aggregation 86 86 86 Separation 90 85 Separation 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 Square Root of #Labelers Square Root of #Labelers Square Root of #Labelers Square Root of #Labelers Square Root of #Labelers Square Root of #Labelers Figure 4: The performances of Cross-Entropy, Backward Loss Correction, and Peer Loss trained on synthetic noisy Statlog-6/Optical-10 aggregated labels (we report the better results between majority vote and EM inference √ for each 𝐾, and noise rate 𝜖), and separated labels. 𝑋-axis: the value of the number of labelers 𝐾; 𝑌 -axis: the best test accuracy achieved. Symmetric Noise, CIFAR-10 Instance-Dependent Noise, CIFAR-10 Cross-Entropy Backward Correction Peer Loss Cross-Entropy Backward Correction Peer Loss 93.5 94 90 94 = 0.2 = 0.2 = 0.2 = 0.2 = 0.2 = 0.2 93.0 80 80 80 92.5 92 93 70 60 60 95.0 94 94 92 92.5 93 94 = 0.4 = 0.4 = 0.4 = 0.4 = 0.4 = 0.4 92 93 90 90.0 90 92 92 92 95 95.0 94 90 90 90 92.5 = 0.6 = 0.6 = 0.6 92.5 = 0.6 = 0.6 = 0.6 85 85 90.0 92 85 90.0 80 75 80 90 90 90 = 0.8 = 0.8 = 0.8 = 0.8 = 0.8 = 0.8 60 50 60 Aggregation 80 Aggregation 25 Separation 80 70 80 Separation 40 40 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 Square Root of #Labelers Square Root of #Labelers Square Root of #Labelers Square Root of #Labelers Square Root of #Labelers Square Root of #Labelers Figure 5: The performances of Cross-Entropy, Backward Loss Correction, and Peer Loss trained on synthetic noisy CIFAR-10 aggregated labels (we report the better results between √ majority vote, EM inference for each 𝐾, and noise rate 𝜖), and separated labels. 𝑋-axis: the value of 𝐾 where 𝐾 denotes the number of labels per training example; 𝑌 -axis: the best achieved test accuracy. It can be observed that when 𝐾 is small (e.g., 𝐾 ≤ 5), the LHS parts of these conditions increase with 𝐾, while they may decrease with 𝐾 if 𝐾 is sufficiently large. Recall that separation is better if LHS is less than the constant value 𝛾. Therefore, Figure 3 shows the trends that aggregation is generally better than separation when 𝐾 is sufficiently large. Tightness of the bias proxies In Theorems 3.4, 4.2, 4.6, we view the error bounds 𝑢 𝑢 𝑢 Δ𝑅 , Δ𝑅← , Δ𝑅↬ as proxies of the worst-case performance of the trained classifier. For the standard loss function ℓ, it has been proven that [66, 67] under mild conditions of ℓ and ℱ, the lower bound of the performance gap between a trained ^ √︀ classifier (𝑓 ) and the optimal achievable * ^ one (i.e., 𝑓 ) 𝑅ℓ,𝒟 (𝑓 ) − 𝑅ℓ,𝒟 (𝑓 ) is of the order 𝑂( 1/𝑁 ), which is of the same order as that * in Theorem 3.4. Noting the behavior concluded from the worst-case bounds may not always hold for each individual case, we further use experiments to validate our analyses in the next section. Table 1 The performances of CE/BW/PeerLoss trained on 2 UCI datasets (Breast, and German datasets), with aggregated labels (majority vote, EM inference), and separated labels. We highlight the results with Green (for the separation method) and Red (for aggregation methods) if the performance gap is larger than 0.05. (𝐾 is the number of labels per training image) UCI-Breast (symmetric) CE UCI-German (symmetric) CE 𝜖 = 0.2 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.2 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 96.05 96.05 96.49 96.93 97.37 97.37 MV 69.00 71.50 71.50 73.50 73.00 73.00 EM 96.93 96.05 96.49 96.93 97.37 97.37 EM 58.75 63.50 65.75 66.50 65.50 65.50 Sep 96.49 95.18 96.49 96.93 97.81 98.25 Sep 70.00 70.75 66.00 69.75 70.75 69.25 𝜖 = 0.4 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.4 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 96.05 96.49 95.18 95.18 96.49 96.93 MV 65.75 62.25 62.75 68.50 71.75 70.50 EM 96.05 92.98 89.47 94.30 96.05 96.93 EM 61.00 60.00 61.50 54.00 62.00 63.25 Sep 92.11 94.30 95.61 96.49 96.93 96.93 Sep 68.25 65.50 65.00 64.50 64.75 69.50 UCI-Breast (symmetric) BW UCI-German (symmetric) BW 𝜖 = 0.2 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.2 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 95.61 96.49 96.05 96.93 96.93 96.93 MV 72.75 71.50 74.00 75.50 76.50 76.50 EM 95.61 96.49 96.05 96.93 96.93 96.93 EM 62.75 61.50 59.25 64.50 62.50 62.50 Sep 95.18 93.42 96.49 96.05 97.37 98.25 Sep 70.50 70.50 73.75 68.25 70.00 72.75 𝜖 = 0.4 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.4 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 89.91 96.05 94.74 94.30 96.05 96.49 MV 65.25 69.50 67.50 69.50 70.50 71.75 EM 81.14 94.30 92.11 94.74 92.54 96.49 EM 57.75 60.25 55.25 53.50 54.00 62.25 Sep 91.67 93.42 94.30 89.47 92.54 97.37 Sep 60.25 63.50 63.00 64.25 69.00 64.75 UCI-Breast (symmetric) PeerLoss UCI-German (symmetric) PeerLoss 𝜖 = 0.2 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.2 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 96.05 96.49 96.49 96.93 96.93 96.93 MV 72.75 71.75 73.00 73.00 72.50 72.50 EM 96.05 96.49 96.49 96.93 96.93 96.93 EM 62.25 64.50 63.75 64.25 62.75 62.75 Sep 94.74 94.30 96.93 96.93 96.93 97.81 Sep 70.25 68.00 70.50 70.00 67.00 73.50 𝜖 = 0.4 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.4 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 92.11 95.61 95.18 92.54 96.49 96.05 MV 69.50 66.25 69.50 68.75 69.00 70.00 EM 92.11 92.11 86.40 93.86 95.61 96.93 EM 62.50 61.25 64.25 57.75 59.75 65.00 Sep 92.11 94.30 95.18 95.18 95.61 96.05 Sep 64.00 61.25 66.50 68.00 69.25 69.00 5. Experimental Results In this section, we empirically compare the performance of different treatments on the multiple noisy labels when learning with robust loss functions (CE loss, forward loss correction, and peer loss). We consider several treatments including label aggregation methods (majority vote and EM inference) and the label separation method. Assuming that multiple noisy labels have different weights, EM inference can be used to solve the problem under this assumption by treating the aggregated labels as hidden variables [68, 69, 8, 70]. In the E-step, the probabilities of the aggregated labels are estimated using the weighted aggregation approach based on the fixed weights of multiple noisy labels. In the M-step, EM inference method re-estimates the weights of multiple noisy labels based on the current aggregated labels. This iteration continues until all aggregated labels remain unchanged. As for label separation, we adopted the mini-batch separation method, i.e., each training sample 𝑥𝑛 is assigned with 𝐾 noisy labels in each batch. 5.1. Experiment on Synthetic Noisy Datasets Experimental results on synthetic noisy UCI datasets [71] We adopt six UCI datasets to empirically compare the performances of label separation and aggregation methods when Table 2 Empirical verification of Theorem 3.4 on Breast & German UCI datasets. Dataset 𝜌∘𝑖 𝑝0 𝑁 (1 − 𝛿, 𝑆𝐾 ) Breast 0.2 0.3726 569 (0.62, {𝐾 > 49}) Breast 0.4 0.3726 569 (0.62, {𝐾 > 49}) German 0.2 0.3 1000 (0.98, {𝐾 > 15}) German 0.4 0.3 1000 (0.98, {𝐾 > 23}) learning with CE loss, backward correction [21, 22], and Peer Loss [42]. The noisy annotations given by multiple annotators are simulated by symmetric label noise, which assumes 𝑇𝑖,𝑗 = 𝑀𝜖−1 for 𝑗 ̸= 𝑖 for each annotator, where 𝜖 quantifies the overall noise rate of the generated noisy labels. In Figure 4, we adopt two UCI datasets (StatLog: (𝑀 = 6); Optical: (𝑀 = 10)) for illustration. From the results in Figure 4, it is quite clear that: the label separation method outperforms both aggregation methods (majority-vote and EM inference) consistently, and is considered to be more beneficial on such small scale datasets. Results on additional datasets and more details are deferred to the Appendix. Experimental results on synthetic noisy CIFAR-10 dataset [72] On CIFAR-10 dataset, we consider two types of simulation for the separate noisy labels: symmetric label noise model and instance-dependent label noise [53, 24], where 𝜖 is the average noise rate and different labelers follow different instance-dependent noise transition matrices. For a fair comparison, we adopt the ResNet-34 model [73], the same training procedure and batch-size for all considered treatments on the separate noisy labels. Figure 5 shares the following insights regarding the preference of the treatments: in the low noise regime or when 𝐾 is large, aggregating separate noisy labels significantly reduces the noise rates and aggregation methods tend out to have a better performance; while in the high noise regime or when 𝐾 is small, the performances of separation methods tend out to be more promising. With the increasing of 𝐾 or 𝜖, we can observe a preference transition from label separation to label aggregation methods. 5.2. Empirical Verification of the Theoretical Bounds To verify the comparisons of bias proxies (i.e., Theorem 3.4) through an empirical perspective, we adopt two binary classification UCI datasets for demonstration: Breast and German datasets, as shown in Table 1. Clearly, on these two binary classification tasks, label aggregation methods tend to outperform label separation, and we attribute this phenomenon to the fact that the ”denoising effect of label aggregation is more significant in(︁the binary case”. )︁ For Theorem 3.4 (CE loss), the condition requires 𝛼𝐾 / 1 − (𝜂𝐾 ∘ )− 12 , where 𝛼 = (𝜌∘ 𝑝 + 0 0 𝜌1 𝑝1 ) − (𝜌0 𝑝0 + 𝜌1 𝑝1 ), 𝛾 = log(1/𝛿)/2𝑁 . For two binary UCI datasets (Breast & German), ∘ ∙ ∙ √︀ the information could be summarized in Table 2, where the column (1 − 𝛿, 𝑆𝐾 ) means: when the number of annotators belongs to the set 𝑆𝐾 , the label separation method is likely to under- perform label aggregation (i.e., majority vote) with probability at least 1 − 𝛿. For example, in the last row of Table 2, when training on UCI German dataset with CE loss under noise rate Table 3 Experimental results on CIFAR-10N and CIFAR-10H dataset with 𝐾 = 3. We highlight the results with Green (for separation method) and Red (for aggregation methods) if the performance gap is large than 0.05. CIFAR-10N (𝜖 ≈ 0.18) CE BW PL Majority-Vote 89.52 89.23 89.84 EM-Inference 89.19 88.88 88.92 Separation 89.77 89.20 89.97 CIFAR-10H (𝜖 ≈ 0.09) CE BW PL Majority-Vote 80.86 82.72 82.11 EM-Inference 80.81 82.43 81.73 Separation 76.75 79.07 78.08 0.4 (the noise rate of separate noisy labels), Theorem 3.4 reveals that with probability at least 0.98, label aggregation (with majority vote) is better than label separation when 𝐾 > 23, which aligns well with our empirical observations (label separation is better only when 𝐾 < 15). 5.3. Experiments on Realistic Noisy Datasets Note that in real-world scenarios, the label-noise pattern may differ due to the expertise of each human annotator. We further compare the different treatments on two realistic noisy datasets: CIFAR-10N [74], and CIFAR-10H [75]. CIFAR-10N provides each CIFAR-10 train image with 3 independent human annotations, while CIFAR-10H gives ≈ 50 annotations for each CIFAR-10 test image. In Table 3, we repeat the reproduction of three robust loss functions with three different treatments on the separate noisy labels. We report the best-achieved test accuracy for Cross- Entropy/Backward Correction/Peer Loss methods when learning with label aggregation methods (majority-vote and EM inference) and the separation method (soft-label). We observe that the separation method tends to have a better performance than aggregation ones. This may be attributed to the relatively high noise rate (𝜖 ≈ 0.18) in CIFAR-N and the insufficient amount of labelers (𝐾 = 3). Note that since the noise level in CIFAR-10H is low (𝜖 ≈ 0.07 wrong labels), label aggregation methods can infer higher quality labels, and thus, result in a better performance than separation methods (Red colored cells in Table 3 and 4). 5.4. Hypothesis Testing We adopt the paired t-test to show which treatment on the separate noisy labels is better, under certain conditions. In Table 5, we report the statistic and 𝑝-value given by the hypothesis testing results. The column “Methods” indicate the two methods we want to compare (A & B). Positive statistics means that A is better than B in the metric of test accuracy. Given a specific setting, denote by Accmethod as the list of test accuracy that belongs to this setting (i.e., CIFAR-10N, 𝐾 = 3), including CE, BW, PL loss functions, the basic hypothesis could be summarized as below: Table 4 Experimental results on CIFAR10-H with 𝐾 ≥ 5. We highlight the results with Green (for separation method) and Red (for aggregation methods) if the performance gap is large than 0.05. CE 𝐾=5 𝐾=9 𝐾 = 15 𝐾 = 25 𝐾 = 49 Majority-Vote 80.69 80.73 81.37 81.79 81.66 EM-Inference 80.97 80.96 81.24 81.01 81.68 Separation 79.65 80.91 81.07 80.78 80.81 BW 𝐾=5 𝐾=9 𝐾 = 15 𝐾 = 25 𝐾 = 49 Majority-Vote 82.51 82.75 83.27 83.59 83.68 EM-Inference 82.30 82.68 82.74 82.89 83.08 Separation 82.14 82.48 81.92 81.72 81.69 PL 𝐾=5 𝐾=9 𝐾 = 15 𝐾 = 25 𝐾 = 49 Majority-Vote 81.84 81.85 82.39 82.98 82.83 EM-Inference 81.89 82.30 82.53 82.86 82.73 Separation 80.25 81.89 81.00 80.71 80.89 • Null hypothesis: there exists zero mean difference between (1) AccMV and AccEM ; or (2) AccMV and AccSep ; or (3) AccEM and AccSep ; • Alternative hypothesis: there exists non-zero mean difference between (1) AccMV and AccEM ; or (2) AccMV and AccSep ; or (3) AccEM and AccSep . To clarify, the three cases in the above hypothesis are tested independently. For test accuracy comparisons of CIFAR-10N in Table 3, the setting of the hypothesis test is 𝐾 = 3 and the label noise rate is relatively high (18%). All 𝑝-values are larger than 0.05, indicating that we should reject the null hypothesis, and we can conclude that the performance of these three methods on CIFAR-10N (high noise, small 𝐾) satisfies: EMSep. 6. Conclusions When learning with separate noisy labels, we explore the answer to the question “whether one should aggregate separate noisy labels into single ones or use them separately as given”. In the empirical risk minimization framework, we theoretically show that label separation could be more beneficial than label aggregation when the noise rates are high or the number of labelers is insufficient. These insights hold for a number of popular loss functions including several robust treatments. Empirical results on synthetic and real-world datasets validate our conclusion. Table 5 Hypothesis testing results of the comparisons between label aggregation methods and the label separa- tion method on realistic noisy datasets. Setting Methods Statistic 𝑝-value CIFAR-10N (𝐾 = 3, high noise) MV & EM 2.650 0.057 CIFAR-10N (𝐾 = 3, high noise) MV & Sep -0.401 0.708 CIFAR-10N (𝐾 = 3, high noise) EM & Sep -2.596 0.060 CIFAR-10H (𝐾 < 15, low noise) MV & EM -0.003 0.998 CIFAR-10H (𝐾 < 15, low noise) MV & Sep 2.336 0.033 CIFAR-10H (𝐾 < 15, low noise) EM & Sep 2.390 0.030 CIFAR-10H (𝐾 ≥ 15, low noise) MV & EM 0.805 0.433 CIFAR-10H (𝐾 ≥ 15, low noise) MV & Sep 4.426 0.000 CIFAR-10H (𝐾 ≥ 15, low noise) EM & Sep 3.727 0.002 References [1] E. Estellés-Arolas, F. González-Ladrón-de Guevara, Towards an integrated crowdsourcing definition, Journal of Information science 38 (2012) 189–200. [2] J. Howe, et al., The rise of crowdsourcing, Wired magazine 14 (2006) 1–4. [3] Y. Liu, M. Liu, An online learning approach to improving the quality of crowd-sourcing, ACM SIGMETRICS Performance Evaluation Review 43 (2015) 217–230. [4] S. Albarqouni, C. Baur, F. Achilles, V. Belagiannis, S. Demirci, N. Navab, Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images, IEEE transactions on medical imaging 35 (2016) 1313–1321. [5] A. A. A. Setio, A. Traverso, T. De Bel, M. S. Berens, C. Van Den Bogaard, P. Cerello, H. Chen, Q. Dou, M. E. Fantacci, B. Geurts, et al., Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the luna16 challenge, Medical image analysis 42 (2017) 1–13. [6] T. Mitra, E. Gilbert, Credbank: A large-scale social media corpus with associated credibility annotations, in: Proceedings of the International AAAI Conference on Web and Social Media, volume 9, 2015, pp. 258–267. [7] G. Pennycook, D. G. Rand, Fighting misinformation on social media using crowdsourced judgments of news source quality, Proceedings of the National Academy of Sciences 116 (2019) 2521–2526. [8] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, L. Moy, Learning from crowds., Journal of machine learning research 11 (2010). [9] J. Whitehill, T.-f. Wu, J. Bergsma, J. Movellan, P. Ruvolo, Whose vote should count more: Optimal integration of labels from labelers of unknown expertise, Advances in neural information processing systems 22 (2009). [10] F. Rodrigues, F. Pereira, B. Ribeiro, Gaussian process classification and active learning with multiple annotators, in: International conference on machine learning, PMLR, 2014, pp. 433–441. [11] F. Rodrigues, M. Lourenco, B. Ribeiro, F. C. Pereira, Learning supervised topic models for classification and regression from crowds, IEEE transactions on pattern analysis and machine intelligence 39 (2017) 2409–2422. [12] T. Luo, Y. Liu, Machine truth serum, arXiv preprint arXiv:1909.13004 (2019). [13] P. G. Ipeirotis, F. Provost, V. S. Sheng, J. Wang, Repeated labeling using multiple noisy labelers, Data Mining and Knowledge Discovery 28 (2014) 402–441. [14] V. S. Sheng, J. Zhang, B. Gu, X. Wu, Majority voting and pairing with multiple noisy labeling, IEEE Transactions on Knowledge and Data Engineering 31 (2017) 1355–1368. [15] Z. Zhu, Z. Dong, Y. Liu, Detecting corrupted labels without training a model to predict, arXiv preprint arXiv:2110.06283 (2022). [16] Z.-H. Zhou, Ensemble methods: foundations and algorithms, CRC press, 2012. [17] M. Guan, V. Gulshan, A. Dai, G. Hinton, Who said what: Modeling individual labelers improves classification, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [18] F. Rodrigues, F. Pereira, Deep learning from crowds, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [19] Z. Chen, H. Wang, H. Sun, P. Chen, T. Han, X. Liu, J. Yang, Structured probabilistic end-to-end learning from crowds., in: IJCAI, 2020, pp. 1512–1518. [20] H. Wei, R. Xie, L. Feng, B. Han, B. An, Deep learning from multiple noisy annotators as a union, IEEE Transactions on Neural Networks and Learning Systems (2022). [21] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, A. Tewari, Learning with noisy labels, in: Advances in neural information processing systems, 2013, pp. 1196–1204. [22] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, L. Qu, Making deep neural networks robust to label noise: A loss correction approach, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1944–1952. [23] X. Xia, T. Liu, N. Wang, B. Han, C. Gong, G. Niu, M. Sugiyama, Are anchor points really indispensable in label-noise learning?, Advances in Neural Information Processing Systems 32 (2019). [24] Z. Zhu, Y. Song, Y. Liu, Clusterability as an alternative to anchor points when learning with noisy labels, in: International Conference on Machine Learning, PMLR, 2021, pp. 12912–12923. [25] Z. Zhu, J. Wang, Y. Liu, Beyond images: Label noise transition matrix estimation for tasks with lower-quality features, arXiv preprint arXiv:2202.01273 (2022). [26] Z. Jiang, K. Zhou, Z. Liu, L. Li, R. Chen, S.-H. Choi, X. Hu, An information fusion approach to learning with instance-dependent label noise, in: International Conference on Learning Representations, 2022. [27] Z. Zhang, Y. Li, H. Wei, K. Ma, T. Xu, Y. Zheng, Alleviating noisy-label effects in image classification via probability transition matrix, arXiv preprint arXiv:2110.08866 (2021). [28] S. Li, X. Xia, H. Zhang, Y. Zhan, S. Ge, T. Liu, Estimating noise transition matrix with label correlations for noisy multi-label learning, in: Advances in Neural Information Processing Systems, 2022. [29] X. Xia, B. Han, N. Wang, J. Deng, J. Li, Y. Mao, T. Liu, Extended tex 𝑡?>: Learning with mixed closed-set and open-set noisy labels, IEEE Transactions on Pattern Analysis and Machine Intelligence (2022). [30] T. Liu, D. Tao, Classification with noisy labels by importance reweighting, IEEE Transac- tions on pattern analysis and machine intelligence 38 (2016) 447–461. [31] H.-S. Chang, E. Learned-Miller, A. McCallum, Active bias: Training more accurate neu- ral networks by emphasizing high variance samples, Advances in Neural Information Processing Systems 30 (2017). [32] N. Bar, T. Koren, R. Giryes, Multiplicative reweighting for robust neural network optimiza- tion, arXiv preprint arXiv:2102.12192 (2021). [33] N. Majidi, E. Amid, H. Talebi, M. K. Warmuth, Exponentiated gradient reweighting for robust training under label noise and beyond, arXiv preprint arXiv:2104.01493 (2021). [34] A. Kumar, E. Amid, Constrained instance and class reweighting for robust learning under label noise, arXiv preprint arXiv:2111.05428 (2021). [35] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, A. Rabinovich, Training deep neural networks on noisy labels with bootstrapping, arXiv preprint arXiv:1412.6596 (2014). [36] M. Lukasik, S. Bhojanapalli, A. Menon, S. Kumar, Does label smoothing mitigate label noise?, in: International Conference on Machine Learning, PMLR, 2020, pp. 6448–6458. [37] J. Wei, H. Liu, T. Liu, G. Niu, Y. Liu, Understanding generalized label smoothing when learning with noisy labels, arXiv preprint arXiv:2106.04149 (2021). [38] Y. Wang, X. Ma, Z. Chen, Y. Luo, J. Yi, J. Bailey, Symmetric cross entropy for robust learning with noisy labels, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 322–330. [39] E. Amid, M. K. Warmuth, R. Anil, T. Koren, Robust bi-tempered logistic loss based on Bregman divergences, Advances in Neural Information Processing Systems 32 (2019). [40] J. Wang, H. Guo, Z. Zhu, Y. Liu, Policy learning using weak supervision, Advances in Neural Information Processing Systems 34 (2021). [41] X. Ma, H. Huang, Y. Wang, S. Romano, S. Erfani, J. Bailey, Normalized loss functions for deep learning with noisy labels, in: International Conference on Machine Learning, PMLR, 2020, pp. 6543–6553. [42] Y. Liu, H. Guo, Peer loss functions: Learning from noisy labels without knowing noise rates, in: International Conference on Machine Learning, PMLR, 2020, pp. 6226–6236. [43] Z. Zhu, T. Liu, Y. Liu, A second-order approach to learning with instance-dependent label noise, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10113–10123. [44] J. Wei, Y. Liu, When optimizing 𝑓 -divergence is robust with label noise, arXiv preprint arXiv:2011.03687 (2020). [45] H. Wei, H. Zhuang, R. Xie, L. Feng, G. Niu, B. An, Y. Li, Logit clipping for robust learning against label noise, arXiv preprint arXiv:2212.04055 (2022). [46] X. Xia, T. Liu, B. Han, C. Gong, N. Wang, Z. Ge, Y. Chang, Robust early-learning: Hindering the memorization of noisy labels, in: International conference on learning representations, 2020. [47] S. Liu, J. Niles-Weed, N. Razavian, C. Fernandez-Granda, Early-learning regularization prevents memorization of noisy labels, Advances in neural information processing systems 33 (2020) 20331–20342. [48] S. Liu, Z. Zhu, Q. Qu, C. You, Robust training under label noise by over-parameterization, arXiv preprint arXiv:2202.14026 (2022). [49] H. Cheng, Z. Zhu, X. Sun, Y. Liu, Demystifying how self-supervised features improve training from noisy labels, arXiv preprint arXiv:2110.09022 (2021). [50] H. Wei, L. Tao, R. Xie, B. An, Open-set label noise can improve robustness against inherent label noise, Advances in Neural Information Processing Systems 34 (2021). [51] H. Huang, H. Kang, S. Liu, O. Salvado, T. Rakotoarivelo, D. Wang, T. Liu, Paddles: Phase- amplitude spectrum disentangled early stopping for learning with noisy labels (????). [52] S. Liu, K. Liu, W. Zhu, Y. Shen, C. Fernandez-Granda, Adaptive early-learning correction for segmentation from noisy annotations, arXiv preprint arXiv:2110.03740 (2021). [53] H. Cheng, Z. Zhu, X. Li, Y. Gong, X. Sun, Y. Liu, Learning with instance-dependent label noise: A sample sieve approach, in: International Conference on Learning Representations, 2021. URL: https://openreview.net/forum?id=2VXyy9mIyU3. [54] T. Luo, X. Li, H. Wang, Y. Liu, Research replication prediction using weakly supervised learning, in: In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, 2020. [55] Z. Wang, J. Jiang, B. Han, L. Feng, B. An, G. Niu, G. Long, Seminll: A framework of noisy-label learning by semi-supervised learning, arXiv preprint arXiv:2012.00925 (2020). [56] C. Qin, Y. Wang, Y. Fu, Robust semi-supervised domain adaptation against noisy labels, in: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022, pp. 4409–4413. [57] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, M. Sugiyama, Co-teaching: Robust training of deep neural networks with extremely noisy labels, in: Advances in neural information processing systems, 2018, pp. 8527–8537. [58] X. Yu, B. Han, J. Yao, G. Niu, I. Tsang, M. Sugiyama, How does disagreement help generalization against label corruption?, in: International Conference on Machine Learning, PMLR, 2019, pp. 7164–7173. [59] H. Wei, L. Feng, X. Chen, B. An, Combating noisy labels by agreement: A joint training method with co-regularization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13726–13735. [60] B. Han, Q. Yao, T. Liu, G. Niu, I. W. Tsang, J. T. Kwok, M. Sugiyama, A survey of label-noise representation learning: Past, present and future, arXiv preprint arXiv:2011.04406 (2020). [61] H. Song, M. Kim, D. Park, Y. Shin, J.-G. Lee, Learning from noisy labels with deep neural networks: A survey, IEEE Transactions on Neural Networks and Learning Systems (2022). [62] V. Feldman, Does learning require memorization? a short tale about a long tail, in: Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, 2020, pp. 954–959. [63] Y. Liu, Understanding instance-level label noise: Disparate impacts and treatments, in: International Conference on Machine Learning, PMLR, 2021, pp. 6725–6735. [64] W. Tang, M. Yin, C.-J. Ho, Leveraging peer communication to enhance crowdsourcing, in: The World Wide Web Conference, 2019, pp. 1794–1805. [65] Z. Zhu, T. Luo, Y. Liu, The rich get richer: Disparate impact of semi-supervised learning, arXiv preprint arXiv:2110.06282 (2021). [66] S. Mendelson, Lower bounds for the empirical minimization algorithm, IEEE Transactions on Information Theory 54 (2008) 3797–3803. [67] G. Lecué, S. Mendelson, Sharper lower bounds on the performance of the empirical risk minimization algorithm, Bernoulli (2010) 605–613. [68] A. P. Dawid, A. M. Skene, Maximum likelihood estimation of observer error-rates using the em algorithm, Journal of the Royal Statistical Society: Series C (Applied Statistics) 28 (1979) 20–28. [69] P. Smyth, U. Fayyad, M. Burl, P. Perona, P. Baldi, Inferring ground truth from subjective labelling of venus images, Advances in neural information processing systems 7 (1994). [70] N. Quoc Viet Hung, N. T. Tam, L. N. Tran, K. Aberer, An evaluation of aggregation techniques in crowdsourcing, in: International Conference on Web Information Systems Engineering, Springer, 2013, pp. 1–15. [71] D. Dua, C. Graff, UCI machine learning repository, 2017. URL: http://archive.ics.uci.edu/ml. [72] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images, Technical Report, Citeseer, 2009. [73] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Pro- ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [74] J. Wei, Z. Zhu, H. Cheng, T. Liu, G. Niu, Y. Liu, Learning with noisy labels revisited: A study using real-world human annotations, arXiv preprint arXiv:2110.12088 (2021). [75] J. C. Peterson, R. M. Battleday, T. L. Griffiths, O. Russakovsky, Human uncertainty makes classification more robust, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9617–9626. [76] J. M. Varah, A lower bound for the smallest singular value of a matrix, Linear Algebra and its applications 11 (1975) 3–5. [77] X. Xia, T. Liu, B. Han, N. Wang, M. Gong, H. Liu, G. Niu, D. Tao, M. Sugiyama, Part- dependent label noise: Towards instance-dependent label noise, Advances in Neural Information Processing Systems 33 (2020) 7597–7610. Appendices A. Full Proofs In this section, we briefly introduce all omitted proofs in the main paper. We firstly give the proof of Lemma 4.1 because it is beneficial for the proofs in Section 3. A.1. Proof of Lemma 4.1 Proof. To apply Hoeffding’s inequality on the dataset of the separation method, we divide the noisy train samples {(𝑥𝑛 , 𝑦˜𝑛,𝑘 )}𝑛∈[𝑁 ] into 𝐾 groups, for 𝑘 ∈ [𝐾], i.e., {(𝑥𝑛 , 𝑦˜𝑛,1 )}𝑛∈[𝑁 ] , · · · , {(𝑥𝑛 , 𝑦˜𝑛,𝐾 )}𝑛∈[𝑁 ] . Note within each group, e.g., group {(𝑥𝑛 , 𝑦˜𝑛,1 )}𝑛∈[𝑁 ] , all the 𝑁 training samples are i.i.d. Additionally, training samples between any two different groups are also i.i.d. given feature set {𝑥𝑛 }𝑛∈[𝑁 ] . Thus, with one group {(𝑥𝑛 , 𝑦˜𝑛,1 )}𝑛∈[𝑁 ] , w.p. 1 − 𝛿0 , we have √︂ ⃒ ⃒^ ⃒ (︁ )︁ log(1/𝛿0 ) ⃒𝑅1← |Group-1 (𝑓 ) − 𝑅1← (𝑓 )⃒ ≤ 1← − 1← · , ∀𝑓. ⃒ 2𝑁 Note that: 1 (︁1 − 𝜌𝑢 −𝜌𝑢 )︁ (𝑇 𝑢 )−1 = 1 0 , for 𝑢 ∈ {, ∙}, 1 − 𝜌𝑢0 − 𝜌𝑢1 −𝜌𝑢1 1 − 𝜌𝑢0 we have: (1 + |𝜌0 − 𝜌1 |) 1← − 1← := 𝐿←0 = . 1 − 𝜌0 − 𝜌1 Applying the above technique on the other groups and by the union bound, we know that w.p. at least 1 − 𝐾𝛿0 , ∀𝑘 ∈ [𝐾], [︃ √︂ √︂ ]︃ ^ 1 |Group-k (𝑓 ) ∈ 𝑅1 (𝑓 ) − 𝐿 · log(1/𝛿 0 ) log(1/𝛿 0 ) 𝑅 ← ← ←0 , 𝑅1← (𝑓 ) + 𝐿←0 · . 2𝑁 2𝑁 Each 𝑅 ^ 1 |Group-k (𝑓 ), 𝑘 ∈ [𝐾] can be seen as a random variable within range: ← [︃ √︂ √︂ ]︃ log(1/𝛿0 ) log(1/𝛿0 ) 𝑅1← (𝑓 ) − 𝐿←0 · , 𝑅1← (𝑓 ) + 𝐿←0 · . 2𝑁 2𝑁 The randomness is from noisy labels 𝑦˜𝑛,𝑘 . Recall that the samples between different groups are i.i.d. given {𝑥𝑛 }𝑛∈[𝑁 ] . Then the above 𝐾 random variables are i.i.d. when the feature set is fixed. By Hoeffding’s inequality, w.p. at least 1 − 𝐾𝛿0 − 𝛿1 , ∀𝑓 , we have √︂ √︂ √︂ ⃒ ⃒^ ⃒ log(1/𝛿0 ) log(1/𝛿1 ) log(1/𝛿1 ) log(1/𝛿0 ) ⃒𝑅1← (𝑓 ) − 𝑅1← (𝑓 )⃒ ≤ 2 · 𝐿←0 · · = 𝐿←0 · . ⃒ 2𝑁 2𝐾 𝑁𝐾 For 𝛿0 = 𝛿1 = 𝐾+1 𝛿 , with the Rademacher bound on the maximal deviation between risks and empirical ones, for 𝑓 * ∈ ℱ and the separation method, with probability at least 1 − 𝛿, we have: (︂ )︂ √︂ ⃒ ⃒^ ⃒ ∘ 𝐾 +1 1 max ⃒𝑅ℓ← ,𝐷̃︀ (𝑓 ) − 𝑅ℓ← ,𝒟̃︀ (𝑓 )⃒ ≤ 2R (ℓ← ∘ ℱ) + 𝐿←0 · (ℓ − ℓ) · log · , ⃒ 𝑓 ∈ℱ 𝛿 𝑁𝐾 √︂ ⃒ ⃒^ ⃒ ∙ (︁ ∙ )︁ log(1/𝛿) max ⃒𝑅ℓ← ,𝐷̃︀ ∙ (𝑓 ) − 𝑅ℓ← ,𝒟̃︀ ∙ (𝑓 )⃒ ≤ 2R (ℓ← ∘ ℱ) + ℓ∙← − ℓ← · ⃒ 𝑓 ∈ℱ 2𝑁 √︂ log(1/𝛿) =2R∙ (ℓ← ∘ ℱ) + 𝐿∙←0 · (ℓ − ℓ) · , 2𝑁 where we define ℓ, ℓ as the upper and lower bound of loss function ℓ respectively, and: ⎡ ⎤ 𝑁 ∑︁𝐾 1 ∑︁ R(ℓ← ∘ ℱ) := E𝑥𝑖 ,𝑦˜𝑖,1 ,...,𝑦˜𝑖,𝐾 ,𝜖𝑖 ⎣ sup 𝜖𝑖 ℓ← (𝑓 (𝑥𝑖 ), 𝑦˜𝑖,𝑗 )⎦ 𝑓 ∈ℱ 𝑁 𝐾 𝑖=1 𝑗=1 𝐾 𝑁 [︃ ]︃ 1 ∑︁ 1 ∑︁ ≤ E𝑥𝑖 ,𝑦˜𝑖,𝑗 ,𝜖𝑖 sup 𝜖𝑖 ℓ← (𝑓 (𝑥𝑖 ), 𝑦˜𝑖,𝑗 ) , 𝐾 𝑓 ∈ℱ 𝑁 𝑗=1 𝑖=1 [︃ ]︃ ∙ 1 ∙ R (ℓ← ∘ ℱ) := E𝑥𝑖 ,𝑦˜∙ ,𝜖𝑖 sup 𝜖𝑖 ℓ← (𝑓 (𝑥𝑖 ), 𝑦˜ ) . 𝑓 ∈ℱ 𝑁 Note that we assume the noisy labels given by the 𝐾 labelers follow the same noise transition matrix, if ℓ is 𝐿−Lipshitz, then for separation and aggregation methods, ℓ← is 𝐿𝑢← Lipshitz (1+|𝜌𝑢 −𝜌𝑢 |)𝐿 for 𝑢 ∈ {, ∙} respectively, where 𝐿𝑢← = 1−𝜌0𝑢 −𝜌1𝑢 ≤ 1−𝜌2𝐿 𝑢 −𝜌𝑢 . By the Lipshitz composition 0 1 0 1 property of Rademacher averages, we have R (ℓ← ∘ ℱ) ≤ 𝐿𝑢← · R(ℱ). Thus, we have: 𝑢 √︂ ^ (1 + |𝜌0 − 𝜌1 |) · (ℓ − ℓ) 𝐾 +1 1 max |𝑅ℓ← ,𝐷̃︀ (𝑓 ) − 𝑅ℓ← ,𝒟̃︀ (𝑓 )| ≤ 2𝐿← R(ℱ) + · log( )· , 𝑓 ∈ℱ 1 − 𝜌0 − 𝜌1 𝛿 𝑁𝐾 (10) √︂ ^ ∙ (1 + |𝜌∙0 − 𝜌∙1 |) · (ℓ − ℓ) log(1/𝛿) max |𝑅 ∙ (𝑓 ) − 𝑅 ∙ (𝑓 )| ≤ 2𝐿← R(ℱ) + · . 𝑓 ∈ℱ ℓ ← ,𝐷 ℓ ← ,𝒟 1 − 𝜌∙0 − 𝜌∙1 2𝑁 ̃︀ ̃︀ Assume 𝑓 * ← min𝑓 ∈ℱ 𝑅ℓ,𝒟 (𝑓 ), for separation methods, we further have: 𝑅ℓ,𝒟 (𝑓^ ← ) − 𝑅ℓ,𝒟 (𝑓 * ) = 𝑅ℓ← ,𝒟̃︀ (𝑓^ ← ) − 𝑅ℓ← ,𝒟̃︀ (𝑓 * ) * * * = 𝑅ℓ← ,𝒟̃︀ (𝑓^ ← ) − 𝑅 ^ ^ ^ ̃︀ (𝑓 ) − 𝑅ℓ← ,𝒟 ̃︀ (𝑓 ← ) + 𝑅ℓ← ,𝐷 ℓ ← ,𝐷 ^ ̃︀ (𝑓 ) + 𝑅ℓ← ,𝐷 ^ ^ ̃︀ (𝑓 ← ) − 𝑅ℓ← ,𝐷 ̃︀ (𝑓 ) ^ ≤ 0 + 2 max |𝑅 ̃︀ (𝑓 ) − 𝑅ℓ← ,𝒟 ̃︀ (𝑓 )| 𝑓 ∈ℱℓ← ,𝐷 √︂ 𝐾 +1 1 ≤ 4𝐿← R(ℱ) + 2𝐿← · (ℓ − ℓ) · log( )· . 𝛿 𝑁𝐾 Similarly, for aggregation methods, we have: ∙ ∙ 𝑅ℓ,𝒟 (𝑓^ ← ) − 𝑅ℓ,𝒟 (𝑓 * ) = 𝑅ℓ← ,𝒟̃︀ ∙ (𝑓^ ← ) − 𝑅ℓ← ,𝒟̃︀ ∙ (𝑓 * ) ∙ = 𝑅ℓ← ,𝒟̃︀ ∙ (𝑓^ ← ) − 𝑅 ^ ^∙ ^ * ̃︀ ∙ (𝑓 ) − 𝑅ℓ← ,𝒟 ̃︀ ∙ (𝑓 ← ) + 𝑅ℓ← ,𝐷 * ^ ^∙ ^ ̃︀ ∙ (𝑓 ← ) − 𝑅ℓ← ,𝐷 ̃︀ ∙ (𝑓 ) + 𝑅ℓ← ,𝐷 * ̃︀ ∙ (𝑓 ) ℓ ← ,𝐷 ^ ≤ 0 + 2 max |𝑅 ̃︀ ∙ (𝑓 ) − 𝑅ℓ← ,𝒟 ̃︀ ∙ (𝑓 )| 𝑓 ∈ℱ ℓ← ,𝐷 √︂ ∙ ∙ log(1/𝛿) ≤ 4𝐿← R(ℱ) + 2𝐿← · (ℓ − ℓ) · . 2𝑁 𝐾·log( 1𝛿 ) Note that 𝜂𝐾 = 2 and 𝜂𝐾 ∙ ≡ 1, we then have: 2(log( 𝐾+1 𝛿 )) √︃ 𝑢 𝐿𝑢 · (ℓ − ℓ) 2 log(1/𝛿) 𝑅ℓ,𝒟 (𝑓^ ← ) − 𝑅ℓ,𝒟 (𝑓 * ) ≤ 4𝐿𝑢← R(ℱ) + ← · 𝑢𝑁 . 𝐿 𝜂𝐾 𝑢 Defined as: Δ𝑅← A.2. Proof of Theorem 4.2 Proof. The proof is straightforward if we proceed with the proof of Lemma 4.1 with the below discussions. With the knowledge of noise rates for both methods, remember that 𝐿𝑢← = (1+|𝜌𝑢 𝑢 0 −𝜌1 |)𝐿 1−𝜌𝑢 −𝜌𝑢 , we have: 0 1 ∙ Δ𝑅← < Δ𝑅← √︂ √︂ 𝐿← 𝐾 +1 1 ∙ 𝐿∙← log(1/𝛿) =⇒ 2𝐿← R(ℱ) + · (ℓ − ℓ) · log( )· < 2𝐿← R(ℱ) + · (ℓ − ℓ) · 𝐿 𝛿 𝑁𝐾 𝐿 2𝑁 √︂ √︂ 𝐿 − 𝐿← ∙ log(1/𝛿) 𝐾 +1 1 =⇒ 2· ← · 𝐿 · R(ℱ) < 𝐿∙← · − 𝐿← · log( )· ℓ−ℓ 2𝑁 𝛿 𝑁𝐾 (︃ √︃ )︃ √︂ 𝐿 − 𝐿∙← 1 log(1/𝛿) =⇒ 2· ← · 𝐿 · R(ℱ) < 𝐿∙← − 𝐿← · ℓ−ℓ 𝜂𝐾 2𝑁 √︃ √︃ 2𝑁 𝐿← − 𝐿∙← 1 =⇒ 2 · 𝐿 · R(ℱ) < 𝐿∙← − 𝐿← · . log(1/𝛿) ℓ − ℓ 𝜂𝐾 For any finite concept class ℱ ⊂ {𝑓 : 𝑋 → {0,√︁ 1}}, and the sample set 𝑆 = {𝑥1 , ..., 𝑥𝑁 }, the 2𝑑 log(𝑁 ) Rademacher complexity is upper bounded by 𝑁 where 𝑑 is the VC dimension of ℱ. ∙ To achieve Δ𝑅← < Δ𝑅← , we simply need to find the condition of 𝐾 (or 𝜂𝐾 ) that satisfies the below in-equation: √︃ √︂ (︃ √︃ )︃ ∙ 2𝑁 𝐿← − 𝐿∙← 2𝑑 log(𝑁 ) 1 Δ𝑅← < Δ𝑅← =⇒ 2 ·𝐿· < 𝐿∙← − 𝐿← · log(1/𝛿) ℓ − ℓ 𝑁 𝜂𝐾 √︃ (︃ √︃ )︃ 𝐿 − 𝐿∙← 𝑑 log(𝑁 ) 1 =⇒ 4 ← ·𝐿· < 𝐿∙← − 𝐿← · ℓ−ℓ log(1/𝛿) 𝜂𝐾 √︃ (︃ √︃ )︃ 4 𝑑 log(𝑁 ) 1 =⇒ (𝐿← − 𝐿∙← ) · ·𝐿· ∙ < 𝐿← − 𝐿← · ℓ−ℓ log(1/𝛿) 𝜂𝐾 denoted as 𝛼← ,which is a function of 𝑁,𝛿,𝑑,𝐿 (︃ √︃ )︃ 1 =⇒ (𝐿← − 𝐿∙← ) · 𝛼← < 𝐿∙← − 𝐿← · 𝜂𝐾 (︃ √︃ )︃ 1 =⇒ (𝐿← − 𝐿∙← ) · 𝛼← < (𝐿∙← − 𝐿← ) + 1− · 𝐿← 𝜂𝐾 (︃)︃ √︃ 1 𝐿← =⇒ 𝛼← + 1 < 1 − · 𝜂𝐾 𝐿← − 𝐿∙← 1 (︁ 1 )︁ 𝐿← =⇒ < 1 − (𝜂𝐾 )− 2 · 𝛾 𝐿← − 𝐿∙← 1 =⇒ 𝛼𝐾 · 1 ≤ 𝛾, 1 − (𝜂𝐾 )− 2 √︁ 𝑑 log(𝑁 ) where we denote by 𝛼𝐾 := 1 − 𝐿∙← /𝐿← , 𝛾 = 1/(1 + ℓ−ℓ 4𝐿 log(1/𝛿) ). A.3. Proof of Theorem 4.3 Proof. 𝑢 [︁ 𝑢 𝑢 ]︁2 Var(𝑓^ ← ) = E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 ) − E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 [ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 )] [︃ ]︃ [︁ 𝑢 ]︁2 [︁ 𝑢 ]︁2 𝑢 𝑢 ^ 𝑢 ^ 𝑢 ^ 𝑢 =E ̃︀ 𝑢 ℓ(𝑓 ← (𝑋), 𝑌̃︀ ) + E ̃︀ 𝑢 [ℓ(𝑓 ← (𝑋), 𝑌̃︀ )] − 2ℓ(𝑓 ← (𝑋), 𝑌̃︀ )E ̃︀ 𝑢 [ℓ(𝑓 ← (𝑋), 𝑌̃︀ )]^ 𝑢 𝒟 𝒟 𝒟 [︁ 𝑢 ]︁2 [︁ 𝑢 ]︁2 [︁ 𝑢 𝑢 ]︁ =E𝒟̃︀ 𝑢 ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 ) + E𝒟̃︀ 𝑢 [ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 )] − E𝒟̃︀ 𝑢 2ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 )E𝒟̃︀ 𝑢 [ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 )] [︁ 𝑢 ]︁2 [︁ 𝑢 ]︁2 =E𝒟̃︀ 𝑢 ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 ) − E𝒟̃︀ 𝑢 [ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 )] [︁ 𝑢 ]︁2 𝑢 =E𝒟̃︀ 𝑢 ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ← ))2 . A special case is the 0-1 loss, i.e., ℓ(·) = 1(·), we then have: 𝑢 [︁ 𝑢 ]︁2 𝑢 Var(𝑓^ ← ) =E𝒟̃︀ 𝑢 ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ← ))2 [︁ 𝑢 ]︁ 𝑢 =E𝒟̃︀ 𝑢 ℓ(𝑓^ ← (𝑋), 𝑌̃︀ 𝑢 ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ← ))2 𝑢 𝑢 =𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ← ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ← ))2 , 𝑢 where 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ← ) ∈ [0, 1] and 𝑔(𝑎) = 𝑎 − 𝑎2 is monotonically increasing when 𝑎 < 12 . Note 𝑢 √︁ that: 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ← ) < 𝐿𝑢←0 · (ℓ − ℓ) · log(1/𝛿) 2𝜂 𝑢 𝑁 , when 𝐾 √︃ √︃ log(1/𝛿) 1 − 12 𝑁 𝐿𝑢←0 · (ℓ − ℓ) · 𝑢 ≤ ⇐⇒ 𝐿𝑢←0 (𝜂𝐾 𝑢 ) < , 2𝜂𝐾 𝑁 2 2(ℓ − ℓ)2 log(1/𝛿) 𝑢 (︁ 𝑢 √︁ )︁ we have: Var(𝑓^ ← ) ≤ 𝑔 𝐿← ·(ℓ−ℓ) 𝐿 · 2 log(1/𝛿) 𝑢 . (︁ √︁ )︁𝜂𝐾 𝑁 (︁ ∙ √︁ )︁ To achieve: 𝑔 𝐿← ·(ℓ−ℓ) 𝐿 · 2 log(1/𝛿) 𝜂 𝑁 ≤ 𝑔 𝐿← ·(ℓ−ℓ) 𝐿 · 2 log(1/𝛿) ∙ 𝜂 𝑁 , we simply need: 𝐾 𝐾 √︃ √︃ 2 log(1/𝛿) 2 log(1/𝛿) 𝐿 ≤ 𝐿∙←0 · (ℓ − ℓ) · > ← √︀ 𝑢 𝐿←0 · (ℓ − ℓ) · ∙ ⇐⇒ 𝜂𝐾 . 𝜂𝐾 𝑁 𝜂𝐾 𝑁 𝐿∙← A.4. Proof for Corollary 4.4 For a general matrix 𝑈 = (𝑇 𝑢 )−1 , we firstly note 1𝑢← − 1𝑢← = max 𝑈𝑖𝑗𝑢 − min 𝑈𝑖𝑗𝑢 𝑖,𝑗∈[𝑀 ] 𝑖,𝑗∈[𝑀 ] ≤| max 𝑈𝑖𝑗𝑢 | + | min 𝑈𝑖𝑗𝑢 | 𝑖,𝑗∈[𝑀 ] 𝑖,𝑗∈[𝑀 ] ∑︁ ∑︁ ≤| max 𝑈𝑖𝑗𝑢 | + | min 𝑈𝑖𝑗𝑢 |. 𝑖∈[𝑀 ] 𝑖∈[𝑀 ] 𝑗∈[𝑀 ],𝑈𝑖𝑗 >0 𝑗∈[𝑀 ],𝑈𝑖𝑗 <0 Recall 𝑇 𝑢 1 = 1 ⇒ 1 = (𝑇 𝑢 )−1 1. We know the above maximum and minimum take the same 𝑖. Then ∑︁ ∑︁ 1𝑢← − 1𝑢← ≤| max 𝑈𝑖𝑗𝑢 | + | min 𝑈𝑖𝑗𝑢 | 𝑖∈[𝑀 ] 𝑖∈[𝑀 ] 𝑗∈[𝑀 ],𝑈𝑖𝑗 >0 𝑗∈[𝑀 ],𝑈𝑖𝑗 <0 𝑢 =‖𝑈 ‖∞ (𝑎) 1 ≤ min𝑖∈[𝑀 ] (𝑇𝑖𝑖𝑢 − 𝑢 ∑︀ 𝑗̸=𝑖 𝑇𝑖𝑗 ) 1 ≤ , 𝑒𝑢 := max (1 − 𝑇𝑖𝑖𝑢 ), 𝑒𝑢 < 0.5. 1 − 2𝑒𝑢 𝑖∈[𝑀 ] Now we prove the inequality (𝑎) [76]. Let 𝜈 satisfy ‖(𝑇 𝑢 )−1 ‖∞ = ‖(𝑇 𝑢 )−1 𝜈‖∞ /‖𝜈‖∞ and let 𝜇 = (𝑇 𝑢 )−1 𝜈. Then ‖(𝑇 𝑢 )−1 ‖∞ = ‖𝜇‖∞ /‖𝜈‖∞ To bound ‖𝜇‖, we choose 𝑖 such that 𝜇𝑖 = ‖𝜇‖∞ . Then ∑︁ 𝑇𝑖𝑖𝑢 𝜇𝑖 = 𝜈𝑖 − 𝑇𝑖𝑗𝑢 𝜇𝑗 , 𝑗̸=𝑖 which further gives ∑︁ ∑︁ |𝑇𝑖𝑖𝑢 |‖𝜇‖∞ ≤ |𝜈𝑖 | + |𝑇𝑖𝑗𝑢 ||𝜇𝑗 | ≤ |𝜈𝑖 | + ‖𝜇‖∞ |𝑇𝑖𝑗𝑢 |. 𝑗̸=𝑖 𝑗̸=𝑖 Therefore, |𝜈𝑖 | ‖𝜇‖∞ ≤ 𝑢, 𝑇𝑖𝑖𝑢 − ∑︀ 𝑗̸=𝑖 𝑇𝑖𝑗 and 1 ‖(𝑇 𝑢 )−1 ‖∞ = ‖𝜇‖∞ /‖𝜈‖∞ ≤ 𝑢. 𝑇𝑖𝑖𝑢 − ∑︀ 𝑗̸=𝑖 𝑇𝑖𝑗 On the other hand, denoting by ‖𝑈 ‖max := max𝑖,𝑗∈[𝑀 ] |𝑈𝑖𝑗 |, from eigenvalues, we know √ 𝑢 𝑢 √ 𝑀 𝑢 1← − 1← ≤‖𝑈 ‖∞ ≤ 𝑀 𝜆max (𝑈 ) = . 𝜆min (𝑇 𝑢 ) where 𝜆min (𝑇 𝑢 ) denotes the minimal eigenvalue of the matrix 𝑇 𝑢 . Therefore, √ 𝑢 𝑢 ∘ 1 𝑀 1← − 1← = 𝐿←0 = min{ 𝑢 , }, 1 − 2𝑒𝑖max 𝜆min (𝑇 𝑢 ) where 𝑒𝑢 := max𝑖∈[𝑀 ] (1 − 𝑇𝑖𝑖𝑢 ), 𝑒𝑢 < 0.5, and 𝜆min (𝑇 𝑢 ) denotes the minimal eigenvalue of the matrix 𝑇 𝑢 . A.5. Proof of Lemma 3.2 𝑢 Proof. Note that for 𝑓^ = 𝑓^ , we have: 𝑢 𝑢 𝑅ℓ,𝒟 (𝑓^ ) − min 𝑅ℓ,𝒟 (𝑓 ) = 𝑅ℓ,𝒟 (𝑓^ ) − 𝑅ℓ,𝒟 (𝑓 * ) 𝑓 ∈ℱ 𝑢 𝑢 𝑢 = 𝑅ℓ,𝒟 (𝑓^ ) − 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) + 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) − min 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓 ) + min 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓 ) − 𝑅ℓ,𝒟 (𝑓 * ). (11) 𝑓 ∈ℱ 𝑓 ∈ℱ Distribution shift Estimation error The term of distribution shift can be upper bounded by: 𝑢 𝑢 𝑅ℓ,𝒟 (𝑓^ ) − 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) [︁ 𝑢 ]︁ [︁ 𝑢 ]︁ =E(𝑋,𝑌 )∼𝒟 ℓ(𝑓^ (𝑋), 𝑌 ) − E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 ℓ(𝑓^ (𝑋), 𝑌̃︀𝑖𝑢 ) 𝑖 ⃒ [︁ ]︁⃒ ≤ max ⃒E(𝑋,𝑌 )∼𝒟 [ℓ(𝑓 (𝑋), 𝑌 )] − E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 ℓ(𝑓 (𝑋), 𝑌̃︀𝑖𝑢 ) ⃒ ⃒ ⃒ 𝑓 ∈ℱ 𝑖 ⃒ = max ⃒E(𝑋,𝑌 =1)∼𝒟 [ℓ(𝑓 (𝑋), 1)] + E(𝑋,𝑌 =0)∼𝒟 [ℓ(𝑓 (𝑋), 0)] ⃒ 𝑓 ∈ℱ [︁ ]︁ [︁ ]︁ ⃒ − E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 ,𝑌 =1 ℓ(𝑓 (𝑋), 𝑌̃︀𝑖𝑢 ) − E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 ,𝑌 =0 ℓ(𝑓 (𝑋), 𝑌̃︀𝑖𝑢 ) ⃒ ⃒ 𝑖 𝑖 ⃒ = max ⃒E(𝑋,𝑌 =1)∼𝒟 [ℓ(𝑓 (𝑋), 1)] + E(𝑋,𝑌 =0)∼𝒟 [ℓ(𝑓 (𝑋), 0)] ⃒ 𝑓 ∈ℱ − E(𝑋,𝑌̃︀ 𝑢 =1)∼𝒟̃︀ 𝑢 ,𝑌 =1 [ℓ(𝑓 (𝑋), 1)] − E(𝑋,𝑌̃︀ 𝑢 =0)∼𝒟̃︀ 𝑢 ,𝑌 =1 [ℓ(𝑓 (𝑋), 0)] 𝑖 𝑖 ⃒ − E(𝑋,𝑌̃︀ 𝑢 =1)∼𝒟̃︀ 𝑢 ,𝑌 =0 [ℓ(𝑓 (𝑋), 1)] − E(𝑋,𝑌̃︀ 𝑢 =0)∼𝒟̃︀ 𝑢 ,𝑌 =0 [ℓ(𝑓 (𝑋), 0)] ⃒ ⃒ 𝑖 𝑖 ⃒ = max ⃒E(𝑋,𝑌 =1)∼𝒟 [ℓ(𝑓 (𝑋), 1)] + E(𝑋,𝑌 =0)∼𝒟 [ℓ(𝑓 (𝑋), 0)] ⃒ 𝑓 ∈ℱ [︁ ]︁ [︁ ]︁ − E(𝑋,𝑌 =1)∼𝒟 P(𝑌̃︀𝑖𝑢 = 1|𝑌 = 1) · ℓ(𝑓 (𝑋), 1) − E(𝑋,𝑌 =1)∼𝒟 P(𝑌̃︀𝑖𝑢 = 0|𝑌 = 1) · ℓ(𝑓 (𝑋), 0) [︁ ]︁ [︁ ]︁⃒ − E(𝑋,𝑌 =0)∼𝒟 P(𝑌̃︀𝑖𝑢 = 1|𝑌 = 0) · ℓ(𝑓 (𝑋), 1) − E(𝑋,𝑌 =0)∼𝒟 P(𝑌̃︀𝑖𝑢 = 0|𝑌 = 0) · ℓ(𝑓 (𝑋), 0) ⃒. ⃒ Combine similar terms, we then have: ⃒ [︁ ]︁ [︁ ]︁ = max ⃒E(𝑋,𝑌 =1)∼𝒟 P(𝑌̃︀𝑖𝑢 = 0|𝑌 = 1) · ℓ(𝑓 (𝑋), 1) + E(𝑋,𝑌 =0)∼𝒟 P(𝑌̃︀𝑖𝑢 = 1|𝑌 = 0) · ℓ(𝑓 (𝑋), 0) ⃒ 𝑓 ∈ℱ [︁ ]︁ [︁ ]︁ ⃒ − E(𝑋,𝑌 =1)∼𝒟 P(𝑌̃︀𝑖𝑢 = 0|𝑌 = 1) · ℓ(𝑓 (𝑋), 0) − E(𝑋,𝑌 =0)∼𝒟 P(𝑌̃︀𝑖𝑢 = 1|𝑌 = 0) · ℓ(𝑓 (𝑋), 1) ⃒ ⃒ ⃒ ⃒ 𝑢 𝑢 = max ⃒E(𝑋,𝑌 =1)∼𝒟 [𝜌1 · (ℓ(𝑓 (𝑋), 1) − ℓ(𝑓 (𝑋), 0))] + E(𝑋,𝑌 =0)∼𝒟 [𝜌0 · (ℓ(𝑓 (𝑋), 0) − ℓ(𝑓 (𝑋), 1))] ⃒ ⃒ ⃒ 𝑓 ∈ℱ ⃒ )︀]︀ ⃒⃒ ≤ max ⃒E(𝑋,𝑌 =1)∼𝒟 𝜌𝑢1 · ℓ − ℓ + E(𝑋,𝑌 =0)∼𝒟 𝜌𝑢0 · ℓ − ℓ ⃒ ⃒ [︀ (︀ )︀]︀ [︀ (︀ 𝑓 ∈ℱ =(𝑝1 𝜌𝑢1 + 𝑝0 𝜌𝑢0 ) · ℓ − ℓ . (︀ )︀ Thus, we have: 𝑢,1 𝑅ℓ,𝒟 (𝑓^ ) − 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) ≤ Δ𝑅 := (𝜌𝑢0 𝑝0 + 𝜌𝑢1 𝑝1 ) · ℓ − ℓ . (︀ )︀ A.6. Proof of Lemma 3.3 Proof. For the term Estimation error, we have: 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) − 𝑅ℓ,𝒟 (𝑓 * ) 𝑢 = 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) − min 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓 ) + min 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓 ) − 𝑅ℓ,𝒟 (𝑓 * ) 𝑓 ∈ℱ 𝑓 ∈ℱ Estimation error 𝑢 ≤ 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) − min 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓 ) + |min 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓 ) − 𝑅ℓ,𝒟 (𝑓 * )| 𝑓 ∈ℱ 𝑓 ∈ℱ Error 1 Error 2 The upper bound of Error 1 could be derived directly from the proof of Lemma 4.1: since the loss function makes no use of loss correction, the L-Lipschitz constant does not have to multiply with the constant and 𝐿𝑢← → 𝐿. Besides, the constant for the variance term (square term) reduces to (ℓ − ℓ). Thus, we have: √︃ 2 log(1/𝛿) Error 1 ≤ 4𝐿R(ℱ) + (ℓ − ℓ) · 𝑢𝑁 , ∀𝑓 ∈ ℱ. 𝜂𝐾 For the term Error 2, the upper bound could be derived with the same procedure as adopted in the proof of Lemma 3.2. Thus, we obtain: √︃ 2 log(1/𝛿) 𝑢,1 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) − 𝑅ℓ,𝒟 (𝑓 * ) ≤ 4𝐿R(ℱ) + (ℓ − ℓ) · 𝑢 + Δ𝑅 . 𝜂𝐾 𝑁 𝑢,2 Defined as: Δ𝑅 A.7. Proof of Theorem 3.4 Proof. To achieve a smaller upper bound for the separation method, mathematically, we want: √︃ 2 log(1/𝛿) (︀ )︀ 4𝐿R(ℱ) + (ℓ − ℓ) · + 2(𝜌0 𝑝0 + 𝜌1 𝑝1 ) · ℓ − ℓ 𝜂𝐾 𝑁 √︃ 2 log(1/𝛿) ∙ ∙ (︀ )︀ ≤4𝐿R(ℱ) + (ℓ − ℓ) · ∙ 𝑁 + 2(𝜌0 𝑝0 + 𝜌1 𝑝 1 ) · ℓ − ℓ , 𝜂𝐾 which is equivalent to proving: √︂ log(1/𝛿) 1 ((𝜂𝐾 )− 2 − 1) · (ℓ − ℓ) ≤ [(𝜌∙0 𝑝0 + 𝜌∙1 𝑝1 ) − (𝜌0 𝑝0 + 𝜌1 𝑝1 )] · ℓ − ℓ (︀ )︀ √︂ 2𝑁 log(1/𝛿) 1 ⇐⇒ (1 − (𝜂𝐾 )− 2 ) ≥ [(𝜌0 𝑝0 + 𝜌1 𝑝1 ) − (𝜌∙0 𝑝0 + 𝜌∙1 𝑝1 )] . (12) 2𝑁 De-noising effect of aggregation ≥0 √︁ log(1/𝛿) (𝜌0 𝑝0 +𝜌1 𝑝1 )−(𝜌∙0 𝑝0 +𝜌∙1 𝑝1 ) Eqn. (12) then requires: 2𝑁 ≥ 1 , which is mentioned as 𝛼𝐾 · (1−(𝜂𝐾 )− 2 ) 1 ≤ 𝛾, where 𝛼𝐾 := (𝜌0 𝑝0 + 𝜌1 𝑝1 ) − (𝜌∙0 𝑝0 + 𝜌∙1 𝑝1 ), 𝛾 = log(1/𝛿)/2𝑁 . √︀ 1 1−(𝜂𝐾 )− 2 A.8. Proof of Theorem 3.6 Proof. For 𝑢 ∈ {, ∙}, we have: 𝑢 [︁ 𝑢 𝑢 ]︁2 Var(𝑓^ ) =E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 ) − E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 [ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 )] [︃ ]︃ [︁ 𝑢 ]︁2 [︁ 𝑢 ]︁2 𝑢 𝑢 =E ̃︀ 𝑢 ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 ) + E ̃︀ 𝑢 [ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 )] − 2ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 )E ̃︀ 𝑢 [ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 )] 𝒟 𝒟 𝒟 [︁ 𝑢 ]︁2 [︁ 𝑢 ]︁2 [︁ 𝑢 𝑢 ]︁ =E𝒟̃︀ 𝑢 ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 ) + E𝒟̃︀ 𝑢 [ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 )] − E𝒟̃︀ 𝑢 2ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 )E𝒟̃︀ 𝑢 [ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 )] [︁ 𝑢 ]︁2 [︁ 𝑢 ]︁2 =E𝒟̃︀ 𝑢 ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 ) − E𝒟̃︀ 𝑢 [ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 )] [︁ 𝑢 ]︁2 𝑢 =E𝒟̃︀ 𝑢 ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ))2 . A special case is the 0-1 loss, i.e., ℓ(·) = 1(·), we then have: 𝑢 [︁ 𝑢 ]︁2 𝑢 Var(𝑓^ ) =E𝒟̃︀ 𝑢 ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ))2 [︁ 𝑢 ]︁ 𝑢 =E𝒟̃︀ 𝑢 ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ))2 𝑢 𝑢 𝑢 (︁ )︁ =𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ))2 = 𝑔 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) . 𝑢 where 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) ∈ [0, 1] and 𝑔(𝑎) = 𝑎 − 𝑎2 is monotonically increasing when 𝑎 < 12 . Thus, when √︃ 𝑢 log(1/𝛿) 1 2 log(1/𝛿) 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) ≤ (ℓ − ℓ) · 𝑢 ≤ ⇐⇒ 𝑢 𝜂𝐾 ≥ , 2𝜂𝐾 𝑁 2 𝑁 reduces to 1 𝑢 √︁ 2 log(1/𝛿) we could derive Var(𝑓^ ) ≤ 𝑔( 𝑢𝑁 𝜂𝐾 ). A.9. Proof of Corollary 3.7 Proof. In the multi-class extension, the only difference is the upper bound of the Distribution Shift term in Eqn. (11), which now becomes: 𝑢 𝑢 𝑅ℓ,𝒟 (𝑓^ ) − 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ) [︁ 𝑢 ]︁ [︁ 𝑢 ]︁ =E(𝑋,𝑌 )∼𝒟 ℓ(𝑓^ (𝑋), 𝑌 ) − E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 ℓ(𝑓^ (𝑋), 𝑌̃︀ 𝑢 ) ⃒ ⃒ ⃒ [︁ ]︁ ⃒ ≤ max ⃒E(𝑋,𝑌 )∼𝒟 [ℓ(𝑓 (𝑋), 𝑌 )] − E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 ℓ(𝑓 (𝑋), 𝑌̃︀ 𝑢 ) ⃒ ⃒ ⃒ 𝑓 ∈ℱ ⃒ ⃒ ⃒⎡ [︃ ⎤⎤ ⎡ ⎤⃒ ⃒ [︁ ]︁ ⃒ ⃒ ∑︁ ∑︁ = max ⃒⃒⎣ E(𝑋,𝑌 =𝑗)∼𝒟 ℓ(𝑓 (𝑋), 𝑗)⎦⎦ − ⎣ E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 ,𝑌 =𝑗 ℓ(𝑓 (𝑋), 𝑌̃︀ 𝑢 ) ⎦ ⃒ ⃒ 𝑓 ∈ℱ ⃒ ⃒ 𝑗∈[𝑀 ] 𝑗∈[𝑀 ] ⃒ ⎡ ⎤ ⎡ ⎤⃒ ⃒ ∑︁ ∑︁ ∑︁ [︁ ]︁ ⃒ = max ⃒ ⎣ E(𝑋,𝑌 =𝑗)∼𝒟 [ℓ(𝑓 (𝑋), 𝑗)]⎦ − ⎣ E(𝑋,𝑌 =𝑗)∼𝒟 P(𝑌̃︀ 𝑢 = 𝑘|𝑌 = 𝑗) · ℓ(𝑓 (𝑋), 𝑘) ⎦ ⃒ ⃒ ⃒ 𝑓 ∈ℱ ⃒ ⃒ 𝑗∈[𝑀 ] 𝑘∈[𝑀 ] 𝑗∈[𝑀 ] ⃒ ⎡ ⎤ ⎡ ⃒ ∑︁ [︁ ]︁ ∑︁ ∑︁ [︁ = max ⃒ ⎣ E(𝑋,𝑌 =𝑗)∼𝒟 P(𝑌̃︀ 𝑢 ̸= 𝑗|𝑌 = 𝑗) · ℓ(𝑓 (𝑋), 𝑗) ⎦ − ⎣ E(𝑋,𝑌 =𝑗)∼𝒟 P(𝑌̃︀ 𝑢 = 𝑘|𝑌 = 𝑗 ⃒ 𝑓 ∈ℱ ⃒ 𝑗∈[𝑀 ] 𝑘∈[𝑀 ],𝑘̸=𝑗 𝑗∈[𝑀 ] ⃒ [︃ ]︃⃒ ⃒ ∑︁ ∑︁ ⃒ 𝑢 𝑢 = max ⃒ E(𝑋,𝑌 =𝑗)∼𝒟 P(𝑌̃︀ ̸= 𝑗|𝑌 = 𝑗) · ℓ(𝑓 (𝑋), 𝑗) − P(𝑌̃︀ = 𝑘|𝑌 = 𝑗) · ℓ(𝑓 (𝑋), 𝑘) ⃒ ⃒ ⃒ 𝑓 ∈ℱ ⃒ ⃒ 𝑗∈[𝑀 ] 𝑘∈[𝑀 ],𝑘̸=𝑗 ⃒ [︃ ]︃⃒ ⃒ ∑︁ )︀ ⃒⃒ 𝑢 (︀ ≤ max ⃒ E(𝑋,𝑌 =𝑗)∼𝒟 P(𝑌̃︀ ̸= 𝑗|𝑌 = 𝑗) · ℓ − ℓ ⃒ ⃒ 𝑓 ∈ℱ ⃒ ⃒ 𝑗∈[𝑀 ] (Assumed uniform prior) ∑︁ 𝑢 (︀ )︀ = P(𝑌 = 𝑗) · (1 − 𝑇𝑗,𝑗 ) ℓ−ℓ . 𝑗∈[𝑀 ] A.10. Proof of Lemma 4.5 Proof. The proof of Lemma 4.5 builds on Theorem 7 in [42]: The performance bound for aggregation methods is the special case of Theorem 7 in [42] (adopting 𝛼* = 1 defined in [42]). As for that of separation methods, the incurred difference lies in the appearance of the weight of sample complexity 𝜂𝐾 . Thus, we have: (︃ √︃ )︃ 𝑢 1 2 log(4/𝛿) 𝑅ℓ,𝒟 (𝑓^ ↬ ) − 𝑅ℓ,𝒟 (𝑓 * ) ≤ (︀ )︀ 8𝐿R(ℱ) + 1 + 2(ℓ̄ − ℓ) 1 − 𝜌𝑢0 − 𝜌𝑢1 𝑢𝑁 𝜂𝐾 𝑢 𝑢 ⇐⇒ 𝑅ℓ,𝒟 (𝑓^ ↬ ) − 𝑅ℓ,𝒟 (𝑓 * ) ≤ Δ𝑅↬ , √︁ 𝑢 where Δ𝑅↬ := 8𝐿𝑢↬ R(ℱ) + 𝐿𝑢↬0 2 log(4/𝛿) . (︀ )︀ 𝑢 𝜂 𝑁 1 + 2(ℓ̄ − ℓ) 𝐾 A.11. Proof of Theorem 4.6 √︂ log(4/𝛿) 2𝜂 𝑢 𝑁 ( 4 1+2(ℓ̄−ℓ)) 𝑢 8𝐿R(ℱ ) ∙ Proof. Denote by Δ𝑅↬ := 1−𝜌𝑢 −𝜌𝑢 + 𝐾 1−𝜌𝑢 𝑢 , in order to achieve Δ𝑅↬ < Δ𝑅↬ , 0 1 0 −𝜌1 ∙ we require Δ𝑅↬ < Δ𝑅↬ , which is equivalent to: √︁ √︁ log(4/𝛿) (︀ )︀ log(4/𝛿) (︀ − )︀ 8𝐿R(ℱ) 4 2𝜂𝐾 𝑁 1 + 2(ℓ̄ ℓ) 8𝐿R(ℱ) 4 2𝑁 1 + 2(ℓ̄ − ℓ) + < + , 1 − 𝜌0 − 𝜌1 1 − 𝜌0 − 𝜌1 1 − 𝜌∙0 − 𝜌∙1 1 − 𝜌∙0 − 𝜌∙1 which is further equivalent to: √︁ √︁ log(4/𝛿) (︀ log(4/𝛿) (︀ )︀ − )︀ 8𝐿R(ℱ) 8𝐿R(ℱ) 4 2𝑁 1 + 2(ℓ̄ − ℓ) 4 2𝜂𝐾 𝑁 1 + 2(ℓ̄ ℓ) − < − . 1 − 𝜌0 − 𝜌1 1 − 𝜌∙0 − 𝜌∙1 ∙ 1 − 𝜌0 − 𝜌1 ∙ 1 − 𝜌0 − 𝜌1 Note that both 1 − 𝜌0 − 𝜌1 and 1 − 𝜌∙0 − 𝜌∙1 are positive, the above requirement then reduces to: [︃ √︃ ]︃ √︂ 1 log(4/𝛿) (︀ [(𝜌0 + 𝜌1 ) − (𝜌∙0 + 𝜌∙1 )]8𝐿R(ℱ) < (1 − 𝜌0 − 𝜌1 ) − (1 − 𝜌∙0 − 𝜌∙1 ) )︀ 4 1 + 2(ℓ̄ − ℓ) 𝜂𝐾 2𝑁 √︃ [(𝜌0 + 𝜌1 ) − (𝜌∙0 + 𝜌∙1 )]8𝐿R(ℱ) 1 ⇐⇒ √︁ < (1 − 𝜌0 − 𝜌1 ) − (1 − 𝜌∙0 − 𝜌∙1 ) . log(4/𝛿) (︀ )︀ 𝜂𝐾 4 2𝑁 1 + 2(ℓ̄ − ℓ) Note that for any finite concept class ℱ ⊂ {𝑓 : 𝑋 → {0, 1}}, √︁ and the sample set 𝑆 = 2𝑑 log(𝑁 ) {𝑥1 , ..., 𝑥𝑁 }, the Rademacher complexity is upper bounded by 𝑁 where 𝑑 is the VC dimension of ℱ, a more strict condition to get becomes: √︁ √︃ ∙ ∙ 2𝑑 log(𝑁 ) 1 (1 − 𝜌0 − 𝜌1 ) [(𝜌 0 + 𝜌 1 ) − (𝜌 0 + 𝜌 1 )]8𝐿 𝑁 < − )︀ . (1 − 𝜌∙0 − 𝜌∙1 ) √︁ 𝜂𝐾 log(4/𝛿) 4(1 − 𝜌∙0 − 𝜌∙1 ) (︀ 2𝑁 1 + 2(ℓ̄ − ℓ) √︁ Denote by 𝛼𝐾 := 1 − 𝐿∙↬ /𝐿↬ , 𝛾 = 1+2(ℓ̄−ℓ) 2𝐿 log(4/𝛿) 4𝑑 log(𝑁 ) . The above condition is satisfied if and only if 1 𝛼𝐾 · 1 ≤ 𝛾. 𝐿∙↬ /𝐿↬ − (𝜂𝐾 )− 2 A.12. Proof of Theorem 4.7 Proof. Similar to the proof of Theorem 3.6, for 𝑢 ∈ {, ∙}, we have: 𝑢 [︁ 𝑢 ]︁2 𝑢 Var(𝑓^ ↬ ) = E𝒟̃︀ 𝑢 ℓ(𝑓^ ↬ (𝑋), 𝑌̃︀ 𝑢 ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ↬ ))2 . A special case is the 0-1 loss, i.e., ℓ(·) = 1(·), we then have: 𝑢 [︁ 𝑢 ]︁2 𝑢 Var(𝑓^ ↬ ) =E𝒟̃︀ 𝑢 ℓ(𝑓^ ↬ (𝑋), 𝑌̃︀ 𝑢 ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ↬ ))2 [︁ 𝑢 ]︁ 𝑢 =E𝒟̃︀ 𝑢 ℓ(𝑓^ ↬ (𝑋), 𝑌̃︀ 𝑢 ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ↬ ))2 𝑢 𝑢 𝑢 (︁ )︁ =𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ↬ ) − (𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ↬ ))2 = 𝑔 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ↬ ) 𝑢 where 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ↬ ) ∈ [0, 1] and 𝑔(𝑎) = 𝑎 − 𝑎2 is monotonically increasing when 𝑎 < 12 . Note that: √︃ 𝑢 1 log(4/𝛿) (︀ 𝑅ℓ,𝒟̃︀ 𝑢 (𝑓^ ↬ ) < )︀ 𝑢 𝑢 𝑢 1 + 2(ℓ̄ − ℓ) , 1 − 𝜌0 − 𝜌1 2𝜂𝐾 𝑁 when √︃ √︂ 1 log(4/𝛿) (︀ )︀ 1 √︀ 𝑢 2 log(4/𝛿) 1 + 2(ℓ̄ − ℓ) 1 + 2(ℓ̄ − ℓ) ≤ ⇐⇒ 𝜂𝐾 ≥ , 1 − 𝜌𝑢0 − 𝜌𝑢1 𝑢 2𝜂𝐾 𝑁 2 𝑁 1 − 𝜌𝑢0 − 𝜌𝑢1 𝑢 ^∙ (︁√︁ )︁ log(4/𝛿) 1+2(ℓ̄−ℓ) we have: Var(𝑓^ ← ) ≤ 𝑔 𝑢 𝑢 𝑢 ^ 2𝜂𝐾 𝑁 1−𝜌0 −𝜌1 . To achieve: Var(𝑓 ↬ ) < Var(𝑓 ↬ ), we simply need: √︃ √︃ log(4/𝛿) 1 + 2(ℓ̄ − ℓ) log(4/𝛿) 1 + 2(ℓ̄ − ℓ) √ 𝐿 ≤ ∙ ∙ ∙ ⇐⇒ 𝜂𝐾 ≥ ↬0 . 2𝜂𝐾 𝑁 1 − 𝜌0 − 𝜌1 2𝜂𝐾 𝑁 1 − 𝜌0 − 𝜌1 𝐿∙↬0 A.13. Proof of Corollary 4.8 Proof. Regarding the multi-class extension of Lemma 4.5, the only different thing lies in the constant: 𝐿𝑢↬0 . The following Lemma A.1 helps us find out the multi-class form of 𝐿𝑢↬0 . 1 Lemma A.1. Assume the clean label 𝑌 has equal prior 𝑃 (𝑌 = 𝑗) = 𝑀 , ∀𝑗 ∈ [𝑀 ]. For the 𝑢 𝑢 uniform noise transition matrix [44] such that 𝑇𝑖,𝑗 = 𝜌𝑖 , ∀𝑗 ∈ [𝑀 ], the expected ℓ↬ in the multi-class setting is invariant to label noise up to an affine transformation: ⎛ ⎞ ∑︁ E(𝑋,𝑌̃︀ 𝑢 )∼𝒟̃︀ 𝑢 [ℓ↬ (𝑓 (𝑋), 𝑌̃︀ 𝑢 )] = ⎝1 − 𝜌𝑢𝑗 ⎠ E𝒟 [ℓ↬ (𝑓 (𝑋), 𝑌 )]. (13) 𝑗∈[𝑀 ] Proof of Lemma A.1 Recall that 𝒟 and 𝒟 ̃︀ 𝑢 refer to the joint distribution over (𝑋, 𝑌 ) and (𝑋, 𝑌̃︀ 𝑢 ), respectively. We further denote the marginal distributions of 𝑋, 𝑌 , and 𝑌̃︀ 𝑢 by 𝒟𝑋 , 𝒟𝑌 , and 𝒟 ̃︀ ̃︀ 𝑢 , respectively. Let 𝑋𝑝 ∼ 𝒟𝑋 , 𝑌̃︀𝑝𝑢 ∼ 𝒟 𝑌 ̃︀ ̃︀ 𝑢 be the random variables corresponding 𝑌 to the peer samples. The peer loss function is defined as ℓ↬ (𝑓 (𝑥𝑛 ), 𝑦˜𝑢𝑛 ) = ℓ(𝑓 (𝑥𝑛 ), 𝑦˜𝑢𝑛 ) − ℓ(𝑓 (𝑥𝑝,𝑛 ), 𝑦˜𝑢𝑝,𝑛 ), (14) where (𝑥𝑛 , 𝑦˜𝑢𝑛 ) is a normal training sample pair, 𝑥𝑝,𝑛 and 𝑦˜𝑢𝑝,𝑛 are corresponding peer samples. Taking expectation for (14) yields [︁ ]︁ E𝒟̃︀ 𝑢 [ℓ↬ (𝑓 (𝑋), 𝑌̃︀ 𝑢 )] = E𝒟̃︀ 𝑢 [ℓ(𝑓 (𝑋), 𝑌̃︀ 𝑢 )] − E𝒟̃︀ 𝑢 E𝒟𝑋 [ℓ(𝑓 (𝑋𝑝 ), 𝑌̃︀𝑝𝑢 )] . (15) 𝑌 ̃︀ The first term in (15) is E𝒟̃︀ 𝑢 [ℓ(𝑓 (𝑋), 𝑌̃︀ 𝑢 )] ∑︁ ∑︁ = 𝑇𝑖𝑗𝑢 · P(𝑌 = 𝑖) · E𝒟|𝑌 =𝑖 [ℓ(𝑓 (𝑋), 𝑗)] 𝑗∈[𝑀 ] 𝑖∈[𝑀 ] [︃ ]︃ ∑︁ ∑︁ 𝑢 = 𝑇𝑗𝑗 · P(𝑌 = 𝑗) · E𝒟|𝑌 =𝑗 [ℓ(𝑓 (𝑋), 𝑗)] + 𝑇𝑖𝑗𝑢 · P(𝑌 = 𝑖) · E𝒟|𝑌 =𝑖 [ℓ(𝑓 (𝑋), 𝑗)] 𝑗∈[𝑀 ] 𝑖∈[𝑀 ],𝑖̸=𝑗 [︃ ⎛ ⎞ ]︃ ∑︁ ∑︁ ∑︁ 𝑢⎠ = ⎝1 − 𝑇𝑗𝑖 · P(𝑌 = 𝑗) · E𝒟|𝑌 =𝑗 [ℓ(𝑓 (𝑋), 𝑗)] + 𝑇𝑖𝑗𝑢 · P(𝑌 = 𝑖) · E𝒟|𝑌 =𝑖 [ℓ(𝑓 (𝑋), 𝑗)] . 𝑗∈[𝑀 ] 𝑖̸=𝑗,𝑖∈[𝑀 ] 𝑖∈[𝑀 ],𝑖̸=𝑗 Accordingly, noting 𝑋𝑝 and 𝑌̃︀𝑝𝑢 are independent, the second term in (15) is [︁ ]︁ E𝒟̃︀ ̃︀ 𝑢 E𝒟𝑋 [ℓ(𝑓 (𝑋𝑝 ), 𝑌̃︀𝑝𝑢 )] 𝑌 ∑︁ = P(𝑌̃︀𝑝𝑢 = 𝑗) · E𝒟𝑋 [ℓ(𝑓 (𝑋𝑝 ), 𝑗)] 𝑗∈[𝑀 ] ∑︁ ∑︁ = 𝑇𝑖𝑗𝑢 · P(𝑌𝑝 = 𝑖) · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)] 𝑗∈[𝑀 ] 𝑖∈[𝑀 ] [︃ ]︃ ∑︁ ∑︁ 𝑢 = 𝑇𝑗𝑗 · P(𝑌𝑝 = 𝑗) · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)] + 𝑇𝑖𝑗𝑢 · P(𝑌𝑝 = 𝑖) · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)] 𝑗∈[𝑀 ] 𝑖∈[𝑀 ],𝑖̸=𝑗 ⎡(︃ ⎞ ]︃ ∑︁ ∑︁ ∑︁ 𝑢⎠ = ⎣ 1− 𝑇𝑗𝑖 · P(𝑌𝑝 = 𝑗) · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)] + 𝑇𝑖𝑗𝑢 · P(𝑌𝑝 = 𝑖) · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)] . 𝑗∈[𝑀 ] 𝑖̸=𝑗,𝑖∈[𝑀 ] 𝑖∈[𝑀 ],𝑖̸=𝑗 In this case, we have 𝜌𝑢𝑖 = 𝑇𝑗𝑖 𝑢 , ∀𝑗 ∈ [𝑀 ], 𝑗 ̸= 𝑖. The first term becomes E𝒟̃︀ 𝑢 [ℓ(𝑓 (𝑋), 𝑌̃︀ 𝑢 )] [︃ ⎛ ⎞ ]︃ ∑︁ ∑︁ ∑︁ = ⎝1 − 𝜌𝑢𝑖 ⎠ · P(𝑌 = 𝑗) · E𝒟|𝑌 =𝑗 [ℓ(𝑓 (𝑋), 𝑗)] + 𝜌𝑢𝑗 · P(𝑌 = 𝑖) · E𝒟|𝑌 =𝑖 [ℓ(𝑓 (𝑋), 𝑗)] 𝑗∈[𝑀 ] 𝑖̸=𝑗,𝑖∈[𝑀 ] 𝑖∈[𝑀 ],𝑖̸=𝑗 [︃ ⎛ ⎞ ]︃ ∑︁ ∑︁ ∑︁ = ⎝1 − 𝜌𝑢𝑖 ⎠ · P(𝑌 = 𝑗) · E𝒟|𝑌 =𝑗 [ℓ(𝑓 (𝑋), 𝑗)] + 𝜌𝑢𝑗 · P(𝑌 = 𝑖) · E𝒟|𝑌 =𝑖 [ℓ(𝑓 (𝑋), 𝑗)] 𝑗∈[𝑀 ] 𝑖∈[𝑀 ] 𝑖∈[𝑀 ] ⎡⎛ ⎞ ⎤ ∑︁ ∑︁ = ⎣⎝1 − 𝜌𝑢𝑖 ⎠ · E𝒟 [ℓ(𝑓 (𝑋), 𝑌 )] + 𝜌𝑢𝑗 · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)]⎦ . 𝑖∈[𝑀 ] 𝑗∈[𝑀 ] The second term becomes [︁ ]︁ E𝒟̃︀ ̃︀ 𝑢 E𝒟𝑋 [ℓ(𝑓 (𝑋𝑝 ), 𝑌̃︀𝑝𝑢 )] 𝑌 [︃ ⎛ ⎞ ]︃ ∑︁ ∑︁ ∑︁ = ⎝1 − 𝜌𝑢𝑖 ⎠ · P(𝑌𝑝 = 𝑗) · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)] + 𝜌𝑢𝑗 · P(𝑌𝑝 = 𝑖) · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)] 𝑗∈[𝑀 ] 𝑖̸=𝑗,𝑖∈[𝑀 ] 𝑖∈[𝑀 ],𝑖̸=𝑗 [︃ ⎛ ⎞ ]︃ ∑︁ ∑︁ ∑︁ = ⎝1 − 𝜌𝑢𝑖 ⎠ · P(𝑌𝑝 = 𝑗) · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)] + 𝜌𝑢𝑗 · P(𝑌𝑝 = 𝑖) · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)] 𝑗∈[𝑀 ] 𝑖∈[𝑀 ] 𝑖∈[𝑀 ] ⎛ ⎞ ∑︁ ∑︁ = ⎝1 − 𝜌𝑢𝑖 ⎠ · E𝒟𝑌 [E𝒟𝑋 [ℓ(𝑓 (𝑋𝑝 ), 𝑌𝑝 )]] + 𝜌𝑢𝑗 · E𝒟𝑋 [ℓ(𝑓 (𝑋), 𝑗)]. 𝑖∈[𝑀 ] 𝑗∈[𝑀 ] Comparing the above two terms we have: ⎛ ⎞ ∑︁ E𝒟̃︀ 𝑢 [ℓ↬ (𝑓 (𝑋), 𝑌̃︀ 𝑢 )] = ⎝1 − 𝜌𝑢𝑖 ⎠ E𝒟 [ℓ↬ (𝑓 (𝑋), 𝑌 )]. (16) 𝑖∈[𝑀 ] Thus, substituting 𝐿𝑢↬0 := 1−𝜌𝑢1−𝜌𝑢 by 1−∑︀ 1 𝜌𝑢 , the proof of Corollary 4.8 is finished if we repeat 0 1 𝑖∈[𝑀 ] 𝑖 the corresponding proof of the binary task. B. Additional Results and Details B.1. Experiment Details on UCI Datasets Datasets In this paper, we conducted experiments on two binary (Breast and German) and two multi-class (StatLog and Optical) UCI classification datasets. As for the splitting of training and testing, the original settings are used when training and testing files are provided. The remaining datasets only give one data file. We adopt 50/50 splitting for the testing results’ statistical significance as more data is distributed to testing dataset. More specifically, the numbers of (training, testing) samples in Breast, German, StatLog, and Optical datasets are (285, 284), (500, 500), (4435, 2000), and (3823, 1797). Generating the noisy labels on UCI datasets For each UCI dataset adopted in this paper, the label of each sample in the training dataset will be flipped to the other classes with the probability 𝜖 (noise rate). For the multi-class classification datasets, the specific label which will be flipped is randomly selected with equal probabilities. For binary and multi-class classification datasets, (0.1, 0.2, 0.3, 0.4) and (0.2, 0.4, 0.6, 0.8) are used as different lists of noise rates respectively. Implementation details We implemented a simple two-layer ReLU Multi-Layer Perceptron (MLP) for the classification task on these four UCI datasets. The Adam optimizer is used with a learning rate of 0.001 and the batch size is 128. B.2. Detailed Results on UCI Datasets In Table 6, we highlight the results with Green (for separation method) and Red (for aggregation methods) if the performance gap is large than 0.05. Clearly, the label separation method outperforms both aggregation methods (majority-vote and EM inference) consistently on StatLog and Optical datasets. For the two binary tasks (Breast and German), aggregation methods tend to outperform label separation, and we attribute this phenomenon to the fact that the ”denoising effect of label aggregation is more significant in the binary case”. B.3. Experiment Details on CIFAR-10 Datasets The generation of the symmetric noisy dataset is adopted from [44]. As for the instance- dependent label noise, the generating algorithm follows the state-of-the-art method [77]. Both cases adopt noise rates: [0.2, 0.4, 0.6, 0.8]. The basic hyper-parameters settings for all methods are listed as follows: mini-batch size (128), optimizer (SGD), initial learning rate (0.1), momentum (0.9), weight decay (0.0005), number of epochs (120) and learning rate decay (0.1 at 50 epochs). Standard data augmentation is applied to each dataset. All experiments run on 8 Nvidia RTX A5000 GPUs. B.4. Details Results on CIFAR-10 Dataset Table 7 includes all the detailed accuracy values that appeared in Figure 5. The results on the synthetic noisy CIFAR-10 dataset align well with the theoretical observations: label sep- aration is preferred over label aggregation when the noise rates are high, or the number of labelers/annotations is insufficient. UCI-StatLog (symmetric) CE UCI-Optical (symmetric) CE 𝜖 = 0.2 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.2 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 87.94 88.88 88.69 88.69 88.69 88.69 MV 95.27 96.73 95.41 96.73 96.73 96.66 EM 87.63 87.94 88.63 88.69 88.69 88.69 EM 95.34 95.69 96.04 96.73 96.73 96.73 Sep 89.25 90.56 91.19 91.13 91.50 91.25 Sep 97.08 97.57 97.98 98.33 98.26 98.61 𝜖 = 0.4 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.4 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 87.00 87.50 88.25 88.63 88.75 88.69 MV 92.35 96.04 96.11 96.18 96.73 96.73 EM 87.50 87.38 87.75 88.94 88.81 88.69 EM 93.25 94.78 95.68 96.03 96.73 96.73 Sep 87.75 89.31 90.38 91.19 91.19 91.00 Sep 96.59 96.80 97.44 97.64 98.26 98.05 𝜖 = 0.6 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.6 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 80.75 85.00 86.75 87.25 87.75 88.62 MV 83.31 91.51 95.34 95.41 93.11 97.14 EM 85.06 84.37 86.68 87.00 87.81 88.93 EM 88.31 92.35 93.81 95.13 96.31 95.96 Sep 87.18 87.31 88.12 89.81 90.81 90.93 Sep 94.43 95.54 96.52 97.35 97.63 97.98 UCI-StatLog (symmetric) BW UCI-Optical (symmetric) BW 𝜖 = 0.2 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.2 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 89.75 90.06 89.87 89.87 89.87 89.87 MV 94.43 95.41 95.61 96.94 96.94 96.94 EM 89.68 89.87 89.75 89.87 89.87 89.87 EM 94.43 95.82 95.68 96.94 96.94 96.94 Sep 89.87 91.18 92.00 91.43 91.37 91.06 Sep 96.87 97.21 97.63 97.77 98.05 97.63 𝜖 = 0.4 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.4 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 87.62 88.25 89.75 89.62 89.87 89.87 MV 90.19 94.29 94.64 96.80 96.94 96.94 EM 87.93 88.31 89.56 89.81 89.75 89.87 EM 93.53 95.41 96.38 95.89 96.94 96.94 Sep 89.62 89.87 90.18 91.00 91.06 91.37 Sep 95.54 96.59 96.66 97.28 97.77 97.77 𝜖 = 0.6 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.6 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 82.12 87.31 87.00 88.50 89.50 89.81 MV 82.54 88.10 92.97 95.75 93.67 96.52 EM 86.68 86.50 87.31 88.25 89.12 89.81 EM 80.52 90.54 93.11 95.06 96.10 95.82 Sep 86.18 87.62 88.31 89.62 90.93 90.75 Sep 91.72 93.39 93.46 96.38 96.66 97.21 UCI-StatLog (symmetric) PeerLoss UCI-Optical (symmetric) PeerLoss 𝜖 = 0.2 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.2 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 90.25 90.06 90.56 90.56 90.56 90.56 MV 94.71 96.03 96.38 96.52 96.52 96.38 EM 89.87 90.06 90.56 90.56 90.56 90.56 EM 94.43 96.17 96.52 96.52 96.52 96.52 Sep 90.36 91.12 91.68 91.43 91.68 91.43 Sep 97.07 97.35 97.70 98.12 98.05 98.12 𝜖 = 0.4 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.4 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 87.93 89.06 90.18 90.43 90.68 90.56 MV 91.86 94.57 95.54 96.87 96.52 96.52 EM 88.00 88.93 90.12 90.37 90.56 90.56 EM 91.86 93.94 95.96 96.87 96.52 96.52 Sep 89.18 90.25 90.37 91.43 91.68 91.87 Sep 96.66 96.73 96.94 97.63 98.05 98.19 𝜖 = 0.6 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.6 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 78.68 86.68 87.31 88.81 89.75 90.56 MV 76.77 87.34 94.71 95.61 93.88 96.38 EM 86.25 86.68 88.00 88.75 89.12 90.31 EM 86.92 89.98 93.11 96.10 95.82 96.10 Sep 87.50 88.25 88.93 90.18 91.31 91.31 Sep 93.32 95.54 96.10 96.73 97.77 97.91 UCI-pop failuers (symmetric) CE UCI-forest fire (symmetric) CE 𝜖 = 0.2 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.2 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 93.06 93.06 93.06 93.06 93.06 93.06 MV 92.86 92.86 91.84 91.84 91.84 91.84 EM 93.06 93.06 93.06 93.06 93.06 93.06 EM 92.86 92.86 91.84 91.84 91.84 91.84 Sep 93.06 93.52 93.98 94.91 95.83 96.30 Sep 93.88 93.88 95.92 94.90 91.84 93.88 𝜖 = 0.4 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.4 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 93.06 93.06 93.06 93.06 93.06 93.06 MV 81.63 81.63 90.82 91.84 91.84 91.84 EM 93.06 93.06 93.06 93.52 93.06 93.52 EM 58.16 77.55 90.82 88.78 91.84 91.84 Sep 93.06 93.52 93.98 94.91 95.37 95.37 Sep 79.59 79.59 87.76 87.76 90.82 92.86 UCI-pop failuers (symmetric) BW UCI-forest fire (symmetric) BW 𝜖 = 0.2 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.2 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 93.06 93.06 93.06 93.06 93.06 93.06 MV 91.84 92.86 91.84 91.84 91.84 92.86 EM 93.06 93.06 93.06 93.06 93.06 93.06 EM 91.84 92.86 91.84 91.84 91.84 92.86 Sep 93.06 93.52 93.98 95.37 95.83 95.83 Sep 88.78 90.82 92.86 93.88 91.84 94.90 𝜖 = 0.4 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.4 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 92.13 90.59 90.59 90.59 92.91 93.06 MV 84.69 86.73 89.80 91.84 91.84 92.86 EM 93.06 89.35 93.06 91.20 93.06 93.06 EM 73.47 75.51 88.78 91.84 91.84 91.84 Sep 93.06 93.06 93.06 91.20 93.98 94.91 Sep 84.69 84.69 86.73 87.76 86.73 90.82 UCI-pop failuers (symmetric) Peer Loss UCI-forest fire (symmetric) Peer Loss 𝜖 = 0.2 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.2 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 93.06 93.52 94.91 94.44 94.44 94.44 MV 92.86 92.86 92.86 92.86 92.86 94.90 EM 93.06 93.06 94.44 94.44 94.44 94.44 EM 92.86 92.86 92.86 92.86 92.86 94.90 Sep 93.06 92.59 93.98 94.44 95.83 97.22 Sep 88.78 89.80 93.88 93.88 92.86 94.90 𝜖 = 0.4 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.4 𝐾 = 3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 93.06 93.52 93.52 93.98 94.91 93.52 MV 78.57 78.57 89.80 91.84 89.80 91.84 EM 93.06 93.06 93.06 93.98 93.06 93.52 EM 59.18 73.47 88.78 89.80 89.80 91.84 Sep 88.89 92.59 92.59 93.98 95.83 95.37 Sep 83.67 80.61 85.71 88.78 91.84 92.86 Table 6 The performances of CE/BW/PeerLoss trained on 4 UCI datasets (StatLog, Optical, Pop-Failures, and Forest Fair datasets), with aggregated labels (majority vote, EM inference), and separated labels. (𝐾 is the number of labels per training image) CIFAR-10, Symmetric CE CIFAR-10, Instance CE 𝜖 = 0.2 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.2 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 92.21 92.98 93.54 93.43 93.73 93.40 MV 91.99 93.29 93.57 93.47 93.68 93.60 EM 92.08 92.93 93.54 93.64 93.35 93.37 EM 91.92 93.21 93.55 93.61 93.44 93.44 Sep 92.52 92.89 93.35 93.15 93.42 93.40 Sep 92.36 92.97 93.43 93.24 93.33 93.35 𝜖 = 0.4 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.4 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 89.09 91.59 93.18 93.43 93.26 93.44 MV 87.14 91.15 93.10 93.15 93.23 93.48 EM 88.83 91.02 92.54 93.45 93.69 93.68 EM 88.07 92.40 93.70 93.58 93.74 93.53 Sep 90.61 91.95 92.70 92.92 93.32 93.13 Sep 90.83 91.90 92.63 92.46 93.08 93.26 𝜖 = 0.6 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.6 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 81.85 87.33 89.88 91.88 92.96 93.40 MV 49.22 83.95 89.45 91.60 92.88 93.65 EM 81.04 85.91 89.76 91.57 92.55 93.10 EM 78.34 88.79 91.95 92.97 93.46 93.65 Sep 87.00 89.19 90.70 91.97 92.40 93.17 Sep 83.79 87.55 90.15 91.58 91.86 92.74 𝜖 = 0.8 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.8 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 20.94 44.62 70.91 79.61 84.83 89.09 MV 14.59 25.25 34.47 57.99 57.51 87.08 EM 37.91 50.78 67.19 75.26 82.97 87.97 EM 20.03 26.54 65.16 80.10 88.59 92.14 Sep 61.47 70.10 79.61 83.93 86.82 90.04 Sep 26.16 28.89 50.35 74.15 71.39 87.54 CIFAR-10, Symmetric BW CIFAR-10, Instance BW 𝜖 = 0.2 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.2 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 92.08 94.09 94.92 94.90 94.79 94.90 MV 92.03 93.87 95.12 95.11 94.97 94.75 EM 92.13 93.08 94.90 94.91 94.90 94.86 EM 91.93 94.39 94.90 94.84 95.05 94.54 Sep 91.74 92.61 92.75 92.59 94.44 92.97 Sep 91.93 92.07 92.70 91.75 93.02 92.47 𝜖 = 0.4 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.4 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 88.28 91.11 92.73 94.60 94.62 94.81 MV 86.61 90.64 93.00 94.73 94.72 94.72 EM 87.41 90.23 92.83 94.77 94.80 95.18 EM 89.83 92.04 94.74 95.00 94.94 94.80 Sep 89.14 89.68 91.07 92.46 92.26 94.24 Sep 88.86 87.89 92.09 89.92 91.05 91.96 𝜖 = 0.6 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.6 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 81.21 86.29 89.51 91.33 93.52 94.81 MV 43.78 82.59 88.56 91.47 93.27 95.06 EM 78.13 84.33 89.44 91.17 92.45 94.60 EM 44.92 87.33 91.39 93.58 94.72 94.99 Sep 83.84 87.05 88.10 89.80 90.95 92.11 Sep 80.88 86.22 88.45 90.69 91.16 92.61 𝜖 = 0.8 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.8 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 16.43 60.97 71.11 77.86 82.72 88.41 MV 16.00 25.03 33.80 67.91 68.52 86.49 EM 10.00 45.97 66.02 74.37 80.08 87.42 EM 16.06 22.73 53.96 76.24 86.74 92.02 Sep 58.48 69.86 76.03 79.79 82.60 86.31 Sep 27.84 26.68 32.72 37.27 54.41 83.37 CIFAR-10, Symmetric PeerLoss CIFAR-10, Instance PeerLoss 𝜖 = 0.2 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.2 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 92.69 93.35 93.90 94.12 94.15 93.81 MV 92.13 93.53 94.00 93.78 94.13 94.08 EM 92.39 93.25 93.76 93.93 93.52 93.77 EM 91.93 93.51 93.78 93.88 94.03 93.82 Sep 93.15 93.51 93.77 93.51 93.56 93.73 Sep 92.86 93.23 93.56 93.72 93.63 93.95 𝜖 = 0.4 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.4 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 89.40 91.88 93.42 93.84 93.83 94.04 MV 88.15 91.61 93.21 93.64 93.84 93.69 EM 89.23 91.41 93.06 93.83 93.85 94.11 EM 90.59 92.60 93.95 94.02 94.06 93.68 Sep 91.08 92.38 93.17 93.40 93.56 93.37 Sep 91.06 92.70 93.22 92.92 93.65 93.67 𝜖 = 0.6 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.6 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 82.88 87.95 90.42 92.31 93.61 93.79 MV 60.66 84.99 90.30 91.93 93.16 93.81 EM 81.64 86.45 90.09 91.98 93.23 93.58 EM 78.53 89.11 92.44 93.17 93.96 93.85 Sep 87.28 89.80 91.19 92.42 93.18 93.65 Sep 85.76 89.07 91.05 92.22 92.45 93.39 𝜖 = 0.8 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 𝜖 = 0.8 𝐾=3 𝐾 = 5 𝐾 = 9 𝐾 = 15 𝐾 = 25 𝐾 = 49 MV 21.82 48.71 72.81 80.32 85.27 89.38 MV 14.35 24.83 40.49 65.47 69.28 88.05 EM 38.29 52.63 68.70 77.42 83.94 88.45 EM 26.52 28.43 66.72 80.71 89.40 92.41 Sep 64.32 72.52 80.31 84.65 87.40 90.56 Sep 33.87 37.49 57.36 77.43 80.51 89.15 Table 7 The performances of CE/BW/PeerLoss trained on (Left half: symmetric noise; right half: instance noise) CIFAR-10 aggregated labels (majority vote, EM inference), and separated labels. (Different number of labels per training image)